Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large array conversion #30

Open
FeatureMan2 opened this issue Dec 14, 2022 · 8 comments
Open

Large array conversion #30

FeatureMan2 opened this issue Dec 14, 2022 · 8 comments

Comments

@FeatureMan2
Copy link

Hi,

I'm facing the issue that when converting Java data to the R representation, my datasets are too large and it fails.

Specifically, it fails here https://github.com/s-u/REngine/blame/master/Rserve/protocol/REXPFactory.java#L476 as cont.asDoubles().length is 700K which x8 is about 5 billion, which is larger than the max int integer value of 2 billion. Shouldn't we be using long instead of int throughout this package to support larger datasets?

Thanks for any thoughts on this.

@FeatureMan2 FeatureMan2 changed the title Large arrays Large array conversion Dec 14, 2022
@s-u
Copy link
Owner

s-u commented Dec 14, 2022

@FeatureMan2 Unfortunately, this is a Java limitation, because in Java arrays can have at most 2^31-1 elements and are indexed by signed integers. Java does not support long as array indices.

You could split the dataset and then c or rbind it back. Note that even if we changed the whole way packets are assembled in REngine, let's say using arrays of arrays instead, it would only increase the capacity at most by factor or 8, because your dataset arrays have the same limitation.

Put simply, Java is not a language designed for data analysis, especially not large data, so you may have more luck reading the data directly from R instead if you can.

@FeatureMan2
Copy link
Author

Thanks for your insights. I think I'll use your fallback method with c/rbind.

@FeatureMan2
Copy link
Author

FeatureMan2 commented Jan 23, 2023

I tried the c approach to some initial success but then ultimately got stuck again.

I split the large array into chunks then reassembled it in R with c. But after that the engine would fail on random statements - even simple string assignment operations.

So then I switched to using the Renjin engine to export the Java dataset to their embedded R engine, then storing it to an .Rdata file, then loading it back into the REngine environment. The idea was to bypass sending so much data over sockets. It worked, but again I faced the random error issue again...

I'm not quite clear what is producing these errors. I thought it was RAM usage but then I can run

a <- rep(pi, 2^31-1)
b <- rep(pi, 2^31-1)
c <- c(a,b)

using up >64GB of Rserve memory...

@s-u
Copy link
Owner

s-u commented Jan 23, 2023

Can you post your actual code? We can't really help you based on the anecdotes above.

@FeatureMan2
Copy link
Author

FeatureMan2 commented Jan 24, 2023

The code is long and unfortunately I'm not able to share it.

It essentially fails at this point

if (rp!=null && rp.isOk()) return;
(this is with the old version REngine 0.9.2 under R v3.6, but get same errors under 2.1.1 under R 4.2) where ro.isOK() returns false. I'm simply assigning a string to a variable here. I don't think it matters what statement I execute, it's just that it falls over irrespective. I cannot manually execute any statement in debug mode after such an error.

This error comes after loading a R numeric array of 650 million elements like so load(file="myfile1.RData") and then executing a few more trivial assignment statements, then trying to load another similar 650 Million elements file. What's weird is that it doesn't fail on the loading of the RData file, but just before on assigning the statement load(file="myfile2.RData") into a string variable before running R's eval() on it (just like it loaded myfile1.RData previously). The Linux Rserve process is using about 15GB of RAM at this point. The error codes I get are typically 17 or sometimes 98.

Note that this same code works just fine on smaller datasets, say if I have 5 million elements...

I was getting similar types of errors when doing the splitting arrays into smaller chunk and concatenating with c().

@s-u
Copy link
Owner

s-u commented Jan 24, 2023

It looks as if something goes wrong with the previous statement. It could also be that R simply runs out of memory. First, simply run the debug version of Rserve (either set debug=TRUE in Rserve() or run R CMD Rserve.dbg instead of R CMD Rserve)- that way you can see what is happening.
Second, can you create a reproducible example you can share? You could, for example, mimic the dataset with something like data.frame(a=1:1e9/2) by creating similar size and types but not revealing your data.

@FeatureMan2
Copy link
Author

Our problem was solved by splitting the datasets Java-> R using the suggested c(a,b) trick, then also splitting the return values R -> Java (as these were large too). This in combination with the use of the judicious usage of the RConnection.voidEval() instead of the default REngine.parseAndEval() - otherwise R statements with large return value were automatically being transmitted to Java leading to misleading errors due to I assumed to be a compromised state of the Rengine after execution of such statements.

I would suggest making the REngine.voidEval() method available since it is not by default. Hope this helps someone. Maybe also worth updating the Javadoc or main documentation stating that large arrays (with ~200M+ elements) will not work and suggest work-arounds.

@s-u
Copy link
Owner

s-u commented Feb 23, 2023

Thanks. Since this is a Java limitation different applications may use different solutions to it depending on the use-case.

RConnection.voidEval(x) is the same as REngine.eval(x, null, false). Besides, you have always control over what you return from the evaluation - most commonly you return some indicator of success. I have the feeling that you would be probably better off evaluating compound statements since it is more efficient that using multiple voidEvals.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants