Skip to content
This repository has been archived by the owner on May 22, 2024. It is now read-only.

Results from by processing limited to 100 rows #43

Open
ashokrags opened this issue Jun 28, 2016 · 8 comments
Open

Results from by processing limited to 100 rows #43

ashokrags opened this issue Jun 28, 2016 · 8 comments

Comments

@ashokrags
Copy link

ashokrags commented Jun 28, 2016

hi,
when I run a by processing in Hawq using a custom function, i only get results for 100 rows. Any ideas as to why that occurs?

if i do something like this
samples_tissue_mu_logcpm <- by(gtex_df[, "logCpm"], c( gtex_df$gene, gtex_df$Tissue_type),mean )`

Then i get the correct number of rows 112636
however if i submit my own function or a non-standard function as below I only get 100 results retrieved
samples_tissue_logcpm_q_lo <- by(gtex_df[, "logCpm"], c( gtex_df$gene, gtex_df$Tissue_type), FUN=function(x) { y <- lookat(x, nrows=NULL) return(quantile(y, prob=0.25 )) })

@iyerr3
Copy link
Contributor

iyerr3 commented Jun 28, 2016

lookat by default returns 100 rows, set nrows=-1 to get all rows.

@ashokrags
Copy link
Author

ashokrags commented Jun 28, 2016

Thanks. So nrows=NULL will not work?? It works under other circumstances. I will test with nrows=-1 and let you know if it worked

@iyerr3
Copy link
Contributor

iyerr3 commented Jun 28, 2016

I missed the NULL input in your call. That is supposed to work - I'll have to debug this further.

@orhankislal
Copy link
Contributor

orhankislal commented Sep 16, 2016

It seems to work as intended. I have tried the following commands to replicate the issue.

lof2
[[1]]
Table : "abalone"
Database : madlib-gpdb43
Host : 127.0.0.1
Connection : 1
lapply(lof2, FUN=function(x) { y <- lookat(x); return(quantile(y$id, probs=0.25 )) })
[[1]]
25%
102
lapply(lof2, FUN=function(x) { y <- lookat(x, nrows=100); return(quantile(y$id, probs=0.25 )) })
[[1]]
25%
100
lapply(lof2, FUN=function(x) { y <- lookat(x, nrows=NULL); return(quantile(y$id, probs=0.25 )) })
[[1]]
25%
1045
lapply(lof2, FUN=function(x) { y <- lookat(x, nrows=-1); return(quantile(y$id, probs=0.25 )) })
[[1]]
25%
1045
quantile(abalone$id, prob=0.25)
25%
1045

I also tried tapply (the function by wraps):

> m <- matrix(c(db.abalone,db.abalone2,1,2), nrow=2)
> fac <- factor(rep(1:2, length = 2), levels = 1:2)
> tapply(m[,1], fac, FUN=function(x) { y <- lookat(x[[1]], nrows=100); return(quantile(y$id, probs=0.25 )) })
  1   2
102 101
> tapply(m[,1], fac, FUN=function(x) { y <- lookat(x[[1]], nrows=NULL); return(quantile(y$id, probs=0.25 )) })
   1    2
1045 1045
> tapply(m[,1], fac, FUN=function(x) { y <- lookat(x[[1]], nrows=-1); return(quantile(y$id, probs=0.25 )) })
   1    2
1045 1045

@ashokrags Do you have a by example that I can use to reproduce this error?

@ashokrags
Copy link
Author

ashokrags commented Sep 17, 2016

@orhankislal I have given a by example above. How many rows does the function return when you apply it over the entire data frame? For example if you made fac 1000 levels and then did a by processing so you get summary for each level does it return 1000 rows?? The inbuilt mean function does that

@orhankislal
Copy link
Contributor

@ashokrags I saw the example you gave but it is not clear what the gtex_df structure looks like. I recently started looking into PivotalR but my understanding is that lookat requires a db.table type object. That is why the matrix I created in my second example has multiple db.table objects (so that lookat can look at each one).

@fmcquillan99
Copy link
Contributor

fmcquillan99 commented Sep 28, 2016

seems to be working OK from our perspective so @ashokrags please let us know if you are still having issues

@ashokrags
Copy link
Author

ashokrags commented Sep 28, 2016

@orhankislal @fmcquillan99 .... the gtex_df is a table with 555Million rows and several columns I think 6 or 7. All I want is an aggregate summary statistic by groups of a particular column ( say that has 300 different categories). So like i mentioned before if use the inbuilt mean function it works, but when i use a custom function it only returns 100 rows. If it appears not to be an issue from your side, then I think i could some time-out issue in what gets returned within the implementation in our side. Thanks for checking into this. I will reopen this issue if i can get any more information that it is an issue from the R side. Thanks a lot again for taking the time to look into this.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants