Don't use Spark's slow wholeTextFiles() for LDA data fetches from S3 #49

nh2 · 2016-08-10T02:39:04Z

sparkle-example-lda seems insanely slow (7 minutes locally, 2 minutes on an EMR cluster with 1 master and 2 workers m3.xlarge); e.g. the first zipWithIndex takes multiple minutes (when run locally), not sure why.

No CPU time is used in htop.

The text was updated successfully, but these errors were encountered:

nh2 · 2016-08-10T02:46:34Z

Example of slowness on EMR:

mboes · 2016-08-10T12:13:15Z

Not sure what's going on here. Next step is to profile with the Scala version of the code on the same dataset to rule out sparkle being the culprit. zipWithIndex is a trivial wrapper around the Java RDD.zipWithIndex() method, so this could be an upstream issue.

I take it you tried this using the default nyt dataset? It could also be the latency of reading a bunch of files from an S3 bucket.

nh2 · 2016-08-10T12:40:33Z

I take it you tried this using the default nyt dataset?

Yes.

It could also be the latency of reading a bunch of files from an S3 bucket.

Is there a way to disable any potential file fetching laziness and ensure that the S3 files are downloaded at the beginning of the pipeline?

mboes · 2016-08-10T12:45:20Z

I don't know, but an way to test is to download local copies of the dataset and test using that.

mboes · 2016-08-10T12:54:50Z

Ok, confirmed. From S3, the run takes 7:15 minutes on my laptop too. But if I download files locally, then it just takes exactly 1 minute. In the nyt dataset, there are 500+ files to process. Looks to me like Spark's S3 client could be faster at downloading many small files (the aws s3 CLI utility is pretty quick in comparison).

mboes · 2016-08-10T13:10:00Z

Looks like a known issue: http://tech.kinja.com/how-not-to-pull-from-s3-using-apache-spark-1704509219. Haven't yet found an upstream ticket to track a resolution though.

mboes · 2016-08-15T11:05:19Z

The above mentioned link has sample code for fetching data from S3 using the AmazonS3 lib directly rather than Spark, as a workaround: https://gist.githubusercontent.com/pjrt/f1cad93b154ac8958e65/raw/7b0b764408f145f51477dc05ef1a99e8448bce6d/S3Puller.scala. Feel free to submit a PR Haskellizing that.

mboes changed the title ~~sparkle-example-lda is extremely slow~~ Don't use Spark's slow wholeTextFiles() for LDA data fetches from S3 Aug 15, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't use Spark's slow wholeTextFiles() for LDA data fetches from S3 #49

Don't use Spark's slow wholeTextFiles() for LDA data fetches from S3 #49

nh2 commented Aug 10, 2016

nh2 commented Aug 10, 2016

mboes commented Aug 10, 2016 •

edited

Loading

nh2 commented Aug 10, 2016

mboes commented Aug 10, 2016 via email

mboes commented Aug 10, 2016

mboes commented Aug 10, 2016

mboes commented Aug 15, 2016

Don't use Spark's slow wholeTextFiles() for LDA data fetches from S3 #49

Don't use Spark's slow wholeTextFiles() for LDA data fetches from S3 #49

Comments

nh2 commented Aug 10, 2016

nh2 commented Aug 10, 2016

mboes commented Aug 10, 2016 • edited Loading

nh2 commented Aug 10, 2016

mboes commented Aug 10, 2016 via email

mboes commented Aug 10, 2016

mboes commented Aug 10, 2016

mboes commented Aug 15, 2016

mboes commented Aug 10, 2016 •

edited

Loading