-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't use Spark's slow wholeTextFiles() for LDA data fetches from S3 #49
Comments
Not sure what's going on here. Next step is to profile with the Scala version of the code on the same dataset to rule out sparkle being the culprit. I take it you tried this using the default |
Yes.
Is there a way to disable any potential file fetching laziness and ensure that the S3 files are downloaded at the beginning of the pipeline? |
I don't know, but an way to test is to download local copies of the dataset
and test using that.
|
Ok, confirmed. From S3, the run takes 7:15 minutes on my laptop too. But if I download files locally, then it just takes exactly 1 minute. In the nyt dataset, there are 500+ files to process. Looks to me like Spark's S3 client could be faster at downloading many small files (the |
Looks like a known issue: http://tech.kinja.com/how-not-to-pull-from-s3-using-apache-spark-1704509219. Haven't yet found an upstream ticket to track a resolution though. |
The above mentioned link has sample code for fetching data from S3 using the AmazonS3 lib directly rather than Spark, as a workaround: https://gist.githubusercontent.com/pjrt/f1cad93b154ac8958e65/raw/7b0b764408f145f51477dc05ef1a99e8448bce6d/S3Puller.scala. Feel free to submit a PR Haskellizing that. |
sparkle-example-lda
seems insanely slow (7 minutes locally, 2 minutes on an EMR cluster with 1 master and 2 workersm3.xlarge
); e.g. the firstzipWithIndex
takes multiple minutes (when run locally), not sure why.No CPU time is used in
htop
.The text was updated successfully, but these errors were encountered: