-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SystemError when using ldamulticore on big corpus #646
Comments
LDAMulticore is a streaming algorithm so should be independent of corpus size. Please provide full stack trace and code to reproduce. Our issue reporting guidelines in https://github.com/piskvorky/gensim/blob/develop/CONTRIBUTING.md |
@hejunqing this sounds like an issue with sending large objects during the parallelization. Python has limitations for the size of objects it can pickle (and multiprocessing uses pickle internally), so if your object is too large, the See also the discussion in #377. There's nothing much we can do in gensim about this... I'd suggest you use a smaller model = either smaller vocabulary (cut off more words) or fewer topics. I think the pickle limit is 2GB or so (may depend on OS and Python build). |
@piskvorky Thank you ,I think that is the point.I do some preprocess to make the corpus smaller(450M),so that the ldamulticore finally running without error. But it is confusing that when I TOP,I find only one running python ,and the CPU doesn't come up.It should use 20 cpu cores as I set.For some reason it does not parallel processing. According to stackoverflow,I try the shardedcorpus,but it yielded an error too: Here is my code,please tell me how to fix it if I didn't use the shardedcorpus in the right weight.Thank you so much! |
I don't think you need the ShardedCorpus. Instead, try trimming your vocabulary by calling 1.1 million features sounds excessive. |
I was running ldamulticore with default settings on a big corpus,and come up with following error:
Traceback (most recent call last):
File "/home/nlip/anaconda2/lib/python2.7/multiprocessing/queues.py", line 268, in _feed
send(obj)
SystemError: NULL result without error in PyObject_Call
The corpus was 533M,containing 5837527 documents,1037945 features, 77704086 non-zero entries.I have tried serializing it to MmCorpus and Bzip it,but doesn't work. When I take only 1000000 document, ldamulticore runs fine.
If it is a bug ,please fix it as soon as possible and inform me .If not,please tell me how to fix it.
The text was updated successfully, but these errors were encountered: