Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SystemError when using ldamulticore on big corpus #646

Closed
hejunqing opened this issue Mar 29, 2016 · 4 comments
Closed

SystemError when using ldamulticore on big corpus #646

hejunqing opened this issue Mar 29, 2016 · 4 comments
Assignees
Labels
bug Issue described a bug

Comments

@hejunqing
Copy link

I was running ldamulticore with default settings on a big corpus,and come up with following error:
Traceback (most recent call last):
File "/home/nlip/anaconda2/lib/python2.7/multiprocessing/queues.py", line 268, in _feed
send(obj)
SystemError: NULL result without error in PyObject_Call

The corpus was 533M,containing 5837527 documents,1037945 features, 77704086 non-zero entries.I have tried serializing it to MmCorpus and Bzip it,but doesn't work. When I take only 1000000 document, ldamulticore runs fine.

If it is a bug ,please fix it as soon as possible and inform me .If not,please tell me how to fix it.

@piskvorky piskvorky added the bug Issue described a bug label Mar 29, 2016
@tmylk
Copy link
Contributor

tmylk commented Mar 30, 2016

LDAMulticore is a streaming algorithm so should be independent of corpus size.

Please provide full stack trace and code to reproduce. Our issue reporting guidelines in https://github.com/piskvorky/gensim/blob/develop/CONTRIBUTING.md

@piskvorky
Copy link
Owner

@hejunqing this sounds like an issue with sending large objects during the parallelization.

Python has limitations for the size of objects it can pickle (and multiprocessing uses pickle internally), so if your object is too large, the multiprocessing parallelization will fail, with a rather unhelpful stacktrace.

See also the discussion in #377.

There's nothing much we can do in gensim about this... I'd suggest you use a smaller model = either smaller vocabulary (cut off more words) or fewer topics. I think the pickle limit is 2GB or so (may depend on OS and Python build).

@hejunqing
Copy link
Author

@piskvorky Thank you ,I think that is the point.I do some preprocess to make the corpus smaller(450M),so that the ldamulticore finally running without error.

But it is confusing that when I TOP,I find only one running python ,and the CPU doesn't come up.It should use 20 cpu cores as I set.For some reason it does not parallel processing.

According to stackoverflow,I try the shardedcorpus,but it yielded an error too:
2016-03-30 17:12:39,850 : INFO : calculating IDF weights for 4641945 documents and 1148034 features (68951648 matrix non-zeros)
2016-03-30 17:12:40,942 : INFO : Initializing sharded corpus with prefix ./genlog/weibo
2016-03-30 17:12:40,942 : INFO : Building from corpus...
2016-03-30 17:12:40,942 : WARNING : Couldn't find number of features, trusting supplied dimension (1037945)
2016-03-30 17:12:40,942 : INFO : Running init from corpus.
2016-03-30 17:12:41,115 : INFO : Chunk no. 0 at 0.173318 s
read2TrainSet Finish!
Traceback (most recent call last):
File "genweibo_ms.py", line 68, in
corpora.ShardedCorpus.serialize('./genlog/weibo',corpus_tfidf,dim=1037945)
File "/home/nlip/anaconda2/lib/python2.7/site-packages/gensim-0.12.4-py2.7-linux-x86_64.egg/gensim/corpora/sharded_corpus.py", line 848, in serialize
*_kwargs)
File "/home/nlip/anaconda2/lib/python2.7/site-packages/gensim-0.12.4-py2.7-linux-x86_64.egg/gensim/corpora/sharded_corpus.py", line 828, in save_corpus
ShardedCorpus(fname, corpus, *_kwargs)
File "/home/nlip/anaconda2/lib/python2.7/site-packages/gensim-0.12.4-py2.7-linux-x86_64.egg/gensim/corpora/sharded_corpus.py", line 242, in init
self.init_shards(output_prefix, corpus, shardsize)
File "/home/nlip/anaconda2/lib/python2.7/site-packages/gensim-0.12.4-py2.7-linux-x86_64.egg/gensim/corpora/sharded_corpus.py", line 282, in init_shards
current_shard = numpy.zeros((len(doc_chunk), self.dim), dtype=dtype)
MemoryError

Here is my code,please tell me how to fix it if I didn't use the shardedcorpus in the right weight.Thank you so much!
genweibo_mb (copy).txt

@piskvorky
Copy link
Owner

I don't think you need the ShardedCorpus. Instead, try trimming your vocabulary by calling filter_extremes() to remove infrequent tokens: http://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary.filter_extremes

1.1 million features sounds excessive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug
Projects
None yet
Development

No branches or pull requests

3 participants