SystemError when using ldamulticore on big corpus #646

hejunqing · 2016-03-29T13:44:38Z

I was running ldamulticore with default settings on a big corpus,and come up with following error:
Traceback (most recent call last):
File "/home/nlip/anaconda2/lib/python2.7/multiprocessing/queues.py", line 268, in _feed
send(obj)
SystemError: NULL result without error in PyObject_Call

The corpus was 533M,containing 5837527 documents,1037945 features, 77704086 non-zero entries.I have tried serializing it to MmCorpus and Bzip it,but doesn't work. When I take only 1000000 document, ldamulticore runs fine.

If it is a bug ,please fix it as soon as possible and inform me .If not,please tell me how to fix it.

tmylk · 2016-03-30T19:53:10Z

LDAMulticore is a streaming algorithm so should be independent of corpus size.

Please provide full stack trace and code to reproduce. Our issue reporting guidelines in https://github.com/piskvorky/gensim/blob/develop/CONTRIBUTING.md

piskvorky · 2016-03-31T00:00:31Z

@hejunqing this sounds like an issue with sending large objects during the parallelization.

Python has limitations for the size of objects it can pickle (and multiprocessing uses pickle internally), so if your object is too large, the multiprocessing parallelization will fail, with a rather unhelpful stacktrace.

See also the discussion in #377.

There's nothing much we can do in gensim about this... I'd suggest you use a smaller model = either smaller vocabulary (cut off more words) or fewer topics. I think the pickle limit is 2GB or so (may depend on OS and Python build).

hejunqing · 2016-03-31T02:25:39Z

@piskvorky Thank you ,I think that is the point.I do some preprocess to make the corpus smaller(450M),so that the ldamulticore finally running without error.

But it is confusing that when I TOP,I find only one running python ,and the CPU doesn't come up.It should use 20 cpu cores as I set.For some reason it does not parallel processing.

According to stackoverflow,I try the shardedcorpus,but it yielded an error too:
2016-03-30 17:12:39,850 : INFO : calculating IDF weights for 4641945 documents and 1148034 features (68951648 matrix non-zeros)
2016-03-30 17:12:40,942 : INFO : Initializing sharded corpus with prefix ./genlog/weibo
2016-03-30 17:12:40,942 : INFO : Building from corpus...
2016-03-30 17:12:40,942 : WARNING : Couldn't find number of features, trusting supplied dimension (1037945)
2016-03-30 17:12:40,942 : INFO : Running init from corpus.
2016-03-30 17:12:41,115 : INFO : Chunk no. 0 at 0.173318 s
read2TrainSet Finish!
Traceback (most recent call last):
File "genweibo_ms.py", line 68, in
corpora.ShardedCorpus.serialize('./genlog/weibo',corpus_tfidf,dim=1037945)
File "/home/nlip/anaconda2/lib/python2.7/site-packages/gensim-0.12.4-py2.7-linux-x86_64.egg/gensim/corpora/sharded_corpus.py", line 848, in serialize
*_kwargs)
File "/home/nlip/anaconda2/lib/python2.7/site-packages/gensim-0.12.4-py2.7-linux-x86_64.egg/gensim/corpora/sharded_corpus.py", line 828, in save_corpus
ShardedCorpus(fname, corpus, *_kwargs)
File "/home/nlip/anaconda2/lib/python2.7/site-packages/gensim-0.12.4-py2.7-linux-x86_64.egg/gensim/corpora/sharded_corpus.py", line 242, in init
self.init_shards(output_prefix, corpus, shardsize)
File "/home/nlip/anaconda2/lib/python2.7/site-packages/gensim-0.12.4-py2.7-linux-x86_64.egg/gensim/corpora/sharded_corpus.py", line 282, in init_shards
current_shard = numpy.zeros((len(doc_chunk), self.dim), dtype=dtype)
MemoryError

Here is my code,please tell me how to fix it if I didn't use the shardedcorpus in the right weight.Thank you so much!
genweibo_mb (copy).txt

piskvorky · 2016-03-31T03:37:03Z

I don't think you need the ShardedCorpus. Instead, try trimming your vocabulary by calling filter_extremes() to remove infrequent tokens: http://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary.filter_extremes

1.1 million features sounds excessive.

piskvorky added the bug Issue described a bug label Mar 29, 2016

piskvorky assigned tmylk Mar 29, 2016

piskvorky closed this as completed Mar 31, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SystemError when using ldamulticore on big corpus #646

SystemError when using ldamulticore on big corpus #646

hejunqing commented Mar 29, 2016

tmylk commented Mar 30, 2016

piskvorky commented Mar 31, 2016

hejunqing commented Mar 31, 2016

piskvorky commented Mar 31, 2016

SystemError when using ldamulticore on big corpus #646

SystemError when using ldamulticore on big corpus #646

Comments

hejunqing commented Mar 29, 2016

tmylk commented Mar 30, 2016

piskvorky commented Mar 31, 2016

hejunqing commented Mar 31, 2016

piskvorky commented Mar 31, 2016