-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RecordDAWG only stores a very small subset of data when reading from generator function #9
Comments
It actually looks like this works fine if I limit the number of lines read to be 1 million, so I guess it is not the generator function after all. I think the problem is because of the 536,870,912 (2^29) DAWG unit limit. I did not think this was a problem because there are 507,369,749 elements I would like to store in the DAWG, but I just now noticed this is based on the number of transitions, not elements. On a related note, it looks like >>> import dawg
>>> d = dawg.RecordDAWG('d')
>>> d.load('sorted_ngrams.txt.rankratios.dawg')
<dawg.RecordDAWG object at 0x2b1e3a53eb00>
>>> d['zurich\t#csum ,']
[(1.06259921167,)] |
Hi Dan, I think there could be 2 issues.
If you need only a single integer value then give IntCompletionDAWG a try: it will be several times faster and could use less memory (depending on data). It could also use less transitions, by the way. Or use marisa-trie.Trie and store values in an array (marisa-trie provides a perfect hash function that maps key to a number 0..len(trie)-1 and vice versa). (1) input_is_sorted is slightly complicated with RecordDAWG because you need to make sure the values are also sorted properly if there are several values. For integer values this means storing them in big-endian format and sorting by value. But it seems that this doesn't affect you because you're only interested in a single value. |
Once I was trying to store a huge number of bigrams, each bigram was associated with a float number. I was not interested in high precision, so the first approach was to store them in RecordDAWG and use 1-byte floats (floats converted to ints 0<=x<256 using some maths). But in practice marisa-trie (with an external array for values) worked better. |
Anyway, I don't know why DAWG saving didn't raise an exception; this looks like a bug. |
Are RecordDAWGs compatible with using generator functions to fill them?
I have a very large TSV file that contains: words, the contexts they appear in, and then some statistics about the words and contexts. The data is calculated based on Gigaword so the file is 29 GB with 507 million lines, and when I tried to build a dictionary that mapped from word-context pairs to their rank ratios it took up over 80 GB of RAM (and the machine I'm using only has 64), so I tried using RecordDAWGs.
Unfortunately, when I use code like the following Python churns away for a long time seemingly loading the DAWG (and using about 45GB of RAM, which is more reasonable), but then the DAWG it saves only has 6 elements in it and takes up 1.6 KB. Is this because the RecordDAWG constructor cannot read from a generator function?
The text was updated successfully, but these errors were encountered: