Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Word2Vec.load_word2vec_format should support python file like objects #372

Open
pgroth opened this issue Jun 29, 2015 · 17 comments
Open

Word2Vec.load_word2vec_format should support python file like objects #372

pgroth opened this issue Jun 29, 2015 · 17 comments
Assignees
Labels
difficulty medium Medium issue: required good gensim understanding & python skills feature Issue described a new feature

Comments

@pgroth
Copy link

pgroth commented Jun 29, 2015

Currently, the loading code requires a filename. It would be nice to allow a file object to be given instead. This would make it easier to load models in environments (e.g. Spark) that don't necessarily have access to a normal file system.

@piskvorky
Copy link
Owner

Just use pickle manually then.

The point of save / load is they do some extra work for efficiency (storing large arrays in separate files, memory mapping, transparent compression).

None of that is possible with just a file descriptor, so you might as well pickle the object manually.

CC @gojomo @cscorley thoughts?

@pgroth
Copy link
Author

pgroth commented Jun 30, 2015

Ok. Maybe I'll look at what happens when I call open directly on an hdfs
path.

Thanks!
Paul

On Tue, Jun 30, 2015 at 10:04 AM, Radim Řehůřek [email protected]
wrote:

Just use pickle manually then.

The point of save / load is they do some extra work for efficiency
(storing large arrays in separate files, memory mapping, transparent
compression).

None of that is possible with just a file descriptor, so you might as well
pickle the object manually.

CC @gojomo https://github.com/gojomo @cscorley
https://github.com/cscorley thoughts?


Reply to this email directly or view it on GitHub
#372 (comment).

@gojomo
Copy link
Collaborator

gojomo commented Jul 1, 2015

There was a comment somewhere that pickle might encounter problems on very-large numpy arrays. Is that true? If so, it'd be good to ensure there's a workaround for that other than the multi-path filesystem-persistence, since it's always nice to have the option to load/save from arbitrary streams/pipe/file-likes.

@piskvorky
Copy link
Owner

Yes, that's one reason why gensim has a dedicated save. Memory mapping is the other.

Some HDF5-like options for saving could be interesting too. Let me know if you write something in that direction, we could include that in gensim too (if it solves a common enough use-case with a sane enough API).

@piskvorky
Copy link
Owner

Oops, I only just noticed this ticket is about load_word2vec_format, not normal save/load.

@pgroth that's a much easier proposition -- can you submit a PR that will allow file-like param in load_word2vec_format? Maybe we can just overload the same fname param, and dynamically check whether it's to be interpreted as a filename or not (=string or not).

By the way, allowing file handles in save/load was added in #292 (gensim 0.11.0), though I haven't used it much myself.

@pgroth
Copy link
Author

pgroth commented Jul 6, 2015

Sure. I'll probably do that next week or so.

On Sun, Jul 5, 2015 at 10:06 PM, Radim Řehůřek [email protected]
wrote:

Oops, I only just noticed this ticket is about load_word2vec_format, not
normal save/load.

@pgroth https://github.com/pgroth that's a much easier proposition --
can you submit a PR that will allow file-like param in
load_word2vec_format? Maybe we can just overload the same param, and
check whether it's to be interpreted as filename or not (=string or not).

By the way, allowing file handles in save/load was added in #292
#292 (gensim 0.11.0), though I
haven't used it much myself.


Reply to this email directly or view it on GitHub
#372 (comment).

@wikier
Copy link

wikier commented Jan 30, 2017

What's the status of this?

@petabyte
Copy link

was this resolved? what version?
This is is what I get when I pass a file object to load
File "Lib\site-packages\gensim\models\word2vec.py", line 1470, in load
File "Lib\site-packages\gensim\utils.py", line 264, in load
File "Lib\site-packages\gensim\utils.py", line 329, in _adapt_by_suffix
AttributeError: 'file' object has no attribute 'endswith'

@pgroth
Copy link
Author

pgroth commented Mar 10, 2017 via email

@yupbank
Copy link

yupbank commented Jun 29, 2017

hey, i also run into the same problem.. I'm loading a very big model (~1.6 Gb) from hdfs in gz format
and smart_open doesn't provide readline for hdfs files apparently.. what is the recommended way to deal with this?

@yupbank
Copy link

yupbank commented Jun 30, 2017

so i figured, the problem is that

  • smart_open('hdfs://xx.gz') the gz decoding is not supported
  • smart_open('hdfs://xx').readline() is not supported or don't need to, but the load_word2vec_format need to support file obj with read only.

@piskvorky
Copy link
Owner

@yupbank adding support for these two features (readline() and gz for HDFS) would be much appreciated!

@menshikh-iv menshikh-iv added feature Issue described a new feature difficulty medium Medium issue: required good gensim understanding & python skills labels Oct 3, 2017
@bhanu-sharma
Copy link

Is anyone working on this as i can pick this up. The same issue persists with Word2Vec.load for loading model too as both uses the same method to load files.

@pgroth
Copy link
Author

pgroth commented Oct 5, 2018

I didn’t get to it.

@bhanu-sharma
Copy link

Okay,i'd like to pick it up then, supporting file objects as an input along with filenames.

@menshikh-iv
Copy link
Contributor

@bhanu546 feel free to pick

@halflings
Copy link

With 3.6 introducing the file-based training, this change would really be welcome.
Additional anecdote on my side: I can't use the file-based training because I can't make a full local copy of some data located in a data center; can only iterate over it (but this means I still need to use the queue-based system).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty medium Medium issue: required good gensim understanding & python skills feature Issue described a new feature
Projects
None yet
Development

No branches or pull requests

9 participants