Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use in-memory graph struct for storing and quering data #7

Closed
hailiang-wang opened this issue Oct 31, 2017 · 7 comments
Closed

Use in-memory graph struct for storing and quering data #7

hailiang-wang opened this issue Oct 31, 2017 · 7 comments

Comments

@hailiang-wang
Copy link
Member

hailiang-wang commented Oct 31, 2017

description

目前使用python dict, list 来存储,在建立多个词之间关系的时候,效率低。

solution

将word's vector, adjacent words and their distance, 存储在Graph中。

Possible solution:
方案1: graphlite
https://pypi.python.org/pypi/graphlite

方案2: projx
http://davebshow.github.io/projx/getting-started/

方案3:networkx
https://pypi.python.org/pypi/networkx/

image

支持 load 和 dump 文件 pickle 文件。
采用哪种格式? https://networkx.github.io/documentation/networkx-1.9.1/reference/readwrite.html

安装是否方便? 保持从 pip 安装,依赖少,跨平台

是否支持高级查询?比如 cypher

性能怎么样?

@hailiang-wang
Copy link
Member Author

question with networkx for C codes

networkx/networkx#2747

@huyingxi
Copy link
Collaborator

huyingxi commented Nov 11, 2017

networkx graph performance:
(testing on 10W vocabulary size, data structure: Graph vs. Dictionary)

  1. . Directed graph vs. Dictionary :
    ('current vocab size : ', 125792)
    ('graph time * 100000 : ', 0.2830028533935547)
    ('graph neighbors number : ', 10)
    ('dict time * 100000 : ', 0.013113021850585938)
    ('dict neighbors number : ', 10)

  2. .Undirected graph vs. Dictionary :
    ('current vocab size : ', 125792)
    ('graph time * 100000 : ', 0.2830028533935547)
    ('graph neighbors number : ', 32)
    ('dict time * 100000 : ', 0.012874603271484375)
    ('dict neighbors number : ', 10)

So, we can get two conclusions:
(1) Graph is slower than dictionary, and the time consumption of the directed graph and the undirected graph is similar.
(2) Undirected graph may get an indefinite number of synonymous.

@hailiang-wang
Copy link
Member Author

hailiang-wang commented Nov 11, 2017

So, maybe for the low level APIs in Synonyms, we just use dict for quick access. But networkx provides many fancy APIs for upper level tasks like textsum and NER, such as pagerank and textrank. https://networkx.github.io/documentation/latest/tutorial.html

Unless networkx can achieve better performance, we just rely on dict.

Another option is using levelDB. It may bring more inconvenience with installation, but it may help with both performance and abilities.

@huyingxi
Copy link
Collaborator

I read the source code of NetworkX where is about finding neighbors, as following:


    def successors_iter(self,n):
        """Return an iterator over successor nodes of n.

        neighbors_iter() and successors_iter() are the same.
        """
        try:
            return iter(self.succ[n])
        except KeyError:
            raise NetworkXError("The node %s is not in the digraph."%(n,))

    def predecessors_iter(self,n):
        """Return an iterator over predecessor nodes of n."""
        try:
            return iter(self.pred[n])
        except KeyError:
            raise NetworkXError("The node %s is not in the digraph."%(n,))

    def successors(self, n):
        """Return a list of successor nodes of n.

        neighbors() and successors() are the same function.
        """
        return list(self.successors_iter(n))

    def predecessors(self, n):
        """Return a list of predecessor nodes of n."""
        return list(self.predecessors_iter(n))

    # digraph definitions
    neighbors = successors
    neighbors_iter = successors_iter

So I guess there are two reasons for the slow query : (1) twice type conversion operation , (2)Multiple function calls

And then I just modified the code like this :

    def neighbors(self, n):
        try:
            return self.succ[n]
        except KeyError:
            raise NetworkXError("The node %s is not in the digraph."%(n,))

It runs faster, only two times slower than Dict.
However, I think they both run fast enough.

@hailiang-wang
Copy link
Member Author

Great job, does the modifications have side impacts for other functions in networks?

  • run test cases
  • create pull request to community

We may borrow the codes of networkx and make the changes right now and contribute back afterwards.

@tyrinwu
Copy link

tyrinwu commented Nov 17, 2017

https://github.com/snap-stanford/snap

FYI, this library has a C++ interface.

@hailiang-wang
Copy link
Member Author

@tyrinwu thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants