Update README.rst (#89)

* Update README.rst Update installation, example and publications * Update README.rst * Update README.rst
SINr-Embeddings · Jun 7, 2024 · 286ec76 · 286ec76
1 parent 74039de
commit 286ec76
Showing 1 changed file with 76 additions and 37 deletions.
diff --git a/README.rst b/README.rst
@@ -27,7 +27,7 @@ Requirements
 -  As SINr relies on libraries implemented using C/C++, a modern C++
    compiler is required.
 -  OpenMP (required for `Networkit <https://networkit.github.io>`__ and
-   compiling *SINr*\ ’s Cython
+   compiling *SINr*\ ’s Cython)
 -  Python 3.9
 -  Pip
 -  Cython
@@ -36,8 +36,7 @@ Requirements
 Install
 =======
 
-SINr can be installed through ``pip`` or from source using ``poetry``
-directives.
+SINr can be installed through ``pip``.
 
 pip
 ---
@@ -47,17 +46,6 @@ pip
    conda activate sinr # activate conda environment
    pip install sinr
 
-from source
------------
-
-.. code:: bash
-
-   conda activate sinr # activate conda environment
-   git clone [email protected]:SINr-Embeddings/sinr.git
-   cd sinr
-   pip install poetry # poetry solves dependencies and installs SINr
-   poetry install # installs SINr based on the pyproject.toml file
-
 Usage example
 =============
 
@@ -68,30 +56,66 @@ Here is a minimum working example of *SINr*
 
 .. code:: python
 
-       import urllib
-       import io
-       import gzip
-       import networkit as nk
-       import sinr.graph_embeddings as ge
-
+       import nltk # For textual resources
 
-       url = "https://snap.stanford.edu/data/wiki-Vote.txt.gz"
-       graph_file = "wikipedia-votes.txt"
-       # Read a graph from SNAP
-       sock = urllib.request.urlopen(url)  # open URL
-       s = io.BytesIO(sock.read())  # read into BytesIO "file"
-       sock.close()
-       with gzip.open(s, "rt") as f_in:
-           with open(graph_file, "wt") as f_out:
-               f_out.writelines(f_in.readlines())
-       # Initialize a networkit.Graph object from SNAP graph
-       G = nk.readGraph(graph_file, nk.Format.SNAP)
-
-       # Build a SINr model and extract embeddings
-       model = ge.SINr.load_from_graph(G)
-       model.run(algo=nk.community.PLM(G))
-       embeddings = model.get_nr()
-       print(embeddings)
+       import sinr.text.preprocess as ppcs
+       from sinr.text.cooccurrence import Cooccurrence
+       from sinr.text.pmi import pmi_filter
+       import sinr.graph_embeddings as ge
+       import sinr.text.evaluate as ev
+
+       # Get a textual corpus
+       # For example, texts from the Project Gutenberg electronic text archive,
+       # hosted at http://www.gutenberg.org/
+       nltk.download('gutenberg')
+       gutenberg = nltk.corpus.gutenberg # contains 25,000 free electronic books
+       file = open("my_corpus.txt", "w")
+       file.write(gutenberg.raw())
+       file.close()
+
+       # Preprocess corpus
+       vrt_maker = ppcs.VRTMaker(ppcs.Corpus(ppcs.Corpus.REGISTER_WEB,
+                                             ppcs.Corpus.LANGUAGE_EN,
+                                             "my_corpus.txt"),
+                                             ".", n_jobs=8)
+       vrt_maker.do_txt_to_vrt()
+       sentences = ppcs.extract_text("my_corpus.vrt", min_freq=20)
+
+       # Construct cooccurrence matrix
+       c = Cooccurrence()
+       c.fit(sentences, window=5)
+       c.matrix = pmi_filter(c.matrix)
+       c.save("my_cooc_matrix.pk")
+
+       # Train SINr model
+       model = ge.SINr.load_from_cooc_pkl("my_cooc_matrix.pk")
+       commu = model.detect_communities(gamma=10)
+       model.extract_embeddings(commu)
+
+       # Construct SINrVectors to manipulate the model
+       sinr_vec = ge.InterpretableWordsModelBuilder(model,
+                                                    'my_sinr_vectors',
+                                                    n_jobs=8,
+                                                    n_neighbors=25).build()
+       sinr_vec.save()
+
+       # Sparsify vectors for better interpretability and performances
+       sinr_vec.sparsify(100)
+
+       # Evaluate the model with the similarity task
+       print('\nResults of the similarity evaluation :')
+       print(ev.similarity_MEN_WS353_SCWS(sinr_vec))
+
+       # Explore word vectors and dimensions of the model
+       print("\nDimensions activated by the word 'apple' :")
+       print(sinr_vec.get_obj_stereotypes('apple', topk_dim=5, topk_val=3))
+
+       print("\nWords similar to 'apple' :")
+       print(sinr_vec.most_similar('apple'))
+
+       # Load an existing SinrVectors object
+       sinr_vec = ge.SINrVectors('my_sinr_vectors')
+       sinr_vec.load()
 
 Documentation
 =============
@@ -136,6 +160,21 @@ documentation <https://sinr-embeddings.github.io/sinr/_build/html/publications.h
    Are Embedding Spaces Interpretable? Results of an Intrusion Detection
    Evaluation on a Large French Corpus. LREC 2022, Jun 2022, Marseille,
    France. `⟨hal-03770444⟩ <https://hal.science/hal-03770444>`__
+
+**Sparsity of SINr embedding**
+
+-  Simon Guillot, Thibault Prouteau, Nicolas Dugué.
+   Sparser is better: one step closer to word embedding interpretability.
+   IWCS 2023, Nancy, France.
+   `⟨hal-04321407⟩ <https://hal.science/hal-04321407>`__
+
+**Filtering dimensions of SINr embedding**
+
+-  Anna Béranger, Nicolas Dugué, Simon Guillot, Thibault Prouteau.
+   Filtering communities in word co-occurrence networks to foster the
+   emergence of meaning. Complex Networks 2023, Menton, France.
+   `⟨hal-04398742⟩ <https://hal.science/hal-04398742>`__
+
 
 
 .. |languages| image:: https://img.shields.io/github/languages/count/SINr-Embeddings/sinr