Skip to content

Commit

Permalink
Update README.rst (#89)
Browse files Browse the repository at this point in the history
* Update README.rst

Update installation, example and publications

* Update README.rst

* Update README.rst
  • Loading branch information
aberanger authored Jun 7, 2024
1 parent 74039de commit 286ec76
Showing 1 changed file with 76 additions and 37 deletions.
113 changes: 76 additions & 37 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ Requirements
- As SINr relies on libraries implemented using C/C++, a modern C++
compiler is required.
- OpenMP (required for `Networkit <https://networkit.github.io>`__ and
compiling *SINr*\ ’s Cython
compiling *SINr*\ ’s Cython)
- Python 3.9
- Pip
- Cython
Expand All @@ -36,8 +36,7 @@ Requirements
Install
=======

SINr can be installed through ``pip`` or from source using ``poetry``
directives.
SINr can be installed through ``pip``.

pip
---
Expand All @@ -47,17 +46,6 @@ pip
conda activate sinr # activate conda environment
pip install sinr
from source
-----------

.. code:: bash
conda activate sinr # activate conda environment
git clone [email protected]:SINr-Embeddings/sinr.git
cd sinr
pip install poetry # poetry solves dependencies and installs SINr
poetry install # installs SINr based on the pyproject.toml file
Usage example
=============

Expand All @@ -68,30 +56,66 @@ Here is a minimum working example of *SINr*

.. code:: python
import urllib
import io
import gzip
import networkit as nk
import sinr.graph_embeddings as ge
import nltk # For textual resources
url = "https://snap.stanford.edu/data/wiki-Vote.txt.gz"
graph_file = "wikipedia-votes.txt"
# Read a graph from SNAP
sock = urllib.request.urlopen(url) # open URL
s = io.BytesIO(sock.read()) # read into BytesIO "file"
sock.close()
with gzip.open(s, "rt") as f_in:
with open(graph_file, "wt") as f_out:
f_out.writelines(f_in.readlines())
# Initialize a networkit.Graph object from SNAP graph
G = nk.readGraph(graph_file, nk.Format.SNAP)
# Build a SINr model and extract embeddings
model = ge.SINr.load_from_graph(G)
model.run(algo=nk.community.PLM(G))
embeddings = model.get_nr()
print(embeddings)
import sinr.text.preprocess as ppcs
from sinr.text.cooccurrence import Cooccurrence
from sinr.text.pmi import pmi_filter
import sinr.graph_embeddings as ge
import sinr.text.evaluate as ev
# Get a textual corpus
# For example, texts from the Project Gutenberg electronic text archive,
# hosted at http://www.gutenberg.org/
nltk.download('gutenberg')
gutenberg = nltk.corpus.gutenberg # contains 25,000 free electronic books
file = open("my_corpus.txt", "w")
file.write(gutenberg.raw())
file.close()
# Preprocess corpus
vrt_maker = ppcs.VRTMaker(ppcs.Corpus(ppcs.Corpus.REGISTER_WEB,
ppcs.Corpus.LANGUAGE_EN,
"my_corpus.txt"),
".", n_jobs=8)
vrt_maker.do_txt_to_vrt()
sentences = ppcs.extract_text("my_corpus.vrt", min_freq=20)
# Construct cooccurrence matrix
c = Cooccurrence()
c.fit(sentences, window=5)
c.matrix = pmi_filter(c.matrix)
c.save("my_cooc_matrix.pk")
# Train SINr model
model = ge.SINr.load_from_cooc_pkl("my_cooc_matrix.pk")
commu = model.detect_communities(gamma=10)
model.extract_embeddings(commu)
# Construct SINrVectors to manipulate the model
sinr_vec = ge.InterpretableWordsModelBuilder(model,
'my_sinr_vectors',
n_jobs=8,
n_neighbors=25).build()
sinr_vec.save()
# Sparsify vectors for better interpretability and performances
sinr_vec.sparsify(100)
# Evaluate the model with the similarity task
print('\nResults of the similarity evaluation :')
print(ev.similarity_MEN_WS353_SCWS(sinr_vec))
# Explore word vectors and dimensions of the model
print("\nDimensions activated by the word 'apple' :")
print(sinr_vec.get_obj_stereotypes('apple', topk_dim=5, topk_val=3))
print("\nWords similar to 'apple' :")
print(sinr_vec.most_similar('apple'))
# Load an existing SinrVectors object
sinr_vec = ge.SINrVectors('my_sinr_vectors')
sinr_vec.load()
Documentation
=============
Expand Down Expand Up @@ -136,6 +160,21 @@ documentation <https://sinr-embeddings.github.io/sinr/_build/html/publications.h
Are Embedding Spaces Interpretable? Results of an Intrusion Detection
Evaluation on a Large French Corpus. LREC 2022, Jun 2022, Marseille,
France. `⟨hal-03770444⟩ <https://hal.science/hal-03770444>`__

**Sparsity of SINr embedding**

- Simon Guillot, Thibault Prouteau, Nicolas Dugué.
Sparser is better: one step closer to word embedding interpretability.
IWCS 2023, Nancy, France.
`⟨hal-04321407⟩ <https://hal.science/hal-04321407>`__

**Filtering dimensions of SINr embedding**

- Anna Béranger, Nicolas Dugué, Simon Guillot, Thibault Prouteau.
Filtering communities in word co-occurrence networks to foster the
emergence of meaning. Complex Networks 2023, Menton, France.
`⟨hal-04398742⟩ <https://hal.science/hal-04398742>`__



.. |languages| image:: https://img.shields.io/github/languages/count/SINr-Embeddings/sinr
Expand Down

0 comments on commit 286ec76

Please sign in to comment.