-
Notifications
You must be signed in to change notification settings - Fork 3
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Update README.rst Update installation, example and publications * Update README.rst * Update README.rst
- Loading branch information
Showing
1 changed file
with
76 additions
and
37 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -27,7 +27,7 @@ Requirements | |
- As SINr relies on libraries implemented using C/C++, a modern C++ | ||
compiler is required. | ||
- OpenMP (required for `Networkit <https://networkit.github.io>`__ and | ||
compiling *SINr*\ ’s Cython | ||
compiling *SINr*\ ’s Cython) | ||
- Python 3.9 | ||
- Pip | ||
- Cython | ||
|
@@ -36,8 +36,7 @@ Requirements | |
Install | ||
======= | ||
|
||
SINr can be installed through ``pip`` or from source using ``poetry`` | ||
directives. | ||
SINr can be installed through ``pip``. | ||
|
||
pip | ||
--- | ||
|
@@ -47,17 +46,6 @@ pip | |
conda activate sinr # activate conda environment | ||
pip install sinr | ||
from source | ||
----------- | ||
|
||
.. code:: bash | ||
conda activate sinr # activate conda environment | ||
git clone [email protected]:SINr-Embeddings/sinr.git | ||
cd sinr | ||
pip install poetry # poetry solves dependencies and installs SINr | ||
poetry install # installs SINr based on the pyproject.toml file | ||
Usage example | ||
============= | ||
|
||
|
@@ -68,30 +56,66 @@ Here is a minimum working example of *SINr* | |
|
||
.. code:: python | ||
import urllib | ||
import io | ||
import gzip | ||
import networkit as nk | ||
import sinr.graph_embeddings as ge | ||
import nltk # For textual resources | ||
url = "https://snap.stanford.edu/data/wiki-Vote.txt.gz" | ||
graph_file = "wikipedia-votes.txt" | ||
# Read a graph from SNAP | ||
sock = urllib.request.urlopen(url) # open URL | ||
s = io.BytesIO(sock.read()) # read into BytesIO "file" | ||
sock.close() | ||
with gzip.open(s, "rt") as f_in: | ||
with open(graph_file, "wt") as f_out: | ||
f_out.writelines(f_in.readlines()) | ||
# Initialize a networkit.Graph object from SNAP graph | ||
G = nk.readGraph(graph_file, nk.Format.SNAP) | ||
# Build a SINr model and extract embeddings | ||
model = ge.SINr.load_from_graph(G) | ||
model.run(algo=nk.community.PLM(G)) | ||
embeddings = model.get_nr() | ||
print(embeddings) | ||
import sinr.text.preprocess as ppcs | ||
from sinr.text.cooccurrence import Cooccurrence | ||
from sinr.text.pmi import pmi_filter | ||
import sinr.graph_embeddings as ge | ||
import sinr.text.evaluate as ev | ||
# Get a textual corpus | ||
# For example, texts from the Project Gutenberg electronic text archive, | ||
# hosted at http://www.gutenberg.org/ | ||
nltk.download('gutenberg') | ||
gutenberg = nltk.corpus.gutenberg # contains 25,000 free electronic books | ||
file = open("my_corpus.txt", "w") | ||
file.write(gutenberg.raw()) | ||
file.close() | ||
# Preprocess corpus | ||
vrt_maker = ppcs.VRTMaker(ppcs.Corpus(ppcs.Corpus.REGISTER_WEB, | ||
ppcs.Corpus.LANGUAGE_EN, | ||
"my_corpus.txt"), | ||
".", n_jobs=8) | ||
vrt_maker.do_txt_to_vrt() | ||
sentences = ppcs.extract_text("my_corpus.vrt", min_freq=20) | ||
# Construct cooccurrence matrix | ||
c = Cooccurrence() | ||
c.fit(sentences, window=5) | ||
c.matrix = pmi_filter(c.matrix) | ||
c.save("my_cooc_matrix.pk") | ||
# Train SINr model | ||
model = ge.SINr.load_from_cooc_pkl("my_cooc_matrix.pk") | ||
commu = model.detect_communities(gamma=10) | ||
model.extract_embeddings(commu) | ||
# Construct SINrVectors to manipulate the model | ||
sinr_vec = ge.InterpretableWordsModelBuilder(model, | ||
'my_sinr_vectors', | ||
n_jobs=8, | ||
n_neighbors=25).build() | ||
sinr_vec.save() | ||
# Sparsify vectors for better interpretability and performances | ||
sinr_vec.sparsify(100) | ||
# Evaluate the model with the similarity task | ||
print('\nResults of the similarity evaluation :') | ||
print(ev.similarity_MEN_WS353_SCWS(sinr_vec)) | ||
# Explore word vectors and dimensions of the model | ||
print("\nDimensions activated by the word 'apple' :") | ||
print(sinr_vec.get_obj_stereotypes('apple', topk_dim=5, topk_val=3)) | ||
print("\nWords similar to 'apple' :") | ||
print(sinr_vec.most_similar('apple')) | ||
# Load an existing SinrVectors object | ||
sinr_vec = ge.SINrVectors('my_sinr_vectors') | ||
sinr_vec.load() | ||
Documentation | ||
============= | ||
|
@@ -136,6 +160,21 @@ documentation <https://sinr-embeddings.github.io/sinr/_build/html/publications.h | |
Are Embedding Spaces Interpretable? Results of an Intrusion Detection | ||
Evaluation on a Large French Corpus. LREC 2022, Jun 2022, Marseille, | ||
France. `⟨hal-03770444⟩ <https://hal.science/hal-03770444>`__ | ||
|
||
**Sparsity of SINr embedding** | ||
|
||
- Simon Guillot, Thibault Prouteau, Nicolas Dugué. | ||
Sparser is better: one step closer to word embedding interpretability. | ||
IWCS 2023, Nancy, France. | ||
`⟨hal-04321407⟩ <https://hal.science/hal-04321407>`__ | ||
|
||
**Filtering dimensions of SINr embedding** | ||
|
||
- Anna Béranger, Nicolas Dugué, Simon Guillot, Thibault Prouteau. | ||
Filtering communities in word co-occurrence networks to foster the | ||
emergence of meaning. Complex Networks 2023, Menton, France. | ||
`⟨hal-04398742⟩ <https://hal.science/hal-04398742>`__ | ||
|
||
|
||
|
||
.. |languages| image:: https://img.shields.io/github/languages/count/SINr-Embeddings/sinr | ||
|