-
Notifications
You must be signed in to change notification settings - Fork 3
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Dev (#94) * factory method to load embeddings at the w2v format (#53) * Revert "factory method to load embeddings at the w2v format (#53)" This reverts commit 7da0e99. * 52 loading vectors such as w2v or spine ones (#57) * factory method to load embeddings at the w2v format * Update graph_embeddings.py, small fix * Adding distRatio (#59) * moving dist ratio in sinr.text.evaluate, adding unit tests (#61) * commenting cosine_dist, pick_intruder, dist_ratio, dist_ratio_dim, intra_sim, inter_sim methods * moving distRatio from graph_embeddings to text/evaluate * tests unitaires distratio * cleaning comments * adding creation and deletion of w2v file for distRatio unit tests * fix with_value() argument (#65) Co-authored-by: Simon Guillot <[email protected]> * load_from_word2vec model s name bug fixed (#70) * missing word list bug fixed (#68) * 73 wrong community memberships update when filtering dimensions (#75) * update of community_membership when filtering dimensions * sinr filtered: removing dimensions and updating communities_sets * fixed code to pass tests * comments * 76 preprocessing multiple documents (#77) * preprocessing by documents * Tests : preprocessing by sentences and by documents * adding size indicator for spacy model and downloading spacy model in tests workflow' * downloading spacy * 78 classification (#80) * preprocess : minimal length of documents kept + tests * vectorizer + test * classification's methods + tests * xgboost interpretable dimensions * adding xgboost for test workflow * classification, fit and score test modification * get_dimension_stereotypes on removed community fixed (#82) * Filtering words using a dictionnary (#84) * Exceptions list, path to save / load SINrVectors (#86) * not removing words when in exceptions list * add path to method save * exceptions list to set + test exceptions list * path parameter method load * new exceptions list for similarity * optionnal parameter path for load and save methods * 90 notebooks (#91) * fix save, load, dim_nnz_thresholds + add obj_nnz_count * add notebook with gutenberg example * bnc model for notebook * notebook bnc * notebook frwac * remove nb evaluate * add tqdm to sparsify method --------- Co-authored-by: Thibault PROUTEAU <[email protected]> Co-authored-by: Anna B <[email protected]> Co-authored-by: Simon Guillot <[email protected]> Co-authored-by: Simon Guillot <[email protected]> * Update deploy.yml * [AUTO-COMMIT] Update release version to v1.3.1. (#96) Files changed: M pyproject.toml M sinr/__init__.py Co-authored-by: nicolasdugue <[email protected]> * doc update (#97) * doc update * links update * remove doc, quality, build * fix LICENSE link * Update conf.py --------- Co-authored-by: Thibault PROUTEAU <[email protected]> * Diachronic features * Update publications.rst --------- Co-authored-by: Thibault PROUTEAU <[email protected]> Co-authored-by: Anna B <[email protected]> Co-authored-by: Simon Guillot <[email protected]> Co-authored-by: Simon Guillot <[email protected]> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: nicolasdugue <[email protected]> Co-authored-by: Thibault PROUTEAU <[email protected]>
- Loading branch information
1 parent
4804015
commit 8611e19
Showing
12 changed files
with
1,027 additions
and
62 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -4,7 +4,7 @@ | |
Overview | ||
============ | ||
|
||
|languages| |downloads| |license| |version| |cpython| |wheel| |python| |docs| |activity| |contributors| |quality| |build| | ||
|languages| |downloads| |license| |version| |cpython| |wheel| |python| |activity| |contributors| | ||
|
||
*SINr* is an open-source tool to efficiently compute graph and word | ||
embeddings. Its aim is to provide sparse interpretable vectors from a | ||
|
@@ -38,24 +38,12 @@ Requirements | |
Install | ||
------- | ||
|
||
**SINr** can be installed through ``pip`` or from source using ``poetry`` directives. | ||
**SINr** can be installed through ``pip``. | ||
|
||
.. tabs:: | ||
.. code:: bash | ||
.. code-tab:: zsh pip | ||
|
||
#Activate conda environment | ||
conda activate sinr | ||
pip install sinr | ||
|
||
.. code-tab:: zsh from source | ||
|
||
#Activate conda environment | ||
conda activate sinr | ||
git clone [email protected]:SINr-Embeddings/sinr.git | ||
cd sinr | ||
pip install poetry #poetry solves dependencies and installs SINr | ||
poetry install #Installs SINr based on the pyproject.toml file | ||
conda activate sinr # activate conda environment | ||
pip install sinr | ||
Usage example | ||
|
@@ -69,30 +57,66 @@ Here is a minimum working example of SINr : | |
|
||
.. code:: python | ||
import urllib | ||
import io | ||
import gzip | ||
import networkit as nk | ||
import sinr.graph_embeddings as ge | ||
url = "https://snap.stanford.edu/data/wiki-Vote.txt.gz" | ||
graph_file = "wikipedia-votes.txt" | ||
# Read a graph from SNAP | ||
sock = urllib.request.urlopen(url) # open URL | ||
s = io.BytesIO(sock.read()) # read into BytesIO "file" | ||
sock.close() | ||
with gzip.open(s, "rt") as f_in: | ||
with open(graph_file, "wt") as f_out: | ||
f_out.writelines(f_in.readlines()) | ||
# Initialize a networkit.Graph object from SNAP graph | ||
G = nk.readGraph(graph_file, nk.Format.SNAP) | ||
# Build a SINr model and extract embeddings | ||
model = ge.SINr.load_from_graph(G) | ||
model.run(algo=nk.community.PLM(G)) | ||
embeddings = model.get_nr() | ||
print(embeddings) | ||
import nltk # For textual resources | ||
import sinr.text.preprocess as ppcs | ||
from sinr.text.cooccurrence import Cooccurrence | ||
from sinr.text.pmi import pmi_filter | ||
import sinr.graph_embeddings as ge | ||
import sinr.text.evaluate as ev | ||
# Get a textual corpus | ||
# For example, texts from the Project Gutenberg electronic text archive, | ||
# hosted at http://www.gutenberg.org/ | ||
nltk.download('gutenberg') | ||
gutenberg = nltk.corpus.gutenberg # contains 25,000 free electronic books | ||
file = open("my_corpus.txt", "w") | ||
file.write(gutenberg.raw()) | ||
file.close() | ||
# Preprocess corpus | ||
vrt_maker = ppcs.VRTMaker(ppcs.Corpus(ppcs.Corpus.REGISTER_WEB, | ||
ppcs.Corpus.LANGUAGE_EN, | ||
"my_corpus.txt"), | ||
".", n_jobs=8) | ||
vrt_maker.do_txt_to_vrt() | ||
sentences = ppcs.extract_text("my_corpus.vrt", min_freq=20) | ||
# Construct cooccurrence matrix | ||
c = Cooccurrence() | ||
c.fit(sentences, window=5) | ||
c.matrix = pmi_filter(c.matrix) | ||
c.save("my_cooc_matrix.pk") | ||
# Train SINr model | ||
model = ge.SINr.load_from_cooc_pkl("my_cooc_matrix.pk") | ||
commu = model.detect_communities(gamma=10) | ||
model.extract_embeddings(commu) | ||
# Construct SINrVectors to manipulate the model | ||
sinr_vec = ge.InterpretableWordsModelBuilder(model, | ||
'my_sinr_vectors', | ||
n_jobs=8, | ||
n_neighbors=25).build() | ||
sinr_vec.save() | ||
# Sparsify vectors for better interpretability and performances | ||
sinr_vec.sparsify(100) | ||
# Evaluate the model with the similarity task | ||
print('\nResults of the similarity evaluation :') | ||
print(ev.similarity_MEN_WS353_SCWS(sinr_vec)) | ||
# Explore word vectors and dimensions of the model | ||
print("\nDimensions activated by the word 'apple' :") | ||
print(sinr_vec.get_obj_stereotypes('apple', topk_dim=5, topk_val=3)) | ||
print("\nWords similar to 'apple' :") | ||
print(sinr_vec.most_similar('apple')) | ||
# Load an existing SinrVectors object | ||
sinr_vec = ge.SINrVectors('my_sinr_vectors') | ||
sinr_vec.load() | ||
Contributing | ||
|
@@ -115,13 +139,35 @@ Publications can also be found on :ref:`Publications`. | |
|
||
**Initial SINr paper, 2021** | ||
|
||
- Thibault Prouteau, Victor Connes, Nicolas Dugué, Anthony Perez, | ||
Jean-Charles Lamirel, et al.. SINr: Fast Computing of Sparse | ||
Interpretable Node Representations is not a Sin!. Advances in | ||
Intelligent Data Analysis XIX, 19th International Symposium on | ||
Intelligent Data Analysis, IDA 2021, Apr 2021, Porto, Portugal. | ||
pp.325-337, | ||
⟨\ `10.1007/978-3-030-74251-5_26 <https://dx.doi.org/10.1007/978-3-030-74251-5_26>`__\ ⟩. | ||
`⟨hal-03197434⟩ <https://hal.science/hal-03197434>`__ | ||
|
||
**Interpretability of SINr embedding** | ||
|
||
- Thibault Prouteau, Nicolas Dugué, Nathalie Camelin, Sylvain Meignier. | ||
Are Embedding Spaces Interpretable? Results of an Intrusion Detection | ||
Evaluation on a Large French Corpus. LREC 2022, Jun 2022, Marseille, | ||
France. `⟨hal-03770444⟩ <https://hal.science/hal-03770444>`__ | ||
|
||
**Sparsity of SINr embedding** | ||
|
||
- Thibault Prouteau, Victor Connes, Nicolas Dugué, Anthony Perez, Jean-Charles Lamirel, et al.. SINr: Fast Computing of Sparse Interpretable Node Representations is not a Sin!. Advances in Intelligent Data Analysis XIX, 19th International Symposium on Intelligent Data Analysis, IDA 2021, Apr 2021, Porto, Portugal. pp.325-337, ⟨`10.1007/978-3-030-74251-5_26 <https://dx.doi.org/10.1007/978-3-030-74251-5_26>`_⟩. `⟨hal-03197434⟩ <https://hal.science/hal-03197434>`_ | ||
- Simon Guillot, Thibault Prouteau, Nicolas Dugué. | ||
Sparser is better: one step closer to word embedding interpretability. | ||
IWCS 2023, Nancy, France. | ||
`⟨hal-04321407⟩ <https://hal.science/hal-04321407>`__ | ||
|
||
**Interpretability of SINr embeddings, 2022** | ||
**Filtering dimensions of SINr embedding** | ||
|
||
- Thibault Prouteau, Nicolas Dugué, Nathalie Camelin, Sylvain Meignier. Are Embedding Spaces Interpretable? Results of an Intrusion Detection Evaluation on a Large French Corpus. LREC 2022, Jun 2022, Marseille, France. `⟨hal-03770444⟩ <https://hal.science/hal-03770444>`_ | ||
- Anna Béranger, Nicolas Dugué, Simon Guillot, Thibault Prouteau. | ||
Filtering communities in word co-occurrence networks to foster the | ||
emergence of meaning. Complex Networks 2023, Menton, France. | ||
`⟨hal-04398742⟩ <https://hal.science/hal-04398742>`__ | ||
|
||
.. |languages| image:: https://img.shields.io/github/languages/count/SINr-Embeddings/sinr | ||
.. |downloads| image:: https://img.shields.io/pypi/dm/sinr | ||
|
@@ -130,8 +176,5 @@ Publications can also be found on :ref:`Publications`. | |
.. |cpython| image:: https://img.shields.io/pypi/implementation/sinr | ||
.. |wheel| image:: https://img.shields.io/pypi/wheel/sinr | ||
.. |python| image:: https://img.shields.io/pypi/pyversions/sinr | ||
.. |docs| image:: https://img.shields.io/website?url=https%3A%2F%2Fsinr-embeddings.github.io%2Fsinr%2F_build%2Fhtml%2Findex.html | ||
.. |activity| image:: https://img.shields.io/github/commit-activity/y/SINr-Embeddings/sinr | ||
.. |contributors| image:: https://img.shields.io/github/contributors/SINr-Embeddings/sinr | ||
.. |quality| image:: https://scrutinizer-ci.com/g/SINr-Embeddings/sinr/badges/quality-score.png?b=main | ||
.. |build| image:: https://scrutinizer-ci.com/g/SINr-Embeddings/sinr/badges/build.png?b=main |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.