Skip to content

Commit

Permalink
99 diachronic features (#100)
Browse files Browse the repository at this point in the history
* Dev (#94)

* factory method to load embeddings at the w2v format (#53)

* Revert "factory method to load embeddings at the w2v format (#53)"

This reverts commit 7da0e99.

* 52 loading vectors such as w2v or spine ones (#57)

* factory method to load embeddings at the w2v format

* Update graph_embeddings.py, small fix

* Adding distRatio (#59)

* moving dist ratio in sinr.text.evaluate, adding unit tests (#61)

* commenting cosine_dist, pick_intruder, dist_ratio, dist_ratio_dim, intra_sim, inter_sim methods

* moving distRatio from graph_embeddings to text/evaluate

* tests unitaires distratio

* cleaning comments

* adding creation and deletion of w2v file for distRatio unit tests

* fix with_value() argument (#65)

Co-authored-by: Simon Guillot <[email protected]>

* load_from_word2vec model s name bug fixed (#70)

* missing word list bug fixed (#68)

* 73 wrong community memberships update when filtering dimensions (#75)

* update of community_membership when filtering dimensions

* sinr filtered: removing dimensions and updating communities_sets

* fixed code to pass tests

* comments

* 76 preprocessing multiple documents (#77)

* preprocessing by documents

* Tests : preprocessing by sentences and by documents

* adding size indicator for spacy model and downloading spacy model in tests workflow'

* downloading spacy

* 78 classification (#80)

* preprocess : minimal length of documents kept + tests

* vectorizer + test

* classification's methods + tests

* xgboost interpretable dimensions

* adding xgboost for test workflow

* classification, fit and score test modification

* get_dimension_stereotypes on removed community fixed (#82)

* Filtering words using a dictionnary (#84)

* Exceptions list, path to save / load SINrVectors (#86)

* not removing words when in exceptions list

* add path to method save

* exceptions list to set + test exceptions list

* path parameter method load

* new exceptions list for similarity

* optionnal parameter path for load and save methods

* 90 notebooks (#91)

* fix save, load, dim_nnz_thresholds + add obj_nnz_count

* add notebook with gutenberg example

* bnc model for notebook

* notebook bnc

* notebook frwac

* remove nb evaluate

* add tqdm to sparsify method

---------

Co-authored-by: Thibault PROUTEAU <[email protected]>
Co-authored-by: Anna B <[email protected]>
Co-authored-by: Simon Guillot <[email protected]>
Co-authored-by: Simon Guillot <[email protected]>

* Update deploy.yml

* [AUTO-COMMIT] Update release version to v1.3.1. (#96)

Files changed:
M	pyproject.toml
M	sinr/__init__.py

Co-authored-by: nicolasdugue <[email protected]>

* doc update (#97)

* doc update

* links update

* remove doc, quality, build

* fix LICENSE link

* Update conf.py

---------

Co-authored-by: Thibault PROUTEAU <[email protected]>

* Diachronic features

* Update publications.rst

---------

Co-authored-by: Thibault PROUTEAU <[email protected]>
Co-authored-by: Anna B <[email protected]>
Co-authored-by: Simon Guillot <[email protected]>
Co-authored-by: Simon Guillot <[email protected]>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: nicolasdugue <[email protected]>
Co-authored-by: Thibault PROUTEAU <[email protected]>
  • Loading branch information
8 people authored Jul 22, 2024
1 parent 4804015 commit 8611e19
Show file tree
Hide file tree
Showing 12 changed files with 1,027 additions and 62 deletions.
3 changes: 2 additions & 1 deletion .github/workflows/deploy.yml
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,8 @@ jobs:
- name: Publish
run: |
poetry publish -u ${{ secrets.PYPI_UNAME }} -p ${{ secrets.PYPI_PWD }}
poetry config pypi-token.pypi ${{ secrets.PYPI_TOKEN }}
poetry publish
- name: Upload binaries to release
uses: softprops/action-gh-release@v1
if: ${{startsWith(github.ref, 'refs/tags/') }}
Expand Down
13 changes: 6 additions & 7 deletions README.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
=====
SINr
=====
|languages| |downloads| |license| |version| |cpython| |wheel| |python| |docs| |activity| |contributors| |quality| |build|
|languages| |downloads| |license| |version| |cpython| |wheel| |python| |activity| |contributors|

*SINr* is an open-source tool to efficiently compute graph and word
embeddings. Its aim is to provide sparse interpretable vectors from a
Expand Down Expand Up @@ -50,7 +50,8 @@ Usage example
=============

To get started using *SINr* to build graph and word embeddings, have a
look at the `notebook <./notebooks>`__ directory.
look at the `notebook <https://github.com/SINr-Embeddings/sinr/tree/main/notebooks>`_
directory.

Here is a minimum working example of *SINr*

Expand Down Expand Up @@ -132,7 +133,7 @@ to disccus the changes to be made.
License
=======

Released under `CeCILL 2.1 <https://cecill.info/>`__, see `LICENSE <./LICENSE>`__ for more details.
Released under `CeCILL 2.1 <https://cecill.info/>`__, see `LICENSE <https://github.com/SINr-Embeddings/sinr/blob/main/LICENSE>`__ for more details.

Publications
============
Expand All @@ -141,7 +142,7 @@ Publications
find *SINr* useful for your own research, please cite the appropriate
papers from the list below. Publications can also be found on
`publications page in the
documentation <https://sinr-embeddings.github.io/sinr/_build/html/publications.html>`__.
documentation <https://sinr-embeddings.github.io/sinr/publications.html>`__.

**Initial SINr paper, 2021**

Expand Down Expand Up @@ -184,8 +185,6 @@ documentation <https://sinr-embeddings.github.io/sinr/_build/html/publications.h
.. |cpython| image:: https://img.shields.io/pypi/implementation/sinr
.. |wheel| image:: https://img.shields.io/pypi/wheel/sinr
.. |python| image:: https://img.shields.io/pypi/pyversions/sinr
.. |docs| image:: https://img.shields.io/website?url=https%3A%2F%2Fsinr-embeddings.github.io%2Fsinr%2F_build%2Fhtml%2Findex.html
.. |activity| image:: https://img.shields.io/github/commit-activity/y/SINr-Embeddings/sinr
.. |contributors| image:: https://img.shields.io/github/contributors/SINr-Embeddings/sinr
.. |quality| image:: https://scrutinizer-ci.com/g/SINr-Embeddings/sinr/badges/quality-score.png?b=main
.. |build| image:: https://scrutinizer-ci.com/g/SINr-Embeddings/sinr/badges/build.png?b=main

4 changes: 2 additions & 2 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,8 @@
# https://www.sphinx-doc.org/en/master/usage/configuration.html#project-information

project = 'SINr'
copyright = '2023, Thibault Prouteau, Nicolas Dugué, Simon Guillot'
author = 'Thibault Prouteau, Nicolas Dugué, Simon Guillot'
copyright = '2024, Thibault Prouteau, Nicolas Dugué, Simon Guillot, Anna Béranger'
author = 'Thibault Prouteau, Nicolas Dugué, Simon Guillot, Anna Béranger'

# -- General configuration ---------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration
Expand Down
137 changes: 90 additions & 47 deletions docs/source/presentation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
Overview
============

|languages| |downloads| |license| |version| |cpython| |wheel| |python| |docs| |activity| |contributors| |quality| |build|
|languages| |downloads| |license| |version| |cpython| |wheel| |python| |activity| |contributors|

*SINr* is an open-source tool to efficiently compute graph and word
embeddings. Its aim is to provide sparse interpretable vectors from a
Expand Down Expand Up @@ -38,24 +38,12 @@ Requirements
Install
-------

**SINr** can be installed through ``pip`` or from source using ``poetry`` directives.
**SINr** can be installed through ``pip``.

.. tabs::
.. code:: bash
.. code-tab:: zsh pip

#Activate conda environment
conda activate sinr
pip install sinr

.. code-tab:: zsh from source

#Activate conda environment
conda activate sinr
git clone [email protected]:SINr-Embeddings/sinr.git
cd sinr
pip install poetry #poetry solves dependencies and installs SINr
poetry install #Installs SINr based on the pyproject.toml file
conda activate sinr # activate conda environment
pip install sinr
Usage example
Expand All @@ -69,30 +57,66 @@ Here is a minimum working example of SINr :

.. code:: python
import urllib
import io
import gzip
import networkit as nk
import sinr.graph_embeddings as ge
url = "https://snap.stanford.edu/data/wiki-Vote.txt.gz"
graph_file = "wikipedia-votes.txt"
# Read a graph from SNAP
sock = urllib.request.urlopen(url) # open URL
s = io.BytesIO(sock.read()) # read into BytesIO "file"
sock.close()
with gzip.open(s, "rt") as f_in:
with open(graph_file, "wt") as f_out:
f_out.writelines(f_in.readlines())
# Initialize a networkit.Graph object from SNAP graph
G = nk.readGraph(graph_file, nk.Format.SNAP)
# Build a SINr model and extract embeddings
model = ge.SINr.load_from_graph(G)
model.run(algo=nk.community.PLM(G))
embeddings = model.get_nr()
print(embeddings)
import nltk # For textual resources
import sinr.text.preprocess as ppcs
from sinr.text.cooccurrence import Cooccurrence
from sinr.text.pmi import pmi_filter
import sinr.graph_embeddings as ge
import sinr.text.evaluate as ev
# Get a textual corpus
# For example, texts from the Project Gutenberg electronic text archive,
# hosted at http://www.gutenberg.org/
nltk.download('gutenberg')
gutenberg = nltk.corpus.gutenberg # contains 25,000 free electronic books
file = open("my_corpus.txt", "w")
file.write(gutenberg.raw())
file.close()
# Preprocess corpus
vrt_maker = ppcs.VRTMaker(ppcs.Corpus(ppcs.Corpus.REGISTER_WEB,
ppcs.Corpus.LANGUAGE_EN,
"my_corpus.txt"),
".", n_jobs=8)
vrt_maker.do_txt_to_vrt()
sentences = ppcs.extract_text("my_corpus.vrt", min_freq=20)
# Construct cooccurrence matrix
c = Cooccurrence()
c.fit(sentences, window=5)
c.matrix = pmi_filter(c.matrix)
c.save("my_cooc_matrix.pk")
# Train SINr model
model = ge.SINr.load_from_cooc_pkl("my_cooc_matrix.pk")
commu = model.detect_communities(gamma=10)
model.extract_embeddings(commu)
# Construct SINrVectors to manipulate the model
sinr_vec = ge.InterpretableWordsModelBuilder(model,
'my_sinr_vectors',
n_jobs=8,
n_neighbors=25).build()
sinr_vec.save()
# Sparsify vectors for better interpretability and performances
sinr_vec.sparsify(100)
# Evaluate the model with the similarity task
print('\nResults of the similarity evaluation :')
print(ev.similarity_MEN_WS353_SCWS(sinr_vec))
# Explore word vectors and dimensions of the model
print("\nDimensions activated by the word 'apple' :")
print(sinr_vec.get_obj_stereotypes('apple', topk_dim=5, topk_val=3))
print("\nWords similar to 'apple' :")
print(sinr_vec.most_similar('apple'))
# Load an existing SinrVectors object
sinr_vec = ge.SINrVectors('my_sinr_vectors')
sinr_vec.load()
Contributing
Expand All @@ -115,13 +139,35 @@ Publications can also be found on :ref:`Publications`.

**Initial SINr paper, 2021**

- Thibault Prouteau, Victor Connes, Nicolas Dugué, Anthony Perez,
Jean-Charles Lamirel, et al.. SINr: Fast Computing of Sparse
Interpretable Node Representations is not a Sin!. Advances in
Intelligent Data Analysis XIX, 19th International Symposium on
Intelligent Data Analysis, IDA 2021, Apr 2021, Porto, Portugal.
pp.325-337,
\ `10.1007/978-3-030-74251-5_26 <https://dx.doi.org/10.1007/978-3-030-74251-5_26>`__\ ⟩.
`⟨hal-03197434⟩ <https://hal.science/hal-03197434>`__

**Interpretability of SINr embedding**

- Thibault Prouteau, Nicolas Dugué, Nathalie Camelin, Sylvain Meignier.
Are Embedding Spaces Interpretable? Results of an Intrusion Detection
Evaluation on a Large French Corpus. LREC 2022, Jun 2022, Marseille,
France. `⟨hal-03770444⟩ <https://hal.science/hal-03770444>`__

**Sparsity of SINr embedding**

- Thibault Prouteau, Victor Connes, Nicolas Dugué, Anthony Perez, Jean-Charles Lamirel, et al.. SINr: Fast Computing of Sparse Interpretable Node Representations is not a Sin!. Advances in Intelligent Data Analysis XIX, 19th International Symposium on Intelligent Data Analysis, IDA 2021, Apr 2021, Porto, Portugal. pp.325-337, ⟨`10.1007/978-3-030-74251-5_26 <https://dx.doi.org/10.1007/978-3-030-74251-5_26>`_⟩. `⟨hal-03197434⟩ <https://hal.science/hal-03197434>`_
- Simon Guillot, Thibault Prouteau, Nicolas Dugué.
Sparser is better: one step closer to word embedding interpretability.
IWCS 2023, Nancy, France.
`⟨hal-04321407⟩ <https://hal.science/hal-04321407>`__

**Interpretability of SINr embeddings, 2022**
**Filtering dimensions of SINr embedding**

- Thibault Prouteau, Nicolas Dugué, Nathalie Camelin, Sylvain Meignier. Are Embedding Spaces Interpretable? Results of an Intrusion Detection Evaluation on a Large French Corpus. LREC 2022, Jun 2022, Marseille, France. `⟨hal-03770444⟩ <https://hal.science/hal-03770444>`_
- Anna Béranger, Nicolas Dugué, Simon Guillot, Thibault Prouteau.
Filtering communities in word co-occurrence networks to foster the
emergence of meaning. Complex Networks 2023, Menton, France.
`⟨hal-04398742⟩ <https://hal.science/hal-04398742>`__

.. |languages| image:: https://img.shields.io/github/languages/count/SINr-Embeddings/sinr
.. |downloads| image:: https://img.shields.io/pypi/dm/sinr
Expand All @@ -130,8 +176,5 @@ Publications can also be found on :ref:`Publications`.
.. |cpython| image:: https://img.shields.io/pypi/implementation/sinr
.. |wheel| image:: https://img.shields.io/pypi/wheel/sinr
.. |python| image:: https://img.shields.io/pypi/pyversions/sinr
.. |docs| image:: https://img.shields.io/website?url=https%3A%2F%2Fsinr-embeddings.github.io%2Fsinr%2F_build%2Fhtml%2Findex.html
.. |activity| image:: https://img.shields.io/github/commit-activity/y/SINr-Embeddings/sinr
.. |contributors| image:: https://img.shields.io/github/contributors/SINr-Embeddings/sinr
.. |quality| image:: https://scrutinizer-ci.com/g/SINr-Embeddings/sinr/badges/quality-score.png?b=main
.. |build| image:: https://scrutinizer-ci.com/g/SINr-Embeddings/sinr/badges/build.png?b=main
16 changes: 13 additions & 3 deletions docs/source/publications.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,20 @@ Publications
**Initial SINr paper, 2021**


- Thibault Prouteau, Victor Connes, Nicolas Dugué, Anthony Perez, Jean-Charles Lamirel, et al.. SINr: Fast Computing of Sparse Interpretable Node Representations is not a Sin!. Advances in Intelligent Data Analysis XIX, 19th International Symposium on Intelligent Data Analysis, IDA 2021, Apr 2021, Porto, Portugal. pp.325-337, ⟨`10.1007/978-3-030-74251-5_26 <https://dx.doi.org/10.1007/978-3-030-74251-5_26>`_⟩. `⟨hal-03197434⟩ <https://hal.science/hal-03197434>`_
- Thibault Prouteau, Victor Connes, Nicolas Dugué, Anthony Perez, Jean-Charles Lamirel, et al.. SINr: Fast Computing of Sparse Interpretable Node Representations is not a Sin!. Advances in Intelligent Data Analysis XIX, 19th International Symposium on Intelligent Data Analysis, IDA 2021, Apr 2021, Porto, Portugal. pp.325-337, ⟨\ `10.1007/978-3-030-74251-5_26 <https://dx.doi.org/10.1007/978-3-030-74251-5_26>`__\ ⟩.
`⟨hal-03197434⟩ <https://hal.science/hal-03197434>`__

**Interpretability of SINr embedding**

**Interpretability of SINr embeddings, 2022**

- Thibault Prouteau, Nicolas Dugué, Nathalie Camelin, Sylvain Meignier. Are Embedding Spaces Interpretable? Results of an Intrusion Detection Evaluation on a Large French Corpus. LREC 2022, Jun 2022, Marseille, France. `⟨hal-03770444⟩ <https://hal.science/hal-03770444>`__

- Thibault Prouteau, Nicolas Dugué, Nathalie Camelin, Sylvain Meignier. Are Embedding Spaces Interpretable? Results of an Intrusion Detection Evaluation on a Large French Corpus. LREC 2022, Jun 2022, Marseille, France. `⟨hal-03770444⟩ <https://hal.science/hal-03770444>`_
**Sparsity of SINr embedding**


- Simon Guillot, Thibault Prouteau, Nicolas Dugué. Sparser is better: one step closer to word embedding interpretability. IWCS 2023, Nancy, France. `⟨hal-04321407⟩ <https://hal.science/hal-04321407>`__

**Filtering dimensions of SINr embedding**


- Anna Béranger, Nicolas Dugué, Simon Guillot, Thibault Prouteau. Filtering communities in word co-occurrence networks to foster the emergence of meaning. Complex Networks 2023, Menton, France. `⟨hal-04398742v1⟩ <https://hal.science/hal-04398742v1>`__
8 changes: 8 additions & 0 deletions docs/source/sinr.text.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,14 @@ Preprocess Text
:members:
:undoc-members:
:show-inheritance:

Evaluate
---------------------------

.. automodule:: sinr.text.evaluate
:members:
:undoc-members:
:show-inheritance:

Module contents
---------------
Expand Down
Loading

0 comments on commit 8611e19

Please sign in to comment.