Semantic text similarity on short text segments

Summary

Building on the Semantic Text Similarity datasets of SemEval, this repository seeks to evaluate computationally-efficient approaches to identify short text segments that have nearly the same semantic meaning in large-scale datasets.

Getting started

The main entry point to the code in this repository is test_textsim.py. That file contains fuller comments describing the approaches being evaluated.

There are several similar files that build upon test_textsim.py.

test_unisent_multilingual.py evaluates the Multilingual Universal Sentence Encoder (which requires tensorflow>=2.0.0 see comments in file)
test_flair.pyevaluates various transformer embeddings using the flairNLP library
test_hindi.py uses XLM-R embeddings (from flairNLP) to test performance on a new Hindi dataset. That dataset is not in this repository currently. Please contact the maintainer (see below) if needed.

density_plots.R plots out results, and mwe* files are minimal working examples for some approaches.

Contact

Further information is available from Scott Hale. Meedan team members can contact Scott via Slack and others can reach out to Scott via comments/issues on this repository or via direct message on Twitter

Name		Name	Last commit message	Last commit date
Latest commit History 154 Commits
annotations		annotations
covid		covid
data		data
evaluations		evaluations
labeled_data		labeled_data
multilingual-sbert		multilingual-sbert
universal_sentence_encoder		universal_sentence_encoder
.gitignore		.gitignore
README.md		README.md
alegre_client.py		alegre_client.py
alegre_docsim.py		alegre_docsim.py
cr5.py		cr5.py
density_graphs.R		density_graphs.R
density_plots.png		density_plots.png
hindi_density_plots_no_stopwords.png		hindi_density_plots_no_stopwords.png
mwe_unisent_multilingual.py		mwe_unisent_multilingual.py
mwe_unisent_multilingual_requirements.txt		mwe_unisent_multilingual_requirements.txt
requirements.txt		requirements.txt
sbert_training_example_hindi.py		sbert_training_example_hindi.py
summary_output.txt		summary_output.txt
test_flair.py		test_flair.py
test_hindi.py		test_hindi.py
test_textsim.py		test_textsim.py
test_unisent_multilingual.py		test_unisent_multilingual.py
wmdistance.py		wmdistance.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Semantic text similarity on short text segments

Summary

Getting started

Contact

About

Releases

Packages

Contributors 5

Languages

meedan/textsimilarity

Folders and files

Latest commit

History

Repository files navigation

Semantic text similarity on short text segments

Summary

Getting started

Contact

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages