cmpe493-term-project

How to Run

TF-IDF

To preprocess:

./preprocessing/cos_sim.py

To cluster:

./models/tf_idf2kmeans.py

To create files for evaluation:

./eval/eval.py

Okapi BM25

Preprocess and get scores:

./models/bm25/bm25.py

Create files for evaluation:

./models/bm25/eval_bm25.py

Produces three files: results_bm25.txt, results_bm25_L.txt, results_bm25_Plus.txt

Last program outputs two files as myqrels.txt and myresults.txt, to be given to trec_eval.

cmpe493 fall 2020 term project : free text based ranked retrieval model

Drive : https://drive.google.com/drive/folders/1iHdwsrQdw_25uPSN5zqcxAgIv-maFBD4?usp=sharing

raw_data.csv

Preprocess

python ./preprocessing/preprocess.py [-l | --lemmatization]

The preprocessed data and queries will be extracted to the preprocessing folder.

Doc2Vec

Training

python ./models/doc2vec_train.py --size 10 --epochs 500

The trained model and the training evaluation results are extracted into the ./models/doc2vec/doc2vec_s10_e500_train.csv file

Testing

python ./models/doc2vec_test.py --path "doc2vec_s10_e500"

The results will be extracted into the ./models/doc2vec/doc2vec_s10_e500_test.csv file

GloVe

Training

python ./models/GloVe/glove_clean_best.py

The trained model is extracted into the ./models/GloVe/glove/pretrained_models/glove_500.model file The results will be extracted into the ./models/GloVe/glove/preprocessing/cosine_similarity_matrix.csv file

python ./models/GloVe/glove_clean_tfidf.py

The trained model is extracted into the ./models/GloVe/glove/pretrained_models/glove_tfidf_500.model file The results will be extracted into the ./models/GloVe/glove/preprocessing/cosine_similarity_matrix_tfidf.csv file

Testing

python ./models/GloVe/glove_clean_tfidf-test.py

The results will be extracted into the ./models/GloVe/glove/preprocessing/cosine_similarity_matrix_test.csv file

python ./models/GloVe/glove_clean_best-test.py

The results will be extracted into the ./models/GloVe/glove/preprocessing/cosine_similarity_matrix_tfidf_test.csv file

Evaluating

For train results

python ./eval/eval.py ./models/doc2vec/doc2vec_s10_e500_train.csv

For testing results

python ./eval/eval.py ./models/doc2vec/doc2vec_s10_e500_test.csv --test

myqrels.txt and myresults.txt files are extracted to the current path.

Installing trec_eval:

Download trec_eval from https://trec.nist.gov/trec_eval/ .
Extract the tar.gz file.
In the trec eval directory, open a terminal and type "make".
The name of the executable is trec_eval in the same directory. You can test it with:

  ./trec_eval test/qrels.test test/results.test

Using trec_eval to evaluate:

On terminal,

<path-to-the-trec_eval> <path-to-qrel-file> <path-to-result-file>

Example:

~/trec_eval-9.0.7/trec_eval myqrels.txt myresults.txt

Qrels and results file:

First argument of the trec-eval (qrels file) should be the file that contains correct labels. It represents the ground truth. It has the format:

query-id 0 document-id relevance

example:

0 0 005b2j4b 2

0 0 00fmeepz 1

...

Second argument of the trec-eval (results file) should be the file that contains our predictions. It has the format:

query-id Q0 document-id rank score explanation

example:

0 Q0 2b73a28n 0 0 STANDARD

0 Q0 zjufx4fo 0 0 STANDARD

...

"Q0" and rank is currently ignored. Explanation is any sequence of alphanumeric characters that is used for identifying the run.

Options:

-q: In addition to summary evaluation, give evaluation for each query

-l<num>: Num indicates the minimum relevance judgement value needed for a document to be called relevant. (All measures used by TREC eval are based on binary relevance). Used if trec_rel_file contains relevance judged on a multi-relevance scale. Default is 1.

-m : Print only specified measure results.

~/trec_eval-9.0.7/trec_eval -m ndcg myqrels.txt myresults.txt

~/trec_eval-9.0.7/trec_eval -m map myqrels.txt myresults.txt

Output

References

http://www.rafaelglater.com/en/post/learn-how-to-use-trec_eval-to-evaluate-your-information-retrieval-system https://github.com/usnistgov/trec_eval

Ideas:

Preprocessing:

Tokenization
Sentence Splitting (?)
Stemming
Lemmatization (?)
Normalization (punctuation removal, case folding, etc.)
Stopword Elimination

Models:

Jaccard Coefficient: (Lec6)
- jaccard(query.tokens, document.tokens) returns a score between 0 and 1
- does not care multiple occurrences

Bag of Words: (Lec6)
- counts the frequency of a token in a text
- does not care ordering
- term frequency, tf(t,d) : the number of times that term t occurs in document d

Log freq. weighting: (Lec6)
- w(t,d) = 1 + log10(tf(t,d)) if tf(t,d) > 0, else 0
- score every document-query pair: score(q, d) = sum( for all term t in (query q INTERSECTION document d), w(t, d) )

idf weighting: (Lec6)
- document freq. df(t) : the number of documents that contain t
- df is inverse measure of the informativeness of t
- inverse document frequency idf(t) = log10(N/df(t)) where N is total # of docs
- measures the informativeness of t

tf-idf weighting: (Lec6)
- tf_idf(t,d) = w(t,d) * idf(t)
- score(q,d) = sum( for all term t in (query q INTERSECTION document d), tf_idf(t, d))
- see sklearn.feature_extraction.text.TfidfVectorizer

Vector representation: (Lec6)
- Table : Rows are tokens, columns are documents -> calculate tf_idf(t,d) for each cell
- Rows (terms) are axes of the space, documents are the vectors
- Do the same to the queries
- Euclidean distance is bad idea bcs it's sensitive to length of vector
- Cosine Similiarity:
  - Length Normalization: ie. L2 norm (see sklearn.preprocessing.normalize)
  - Apply cosine similarity: cosine_similarity(q,d) where q and d are vectors explained above (see sklearn.metrics.pairwise.cosine_similarity)
  - After normalization, cosine similarity is just the dot product
  - Score for each document can divided into document's length (maybe softmax ?)
- Practical considerations:
  - Consider w(t, q) = 1 for queries : Faster Cosine
  - Take only high-idf query terms : documents came from low-idf terms are eleminated
  - Take a doc if cardinality of intersection between terms in query and terms in doc is higher than a threshold, say 3 or 4
  - champions list (top docs) : apply a threshold to the posting list of a term
  - we assume authority (quality) of each document is the same. thus we dont need a tier

Deep Learning:
- word embeddings (https://medium.com/deeper-learning/glossary-of-deep-learning-word-embedding-f90c3cec34ca)
- Word2Vec
- BERT
- LSTM/GRU/Attention
- Glove

Evaluation

Mean Average Precision (mAP)
Normalized Discounted Cumulative Gain (NDCG)
Precision of top 10 results (P@10)

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
data		data
eval		eval
models		models
preprocessing		preprocessing
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
raw_data_sample.jpg		raw_data_sample.jpg
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cmpe493-term-project

How to Run

TF-IDF

Okapi BM25

raw_data.csv

Preprocess

Doc2Vec

GloVe

Evaluating

Installing trec_eval:

Using trec_eval to evaluate:

Qrels and results file:

Options:

Output

References

Ideas:

Preprocessing:

Models:

Evaluation

About

Releases

Packages

Contributors 3

Languages

License

egirgin/cmpe493-term-project

Folders and files

Latest commit

History

Repository files navigation

cmpe493-term-project

How to Run

TF-IDF

Okapi BM25

raw_data.csv

Preprocess

Doc2Vec

GloVe

Evaluating

Installing trec_eval:

Using trec_eval to evaluate:

Qrels and results file:

Options:

Output

References

Ideas:

Preprocessing:

Models:

Evaluation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages