To preprocess:
./preprocessing/cos_sim.py
To cluster:
./models/tf_idf2kmeans.py
To create files for evaluation:
./eval/eval.py
Preprocess and get scores:
./models/bm25/bm25.py
Create files for evaluation:
./models/bm25/eval_bm25.py
Produces three files: results_bm25.txt, results_bm25_L.txt, results_bm25_Plus.txt
Last program outputs two files as myqrels.txt and myresults.txt, to be given to trec_eval.
cmpe493 fall 2020 term project : free text based ranked retrieval model
Drive : https://drive.google.com/drive/folders/1iHdwsrQdw_25uPSN5zqcxAgIv-maFBD4?usp=sharing
python ./preprocessing/preprocess.py [-l | --lemmatization]
The preprocessed data and queries will be extracted to the preprocessing folder.
Training
python ./models/doc2vec_train.py --size 10 --epochs 500
The trained model and the training evaluation results are extracted into the ./models/doc2vec/doc2vec_s10_e500_train.csv
file
Testing
python ./models/doc2vec_test.py --path "doc2vec_s10_e500"
The results will be extracted into the ./models/doc2vec/doc2vec_s10_e500_test.csv
file
Training
python ./models/GloVe/glove_clean_best.py
The trained model is extracted into the ./models/GloVe/glove/pretrained_models/glove_500.model
file
The results will be extracted into the ./models/GloVe/glove/preprocessing/cosine_similarity_matrix.csv
file
python ./models/GloVe/glove_clean_tfidf.py
The trained model is extracted into the ./models/GloVe/glove/pretrained_models/glove_tfidf_500.model
file
The results will be extracted into the ./models/GloVe/glove/preprocessing/cosine_similarity_matrix_tfidf.csv
file
Testing
python ./models/GloVe/glove_clean_tfidf-test.py
The results will be extracted into the ./models/GloVe/glove/preprocessing/cosine_similarity_matrix_test.csv
file
python ./models/GloVe/glove_clean_best-test.py
The results will be extracted into the ./models/GloVe/glove/preprocessing/cosine_similarity_matrix_tfidf_test.csv
file
For train results
python ./eval/eval.py ./models/doc2vec/doc2vec_s10_e500_train.csv
For testing results
python ./eval/eval.py ./models/doc2vec/doc2vec_s10_e500_test.csv --test
myqrels.txt and myresults.txt files are extracted to the current path.
- Download trec_eval from https://trec.nist.gov/trec_eval/ .
- Extract the tar.gz file.
- In the trec eval directory, open a terminal and type "make".
- The name of the executable is trec_eval in the same directory. You can test it with:
./trec_eval test/qrels.test test/results.test
On terminal,
<path-to-the-trec_eval> <path-to-qrel-file> <path-to-result-file>
Example:
~/trec_eval-9.0.7/trec_eval myqrels.txt myresults.txt
- First argument of the trec-eval (qrels file) should be the file that contains correct labels. It represents the ground truth. It has the format:
query-id 0 document-id relevance
example:
0 0 005b2j4b 2
0 0 00fmeepz 1
...
- Second argument of the trec-eval (results file) should be the file that contains our predictions. It has the format:
query-id Q0 document-id rank score explanation
example:
0 Q0 2b73a28n 0 0 STANDARD
0 Q0 zjufx4fo 0 0 STANDARD
...
"Q0" and rank is currently ignored. Explanation is any sequence of alphanumeric characters that is used for identifying the run.
-q: In addition to summary evaluation, give evaluation for each query
-l<num>: Num indicates the minimum relevance judgement value needed for a document to be called relevant. (All measures used by TREC eval are based on binary relevance). Used if trec_rel_file contains relevance judged on a multi-relevance scale. Default is 1.
-m : Print only specified measure results.
~/trec_eval-9.0.7/trec_eval -m ndcg myqrels.txt myresults.txt
~/trec_eval-9.0.7/trec_eval -m map myqrels.txt myresults.txt
http://www.rafaelglater.com/en/post/learn-how-to-use-trec_eval-to-evaluate-your-information-retrieval-system https://github.com/usnistgov/trec_eval
- Tokenization
- Sentence Splitting (?)
- Stemming
- Lemmatization (?)
- Normalization (punctuation removal, case folding, etc.)
- Stopword Elimination
- Jaccard Coefficient: (Lec6)
jaccard(query.tokens, document.tokens)
returns a score between 0 and 1- does not care multiple occurrences
- Bag of Words: (Lec6)
- counts the frequency of a token in a text
- does not care ordering
- term frequency,
tf(t,d)
: the number of times that term t occurs in document d
- Log freq. weighting: (Lec6)
w(t,d) = 1 + log10(tf(t,d)) if tf(t,d) > 0, else 0
- score every document-query pair:
score(q, d) = sum( for all term t in (query q INTERSECTION document d), w(t, d) )
- idf weighting: (Lec6)
- document freq.
df(t)
: the number of documents that contain t - df is inverse measure of the informativeness of t
- inverse document frequency
idf(t) = log10(N/df(t))
where N is total # of docs - measures the informativeness of t
- document freq.
- tf-idf weighting: (Lec6)
tf_idf(t,d) = w(t,d) * idf(t)
score(q,d) = sum( for all term t in (query q INTERSECTION document d), tf_idf(t, d))
- see
sklearn.feature_extraction.text.TfidfVectorizer
- Vector representation: (Lec6)
- Table : Rows are tokens, columns are documents -> calculate tf_idf(t,d) for each cell
- Rows (terms) are axes of the space, documents are the vectors
- Do the same to the queries
- Euclidean distance is bad idea bcs it's sensitive to length of vector
- Cosine Similiarity:
- Length Normalization: ie. L2 norm (see
sklearn.preprocessing.normalize
) - Apply cosine similarity:
cosine_similarity(q,d)
where q and d are vectors explained above (seesklearn.metrics.pairwise.cosine_similarity
) - After normalization, cosine similarity is just the dot product
- Score for each document can divided into document's length (maybe softmax ?)
- Length Normalization: ie. L2 norm (see
- Practical considerations:
- Consider
w(t, q) = 1
for queries : Faster Cosine - Take only high-idf query terms : documents came from low-idf terms are eleminated
- Take a doc if cardinality of intersection between terms in query and terms in doc is higher than a threshold, say 3 or 4
- champions list (top docs) : apply a threshold to the posting list of a term
- we assume authority (quality) of each document is the same. thus we dont need a tier
- Consider
- Deep Learning:
- word embeddings (https://medium.com/deeper-learning/glossary-of-deep-learning-word-embedding-f90c3cec34ca)
- Word2Vec
- BERT
- LSTM/GRU/Attention
- Glove
- Mean Average Precision (mAP)
- Normalized Discounted Cumulative Gain (NDCG)
- Precision of top 10 results (P@10)