RZD_trainer

This is a repository of our solution for the hackathon "Цифровой прорыв." The main idea behind this project is to create an assistant that helps teach train workers. Our solution includes the generation of questions based on the texts and also answering these questions.

The pipeline includes the following steps:

Parse the PDF documents into topics. All the main topics with raw texts of small subtopics can be found in ./data/generated_data.json. Overall, we have parsed 12 big topics and 69 subtopics.
Next, we generated open and multiple-choice questions. The generation process was done using the LLM Saiga-Mistral 7b. In total we have generated over 1200 questions.
To better answer the questions according to the documents, we use embeddings similarity based on the cosine distance with LM multilingual-e5-base.

The whole process was described in this video (rus)

This solution was ranked the 3rd.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
LLM		LLM
__pycache__		__pycache__
data		data
qna_data		qna_data
rubert-tiny2		rubert-tiny2
LICENSE		LICENSE
README.md		README.md
answer_validation.py		answer_validation.py
cfg.json		cfg.json
generation_pipe.py		generation_pipe.py
interact_mistral_llamacpp.py		interact_mistral_llamacpp.py
interaction.py		interaction.py
log_to_parse.txt		log_to_parse.txt
logo.jpg		logo.jpg
main.py		main.py
make_embeddings.py		make_embeddings.py
parser.py		parser.py
questions_answering.py		questions_answering.py
submission_generator.py		submission_generator.py
submission_parsed.xlsx		submission_parsed.xlsx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RZD_trainer

About

Releases

Packages

Contributors 2

Languages

License

MilanaShhanukova/RZD_trainer

Folders and files

Latest commit

History

Repository files navigation

RZD_trainer

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages