PDF Search Engine in Python

A python project that allows for a quick search through any PDF and ranks the results based on both the number of occurrences of the search query on the page, as well as the 'page rank' of the page (how many times have other pages referenced the current page). The app works with a console menu, although it may be interesting to make this into a gui app.

Technologies used

The project was built in Python 3. The pages of the PDF were organized in a graph, and a trie was used for word searching.

External libraries

PyMuPDF or now Fitz - used for parsing the PDF into 1 txt file per page, also used for extracting pages and highlighting the query within the 'first 10 results' request
difflib - finding similar words, for the 'did you mean' feature, using Levenshtein distance
Collections -> defaultdict - used for an alternative to a dictionary
os, re - used os and re for writing, creating files and regular expressions

Features

The app supports 1 word search, multiple word searches, logical operators (and, or and binary not), it handles searching for a phrase, autocomplete and a 'did you mean' feature which gets triggered if there are no search results. As an additional feature, you can save the search results in a separate PDF file, which extracts them and highlights the term you were searching for, as you can see here:

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
pdf		pdf
project specifications		project specifications
search_engine		search_engine
structures		structures
txts		txts
README.md		README.md
formation_search_results.pdf		formation_search_results.pdf
graph_search_results.pdf		graph_search_results.pdf
main.py		main.py
projekat2_septembar_sv_38_2023.zip		projekat2_septembar_sv_38_2023.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF Search Engine in Python

Technologies used

External libraries

Features

About

Releases

Packages

Languages

sara-stojkov/Python_PDF_Search_Engine

Folders and files

Latest commit

History

Repository files navigation

PDF Search Engine in Python

Technologies used

External libraries

Features

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages