WIP: Metadata extraction using LLM API service #29
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This draft PR contains an initial rough implementation of the LLM-based metadata extraction described in #21. Because some functionality is still missing and there are uncertainties in the implementation, I'm leaving it as a Draft PR.
The initial prototyping was performed in this Jupyter notebook that has essentially the same functionality, but in this PR the code from the notebook has been retrofitted into the Meteor codebase.
How it works
This code adds a new LLMExtractor class that performs the main work of metadata extraction by calling an LLM API service such as llama.cpp running locally. Here is an outline of the changes:
src/settings.py
,src/util.py
and.env.example
have been extended to handle LLM configuration settings LLM_API_URL, LLM_API_KEY and LLM_MODELbackend=LLMExtractor
parameter is given in the API method callextract_text_as_json
that returns the text and pdfinfo metadata in a JSON format that the LLMs expect.index.html
; the default value is FinderLLM
has been added and all information coming from the LLM is tagged as having that origin. The existing values didn't seem to fit because the LLM won't tell (at least currently) where in the document the information came from.unittest.mock
so that the tests don't have to set up a real LLM service and wait for its responses.How it looks like
There is a new select element for choosing the backend:
How to test it
git clone
the repository and runmake
to compile it)./llama-server -m Qwen2-0.5B-Instruct-FinGreyLit-Q4_K_M.gguf
and leave it runningexport LLM_API_URL=http://localhost:8080
(or edit the .env file)backend=LLMExtractor
.Example using a Norwegian document in English language:
Here is the same metadata as shown in the Meteor web UI:
As far as I can tell, this metadata is correct, except that the LLM for some reason didn't pick up the ISSN on page 3. But this was using the relatively stupid Qwen2-0.5B based small model, not the larger Mistral-7B based model that gives much better quality responses.
Here is again the same document, but this time the LLM is the larger Mistral-7B based model, quantized to Q6_K GGUF format, running on a V100 GPU using llama.cpp, all 33 layers offloaded to the GPU, requiring around 12.5GB VRAM.
Note that the request now completed in 3.4 seconds (including downloading the PDF) and this time the ISSN was successfully extracted as well.
Missing functionality
Code/implementation issues
MeteorDocument.extract_text_as_json
is implemented separately from the text extraction that the class performs anyway. This means that there is some duplicated code and extra work is done. It should probably be better integrated with the existing code. One issue is that Meteor by default looks at the first 5 and last 5 pages, while the LLM extractors have been developed using text from the 8 first and 2 last pages. Also text extraction for LLMs is performed using the parametersorted=True
, while Meteor doesn't use that option. Such differences in details make it hard to reuse the existing code.Other potential problems
_underscore_methods
(or__double_underscore_methods
) and underscore variables, used to signal that some information is internal to a class, aren't used consistently. Also some classes have the habit of mutating objects and accessing them from inside another class when they could simply pass them around instead - as an example,Finder.extract_metadata
could be given a MeteorDocument as parameter and it could return a Metadata object. I'd perhaps like to do a few small changes in these areas, but that's outside the scope of this PR. Here I just tried to follow the established style./
and/json
methods insrc/routes/extract.py
don't properly declare the parameters supported by each method. Instead, the parameters are read at runtime from the request/form. As a consequence of this, Swagger-UI documentation doesn't show the parameters and instead claims that these two methods don't take any parameters. This also means that they cannot be tested using Swagger-UI. I think I could fix this in a PR...Fixes #21