WIP: Metadata extraction using LLM API service #29

osma · 2024-07-31T14:57:14Z

This draft PR contains an initial rough implementation of the LLM-based metadata extraction described in #21. Because some functionality is still missing and there are uncertainties in the implementation, I'm leaving it as a Draft PR.

The initial prototyping was performed in this Jupyter notebook that has essentially the same functionality, but in this PR the code from the notebook has been retrofitted into the Meteor codebase.

How it works

This code adds a new LLMExtractor class that performs the main work of metadata extraction by calling an LLM API service such as llama.cpp running locally. Here is an outline of the changes:

src/settings.py, src/util.py and .env.example have been extended to handle LLM configuration settings LLM_API_URL, LLM_API_KEY and LLM_MODEL
a new LLMExtractor class
Meteor class stores the LLM configuration and calls LLMExtractor if backend=LLMExtractor parameter is given in the API method call
MeteorDocument class has a new method extract_text_as_json that returns the text and pdfinfo metadata in a JSON format that the LLMs expect.
a new select field for choosing the backend (Finder or LLMExtractor) has been added the HTML template index.html ; the default value is Finder
a new Origin value LLM has been added and all information coming from the LLM is tagged as having that origin. The existing values didn't seem to fit because the LLM won't tell (at least currently) where in the document the information came from.
unit tests for the LLMExtractor functionality. The LLM API service is mocked using unittest.mock so that the tests don't have to set up a real LLM service and wait for its responses.

How it looks like

There is a new select element for choosing the backend:

How to test it

Install llama.cpp on your computer (on Linux, using CPU only: git clone the repository and run make to compile it)
Download a fine-tuned model such as NatLibFi/Qwen2-0.5B-Instruct-FinGreyLit-GGUF in GGML format i.e. Qwen2-0.5B-Instruct-FinGreyLit-Q4_K_M.gguf
Start the llama.cpp server using the GGUF model: ./llama-server -m Qwen2-0.5B-Instruct-FinGreyLit-Q4_K_M.gguf and leave it running
Set the environment variable LLM_API_URL to point to the llama.cpp server API endpoint: export LLM_API_URL=http://localhost:8080 (or edit the .env file)
Start up Meteor from this branch and run it normally. In the UI, select "LLMExtractor" as the extraction method. If performing API calls, set the parameter backend=LLMExtractor.

Example using a Norwegian document in English language:

$ time curl -d fileUrl=https://www.ssb.no/forside/_attachment/453137 -d backend=LLMExtractor http://127.0.0.1:5000/json
{"year":{"origin":{"type":"LLM"},"value":"2021"},"language":{"origin":{"type":"LLM"},"value":"eng"},"title":{"origin":{"type":"LLM"},"value":"Family composition and transitions into long-term care services among the elderly"},"publisher":{"origin":{"type":"LLM"},"value":"Statistics Norway"},"publicationType":null,"authors":[{"origin":{"type":"LLM"},"firstname":"Astri","lastname":"Syse"},{"origin":{"type":"LLM"},"firstname":"Alyona","lastname":"Artamonova"},{"origin":{"type":"LLM"},"firstname":"Michael","lastname":"Thomas"},{"origin":{"type":"LLM"},"firstname":"Marijke","lastname":"Veenstra"}],"isbn":null,"issn":null}
real	0m19.678s
user	0m0.000s
sys	0m0.015s

Here is the same metadata as shown in the Meteor web UI:

As far as I can tell, this metadata is correct, except that the LLM for some reason didn't pick up the ISSN on page 3. But this was using the relatively stupid Qwen2-0.5B based small model, not the larger Mistral-7B based model that gives much better quality responses.

Here is again the same document, but this time the LLM is the larger Mistral-7B based model, quantized to Q6_K GGUF format, running on a V100 GPU using llama.cpp, all 33 layers offloaded to the GPU, requiring around 12.5GB VRAM.

time curl -d fileUrl=https://www.ssb.no/forside/_attachment/453137 -d backend=LLMExtractor http://127.0.0.1:5000/json
{"year":{"origin":{"type":"LLM"},"value":"2021"},"language":{"origin":{"type":"LLM"},"value":"eng"},"title":{"origin":{"type":"LLM"},"value":"Family composition and transitions into long-term care services among the elderly"},"publisher":{"origin":{"type":"LLM"},"value":"Statistics Norway"},"publicationType":null,"authors":[{"origin":{"type":"LLM"},"firstname":"Astri","lastname":"Syse"},{"origin":{"type":"LLM"},"firstname":"Alyona","lastname":"Artamonova"},{"origin":{"type":"LLM"},"firstname":"Michael","lastname":"Thomas"},{"origin":{"type":"LLM"},"firstname":"Marijke","lastname":"Veenstra"}],"isbn":null,"issn":{"origin":{"type":"LLM"},"value":"1892-753X"}}
real    0m3.358s
user    0m0.002s
sys     0m0.003s

Note that the request now completed in 3.4 seconds (including downloading the PDF) and this time the ISSN was successfully extracted as well.

Missing functionality

The LLMs already support returning more metadata fields than Meteor does, such as DOI, p-ISBN, p-ISSN and COAR resource type, but Meteor doesn't handle this so this information is lost. Meteor should be extended to return also the new fields.

Code/implementation issues

The extraction of text and metadata in MeteorDocument.extract_text_as_json is implemented separately from the text extraction that the class performs anyway. This means that there is some duplicated code and extra work is done. It should probably be better integrated with the existing code. One issue is that Meteor by default looks at the first 5 and last 5 pages, while the LLM extractors have been developed using text from the 8 first and 2 last pages. Also text extraction for LLMs is performed using the parameter sorted=True, while Meteor doesn't use that option. Such differences in details make it hard to reuse the existing code.
The LLMs have been trained to return ISO 693-3 three-letter language codes, while Meteor uses BCP47 codes that are often only two letters. I didn't try to do any mapping, but I think that in the future, perhaps the LLMs could also switch to BCP47, because that's what is generally used on the Internet.

Other potential problems

There are a few code style issues in Meteor that I find a bit confusing. For example _underscore_methods (or __double_underscore_methods) and underscore variables, used to signal that some information is internal to a class, aren't used consistently. Also some classes have the habit of mutating objects and accessing them from inside another class when they could simply pass them around instead - as an example, Finder.extract_metadata could be given a MeteorDocument as parameter and it could return a Metadata object. I'd perhaps like to do a few small changes in these areas, but that's outside the scope of this PR. Here I just tried to follow the established style.
I noticed that the API method definitions for the / and /json methods in src/routes/extract.py don't properly declare the parameters supported by each method. Instead, the parameters are read at runtime from the request/form. As a consequence of this, Swagger-UI documentation doesn't show the parameters and instead claims that these two methods don't take any parameters. This also means that they cannot be tested using Swagger-UI. I think I could fix this in a PR...

Fixes #21

… form

osma · 2024-08-16T14:01:21Z

I've added a few more features to the original draft implementation (better configuration handling + ability to select the backend in the web UI and in API methods). I've updated the OP accordingly.

osma added 5 commits July 31, 2024 16:56

initial rough implementation of LLM integration

3617d99

increase timeout to avoid issues with long documents

d9cf0cd

let LLM-generated titles be selected as well

4ccf5ce

configure LLM API service using src/settings module

e55d3da

allow selecting between Finder and LLMExtractor in API methods and UI…

52b3de6

… form

osma added 6 commits August 16, 2024 17:04

fix pylint warnings (wrap long lines)

9f28c28

set default value of backend to None

49ea7a9

fix mypy warnings

3fb20a1

add unit tests for LLM extraction; make year an integer

d5e3caf

implement registry lookups for LLMExtractor

70b2dae

robustness fix for mononyms

d600d52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Metadata extraction using LLM API service #29

WIP: Metadata extraction using LLM API service #29

osma commented Jul 31, 2024 •

edited

Loading

osma commented Aug 16, 2024

WIP: Metadata extraction using LLM API service #29

Are you sure you want to change the base?

WIP: Metadata extraction using LLM API service #29

Conversation

osma commented Jul 31, 2024 • edited Loading

How it works

How it looks like

How to test it

Missing functionality

Code/implementation issues

Other potential problems

osma commented Aug 16, 2024

osma commented Jul 31, 2024 •

edited

Loading