add langchain document support #56

priamai · 2024-07-16T10:42:55Z

Description

Love the project,
we need to add a langchain Document interface, which I am more than happy to do it but just a few questions:

each node will become a document
the content will become the text field
the metadata can be added as bbox and node_id

What is the embedding field for? Will that be filled eventually with an openai embedding vector?
What are tokens and how they are calculated base on what model? are you using tiktoken?
Within each node you have something called Lines, is that basically the text but split into detected lines?

Cheers.

Filimoa · 2024-07-16T14:47:59Z

Great!

embedding is used for semantic processing (combining chunks by similarity) - yes it's a vector from OpenAI (long term maybe agnostic).

Tiktoken is our current method for calculating tokens since (unfortunately) semantic processing is OpenAI centric at the moment.

I wouldn't worry about lines - they're used internally to assemble nodes. Once the node is created they're no longer needed.

Feel free to ask anything else!

priamai · 2024-07-19T22:51:27Z

@Filimoa enjoy this simple class that is compatible.

from typing import AsyncIterator, Iterator

from langchain_core.document_loaders import BaseLoader
from langchain_core.documents import Document

import openparse

class OpenParseDocumentLoader(BaseLoader):
    """An example document loader that reads a file line by line."""

    def __init__(self, file_path: str) -> None:
        """Initialize the loader with a file path.

        Args:
            file_path: The path to the file to load.
        """
        self.file_path = file_path

    def lazy_load(self) -> Iterator[Document]:  # <-- Does not take any arguments
        """A lazy loader that reads a file line by line.

        When you're implementing lazy load methods, you should use a generator
        to yield documents one by one.
        """

        parser = openparse.DocumentParser()
        parsed_basic_doc = parser.parse(self.file_path)

        for node in parsed_basic_doc.nodes:
            yield Document(
                page_content=node.text,
                metadata={"tokens": node.tokens,
                          "num_pages":node.num_pages,
                          "node_id":node.node_id,
                          "start_page":node.start_page,
                          "end_page":node.end_page,
                          "source": self.file_path},
            )

Usage:


from OpenTextLoader import OpenParseDocumentLoader

loader = OpenParseDocumentLoader("./sample_docs/companies-list.pdf")

## Test out the lazy load interface
for doc in loader.lazy_load():
    print()
    print(type(doc))
    print(doc)

Feel free to add to the code base.

ITHealer · 2024-07-29T03:54:20Z

How do I extract tabels and images from a pdf??

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add langchain document support #56

add langchain document support #56

priamai commented Jul 16, 2024

Filimoa commented Jul 16, 2024

priamai commented Jul 19, 2024

ITHealer commented Jul 29, 2024

add langchain document support #56

add langchain document support #56

Comments

priamai commented Jul 16, 2024

Description

Filimoa commented Jul 16, 2024

priamai commented Jul 19, 2024

ITHealer commented Jul 29, 2024