Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add langchain document support #56

Open
priamai opened this issue Jul 16, 2024 · 3 comments
Open

add langchain document support #56

priamai opened this issue Jul 16, 2024 · 3 comments

Comments

@priamai
Copy link

priamai commented Jul 16, 2024

Description

Love the project,
we need to add a langchain Document interface, which I am more than happy to do it but just a few questions:

  • each node will become a document
  • the content will become the text field
  • the metadata can be added as bbox and node_id

What is the embedding field for? Will that be filled eventually with an openai embedding vector?
What are tokens and how they are calculated base on what model? are you using tiktoken?
Within each node you have something called Lines, is that basically the text but split into detected lines?

Cheers.

@Filimoa
Copy link
Owner

Filimoa commented Jul 16, 2024

Great!

embedding is used for semantic processing (combining chunks by similarity) - yes it's a vector from OpenAI (long term maybe agnostic).

Tiktoken is our current method for calculating tokens since (unfortunately) semantic processing is OpenAI centric at the moment.

I wouldn't worry about lines - they're used internally to assemble nodes. Once the node is created they're no longer needed.

Feel free to ask anything else!

@priamai
Copy link
Author

priamai commented Jul 19, 2024

@Filimoa enjoy this simple class that is compatible.

from typing import AsyncIterator, Iterator

from langchain_core.document_loaders import BaseLoader
from langchain_core.documents import Document

import openparse

class OpenParseDocumentLoader(BaseLoader):
    """An example document loader that reads a file line by line."""

    def __init__(self, file_path: str) -> None:
        """Initialize the loader with a file path.

        Args:
            file_path: The path to the file to load.
        """
        self.file_path = file_path

    def lazy_load(self) -> Iterator[Document]:  # <-- Does not take any arguments
        """A lazy loader that reads a file line by line.

        When you're implementing lazy load methods, you should use a generator
        to yield documents one by one.
        """

        parser = openparse.DocumentParser()
        parsed_basic_doc = parser.parse(self.file_path)

        for node in parsed_basic_doc.nodes:
            yield Document(
                page_content=node.text,
                metadata={"tokens": node.tokens,
                          "num_pages":node.num_pages,
                          "node_id":node.node_id,
                          "start_page":node.start_page,
                          "end_page":node.end_page,
                          "source": self.file_path},
            )

Usage:


from OpenTextLoader import OpenParseDocumentLoader

loader = OpenParseDocumentLoader("./sample_docs/companies-list.pdf")

## Test out the lazy load interface
for doc in loader.lazy_load():
    print()
    print(type(doc))
    print(doc)

Feel free to add to the code base.

@ITHealer
Copy link

How do I extract tabels and images from a pdf??

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants