Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ospp/new llm embedding #19727

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

charleschile
Copy link
Contributor

@charleschile charleschile commented Oct 31, 2024

User description

What type of PR is this?

  • API-change
  • BUG
  • Improvement
  • Documentation
  • Feature
  • Test and CI
  • Code Refactoring

Which issue(s) this PR fixes:

issue #18664

What this PR does / why we need it:

As part of our document LLM support, we are introducing the LLM_EMBEDDING function. This function can a string or embed the content from a specified txt file by using LLM platforms like Ollama and LLM models like llama3.

Three global variables are introduced for users to customize their own LLM platforms, proxy and models.

Three global variables:

  • llm_embedding_platform: default is ollama
  • llm_server_proxy: default is http://localhost:11434/api/embed
  • llm_model: default is llama3

Usage: llm_embedding(<input txt datalink>); or llm_embedding(<input string>);

Return Value: a vector of 4096 32-bit floating point.

Note:

  • we only support ollama for the platform right now.
  • if you want to use the Llama 3 model, ensure that the model is running by executing ollama run llama3 in the shell before using the embedding function.

Example SQL:

-- datalink
SELECT llm_embedding(cast('file:///Users/charles/Desktop/codes/testData/embeddingTest.txt' as datalink));

-- string
SELECT llm_embedding("This is text... bla ");

Example return:

mysql> SELECT embedding("This is text... bla ");

embedding(This is text... bla )                                                                                                                                                                                                                                |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
 [-0.021955958, -0.024112409, 0.010994452, -0.011615187, 0.03107838, -0.024938945, -0.029811664, -0.018000955, -0.01751671, -0.0007840055, -0.021467123, -0.014996716, -0.0045815744, -0.026652282, -0.016832603, 0.009745739, -0.016642446, 0.00514828, -0.015859831, -0.010222016, -0.010362088, 0.0056358385, -0.015048031, -0.0031982209, -0.010094931, 0.016682236, 0.008610655, -0.010177104, 0.0035889482, -0.011772896, 0.0020790452, 0.0064958483, 0.007662446, -0.011871371, -0.0060495464, -0.00031009244 |

1 row(s) fetched.

PR Type

Enhancement, Tests


Description

  • Implemented new LLM functions: LLM_CHUNK, LLM_EXTRACT_TEXT, and LLM_EMBEDDING.
  • Added unit and SQL tests for LLM functions, covering chunking, text extraction, and embedding.
  • Integrated Ollama service for embedding operations and defined related system variables.
  • Updated dependencies to include a PDF processing library.

Changes walkthrough 📝

Relevant files
Enhancement
6 files
func_llm.go
Implement LLM functions for chunking, text extraction, and embedding

pkg/sql/plan/function/func_llm.go

  • Implemented LLM_CHUNK, LLM_EXTRACT_TEXT, and LLM_EMBEDDING functions.
  • Added chunking strategies: fixed width, sentence, paragraph, and
    document.
  • Integrated PDF text extraction and embedding operations.
  • Managed file reading and error handling for LLM operations.
  • +471/-0 
    list_builtIn.go
    Register new LLM functions and define overloads                   

    pkg/sql/plan/function/list_builtIn.go

  • Registered new LLM functions: LLM_EXTRACT_TEXT, LLM_CHUNK, and
    LLM_EMBEDDING.
  • Defined overloads and argument types for each function.
  • +76/-2   
    ollama_service.go
    Implement Ollama service interaction for embeddings           

    pkg/sql/plan/function/ollama_service.go

  • Implemented Ollama service interaction for embeddings.
  • Defined request and response structures for Ollama API.
  • +96/-0   
    embedding_service.go
    Define EmbeddingService interface and implement Ollama service

    pkg/sql/plan/function/embedding_service.go

  • Defined EmbeddingService interface and OllamaEmbeddingService.
  • Implemented method to retrieve embeddings using Ollama API.
  • +70/-0   
    variables.go
    Add system variables for LLM embedding configuration         

    pkg/frontend/variables.go

  • Added system variables for LLM embedding configuration.
  • Defined defaults for platform, proxy, and model.
  • +24/-0   
    function_id.go
    Add function IDs for new LLM functions                                     

    pkg/sql/plan/function/function_id.go

  • Added function IDs for new LLM functions.
  • Updated function ID registry with LLM functions.
  • +8/-0     
    Tests
    7 files
    func_llm_test.go
    Add unit tests for LLM chunking and extraction functions 

    pkg/sql/plan/function/func_llm_test.go

  • Added unit tests for LLM chunking and text extraction functions.
  • Tested various chunking modes and error scenarios.
  • +237/-0 
    func_llm_chunk.result
    Add expected results for LLM chunking function tests         

    test/distributed/cases/function/func_llm_chunk.result

  • Added expected results for LLM chunking function tests.
  • Verified chunking outputs for various modes.
  • +30/-0   
    func_llm_chunk.sql
    Add SQL test cases for LLM chunking function                         

    test/distributed/cases/function/func_llm_chunk.sql

  • Added SQL test cases for LLM chunking function.
  • Tested different chunking strategies and data inputs.
  • +24/-0   
    func_llm_extract_file.result
    Add expected results for LLM text extraction function tests

    test/distributed/cases/function/func_llm_extract_file.result

  • Added expected results for LLM text extraction function tests.
  • Verified text extraction from PDF files.
  • +6/-0     
    func_llm_extract_file.sql
    Add SQL test cases for LLM text extraction function           

    test/distributed/cases/function/func_llm_extract_file.sql

  • Added SQL test cases for LLM text extraction function.
  • Tested extraction from PDF files to text files.
  • +2/-0     
    4.txt
    Add text file for LLM chunking test resources                       

    test/distributed/resources/llm_test/chunk/4.txt

    • Added text file for LLM chunking test resources.
    +3/-0     
    func_llm_embedding.sql
    Add SQL test case for LLM embedding function                         

    test/distributed/cases/function/func_llm_embedding.sql

  • Added SQL test case for LLM embedding function.
  • Included a skip directive for known issue.
  • +2/-0     
    Dependencies
    2 files
    go.sum
    Update dependencies for PDF processing                                     

    go.sum

    • Updated dependencies to include PDF processing library.
    +2/-0     
    go.mod
    Add PDF processing library to module dependencies               

    go.mod

    • Added PDF processing library to module dependencies.
    +1/-0     
    Additional files (token-limit)
    1 files
    func_llm_embedding.result
    ...                                                                                                           

    test/distributed/cases/function/func_llm_embedding.result

    ...

    +7/-1     

    💡 PR-Agent usage: Comment /help "your question" on any pull request to receive relevant information

    @matrix-meow matrix-meow added the size/XL Denotes a PR that changes [1000, 1999] lines label Oct 31, 2024
    Copy link
    Contributor

    mergify bot commented Oct 31, 2024

    ⚠️ The sha of the head commit of this PR conflicts with #18470. Mergify cannot evaluate rules on this PR. ⚠️

    Copy link

    PR-Agent was enabled for this repository. To continue using it, please link your git user with your CodiumAI identity here.

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    ⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
    🧪 PR contains tests
    🔒 Security concerns

    Sensitive information exposure:
    The code includes hardcoded default values for LLM server proxy and model in the pkg/frontend/variables.go file. While these are just defaults, it's important to ensure that production deployments don't accidentally expose internal LLM services to the public internet.

    ⚡ Recommended focus areas for review

    Resource Management
    The defer r.Close() inside the for loop may not close the file descriptor until the function ends. This could lead to resource leaks if processing a large number of rows.

    Error Handling
    Some error cases return errors directly without proper wrapping or context. Consider using moerr.NewInvalidInputf consistently for better error handling and debugging.

    Error Handling
    Error messages in getLLMGlobalVariable function use different error creation methods. Standardize error handling for consistency.

    Copy link

    PR-Agent was enabled for this repository. To continue using it, please link your git user with your CodiumAI identity here.

    PR Code Suggestions ✨

    Explore these optional code suggestions:

    CategorySuggestion                                                                                                                                    Score
    Best practice
    Implement proper error handling for deferred Close() calls to prevent resource leaks and capture important errors

    Consider implementing proper error handling and logging for the defer r.Close()
    statement. Ignoring potential errors from Close() could lead to resource leaks or
    mask important issues.

    pkg/sql/plan/function/func_llm.go [162-166]

     r, err := ReadFromFile(moUrl, fs)
     if err != nil {
       return err
     }
    -defer r.Close()
    +defer func() {
    +  if cerr := r.Close(); cerr != nil {
    +    // Log the error or update the return error if it's nil
    +    if err == nil {
    +      err = cerr
    +    } else {
    +      // Log cerr or combine with existing error
    +    }
    +  }
    +}()
    • Apply this suggestion
    Suggestion importance[1-10]: 9

    Why: Proper error handling for deferred Close() calls is crucial to prevent resource leaks and capture errors that might otherwise be ignored, making this a highly impactful suggestion.

    9
    Performance
    Use a more efficient string concatenation method for improved performance when handling large texts or many chunks

    Consider using a more efficient string concatenation method, such as
    strings.Builder, for building the result string in the ChunkString function,
    especially when dealing with large texts or many chunks.

    pkg/sql/plan/function/func_llm.go [128-133]

    -var chunkStrings []string
    -for _, chunk := range chunks {
    -  chunkStrings = append(chunkStrings, fmt.Sprintf("[%d, %d, %q]", chunk.Start, chunk.Length, chunk.Text))
    +var builder strings.Builder
    +builder.WriteString("[")
    +for i, chunk := range chunks {
    +  if i > 0 {
    +    builder.WriteString(", ")
    +  }
    +  fmt.Fprintf(&builder, "[%d, %d, %q]", chunk.Start, chunk.Length, chunk.Text)
     }
    +builder.WriteString("]")
    +return builder.String(), nil
     
    -return "[" + strings.Join(chunkStrings, ", ") + "]", nil
    -
    • Apply this suggestion
    Suggestion importance[1-10]: 8

    Why: The suggestion to use strings.Builder for string concatenation is a well-known optimization that can significantly improve performance when dealing with large texts or many chunks, making it a valuable enhancement.

    8
    Enhancement
    Implement context-aware file reading to improve resource management and responsiveness

    Consider using a context-aware file reader that respects cancellation signals. This
    can help prevent resource leaks and improve responsiveness when handling large files
    or slow I/O operations.

    pkg/sql/plan/function/func_llm.go [162-172]

    -r, err := ReadFromFile(moUrl, fs)
    +ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
    +defer cancel()
    +r, err := ReadFromFileWithContext(ctx, moUrl, fs)
     if err != nil {
       return err
     }
     defer r.Close()
    -ctx, err := io.ReadAll(r)
    +content, err := io.ReadAll(r)
     if err != nil {
       return err
     }
    • Apply this suggestion
    Suggestion importance[1-10]: 7

    Why: The suggestion to use context-aware file reading can enhance resource management and responsiveness, especially for large files or slow I/O operations. However, it requires significant changes to the existing code, including introducing a timeout and modifying the ReadFromFile function to accept a context.

    7
    Expand test cases for the embedding function to cover more scenarios

    Consider adding more diverse test cases with different input types and lengths to
    thoroughly test the llm_embedding function.

    test/distributed/cases/function/func_llm_embedding.sql [1-2]

     -- @skip:issue#18664
     SELECT llm_embedding("This is text... bla ");
    +SELECT llm_embedding("A longer piece of text to test embedding generation with more context.");
    +SELECT llm_embedding("");  -- Test empty string
    +SELECT llm_embedding(NULL);  -- Test NULL input
    • Apply this suggestion
    Suggestion importance[1-10]: 7

    Why: Adding more diverse test cases for the llm_embedding function is a good suggestion as it would help ensure the function handles various input scenarios, including edge cases like empty strings and NULL values, thereby improving test coverage and reliability.

    7
    Implement a more comprehensive error handling strategy to collect and report multiple errors instead of failing on the first error

    Consider implementing a more robust error handling strategy for the LLMExtractText
    function. Instead of returning early on the first error, collect all errors and
    return them together, possibly using a custom error type or a slice of errors.

    pkg/sql/plan/function/func_llm.go [202-253]

    +var errors []error
     for i := uint64(0); i < rowCount; i++ {
       inputBytes, nullInput := input.GetStrValue(i)
       if nullInput {
         if err := rs.AppendMustNullForBytesResult(); err != nil {
    -      return err
    +      errors = append(errors, fmt.Errorf("row %d: %w", i, err))
         }
         continue
       }
       
       // ... (similar error checking for other inputs)
       
       success, err := extractTextFromPdfAndWriteToFile(moUrl, outputPathUrl, proc)
       if err != nil {
    -    return err
    +    errors = append(errors, fmt.Errorf("row %d: %w", i, err))
    +    continue
       }
       
       // ... (result appending)
     }
    +if len(errors) > 0 {
    +  return fmt.Errorf("multiple errors occurred: %v", errors)
    +}
    • Apply this suggestion
    Suggestion importance[1-10]: 6

    Why: Collecting and reporting multiple errors can improve the robustness of the LLMExtractText function by providing more comprehensive error feedback. However, it adds complexity to error handling and may not be necessary for all use cases.

    6
    Add error handling to the text extraction queries to improve robustness

    Consider adding error handling or validation checks to ensure the PDF files exist
    and are accessible before attempting to extract text.

    test/distributed/cases/function/func_llm_extract_file.sql [1-2]

    -select llm_extract_text(cast('file://$resources/llm_test/extract_text/MODocs1.pdf?offset=0&size=4' as datalink), cast('file://$resources/llm_test/extract_text/MODocs1.txt' as datalink), "pdf");
    -select llm_extract_text(cast('file://$resources/llm_test/extract_text/example.pdf?offset=0&size=4' as datalink), cast('file://$resources/llm_test/extract_text/example.txt' as datalink), "pdf");
    +SELECT CASE
    +  WHEN llm_extract_text(cast('file://$resources/llm_test/extract_text/MODocs1.pdf?offset=0&size=4' as datalink), cast('file://$resources/llm_test/extract_text/MODocs1.txt' as datalink), "pdf") THEN 'Success'
    +  ELSE 'Failure'
    +END AS extraction_result;
     
    +SELECT CASE
    +  WHEN llm_extract_text(cast('file://$resources/llm_test/extract_text/example.pdf?offset=0&size=4' as datalink), cast('file://$resources/llm_test/extract_text/example.txt' as datalink), "pdf") THEN 'Success'
    +  ELSE 'Failure'
    +END AS extraction_result;
    +
    • Apply this suggestion
    Suggestion importance[1-10]: 4

    Why: The suggestion to add error handling for file existence and accessibility is valid and could improve robustness. However, the proposed SQL code does not actually check for file existence or accessibility, it merely wraps the function call in a CASE statement, which doesn't address the underlying issue.

    4

    💡 Need additional feedback ? start a PR chat

    @mergify mergify bot added the kind/feature label Oct 31, 2024
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    kind/feature Review effort [1-5]: 4 size/XL Denotes a PR that changes [1000, 1999] lines
    Projects
    None yet
    Development

    Successfully merging this pull request may close these issues.

    4 participants