forked from Unstructured-IO/unstructured
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
getting upstream changes #1
Open
ptorru
wants to merge
537
commits into
octoml:octoai
Choose a base branch
from
Unstructured-IO:main
base: octoai
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…ing (#3287) ## Summary This PR addresses an issue where the code could attempt to run `soffice` in multiple processes and closes #3284 The fix is to add a wait mechanism when there is another `soffice` process running in already. ## Diagnosis of issue - `soffice` can only have one process running when using the command `soffice` as is. - on main branch the function `partition.common.convert_office_doc` simply spawns a subprocess to run `soffice` command to convert a `doc` or `ppt` file into `docx` or `pptx` format. - if there are multiple partition calls to process `doc` or `ppt` files and they all want to spawn `soffice` subprocesses only one will succeed while other processes will simply fail and return 1 from the subprocess - in downstream this will lead to errors like `PackageNotFoundError: Package not found at '/tmp/tmpac6lcu4w/document.docx'` ## solution While there are [ways](https://www.reddit.com/r/libreoffice/comments/agk3os/how_to_open_more_than_one_calc_instance_under/) to circumvent the limit of `soffice` by setting a tmp file as user installation env, these kind of solutions rely on the internals of `soffice` and adds maintenance cost to track its changes. This PR solves this problem by adding a wait mechanism: - we first spawning a subprocess to run `soffice` - if the `stdout` is empty and we still have wait time budget left the function first checks if there is another `soffice` running * If yes then the function waits for 0.01s before checking again; * if no the functions spawns a subprocess to run `soffice` and return to beginning of this step * we need to return the the beginning to check if `stdout` is empty because we could have another collision right after `soffice` becomes available. ## test This PR adds two unit tests. Additionally this can be tested by running partition of `.doc` files locally with multiprocessing.
Moved numpy pin to `base.in` where it will be picked up by packaging. Side note: `constraints.txt` (formerly `constraints.in`) is a really useful pattern: you put a constraint there, add that file as a `-c` requirement in other files, and the constraint will be applied when pip-compiling *only when needed* because the library is required by something else. Neat! However, unfortunately, in my searches I've never found a similar pattern for packaging, so any pins we want to propagate to user installs need to be explicitly placed in the `.in` files. So what is `constraints.txt` really doing for us? Well in the past I think there have been instances where something is temporarily broken in an upstream dependency but we expect it to be patched soon, but in the meantime we want things to work in our CI builds and development installs, so it's not worth pinning everywhere it's used. Having said that, I'm coming to the conclusion that `constraints.txt` causes more harm than good in the confusion it causes WRT packaging -- maybe we should remove that pattern at some point.
**Summary** Remedy gap where `strategy` argument passed to `partition()` was not forwarded to `partition_doc()` or `partition_odt()` and so was not making its way to `partition_docx()`.
This PR adds new capabilities for drawing bboxes for each layout (extracted, inferred, ocr and final) + OD model output dump as a json file for better analysis. --------- Co-authored-by: Christine Straub <[email protected]> Co-authored-by: Michal Martyniak <[email protected]>
### Description Isolate all log statements that happen per record and make them debug level to avoid bloating the console output.
### Summary Bumps to the latest `langchain-community` version to resolve [CVE-2024-2965](https://nvd.nist.gov/vuln/detail/CVE-2024-2965). --------- Co-authored-by: ryannikolaidis <[email protected]> Co-authored-by: MthwRobinson <[email protected]>
### Description Migrate the onedrive source connector to v2, adding in more rich content pulled from the response of the SDK to add further metadata to the FIleData produced by the indexer. --------- Co-authored-by: ryannikolaidis <[email protected]> Co-authored-by: rbiseck3 <[email protected]>
**Summary** The `python-docx` error `docx.opc.exceptions.PackageNotFoundError` arises both when no file exists at the given path and when the file exists but is not a ZIP archive (and so is not a DOCX file). This ambiguity is unwelcome when diagnosing the error as the two possible conditions generally indicate a different course of action to resolve the error. Add detailed validation to `DocxPartitionerOptions` to distinguish these two and provide more precise exception messages. **Additional Context** - `python-pptx` shares the same OPC-Package (file) loading code used by `python-docx`, so the same ambiguity will be present in `python-pptx`. - It would be preferable for this distinguished exception behavior to be upstream in `python-docx` and `python-pptx`. If we're willing to take the version bump it might be worth considering doing that instead.
### Description Using a `isinstance` on the destination registry mapping breaks when inheritance is used for the associated uploader types. This adds a connector type field to all uploaders so that the entry can be deterministically fetched when running check for associated stager in pipeline.
### Summary Release for `0.14.9`.
Migrates OpenSearch destination connector to V2. Relies a lot on the Elasticsearch connector where possible. (this is expected)
### Summary Adds links to the serverless api. README updates look like the following: <img width="904" alt="image" src="https://github.com/Unstructured-IO/unstructured/assets/1635179/fcb2b0c5-0dff-4612-8f18-62836ca6de8b">
When we switched community Slack from Paid to Free we lost the CI test bot. Also if messages delete after 90 days then our expected test data will disappear. - created a new bot in our paid company slack (test_unstructured_ingest_bot) - added a new private channel (test-ingest) - invited the bot to the channel - adjusted the end datetime of the test to cover the first few messages in the channel Still to do: - update the CI secrets with the new bot token - update the LastPass with the new bot token (I don't have write access.. :(.
…3310) ### Summary Updates to the latest version of the `wolfi-base` image. Changes include: - Version bumps to address CVEs - `libreoffice` is now included in the `arm64`. `.doc` files are now supported for `arm64`. `.ppt` do not work with the `libreoffice` package currently available on `wolfi-os`. We have follow on work to look into that. - Updates the location of the `tesseract` `tessdata` files on the `arm64` build. Closes #3290. - Closes #3319 and addes `psutil` to the base dependencies. ### Testing - `test_dockerfile` should continue to pass with the updates.
Updates opensearch source connector to v2. Leverages elasticsearch v2 heavily. Expected tests renamed because thats how Elasticsearch names them.
This PR adds a V2 version of the Pinecone destination connector
…#3300) This pull request fixes counting tables metric for three cases: - False Negatives: when table exist in ground truth but any of the predicted tables doesn't match the table, the table should count as 0 and the file should not be completely skipped (before it was np.NaN). - False Positives: When there is a predicted table that didn't match any ground truth table it should be counted as 0, right now it is skipped in processing (matched_indices==-1) - The file should be completely skipped only if there is no tables in ground truth and in prediction In short we can say that previous metric calculation didn't consider OD mistakes
### Description This PR handles two things: * Exposing all the connectors via the connector registries by simply importing the connector module. This should be safe assuming all connector specific dependencies themselves are imported in the methods where they are used and wrapped in `@requires_dependencies` decorator * Remove any import that pulls from the v2 ingest.cli package
This PR provides support for V2 mongodb destination connector.
Change unstructured-client pin to setting minimum version instead of max version and `make pip-compile`. Integration tests that were dependent on the old version of the client are removed. These tests should be replicated in/moved to the SDK repo(s).
### Description Adds [SingleStore](https://www.singlestore.com/) database destination connector with associated ingest test.
### Description Allow used to pass in a reference to a custom defined stager via the CLI. Checks are run on the instance passed in to be a subclass of the UploadStager interface.
The purpose of this PR is to help investigate #3202.
## Summary Version bumps for 2024-07-08.
This pull request add table detection metrics. One case that was considered by me: Case: Two tables are predicted and matched with one table in ground truth Question: Is this matching correct in both cases or just for on table There are two subcases: - table was predicted by OD as two sub tables (so half in two, there are two non overlapping subtables) -> in my opinion both are correct - it is false positive from tables matching script in get_table_level_alignment -> 1 good, 1 wrong As we don't have bounding boxes I followed the notebook calculation script and assumed pessimistic, second subcase version
The table metrics considering spans is not used and it messes with the output thus I have cleaned the code from it. Though, I have left table_as_cells in the source code - it still may be useful for the users
**Summary** In preparation for further work on auto-partitioning (`partition()`), improve typing and organize `test_auto.py` by introducing categories.
### Summary Addresses [CVE-2024-39705](https://nvd.nist.gov/vuln/detail/CVE-2024-39705), which highlights the risk of remote code execution when running `nltk.download` . Removes `nltk.download` in favor of a `.tgz` file with the appropriate NLTK data files and checking the SHA256 hash to validate the download. An error now raises if `nltk.download` is invoked. The logic for determining the NLTK download directory is borrowed from `nltk`, so users can still set `NLTK_DATA` as they did previously. ### Testing 1. Create a directory called `~/tmp/nltk_test`. Set `NLTK_DATA=${HOME}/tmp/nltk_test`. 2. From a python interactive session, run: ```python from unstructured.nlp.tokenize import download_nltk_packages download_nltk_packages() ``` 3. Run `ls /tmp/nltk_test/nltk_data`. You should see the downloaded data. --------- Co-authored-by: Steve Canny <[email protected]>
This pull request adds NLTK data to the Docker image by pre-packaging the data to ensure a more reliable and efficient deployment process, as the required NLTK resources are readily available within the container. **Current updated solution:** - Dockerfile Update: Integrated NLTK data directly into the Docker image, ensuring that the API can operate independently of external - data sources. The data is stored at /home/notebook-user/nltk_data. - Environment Variable Setup: Configured the NLTK_PATH environment variable, enabling Python scripts to automatically locate and use the embedded NLTK data. This eliminates the need for manual configuration in deployment environments. - Code Cleanup: Removed outdated code in tokenize.py and related scripts that previously downloaded NLTK data from S3. This streamlines the codebase and removes unnecessary dependencies. - Script Updates: Updated tokenize.py and test_tokenize.py to utilize the NLTK_PATH variable, ensuring consistent access to the embedded data across all environments. - Dependency Elimination: Fully eliminated reliance on the S3 bucket for NLTK data, mitigating risks from network failures or access changes. - Improved System Reliability: By embedding assets within the Docker image, the API now has a self-contained setup that ensures consistent behavior regardless of deployment location. - Updated the Dockerfile to copy the local NLTK data to the appropriate directory within the container. - Adjusted the application setup to verify the presence of NLTK assets during the container build process.
This change adds the ability to filter out characters predicted by Tesseract with low confidence scores. Some notes: - I intentionally disabled it by default; I think some low score(like 0.9-0.95 for Tesseract) could be a safe choice though - I wanted to use character bboxes and combine them into word bbox later. However, a bug in Tesseract in some specific scenarios returns incorrect character bboxes (unit tests caught it 🥳 ). More in comment in the code
Co-authored-by: Kamil Plucinski <[email protected]>
This PR fixes a bug when using `partition` to partition an email with image attachments with hi_res and allow table structure inference -> the partitioning of the image would encounter a value error: `got multiple values for keyword argument 'infer_table_structure'`. This is because pass `kwargs` into partition "other" types of files in this [block](https://github.com/Unstructured-IO/unstructured/blob/50ea6fe7fc324efa09398898dc35d0cd4e78b1cf/unstructured/partition/auto.py#L270-L280) `infer_table_structure` is packaged into `partitioning_kwargs`. Then for email at least when there are attachments that can be partitioned with `hi_res` we pass that dict of `kwargs` right back into `partition` entry -> so when we get [here](https://github.com/Unstructured-IO/unstructured/blob/50ea6fe7fc324efa09398898dc35d0cd4e78b1cf/unstructured/partition/auto.py#L222-L235) we are both specifying explicitly `infer_table_structure` and have it in `kwargs` variable The fix is to detect first if `kwargs` already contains `infer_table_structure` and if yes use that and pop it from `kwargs`. --------- Co-authored-by: Kamil Plucinski <[email protected]> Co-authored-by: christinestraub <[email protected]> Co-authored-by: ryannikolaidis <[email protected]> Co-authored-by: christinestraub <[email protected]>
update unstructured-inference to 0.8.6 in requirements in extra-pdf-image.in 0.8.6 has pdfminer=20240706 (newer version)
this PR is to release to 0.16.5 which has below updates: - **Update `unstructured-inference`** to 0.8.6 in requirements which removed `layoutparser` dependency libs - **Update `pdfminer-six` to 20240706**
…3881) This PR refactors the data structure for `list[LayoutElement]` and `list[TextRegion]` used in partition pdf/image files. - new data structure replaces a list of objects with one object with `numpy` array to store data - this only affects partition internal steps and it doesn't change input or output signature of `partition` function itself, i.e., `partition` still returns `list[Element]` - internally `list[LayoutElement]` -> `LayoutElements`; `list[TextRegion]` -> `TextRegions` - current refactor stops before clean up pdfminer elements inside inferred layout elements -> the algorithm of clean up needs to be refactored before the data structure refactor can move forward. So current refactor converts the array data structure into list data structure with `element_array.as_list()` call. This is the last step before turning `list[LayoutElement]` into `list[Element]` as return - a future PR will update this last step so that we build `list[Element]` from `LayoutElements` data structure instead. The goal of this PR is to replace the data structure as much as possible without changing underlying logic. There are a few places where the slicing or filtering logic was simple enough to be converted into vector data structure operations. Those are refactored to be vector based. As a result there is some small improvements observed in ingest test. This is likely because the vector operations cleaned up some previous inconsistency in data types and operations. --------- Co-authored-by: ryannikolaidis <[email protected]> Co-authored-by: badGarnet <[email protected]>
### Description Avoid using the ndjson dependency due to the limiting license that exists on it
…ly repairing PDFs with long content streams, causing needless and endless OCR (#3822) Fixes: #3815 Verified on my very large documents that it doesn't unnecessarily and unsuccessfully "repair" them. You may or may not wish to keep the version check in `patch_psparser`. Since ~you're pinning the version of pdfminer.six and since it isn't guaranteed that the bug in question will be fixed in the next pdfminer.six release (but it is rather serious, so I should hope so), then perhaps you just want to unconditionally patch it.~ it seems like pinning of versions is only operative when running from Docker (good!) so never mind! Keep that version check! Also corrected an import so that if you do feel like using a newer version of pdfminer.six, it won't break on you. --------- Authored-by: David Huggins-Daines <[email protected]>
- **Add auto-download for NLTK for Python Enviroment** When user import `tokenize`, It will automatically download nltk data. - Added `AUTO_DOWNLOAD_NLTK` flag in `tokenize.py` to download `NLTK_DATA`
E.g., now can run: ```bash # extracts base64 encoded image data for `Table` and `Image` elements $ unstructured-get-json.sh --trace --verbose --images /t/docs/Captur-1317-5_ENG-p5.pdf # also extracts `Title` elements (see screenshot) $ IMAGE_BLOCK_TYPES='"title","table","image"' unstructured-get-json.sh --trace --verbose --images /t/docs/Captur-1317-5_ENG-p5.pdf ``` It was discovered during testing that "narrativetext" does not work, probably due to camel casing of NarrativeText 😬 ![image](https://github.com/user-attachments/assets/e6414a57-81e1-4560-b1b2-dce3b1c2c804)
Using VoyageAI's python package directly, allowing more features than through langchain
This PR fixes a bug in `build_layout_elements_from_ocr_regions` where texts are joint in incorrect orders. The bug is due to incorrect masking of the `ocr_regions` after some are already selected as one of the final groups. The fix uses simpler method to mask the indices by simply use the same indices that adds the regions to the final groups to mask them so they are not considered again. ## Testing This PR adds a unit test specifically aimed for this bug. Without the fix the test would fail. Additionally any PDF files with repeated texts has a potential to trigger this bug. e.g., create a simple pdf use the test text ```python "LayoutParser: \n\nA Unified Toolkit for Deep Learning Based Document Image\n\nLayoutParser for Deep Learning" ``` and partition with `ocr_only` mode on main branch would hit this bug and output text where position of the second "LayoutParser" is incorrect. ```python [ 'LayoutParser:', 'A Unified Toolkit for Deep Learning Based Document Image', 'for Deep Learning LayoutParser', ] ```
Co-authored-by: Yao You <[email protected]>
I noticed that `make tidy` wasn't working in my development environment. This happens if you, a developer, forget to follow the specific instructions in `README.md` and install exactly the right versions of the necessary tools, including a *quite old* version of Ruff. This version will nonetheless warn you: warning: `ruff <path>` is deprecated. Use `ruff check <path>` instead. So this fixes that, in order to future-proof and avoid confusion!
Although, python3.13 is not officially supported or tested in CI just yet.
small minor version change to trigger workflows. and fix the open CVEs we had.
- there is a bug in deciding if a page has tables before performing table extraction. This logic checks if the id associated with Table type element is True - however, it should be checking if the id is `None` because sometimes the id can be 0 (the first type of element in the page) - the fix updates the logic - adds a unit test for this specific case
Resolves #3791 by setting a default timeout of 10 seconds.
This PR: - Fixes removing HTML tags that exist in <td> cells - stripping function was in general problematic to implement in easy and straightforward way (you can't modify `descendants` in-place). So I decided instead of patching something in table cell I added stripping everywhere in the same consistent way. This is why some tests needed small edits with removing one white-space in each tag. I believe this won't cause any problems for downstream tasks. Tested HTML: ```html <table class="Table"> <tbody> <tr> <td colspan="2"> Some text </td> <td> <input checked="" class="Checkbox" type="checkbox"/> </td> </tr> </tbody> </table> ``` Before & After ```html '<table class="Table" id="..."> <tbody> <tr> <td colspan="2">Some text</td><td></td></tr></tbody></table>' '<table class="Table" id="..."><tbody><tr><td colspan="2">Some text</td><td><input checked="" type="checkbox"/></td></tr></tbody></table>'' ```
#### Summary A recent security review showed that it was possible to partition arbitrary local files in cases where the filetype supports an "include" functionality that brings in the content of files external to the partitioned file. This affects `rst` and `org` files. #### Fix This PR fixes the above issue by passing the parameter `sandbox=True` in all cases where `pypandoc.convert_file` is called. Note I also added the parameter to a call to this method in the ODT code. I haven't investigated whether there was a security issue with ODT files, but it seems better to use pandoc in sandbox mode given the security issues we know about. #### Testing To verify that the tests that are added with this PR find the relevant issue: - Remove the `sandbox=True` text from `unstructured/file_utils/file_conversion.py` line 17. - Run the tests `test_unstructured.partition.test_rst.test_rst_wont_include_external_files` and `test_unstructured.partition.test_org.test_org_wont_include_external_files`. Both should fail due to the partitioning containing the word "wombat", which only appears in a file external to the partitioned file. - Add the parameter back in, and the tests pass.
This PR rewrites the logic in `unstructured_inference` that merges extracted with inferred layout using vectorized operations. The goal is to: - vectorize the operation to improve memory and cpu efficiency - apply logic equally without order being a factor (the `unstructured_inference` version uses loops and modifies the content of the inner loop on the fly -> order of the out loop, which is the order of extracted elements becomes a factor) determining the merging results - rewrite the loop into clear steps with clear rules - setup stage for followup improvements While this PR aim to reproduce the existing behavior as much as possible it is not an exact replica of the looped version. Because order is not a factor any more some extracted elements that used to be not considered part of a larger inferred element (due to processing order being not optimum) are now properly merged. This lead to changes in one ingest test. For example, the change shows that now we properly merge the section numerical number with the section title as the full title element. ## Test: Since the goal of this refactor is to preserve as much existing behavior as possible we rely on existing tests. As mentioned above the one file that changed output during ingest test is a net positive change. --------- Co-authored-by: ryannikolaidis <[email protected]> Co-authored-by: badGarnet <[email protected]>
### Description NDJSON files were being detected as JSON due to having the same mime-type. This adds additional logic to skip mime-type based detection if extension is `.ndjson`
Add password with PDF files Must be combined with [PR 392 in unstructured-inference](Unstructured-IO/unstructured-inference#392) --------- Co-authored-by: John J <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.