Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to get document widgets #74

Open
Bonn2018 opened this issue Jun 8, 2023 · 11 comments
Open

Unable to get document widgets #74

Bonn2018 opened this issue Jun 8, 2023 · 11 comments

Comments

@Bonn2018
Copy link
Collaborator

Bonn2018 commented Jun 8, 2023

Can't get widgets from this document
Maryland Bill of Sale for Vehicle Transactions (Form VR-181).pdf

As example widget with id 673R exist for this document but the document method getComponentById called with value 673 returns null

@microshine
Copy link
Contributor

It returns null because this component is marked like removed (type: 'f'). I don't think we should allow updating such components. But maybe it would be better to throw an exception for this case. @Bonn2018 what do you think?

image

@Bonn2018
Copy link
Collaborator Author

Bonn2018 commented Jun 9, 2023

It returns null because this component is marked like removed (type: 'f'). I don't think we should allow updating such components. But maybe it would be better to throw an exception for this case. @Bonn2018 what do you think?

let's try to recognize Adobe behavior with the same fields. Maybe need install some auto-fix with removing this flag or something else. Can be sure that Adobe allows to fill these fields and we also should

@microshine
Copy link
Contributor

The document has incorrect object indexing in the XRef table. The objects are mistakenly marked as deleted objects, making it impossible to retrieve the position of the object within the document. Our current implementation relies solely on the indices specified in the XRef table, which speeds up the document loading process and avoids line-by-line reading.

Here is example of XRef indexes from this document

0000205176 00000 n
0000207305 00000 n
0000209026 00000 n
0000210373 00000 n
0000211931 00000 n
0000214166 00000 n
0000214831 00000 n
0000215684 00000 n
0000216733 00000 n
0000216795 00000 n
0000219355 00000 n
0000000000 65535 f
0000000000 65535 f
0000000000 65535 f
0000000000 65535 f
0000000000 65535 f
0000000000 65535 f
0000000000 65535 f
0000000000 65535 f
0000000000 65535 f
0000000000 65535 f
0000000000 65535 f
0000000000 65535 f
0000000000 65535 f
0000000000 65535 f
0000000000 65535 f
0000000000 65535 f
0000000000 65535 f
0000000000 65535 f
0000000000 65535 f
0000000000 65535 f
0000000000 65535 f
0000000000 65535 f
0000000000 65535 f
0000000000 65535 f
0000000000 65535 f

Considering the severity of the issue and the impact it has on the document's integrity, I recommend treating this particular format as corrupted. However, if it's necessary to support this format, we will need to modify the document reading approach in our module, moving away from relying solely on the XRef indices.

@rmhrisk What do you think?

@Bonn2018
Copy link
Collaborator Author

Bonn2018 commented Jun 9, 2023

@microshine Just my thinks about it:
As I understood we consider two tactics for reading documents:

  1. According to xref table
  2. Line-by-line reading

Now we find a case where xref table betrayed us because have the wrong info and present an invalid experience (we can't find a field that actually exists)
From the last thesis, we can find a fast theory that "Line-by-line reading " is better because of more safety.
Also, need to say that "Line-by-line reading" is much slower and we want omit it.

I'll leave some questions about it:

1) Could we install "Line-by-line reading " as a fallback method? Is it hard to implement?

In Hancock experience, we do not produce random requests to widgets. If we ask for some widget by id, it means that we are sure that this widget exists in the document. I think we can keep "Line-by-line reading " as a fallback method and use it only in extra cases do not produce a bad effect on documents with a good structure

2) Could we create some review for xref table and refactor it as needed?

We currently implemented auto fix for documents with some strange format etc. Maybe we can produce the same with this problem. Some method in a document instance that review xref table and create new one if will found a bug. This strategy allow for us fix table and omit "Line-by-line reading" at the start of using a document. It could work at least for all not signed previously documents

@microshine
Copy link
Contributor

  1. Could we install "Line-by-line reading " as a fallback method? Is it hard to implement?

Supporting line-by-line reading should not be difficult. The main question is how to determine when to apply this approach.

  1. Could we create some review for xref table and refactor it as needed?

Updating the indices in the XRef table can be quite problematic. It would be easier to suggest re-saving the document. Fortunately, our module provides the capability to resave documents.

@Bonn2018
Copy link
Collaborator Author

Bonn2018 commented Jun 9, 2023

Supporting line-by-line reading should not be difficult. The main question is how to determine when to apply this approach.

When we can't get an object using xref table

@Bonn2018
Copy link
Collaborator Author

Bonn2018 commented Jun 9, 2023

Updating the indices in the XRef table can be quite problematic. It would be easier to suggest re-saving the document. Fortunately, our module provides the capability to resave documents.

This way is valid for us but will be strange if we will do it with each document. I think need some method which will do review before it

@microshine
Copy link
Contributor

I discussed this matter with @Romashine, and we came up with an idea on how it could be implemented.

After reading the XRef indices in the document, we can perform a check for deleted objects. If a deleted object doesn't have a preceding version (i.e., it was created as deleted), we can search for that object throughout the original document using its header obj. We will read the found object and use it as a reference within the document.

At first glance, the implementation doesn't seem too complicated. I will try to work on it over the weekend.

However, there is one challenging aspect to consider. What should be done if the same deleted object appears twice in the document? In this case, it becomes difficult to determine the most recent version because, unfortunately, in PDFs, objects are not always written in the sequence they were created.

@microshine
Copy link
Contributor

It turns out that the document uses a hybrid XRef. Unfortunately, our current version does not support this XRef format.

trailer
<</Size 1131/Root 1083 0 R/Info 173 0 R/ID[<5822179FD54F55489CDF1CB430BF4866><FF637C10A2A4434BBBED709DCC73345A>]/Prev 219538/XRefStm 1792>>
startxref

I have created a new issue #78 to implement support for this format.

@Bonn2018
Copy link
Collaborator Author

Washington Vehicle_Vessel Bill of Sale Form (1).pdf

@microshine another one document with same issue but by another reason. Please check this document

This was referenced Jun 14, 2023
@MarikTar
Copy link

There is also another problem with this Maryland Bill of Sale for Vehicle Transactions Form VR-181.pdf file, but it may also be related to XRef table.

Some of Signature images could be missposition and compressed vertically after signing. It happens if the document has several signature fields assigned to different recipients.

image

Steps to reproduce in Hancock:

  1. Create a 'me and other' transaction with this document
  2. Assign two any signature fields for First recipient
  3. Add another one signature field
  4. Assign new one field and the last one not assigned signature field to the Second recipient
  5. Sign transaction for both recipients

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants