Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't read PDF-file #30

Open
FredrikBrandt opened this issue Aug 20, 2018 · 13 comments
Open

Can't read PDF-file #30

FredrikBrandt opened this issue Aug 20, 2018 · 13 comments

Comments

@FredrikBrandt
Copy link

Hi,
I have a PDF-file (version 1.7) which is working correctly.
I have another PDF-file (version 1.3 from Producer Amyuni PDF Converter version 5.0.0.3) which is not read at all (error code with: Error, this is not a valid PDF: ...) which is in fact a readable PDF-file.
What seems to be the problem?

I am using this call:
// Check the PDF-file for information
$uri = $dir.'/'.$fileName;
$pdf = new Pdf();
$pdfdata = $pdf->getPdfInfo($uri);

In my class:
...
public function getPdfInfo($uri = "") {

    $error = '';
    try{
        $pdf = new PdfToText($uri);
    } catch (Exception $e) {
        $this->error = 'Caught exception: '. $e->getMessage(). "\n";
        return false;
    }
    $pdftext = $pdf -> Text ;

(calling class/PdfToText.phpclass).

The error log will show:
[Text] => �����
�� �����
...

Please, can you help me out?
It is working with other PDF-files.

@FredrikBrandt
Copy link
Author

Hi,
I have another file not working either for:
%PDF-1.5
%���
3 0 obj
<< /Length 4 0 R
/Filter /FlateDecode

stream
x���oo�Ǒ.��������`��{�����7�^l�kd�ͽ@����#E�E9$e��;�]U��穪$�E���.=U3���xH��׋���w�����wo�.������%�߮��C���*v��.�zs�_���]�����݋1�/�9�8�ݝ��u�~>߿{{�^��O��z1�W�&�_�<�8�y�R�Ws�a�[,G}wz����T�p���-�X�Hi�����}w9\��������?������{��������Mî�_���)�n ]��?����R��R�o��4NWC��4,�Kp��O7�燇�k����5������w��r�9�=�������u�<L\3^�aIi�� ����J]��Ͽ��?|��R����^��*��+1ī���X/zY�^}�^��������_��% ]�����^��������?�i�軛���W^�!\��tc�\MK�:���������ή˺����+�����ܜ�?}8x �u��d5/�����_�/5{�4Ҽ���S����>}�p�?�iJWSMh�����6MZ��<���淟�n�(a�{�$5�Ġ
��1�w��y��s���X^����
...

@FredrikBrandt
Copy link
Author

Hi,
I tried to send a PDF-file to ([email protected]), but it bounced back.
Where can I send it?
Regards,
/Fredrik.

@mjblacker
Copy link

I'm happy to take a look at it if you can send it over.

@FredrikBrandt
Copy link
Author

@FredrikBrandt
Copy link
Author

FredrikBrandt commented Aug 24, 2018 via email

@mjblacker
Copy link

Thanks, the problem is the CID IDENTITY_H fonts.

With just using the unicode map on the font object you get around a third of the text out but the rest isn't mapped to characters properly.

I'm working on a change that will read CID font's CMAP which will hopefully make reading international PDF's much better.

@FredrikBrandt
Copy link
Author

Hi,
Faktura-1587.pdf

This is not working either.
When do you think the change will be done?
Regards,
/Fredrik.

@FredrikBrandt
Copy link
Author

Hi,
How is it going?
When do you think a solution can be available?

This file is not possible to read at all.
Faktura20541.pdf

I use this syntax.
The fist part is printed out, but if I do another printout after the function call,
it will not show.

    // Check the PDF-file for information
    $uri = $dir.'/'.$fileName;
    $pdf = new Pdf();
    //
    error_log(print_r(array(
        'uri' => $uri,
        'pdf' => $pdf,
        '' => ''
    ), true));
    **$pdfdata = $pdf->getPdfInfo($uri);**

Otherwise the tool is great.

Regards,
/Fredrik

@FredrikBrandt
Copy link
Author

Hi,
Please, I need this urgently.
Can I atleast get an answer to when it is expected to be changed?
It is much appreciated :).
Regards,
/Fredrik.

@FredrikBrandt
Copy link
Author

FredrikBrandt commented Sep 17, 2018

Hi,
Maybe I am not using the complete files?
I am using:
class/PdfToText.phpclass
class/Maps/adobe-charsets.map
class/Maps/unicode-to-ansi.map

Do I also need the CIDTables-directory like class/CIDTables/.?

Btw:
I tried adding libraries:
class/CIDTables
class/contributions
class/FontMetrics
class/FormTemplates to the class-library without any effect.

@mjblacker
Copy link

Hey Fredrik,

We are still working out the best way to resolve the issues with CID fonts.

We've made a few changes to the fork on our github if you check that out you should get some information out of the PDF from the unicode map we process even with CID fonts.

No time-frame currently as this is very much a side project

@FredrikBrandt
Copy link
Author

FredrikBrandt commented Sep 18, 2018

Hi and thanks alot,
It almost suits my purpose.
Can this be adjusted little more?
I seem to get part of the invoice, but not the part that I want.
Great otherwise.

I actually only needs 2 parameters from the PDF-files.
One is the number of pages and the second if the text in the PDF contains
Invoice (Faktura) or Creditinvoice (Kreditfaktura).
Can this be maintained somehow?

Yes, this is solving my problems for now.
Thank you very much :).

@FredrikBrandt
Copy link
Author

Hi again,
I am having problem with this type of invoice, is it because of the qr-code?
It doesn't even load anything.
This line of code will not run correctly:
$pdf = new PdfToText($uri);.

The dropzone will respond with:
Server responded with 0 code.

Can this be fixed?

Here is the invoice.
Faktura20541.pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants