-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Undefined map and font notices #12
Comments
Hi Anthony,
Thanks for your reporting, your proposals make perfect sense. Would it be
possible for you to send me the pdf file that generated such notices ? (my
email is : [email protected]).
First, because I suspect that the origin of the warning at line #5517 is
more complex than that, and should not be solved by calling the isset()
function (I said �should not be solved�, but it may happen that this could
be finally the right solution!).
To further explain the reason why, I have left some parts elsewhere in the
code where I do not perform any checkings at all, except for the cases I
already encountered so far, that I�m holding for �truths�. The idea behind
not checking unusual cases is that the PDF format is so versatile and can
take so many different forms that I cannot predict any possible case. So,
when such an unpredicted case happens to a user, a warning is issued. With
that warning and the corresponding PDF file, I can investigate the problem.
Most of the time, this helped me discover new ways of building PDF parts,
which I was unaware of, or were too poorly documented by Adobe to notice
them. As a conclusion, this helped me enhance the class and provide more
reliable results.
The second reason why I would like your PDF sample is that it contains CID
fonts, and I�m badly lacking example PDF files using CID fonts. CID
(�Character ID�) fonts are internal fonts developed by Adobe long before the
Unicode standard emerged. Characters defined in a CID font are just glyphs
containing instructions on how to draw them, but even Adobe itself is not
aware of which Unicode character correspond to a given CID character entry !
using Acrobat Reader, you cannot perform searches on text written using CID
fonts, because it does not know which character is behind each glyph ; and
if you copy and paste such a text in Notepad or Word, you will only see a
series of rectangles.
I�m trying to implement correspondence tables for some CID fonts (ie,
character id <=> Unicode character) but it�s a tough task because I do have
not enough examples.
Regarding warnings at lines #4720 and #4724, your proposal is correct and
will be integrated in version 1.3.8. And yet again, a PDF sample would be
highly welcome for testing this case !
Many thanks for your proactive help !
With kind regards,
Christian.
…_____
De : Anthony Bolognese [mailto:[email protected]]
Envoyé : vendredi 23 décembre 2016 21:41
À : christian-vigh-phpclasses/PdfToText
Cc : Subscribed
Objet : [christian-vigh-phpclasses/PdfToText] Undefined map and font notices
(#12)
Version : 1.3.7
Notice: Undefined variable: map in PdfToText.phpclass on line 4720
Notice: Undefined variable: map in PdfToText.phpclass on line 4724
Notice: Undefined index: fonts in PdfToText.phpclass on line 5517
Warning: Invalid argument supplied for foreach() in PdfToText.phpclass on
line 5517
proposed change #1
<#1>
$file = PdfToText::$CIDTablesDirectory . DIRECTORY_SEPARATOR . $map_name .
'.cid' ;
change to:
$map = array(); $file = PdfToText::$CIDTablesDirectory . DIRECTORY_SEPARATOR
. $map_name . '.cid' ;
proposed change #2
<#2>
foreach ( $page_contents [ 'fonts' ] as $font_name => $font_object )
change to:
`if ( ! isset ( $page_contents [ 'fonts' ] ) )
continue ;
foreach ( $page_contents [ 'fonts' ] as $font_name => $font_object )`
�
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it
<#12> on
GitHub, or mute
<https://github.com/notifications/unsubscribe-auth/ARM8agQqHqD9tcUCeCzVg3_fX
sun3WlRks5rLDH0gaJpZM4LVDfs> the thread.
<https://github.com/notifications/beacon/ARM8am1jQROn3-LadSvipeJRmygEjiSYks5
rLDH0gaJpZM4LVDfs.gif>
---
L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast.
https://www.avast.com/antivirus
|
Christian, I regret that I can't share the PDF sample with you. It contains proprietary information that I signed a Non Disclosure Agreement for. I truly wish I could share it so you can continue to improve the project. Unfortunately I can not. If I can help in another way please let me know! Kind Regards, |
Hi Anthony,
Thanks ! I understand that you are covered by a NDA so there are not many
options. The only one that I see is that your customer generates the same
kind of document with anonymized (dummy) data, but if he�s not himself the
owner (generator) of the document, because the document comes from a third
party for example, it won�t be possible. And I don�t think he would agree to
sign an NDA with me !
Anyway, I will implement your suggested corrections with version 1.3.8 this
week-end, so that it will help you go further.
Meanwhile, one thing that might help me is that you edit the PDF file using
notepad or notepad++ (if you�re running on Windows) then :
- Search for the string /Encoding
- It should be followed by either /WinAnsiEncoding, /Unicode or
/Identity-H. If it�s followed by a different �/keyword� option, I would be
interested in knowing what this keyword is (for example, /Encoding
/some_keyword_not_listed_here)
I also would like to know, if possible, the written language of the
document, especially if you find the string �/Encoding� followed by
�/Identity-H�. But don�t spend too much time on it.
Anyway, if you encounter any further issue with text extraction, please feel
free to contact me.
With kind regards,
Christian.
…_____
De : Anthony Bolognese [mailto:[email protected]]
Envoyé : samedi 24 décembre 2016 00:01
À : christian-vigh-phpclasses/PdfToText
Cc : christian-vigh-phpclasses; Comment
Objet : Re: [christian-vigh-phpclasses/PdfToText] Undefined map and font
notices (#12)
Christian,
Thank you for building this awesome pdf extraction class. Well done!
I regret that I can't share the PDF sample with you. It contains proprietary
information that I signed a Non Disclosure Agreement for. I truly wish I
could share it so you can continue to improve the project. Unfortunately I
can not. If I can help in another way please let me know!
Kind Regards,
Anthony
�
You are receiving this because you commented.
Reply to this email directly, view
<#12 (comment)
nt-269054699> it on GitHub, or mute
<https://github.com/notifications/unsubscribe-auth/ARM8ajolvw9TWZE35ojvmo6Nu
wVMlK1Dks5rLFKngaJpZM4LVDfs> the thread.
<https://github.com/notifications/beacon/ARM8ap8YIHiL2KoBqnxPoYWqf8Nk1HLmks5
rLFKngaJpZM4LVDfs.gif>
---
L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast.
https://www.avast.com/antivirus
|
Christian, So I checked the document in SublimeText and found the following line:
The document is written in English. Let me know if I can do any more searching for you, I'm happy to help! Kind Regards, |
Sorry, I should have included the line above also. Here it is!
Best, |
Hello Anthony,
Merry x-mas to you as well !
Thanks for your feedback, it does confirm that your pdf file is using
non-far east language CID fonts. My current implementation regarding CID
fonts could be described as several steps below �experimental�. I hope by
the near future that I will be able to collect further information and
sample files to figure out how this really strange thing works�
With kind regards,
Christian.
…_____
De : Anthony Bolognese [mailto:[email protected]]
Envoyé : lundi 26 décembre 2016 14:19
À : christian-vigh-phpclasses/PdfToText
Cc : christian-vigh-phpclasses; Comment
Objet : Re: [christian-vigh-phpclasses/PdfToText] Undefined map and font
notices (#12)
Sorry, I should have included the line above also. Here it is!
<</BaseFont/GHTXAC+Wingdings-Identity-H/Type/Font
/Encoding /Identity-H/DescendantFonts[181 0 R]/Subtype/Type0>>
Best,
Anthony
�
You are receiving this because you commented.
Reply to this email directly, view
<#12 (comment)
nt-269210738> it on GitHub, or mute
<https://github.com/notifications/unsubscribe-auth/ARM8ak8TX4loJaOXVjc2Um8Nz
btC0gH9ks5rL76xgaJpZM4LVDfs> the thread.
<https://github.com/notifications/beacon/ARM8anRT4o4I76Q3le6aD7RKxnXYga7Wks5
rL76xgaJpZM4LVDfs.gif>
---
L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast.
https://www.avast.com/antivirus
|
Version : 1.3.7
Notice: Undefined variable: map in PdfToText.phpclass on line 4720
Notice: Undefined variable: map in PdfToText.phpclass on line 4724
Notice: Undefined index: fonts in PdfToText.phpclass on line 5517
Warning: Invalid argument supplied for foreach() in PdfToText.phpclass on line 5517
proposed change #1
$file = PdfToText::$CIDTablesDirectory . DIRECTORY_SEPARATOR . $map_name . '.cid' ;
change to:
$map = array(); $file = PdfToText::$CIDTablesDirectory . DIRECTORY_SEPARATOR . $map_name . '.cid' ;
proposed change #2
foreach ( $page_contents [ 'fonts' ] as $font_name => $font_object )
change to:
`if ( ! isset ( $page_contents [ 'fonts' ] ) )
continue ;
foreach ( $page_contents [ 'fonts' ] as $font_name => $font_object )`
The text was updated successfully, but these errors were encountered: