-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retrieving the PDF Text in Array. #16
Comments
I solved it by setting the value of BlockSeparator in main class like this: And using that value for exploding the string. $file = 'sample2' ; foreach( $pdf -> Pages as $page_number => $page_contents){ |
Hi Manuel,
Exactly, this is what the BlockSeparator property was intended for.
In fact, unlike HTML, the PDF file format has absolutely no notion of what a
table is ; it�s just a set of instructions for drawing shapes and displaying
text. Sometimes, it�s a hard task when extracting text to decide if a space
should be inserted between two strings of text or not.
However, I�ve noticed that setting this property to something else (like you
did) works fine for most PDF files presenting �tabular� data, maybe because
they have been generated by some software that �draws� the data as it comes.
Please feel free to contact me if you have any other question or issue.
Christian.
…_____
De : Manuel Osuna [mailto:[email protected]]
Envoyé : dimanche 16 avril 2017 10:00
À : christian-vigh-phpclasses/PdfToText
Cc : Subscribed
Objet : Re: [christian-vigh-phpclasses/PdfToText] Retrieving the PDF Text in
Array. (#16)
I solved it by setting the value of BlockSeparator in main class like this:
public $BlockSeparator = '#$' ;
And using that value for exploding the string.
$file = 'sample2' ;
$pdf = new PdfToText ( "$file.pdf" ) ;
foreach( $pdf -> Pages as $page_number => $page_contents){
$lines = explode(PHP_EOL,$page_contents);
foreach($lines as $key=>$line){
$texts = explode('#$',$line);
}
}
�
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view
<#16 (comment)
nt-294339166> it on GitHub, or mute
<https://github.com/notifications/unsubscribe-auth/ARM8atLCGbKWTvE6GbekMSTSm
cejWS_2ks5rwcpugaJpZM4M-kCA> the thread.
<https://github.com/notifications/beacon/ARM8amskMvYXI3VL7GyPfsXJI_IRw4p0ks5
rwcpugaJpZM4M-kCA.gif>
---
L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast.
https://www.avast.com/antivirus
|
Thanks for your quick response. I'm practically new trying your powerful library and so far it's been the best I've tried. |
I�m happy to learn that !
Regarding your question : it�s curious, since the birth of my class 1 year
ago, you are the second person to ask me for such a feature.
However, I�m not sure of the real need behind this question. I�m just
guessing that you may have documents that all contain some useful
information in the same area of the page (a rectangle), and that you want to
retrieve only the information that is within this rectangle, is this correct
?
If I�m guessing right, that could be a really good feature to implement !
If I�m guessing wrong, could you further explain me what you would do with
such a feature ?
With kind regards,
Christian.
…_____
De : Manuel Osuna [mailto:[email protected]]
Envoyé : lundi 17 avril 2017 17:46
À : christian-vigh-phpclasses/PdfToText
Cc : christian-vigh-phpclasses; Comment
Objet : Re: [christian-vigh-phpclasses/PdfToText] Retrieving the PDF Text in
Array. (#16)
Thanks for your quick response. I'm practically new trying your powerful
library and so far it's been the best I've tried.
I was wondering something else, if it's possible to get the text from
specific document areas? For example giving the coordinates of two points to
draw a rectangle and then get all the text inside of it.
�
You are receiving this because you commented.
Reply to this email directly, view
<#16 (comment)
nt-294512725> it on GitHub, or mute
<https://github.com/notifications/unsubscribe-auth/ARM8agNgkUoOOGGCz1j5-Xf9p
342pAedks5rw4kvgaJpZM4M-kCA> the thread.
<https://github.com/notifications/beacon/ARM8aiXvvMI9rDxVnL-g2jDbGLzIElslks5
rw4kvgaJpZM4M-kCA.gif>
---
L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast.
https://www.avast.com/antivirus
|
Yeah, you couldn't explain it better. That is exactly what I'm looking for and it's nice to know I'm not the only one who thought about it. That would be a really useful feature because we could design some templates for our document pages and we could expect some information in those areas, so if we don't get anything then we would be pretty sure that the value is empty instead of conclude it with other methods. This is something that has to do with tables where some column values may be empty. |
Great ! but I have to admit that I didn�t understand the actual need of the
previous person that contacted me a few months ago for that ! now, thanks to
you, it has become really clear to me.
Before version 1.5, I would have answered that it would take some time,
because I didn�t really handle (x,y) coordinates, except for x-coordinates
that were following sequentially in the pdf flow. Starting from version 1.5,
I implemented some layout rendering (with option PDFOPT_BASIC_LAYOUT).
This new layout feature will be of great help to implement the one you want.
However, it does not work yet on all the samples I�ve tried, meaning that I
did not completely understand some positioning instructions that are part of
the PDF Postscript-like language.
The good new is that it has worked for at least half of the samples (those
that did not include tricky PDF instructions).
So, maybe the PDF files you have to process fall in this case (I mean, with
the PDFOPT_BASIC_LAYOUT option). It would be really useful to me if you
could send me a few ones at the following address, and explain me which
area(s) you want to catch :
[email protected]
If my �basic layout� option gives the results I expect with your sample PDF
files, then I could start implementing the feature you suggested right now,
before finishing to solve my other (x,y) positioning issues (which should
take a little time).
With kind regards,
Christian.
…_____
De : Manuel Osuna [mailto:[email protected]]
Envoyé : mardi 18 avril 2017 16:46
À : christian-vigh-phpclasses/PdfToText
Cc : christian-vigh-phpclasses; Comment
Objet : Re: [christian-vigh-phpclasses/PdfToText] Retrieving the PDF Text in
Array. (#16)
Yeah, you couldn't explain it better. That is exactly what I'm looking for
and it's nice to know I'm not the only one who thought about it. That would
be a really useful feature because we could design some templates for our
document pages and we could expect some information in those areas, so if we
don't get anything then we would be pretty sure that the value is empty
instead of conclude it with other methods. This is something that has to do
with tables where some column values may be empty.
�
You are receiving this because you commented.
Reply to this email directly, view
<#16 (comment)
nt-294867929> it on GitHub, or mute
<https://github.com/notifications/unsubscribe-auth/ARM8apI2VabN-na5UdlCfgDNI
crN9ICDks5rxMycgaJpZM4M-kCA> the thread.
<https://github.com/notifications/beacon/ARM8akn8h0PbeJaSEryEjhZdmccNv2LZks5
rxMycgaJpZM4M-kCA.gif>
---
L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast.
https://www.avast.com/antivirus
|
Hi Christian, I sent you the email with the sample PDF files hoping they work for the tests you need to do. Thanks for your support. |
Hi Manuel,
Thanks for sending me those materials ; I will come back to you when I�ll
have a solution ready.
Christian.
…_____
De : Manuel Osuna [mailto:[email protected]]
Envoyé : mercredi 19 avril 2017 01:58
À : christian-vigh-phpclasses/PdfToText
Cc : christian-vigh-phpclasses; Comment
Objet : Re: [christian-vigh-phpclasses/PdfToText] Retrieving the PDF Text in
Array. (#16)
Hi Christian,
I sent you the email with the sample PDF files hoping they work for the
tests you need to do.
Thanks for your support.
�
You are receiving this because you commented.
Reply to this email directly, view
<#16 (comment)
nt-295020037> it on GitHub, or mute
<https://github.com/notifications/unsubscribe-auth/ARM8arHz7lMj3lW_RewhNWxPf
uWDutmRks5rxU4WgaJpZM4M-kCA> the thread.
<https://github.com/notifications/beacon/ARM8ahZm1LCRmalSPRoBShw0TOun8J31ks5
rxU4WgaJpZM4M-kCA.gif>
---
L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast.
https://www.avast.com/antivirus
|
Hello Manuel,
Your files helped me a lot ! I�ve published version 1.6.0, which should be a
first answer to your needs.
With kind regards,
Christian.
…_____
De : Manuel Osuna [mailto:[email protected]]
Envoyé : mercredi 19 avril 2017 01:58
À : christian-vigh-phpclasses/PdfToText
Cc : christian-vigh-phpclasses; Comment
Objet : Re: [christian-vigh-phpclasses/PdfToText] Retrieving the PDF Text in
Array. (#16)
Hi Christian,
I sent you the email with the sample PDF files hoping they work for the
tests you need to do.
Thanks for your support.
�
You are receiving this because you commented.
Reply to this email directly, view
<#16 (comment)
nt-295020037> it on GitHub, or mute
<https://github.com/notifications/unsubscribe-auth/ARM8arHz7lMj3lW_RewhNWxPf
uWDutmRks5rxU4WgaJpZM4M-kCA> the thread.
<https://github.com/notifications/beacon/ARM8ahZm1LCRmalSPRoBShw0TOun8J31ks5
rxU4WgaJpZM4M-kCA.gif>
---
L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast.
https://www.avast.com/antivirus
|
Hi Christian. How are you? I'm back again with my project where I'm using your library and I have some issues about empty values. Thanks in advice for your support. Greetings. |
Hi there, is there a way to retrieve the data separated in an array?
I have a PDF with a table like this:
Asesor Emisor Carpeta Cis
13315 29036 20001310 20001178
--
But I get the output like this:
AsesorEmisorCarpetaCis13315290362000131020001178
I want to store the data in a database but getting that output doesn't help at all. I want to get an array like this:
Array(
[0] => "Asesor",
[1] => "Emisor",
[2] => "Carpeta",
[3] => "Cis",
[4] => "13315",
[5] => "29036",
[6] => "20001310",
[7] => "20001178"
)
Any help will be appreciated, thanks in advice.
The text was updated successfully, but these errors were encountered: