Extracting Text from a PDF

14 views
Skip to first unread message

Abhishek Kolluru

unread,
Aug 29, 2022, 3:21:18 PM8/29/22
to live...@googlegroups.com
Hello Everyone - I am trying to explore options using current AEM Forms on JEE services to extract text from a PDF and then be able to parse it for getting the desired result. I went through the DDX reference document that explains changing the PDF Content with DDX, but did not see any section talk about extracting content from a PDF using DDX, so i believe this is not possible.

I know when I convert an IMAGE to PDF, I see a OCR enable option, though it makes the text readable now, how am i going to extract actually ? Any thoughts ?

Regards,
Abhishek

fred.pantalone

unread,
Aug 30, 2022, 9:38:03 AM8/30/22
to Adobe LiveCycle Developers
What type of PDF are you working with? Is it an XFA PDF (i.e. created with Designer), Acroform, or is it as you described from a scanned document?

Duane Nickull

unread,
Aug 30, 2022, 9:49:50 AM8/30/22
to live...@googlegroups.com
Abhishek:

Just to clarify, if the PDF is created by scanning, it can be very difficult to get the text unless you use a higher quality scan and OCR.  PDF's distilled from word or other documents do have the text within the documents.

Please provide the full context of the workflow and maybe we can help.

Duane

--
You received this message because you are subscribed to the Google Groups "Adobe LiveCycle Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to livecycle+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/livecycle/210368d7-2c1f-4b44-865f-95cd58e92c29n%40googlegroups.com.


--
******************************
CTO Hired Gun - speaking only for myself
s. Bootstrap 5, jQuery, HTML5, CSS3+, PHP, Node.js, Neo4J & more
t.  @duanenickull

NOTICE: This e-mail and any attachments may contain confidential information. If you are the intended recipient, please consider this a privileged communication, not to be forwarded without explicit approval from the sender.  If you are not the intended recipient, please notify the sender immediately by return e-mail, delete this e-mail and destroy any copies. Any dissemination or use of this information by a person other than the intended recipient is unauthorized and may be illegal. The originator reserves the right to monitor all e-mail communications through its networks for quality control purposes.

Adam D.

unread,
Aug 30, 2022, 10:16:22 AM8/30/22
to live...@googlegroups.com, Duane Nickull
Thanks for getting back to him Duane, I have an idea but it may rely on other software and not sure if it would help in yhe environment unless as you stated that we need a better look at the workflow process.



- Adam

Duane Nickull

unread,
Aug 30, 2022, 6:41:12 PM8/30/22
to Adam D., live...@googlegroups.com
I actually use PDFBox for this.

Duane

Adam D.

unread,
Aug 30, 2022, 7:52:12 PM8/30/22
to Duane Nickull, live...@googlegroups.com
Great minds think alike, however does it inplement well with ocr inport to xml or lcd forms?



- Adam

Duane Nickull

unread,
Aug 31, 2022, 11:10:16 AM8/31/22
to Adam D., live...@googlegroups.com
Correct.

Has Abhishek let anyone know more about how his PDFs are being created yet?

Duane

Adam D.

unread,
Aug 31, 2022, 11:31:58 AM8/31/22
to Duane Nickull, live...@googlegroups.com
I havent yet, but will wait. By the way Duane, your contributions and knowledge are very awesome and appreciated. Have a great day!


- Adam

Adam D.

unread,
Aug 31, 2022, 11:36:37 AM8/31/22
to Duane Nickull, live...@googlegroups.com
Oh before I forget, Fred, you are awesome as well. Back in the early days i used to use quite a few of your suggestions. As a community we kick butt! 



- Adam


-------- Original message --------
From: Duane Nickull <duane....@gmail.com>
Date: Wed., Aug. 31, 2022, 10:10 a.m.

Abhishek Kolluru

unread,
Aug 31, 2022, 1:50:49 PM8/31/22
to live...@googlegroups.com, Duane Nickull
Good to see responses from some of the greats of LiveCycle after a long time :-), this definitely is a great community to be part of.

Coming back to my question : My task was to extract the user info from a PDF (these are payslips), mostly scanned ones and then find the employee ID in some way out of the extract. Second part is to disassemble the PDF and name it with this extracted employee ID. I was thinking about DDX to achieve this and get the text into an XML however I was not sure if DDX can be applied on non-XFA documents.

Thanks,
Abhishek

Adam D.

unread,
Aug 31, 2022, 1:57:21 PM8/31/22
to Abhishek Kolluru, live...@googlegroups.com, Duane Nickull
Reading a small field via OCR is doable, long as the target range is easy to identify (2 x 3 area to scan lets say) Adobe Acrobat has the built in functionality, just need to line up the scans folder on server, create an output folder with unique IDs (file name based on employee number) and LCD / AEM can handle the rest. A backend SQL server with aftermarket PDF OCR can acheive same reults, just more work.

My best bet, if the other experts agree is to go all adobe (Acro w/OCR) and inport to AEM (LCD)

 





- Adam

Adam D.

unread,
Aug 31, 2022, 2:02:32 PM8/31/22
to Abhishek Kolluru, live...@googlegroups.com, Duane Nickull
Should note, File Name is based on OCR text and metadata from date created to have a proper naming convention. SQL being so flexible its doable, but youd not be giving back a pdf to end user, just a text file with the particular need to knows if i understand right.

The key is implementation, and how much the companys willing to spend on the whim. Sometimes for companies its easier to pay someone a fair living wage to handle pay statements (like from crystal reports, quickbooks etc.) Than try to build a complex system.



- Adam

Adam D.

unread,
Aug 31, 2022, 2:07:38 PM8/31/22
to Abhishek Kolluru, live...@googlegroups.com, Duane Nickull
Sorry to spam so much. XFA based forms keep in mind are windows only reading. It would be unlikely that recieved files would be in that format. Cam scanner so on can encapsulate data from raw or img format and wrap it in to a finished pdf. XFA files are only generated natively by livecycle and they arent compatible with any other unix based os (apple linux android..)

I am now done chiming in unless asked, likely Fred or Duane may have an easier idea i dont know. Have a great day everyone!



- Adam

Abhishek Kolluru

unread,
Aug 31, 2022, 2:20:23 PM8/31/22
to Adam D., live...@googlegroups.com, Duane Nickull
One other option i was exploring is using below DDX, this is just out of curiosity to see how precise the returned text from PDF would look like, 

<DDX xmlns="http://ns.adobe.com/DDX/1.0/"><PDF result="doc1">
<PDF source="doc2"/>
</PDF>
<DocumentText result="words.xml">
<PDF source="doc1"/>
</Text></DDX>

The challenge I am having is, I am mapping the words.xml from the above ddx to a XML process variable but it always returns empty and no errors from the logs. Not sure if that is what the above DDX says. Any thoughts ? if i am missing something. Meanwhile i will try this on an XFA document and see if that gives me any different results.

Thanks,
Abhishek

Duane Nickull

unread,
Aug 31, 2022, 3:09:08 PM8/31/22
to Abhishek Kolluru, Adam D., live...@googlegroups.com
Abhishek:

I would still like to know more about the process.  For example, is this a batch scan of multiple employee paycheque stubs or is there a link to the overall process where the employee ID could be grabbed?  I have found in my LC ES days that many times people already had the information they sought in different areas.
 
Can you also describe the scanning process as well?  Is it a feed scanner using a non-deterministic order/approach or is there some order to it.

Adam is correct, you can probably do this with OCR but you might not have to if the data can be grabbed from somewhere else.  Many Fortune 500 companies still do this digital -> paper -> digital dance and it costs a ton of money.

Until the details, I agree with Adam.

Duane

Adam D.

unread,
Aug 31, 2022, 3:41:13 PM8/31/22
to Abhishek Kolluru, live...@googlegroups.com, Duane Nickull
The lead ddx address is wrong. Needs to point to a file name (convention) in a directory, not a url. The tsrget url from adobe is bslid provided you pay for theor ocr /archiving services



- Adam

Abhishek Kolluru

unread,
Aug 31, 2022, 3:43:13 PM8/31/22
to Duane Nickull, Adam D., live...@googlegroups.com
This is a single stitched PDF with almost 25k pages, each page is a ADP format payslip of an employee, this gets sent to me from an upstream system. Each page (basically each payslip) will have an employee ID just like in any payslip, i need to break this 25k pages PDF into 25k separate PDF's and name each PDF with the employee id they contain. Does that make sense ? At this point I am trying to achieve this completely with LiveCycle owned capabilities. 

If I am able to achieve this with LiveCycle, I will set up a watch folder for the upstream system to drop the PDF, process it and then probably send the separated PDFs back to them.

Thanks,
Abhishek

Duane Nickull

unread,
Aug 31, 2022, 6:23:56 PM8/31/22
to Abhishek Kolluru, Adam D., live...@googlegroups.com
Abhishek:

I understand.  My first question will be "is there anyway that ADP can just send you non-scanned (ie - embedded text) documents?  That would be the best IMO from a 50,000 ft architectural perspective.

If not (which is probably true), your best bet will be using the OCR then as a safety step, match the scanned OCR values against a master list to verify. I am sure the error rate will be relatively low.

If ADP provides these in a semi-deterministic format (ie - the employee data is always on the top left vs random page orientation), you may be able to write your own custom LC module as well.  My good friend Scott MacDonald whom wrote most of the LCES examples works for Amazon now but he probably has an example similar to that somewhere (not sure if Adobe still publishes them).

Duane
Reply all
Reply to author
Forward
0 new messages