PDF to text

279 views
Skip to first unread message

Hugh Myrie

unread,
Jan 22, 2025, 11:08:51 AMJan 22
to golang-nuts
I want to extract text from a PDF and preserve any table or at least convert it to a CSV. I am using the PDFtoText package (which uses the Poppler software). The text is extracted vertically (i.e. one column at a time) and each text is separated by a space. There is no line break making it difficult to manipulate. I want to extract the text horizontally to preserve and possible add line breaks to allow for further manipulation.

Your help in this matter is appreciated. Suggest alternatives if available.

Here is the Go code:

package main

import (
    "fmt"
    "log"
    "os"

)

func main() {
    // Replace "test.pdf" with the path to your PDF file
    pdfPath := "test.pdf"
    // Open the PDF file
    f, err := os.Open(pdfPath)
    if err != nil {
        log.Fatalf("Failed to open PDF file: %v", err)
    }
    defer f.Close()
    // Read the file content
    content, err := os.ReadFile(pdfPath)
    if err != nil {
        log.Fatalf("Failed to read PDF file: %v", err)
    }
    // Extract text from the PDF file
    text, err := pdftotext.Extract(content)
    if err != nil {
        log.Fatalf("Failed to extract text from PDF file: %v", err)
    }
    // Print the extracted text
    fmt.Println(text)
}

Edgar Madrigal

unread,
Jan 22, 2025, 5:47:17 PMJan 22
to golang-nuts
The function extract https://pkg.go.dev/github.com/heussd/pdftotext-go#Extract actually says:  Extract PDF text content in simplified format
That might mean it will return text only and not tables /etc. You might find a better support if you raise a git issue in: https://github.com/heussd/pdftotext-go/issues as an idea for getting more information
Also, LLM like gemini or chatGpt might get you a good direction:

Mike Schinkel

unread,
Jan 22, 2025, 10:26:25 PMJan 22
to Hugh Myrie, GoLang Nuts Mailing List
Hi Hugh,

I have been planning to do some Go work with PDF files, so your email triggered me to do some research.

Not sure it using heussd/pdftotext-go is critical to you, or if you are just trying to read text in a PDF?  I tried to get pdf2text installed but my dev laptop is still running macOS Monterey and I couldn't get it working so I looked for other options.

If you are just interested in reading PDF text and do not have a specific need to use pdf2text then one those others I looked at might work. I came across a package originally developed by Russ Cox that was forked by many others, and to evaluate it I forked one of those and then converted it from using a reader to returning a slice of strings so I could easily split out the new lines. (I could probably have make it work with the reader, but I was just going for quick.)

If you think it can help your use-case, please check it out (but be aware, my additions to the forked code are rather hacky):


-Mike

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/golang-nuts/c19e212d-a81f-4525-ae0d-a9abb0b292fbn%40googlegroups.com.

Hugh Myrie

unread,
Jan 23, 2025, 7:30:33 AMJan 23
to Mike Schinkel, GoLang Nuts Mailing List
Hi Mike,

Thanks for the suggestion! I'm interested in checking out your forked code. It seems like a good alternative to what I'm currently using.

Hugh

Michael Bright

unread,
Jan 23, 2025, 12:49:17 PMJan 23
to golang-nuts

Hi Mike,

Not wanting to suggest that you take the Python route, but just sharing my experience.

I've tried Acrobat Reader's "Save as Text" functionality, and also one or two Python libraries to extract text from PDFs (PyPDF2 is the one I've settled on).

But what I learnt - without really digging into the issue - is that PDF is a pretty weird format where text from the same sentence/paragraph "floats around" as separate objects.
Bottom line is -  no matter what tool you use - you may find it really tricky to get polished text from what seems like a simple PDF.

That said ... please please prove me wrong !
If anyone has a good pdf extraction tool in any easy to use form I'm interested.

My own use case is to extract text from some partner training materials which I regularly deliver so that I can do a diff to see what changed between releases (obviously if they actually summarized that would be ideal, but they point me to pdfdiff ... yuk).

I have some scripting that - just about - works but it's an absolute pain having
- sentence/paragraphs broken up into multiple lines (and not the same across releases)
- embedded code (in boxes) is indistinguishable from other text.

My 2cts.py,
another Mike.

robert engels

unread,
Jan 23, 2025, 12:56:29 PMJan 23
to Michael Bright, golang-nuts
You typically can’t convert a PDF to text and do what you are trying to do.

Look for PDF to XML converters - you need the “blocks” and the hierarchy in order to interpret most PDFs with any sort of complex formatting.

But even with XML, tables may not work, because there is no guarantee that the PDF authoring tool provided the table metadata, which is why most really good PDF -> XML converters use OCR and try and find the tables that way. There are several AI/ML based automated OCR tools that do a pretty good job.

Often though, a user/system creates a “parsing template” for the various documents it wants to parse (i.e. forms) and adds the additional metadata (e.g. identifies fields, and tables) for how it should be interpreted.


Sharon Mafgaoker

unread,
Jan 23, 2025, 1:57:34 PMJan 23
to robert engels, Michael Bright, golang-nuts
Hey,

I’m using 

I’m sending my pdf and getting back extracted text json object.

Work fast and not expensive 🙏

I hope this will help you .

Sharon Mafgaoker – Senior Solutions Architect 

M. 050 995 99 16 | sha...@cloud5.co.il




Duncan Harris

unread,
Jan 23, 2025, 6:31:44 PMJan 23
to golang-nuts
Amusingly we wrote our PDF table extractor largely in Go: https://pdftables.com/
It identifies tables and cells by looking at the statistical distribution of glyph boundaries on the pages
rather than inferring anything from the way the text is logically grouped within the PDF.

There are many approaches including at least one which just renders all PDFs as an image and then passes that to an LLM vision model API.

Duncan

Robert Engels

unread,
Jan 23, 2025, 6:45:16 PMJan 23
to Duncan Harris, golang-nuts
Glyph boundaries maintains the positional information but you still need to effectively treat it as an image - it’s just very course. Which leads to the OCR/vision AI model. 

If the pdf author is intentionally hindering the ability to “grab the data” then there is no text at all - and it is an image that must be OCR/vision to decode. 

I have a service that converts regular PDFs to this format for those that are interested :)

On Jan 23, 2025, at 5:32 PM, Duncan Harris <dun...@harris3.org> wrote:

Amusingly we wrote our PDF table extractor largely in Go: https://pdftables.com/

Hugh Myrie

unread,
Jan 23, 2025, 7:29:13 PMJan 23
to Michael Bright, golang-nuts
Hi Michael,

You're absolutely right, PDF extraction can be a real headache!
I've tried Mike's suggestion, but unfortunately, it didn't quite work as I'd hoped – it put each character on a separate line, which made it just as difficult to work with.

I think I'll give OCR a shot and see if that yields better results. If that doesn't pan out, I might explore some Python libraries, as you suggested.

Thanks again for your input, it's much appreciated!

Best regards,
Hugh


You received this message because you are subscribed to a topic in the Google Groups "golang-nuts" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/golang-nuts/f7aJwHTcZwQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to golang-nuts...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/golang-nuts/70c80f52-96e4-4e3d-94af-25015515abb5n%40googlegroups.com.

Robert Solomon

unread,
Jan 25, 2025, 11:36:10 AMJan 25
to golang-nuts
Adobe's Acrobat can extract to docx and xlsx.  Not a cheap option but it does work
Reply all
Reply to author
Forward
0 new messages