OCR multiple pngs into one PDF

83 views
Skip to first unread message

jollysalmon

unread,
Jun 14, 2025, 9:04:20 AMJun 14
to tesseract-ocr
# Steps to reproduce

```bash
tesseract pngs.txt "$name" -l ita pdf
```

# Error

```
Page 0 : /storage/emulated/0/Download/tmp/CIRCOLARE-SPOSTAMENTO-CLASSI-DAL-30-09-2024-_-1D-AG-EN-_-2D-AG-EN-_-1LA-_-2LA-_-3LA.pdf-1.png
pdf to convert: /storage/emulated/0/Download/tmp/CIRCOLARE-SPOSTAMENTO-CLASSI-DAL-30-09-2024-_-1D-AG-EN-_-2D-AG-EN-_-1LA-_-2LA-_-3LA.pdf.txt
Syntax Warning: May not be a PDF file (continuing anyway)
Syntax Error: Couldn't find trailer dictionary
Syntax Error: Couldn't find trailer dictionary
Syntax Error: Couldn't read xref table
Error in findFileFormatStream: truncated file
Error during processing.
pdf to convert: /storage/emulated/0/Download/tmp/CIRCOLARE-SPOSTAMENTO-CLASSI-DAL-30-09-2024-_-1D-AG-EN-_-2D-AG-EN-_-1LA-_-2LA-_-3LA.pdf.pdf
converting CIRCOLARE-SPOSTAMENTO-CLASSI-DAL-30-09-2024-_-1D-AG-EN-_-2D-AG-EN-_-1LA-_-2LA-_-3LA.pdf
.pdf to png...
Syntax Error: Document stream is empty
Error, could not create PDF output file: Operation not permitted
```

# Thoughts

I'm providing tesseract a list of png files, and this worked while I was outputting text (`tesseract pngs.txt "$name" -l ita txt`, but when I tried doing the same for a pdf it didn't work :/

I know there are lots of tools that use tesseract for this, but I prefer doing it with tesseract + combo of other tools if necessary so that I get better/easier control over tesseract itself.

Thank you in advance, I'm sure this must be something common but I just can't seem to get it right!


Zdenko Podobny

unread,
Jun 14, 2025, 2:13:31 PMJun 14
to tesser...@googlegroups.com
Is this "AI" generated?

Provided errors are not output of "Steps to reproduce"

Zdenko


so 14. 6. 2025 o 15:04 jollysalmon <alegr...@gmail.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/tesseract-ocr/1726fc5f-202a-42ca-957f-4040f1fafcban%40googlegroups.com.

Alessandro Griseta

unread,
Jul 6, 2025, 10:17:47 AMJul 6
to tesseract-ocr
Sorry about that, looks like it wasn't so clear at al; - anyway, I ended up completing a script, so here it is:

```bash
lang="mt"
#pp replace original file
if [ lang == "" ]; then
read -p "Choose lang (gr/la/it/en): " lang && [[ $lang == la || $lang == grc || $lang == it ]] || exit 1
fi #x cp } deb
# b as well as presets, allow direct input > this will make cases statements less verbose / eliminate them entirely
#check if $lang has a value first before asking <- will need to accept cmd parameters if it turns into a shell script
#pp add option to choose output (pdf/txt etc.)

root="$HOME/Downloads/" #root="/storage/emulated/0/Download/tmp/" #pp cross-platform <- use $HOME env var
mkdir "$root"; cd "$root"
find $root -name "*.png" -type f -delete

find $root | grep -P "\.pdf" > pdfs.txt
while read pdf; do

echo "pdf to convert: $pdf"
name="$(echo $pdf | grep -o -P "[^\/]*?\.pdf")"
echo "converting $name to png..."
pdftoppm "$pdf" "$name" -png #-f 12 -l 17 #i# ^ cpu cores
find $root | grep -P "\.png$" | sort > pngs.txt
case $lang in

  gr)
    tesseract pngs.txt "$name" -l lat+grc+ita pdf
    ;;

  la)
    tesseract pngs.txt "$name" -l lat+ita pdf
    ;;

  it)

    tesseract pngs.txt "$name" -l ita pdf
    ;;
  en)
    tesseract pngs.txt "$name" -l eng+ita pdf
    ;;
  dz-la)
    tesseract pngs.txt "$name" -l lat+eng pdf
    ;;
  dz-gr)
    tesseract pngs.txt "$name" -l grc+eng pdf
    ;;
  mt)
    tesseract pngs.txt "$name" -l ita+equ --tessdata-dir "$HOME/Downloads/tessdata" pdf
    ;;
esac
#find $root -name "*.png" -type f -delete
done <pdfs.txt
```
It's not much, but it's good enough for me to paste into a terminal (have also tested on Termux)
Reply all
Reply to author
Forward
0 new messages