I'm trying to get started with Tesseract and OCR to make my life a bit easier. I'll try to be as descriptive as possible.
Basically what I'm trying to do:Me and my friends are playing F1 together over Ps5 and I have google sheets with all the stats from our races. Link to document:
F1 Google Sheets statsRight now I'm typing in all the data myself with is super tedious and time-consuming. I want to load a screenshot into tesseract and get the data ready to copy-paste into the document and make it more automatic. (Example in the bottom of this post)
What I want to do:
I want to parse the data from the screenshots, all the data is already known and the screenshots will be in clear 1080p pictures. I know the name of all the drivers and teams and the lap times are in the format: d:dd.ddd
and the gap times are in the format: +d.ddd (possible: +dd.ddd)
d = integer
I want the output of every position 1-20, name of the driver, team, lap time & gap time to leader.
What I've tried to do:
I'm on Windows so I installed Tesseract 5.1.0 with pre-build binaries. After some googling I got the feeling that Tesseract is easier with Linux so I installed Ubuntu via WSL and installed Tesseract there as well.
But I'm very confused what "LSTM" is and what training modules are deprecated/unsupported for Tesseract 5.
The Tesstrain repo has "ocrd-testset.zip" with .tif files and textfiles that describe the expected output so I did the same with my case. (Included F1 training files as a zip to this post). I created a "data/foo-ground-truth" as described in tesstrain readme and ran "make training"
Output:
find -L data/foo-ground-truth -name '*.gt.txt' | xargs paste -s > "data/foo/all-gt"
unicharset_extractor --output_unicharset "data/foo/unicharset" --norm_mode 2 "data/foo/all-gt"
Bad box coordinates in boxfile string! 10 Fernando Alonso Alpine W 1:27.662 +1.515
Extracting unicharset from plain text file data/foo/all-gt
Other case I of i is not in unicharset
Other case U of u is not in unicharset
Other case Z of z is not in unicharset
Other case Ä of ä is not in unicharset
Other case Ö of ö is not in unicharset
Other case X of x is not in unicharset
Wrote unicharset file data/foo/unicharset
PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i "data/foo-ground-truth/alonso.tif" -t "data/foo-ground-truth/alonso.gt.txt" > "data/foo-ground-truth/alonso.box"
Traceback (most recent call last):
File "generate_line_box.py", line 6, in <module>
from PIL import Image
ModuleNotFoundError: No module named 'PIL'
Makefile:218: recipe for target 'data/foo-ground-truth/alonso.box' failed
make: *** [data/foo-ground-truth/alonso.box] Error 1)
I'm quite stuck and don't know how to train my Tesseract 5. Is it deprecated? Should I downgrade my tesseract to version 4 or 3? Am I missing some dependencies? Anyone that can guide me how to train my Tesseract into doing what I want?
Tesseract version:
Output in the terminal: (tesseract --version)
tesseract 5.1.0-32-gf36c0
leptonica-1.78.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
Found AVX2
Found AVX
Found FMA
Found SSE4.1
Found OpenMP 201511
Found libarchive 3.2.2 zlib/1.2.11 liblzma/5.2.2 bz2lib/1.0.6 liblz4/1.7.1
Found libcurl/7.58.0 OpenSSL/1.1.1 zlib/1.2.11 libidn2/2.0.4 libpsl/0.19.1 (+libidn2/2.0.4) nghttp2/1.30.0 librtmp/2.3
Python version: Output in the terminal: (py --version)
Pythonpy ???
Python 3.6.9
Example:
The screenshots look like this:
Expected output:
Pos Driver Team Tyre Best Gap
1 Lewis Hamilton Mercedes-AMG Petronas W 1:26.147 -
2 Max Verstappen Red Bull W 1:26.383 +0.236
3 Bottas Mercedes-AMG Petronas W 1:26.431 +0.284
4 Sergio Perez Red Bull W 1:26.538 +0.391
5 Charles Leclerc Ferrari W 1:26.981 +0.834
6 Lando Norris McLaren W 1:27.274 +1.127
7 Daniel Ricciardo McLaren W 1:27.387 +1.240
8 Carlos Sainz Ferrari W 1:27.390 +1.243
9 Pierre Gasly AlphaTauri W 1:27.427 +1.280
10 Fernando Alonso Alpine W 1:27.662 +1.515
11 Yuki Tsunoda AlphaTauri W 1:27.812 +1.665
12 Esteban Ocon Alpine W 1:27.877 +1.730
13 Sebastian Vettel Aston Martin W 1:27.966 +1.819
14 Lance Stroll Aston Martin W 1:28.119 +1.972
15 Kimi Räikkönen Alfa Romeo W 1:28.561 +2.414
16 Antonio Giovinazzi Alfa Romeo W 1:28.632 +2.485
17 Mick Schumacher Haas W 1:28.694 +2.547
18 George Russell Williams W 1:28.981 +2.834
19 Nikita Mazepin Haas W 1:29.388 +3.241
20 Nicholas Latifi Williams W No Time -