Hello!
I'm trying to reinstall some code I made with artificial intelligence that uses tesseract.
I managed to get everything working (intelligent scanning of minutes so that they are later renamed and moved to folders) but now I have changed offices and there is another machine. When trying to install the libraries I get that windows does not recognize the installation of tesseract.
I downloaded it from https://github.com/UB-Mannheim/tesseract/wiki and tried various versions. When I put tesseract -v or where tesseract in the command line it tells me "tesseract" is not recognized as an internal or external command,
program or executable batch file.
I tried to edit the environment variable and it doesn't work either (select the installation path C:\Program Files\Tesseract-OCR and check that the files are there)
try to open tesseract outside of python, and it opens a window (like a black console) and closes automatically
I also tried to open from pycharm (where I have the codes) and it gives me this error:
C:\Users\UNTREF\PycharmProjects\obtencion_de_valores\venv\Scripts\python.exe C:\Users\UNTREF\PycharmProjects\obtencion_de_valores\main.py
Introduce la ruta de la carpeta que contiene los archivos PDF: C:\Users\UNTREF\Desktop\prueba_excel
Procesando archivo: 1-49.pdf
Traceback (most recent call last):
File "C:\Users\UNTREF\PycharmProjects\obtencion_de_valores\venv\Lib\site-packages\pytesseract\pytesseract.py", line 255, in run_tesseract
proc = subprocess.Popen(cmd_args, **subprocess_args())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\UNTREF\AppData\Local\Programs\Python\Python311\Lib\subprocess.py", line 1024, in __init__
self._execute_child(args, executable, preexec_fn, close_fds,
File "C:\Users\UNTREF\AppData\Local\Programs\Python\Python311\Lib\subprocess.py", line 1509, in _execute_child
hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [WinError 2] El sistema no puede encontrar el archivo especificado
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\UNTREF\PycharmProjects\obtencion_de_valores\main.py", line 33, in <module>
text = pytesseract.image_to_string(page, lang='spa', config='--psm 4 --oem 1')
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\UNTREF\PycharmProjects\obtencion_de_valores\venv\Lib\site-packages\pytesseract\pytesseract.py", line 423, in image_to_string
return {
^
File "C:\Users\UNTREF\PycharmProjects\obtencion_de_valores\venv\Lib\site-packages\pytesseract\pytesseract.py", line 426, in <lambda>
Output.STRING: lambda: run_and_get_output(*args),
^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\UNTREF\PycharmProjects\obtencion_de_valores\venv\Lib\site-packages\pytesseract\pytesseract.py", line 288, in run_and_get_output
run_tesseract(**kwargs)
File "C:\Users\UNTREF\PycharmProjects\obtencion_de_valores\venv\Lib\site-packages\pytesseract\pytesseract.py", line 260, in run_tesseract
raise TesseractNotFoundError()
pytesseract.pytesseract.TesseractNotFoundError: tesseract is not installed or it's not in your PATH. See README file for more information.
Process finished with exit code 1Here I provide the python code, although I don't think it's the problem
import os
import re
import pytesseract
from pdf2image import convert_from_path
# Ruta de la carpeta que contiene los archivos PDF a procesar
pdf_folder = input("Introduce la ruta de la carpeta que contiene los archivos PDF: ")
# Definir patrones de búsqueda
libro_pattern = r'LIBRO:\s+(\d+)'
folio_pattern = r'FOLIO:\s+(\d+)'
no_pattern = r'No\s+(\d+)'
materia_pattern = r'MATERIA\s+:\s+(.*)\n'
docente_pattern = r'DOCENTE\s+:\s+(.*)\n'
fecha_pattern = r'FECHA\s+:\s+(\d{2}/\d{2}/\d{4})\s+'
# Iterar sobre cada archivo PDF en la carpeta
for filename in os.listdir(pdf_folder):
if filename.endswith('.pdf'):
print(f"Procesando archivo: {filename}")
# Ruta del archivo PDF a procesar
pdf_path = os.path.join(pdf_folder, filename)
# Convertir cada página del PDF a una imagen
pages = convert_from_path(pdf_path)
# Lista para almacenar los resultados de cada página
results = []
# Procesar cada imagen con Pytesseract
for page in pages:
text = pytesseract.image_to_string(page, lang='spa', config='--psm 4 --oem 1')
results.append(text)
# Buscar los valores de LIBRO, FOLIO, No, MATERIA, DOCENTE y FECHA en la cadena de texto
libro_match = re.search(libro_pattern, results[0])
folio_match = re.search(folio_pattern, results[0])
no_match = re.search(no_pattern, results[0])
materia_match = re.search(materia_pattern, results[0])
docente_match = re.search(docente_pattern, results[0])
fecha_match = re.search(fecha_pattern, results[0])
# Extraer los valores encontrados e imprimirlos
if libro_match:
libro = libro_match.group(1)
print(f"LIBRO: {libro}")
if folio_match:
folio = folio_match.group(1)
print(f"FOLIO: {folio}")
if no_match:
no = no_match.group(1)
print(f"No: {no}")
if materia_match:
materia = materia_match.group(1)
print(f"MATERIA: {materia}")
if docente_match:
docente = docente_match.group(1)
print(f"DOCENTE: {docente}")
if fecha_match:
fecha = fecha_match.group(1)
print(f"FECHA: {fecha}")
print(f"Archivo {filename} procesado correctamente.\n")