Parsing CGWB Groundwater Data PDFs

57 views
Skip to first unread message

Saloni Taneja

unread,
Jul 21, 2025, 4:34:51 AMJul 21
to datameet
Hi everyone,

I’ve been trying to parse the compiled PDFs uploaded by the CGWB here (specifically the ones under “4. Water Level Data”) which contain four readings per monitoring well per year. However, I’ve run into an issue with overlapping text across columns, which is leading to jumbled or misaligned outputs.

For instance, on page 5 of the file titled “August Ground Water Level 1994–2023”, the district “Dr. B.R. Ambedkar Konaseema” appears as “Dr. B.R. Ambedkar Konaseem”, with the missing "a" mistakenly attached to the start of the following block name. Camelot (Python) is detecting these characters but struggles to resolve them correctly, likely because overlapping text layers in the PDF are assigned nearly identical coordinates, causing cell misassignments. Another example is all rows correspondeding to "Dadra and Nagar Haveli and Daman and Diu".

I wanted to check:
  1. Has anyone here successfully parsed this dataset before?
  2. Am I understanding the complexity of scraping this correctly?
  3. Does anyone have a contact at CGWB who might be able to share the original Excel files? The PDFs appear to have been exported via iLovePDF from XLSX files. Since these files are already publicly available, I’m hoping the CGWB might be open to sharing the source formats directly, but I'm worried the turnaround times might vary.
Any help, advice, or pointers would be really appreciated. Thanks so much!

Best,

sreeram kandimalla

unread,
Jul 21, 2025, 5:31:50 AMJul 21
to data...@googlegroups.com
Camelot is nice and lightweight but is currently unmaintained.. https://github.com/datalab-to/marker is a good alternative. It's a mix of OCR and pdf parsing and can use LLMs for correcting thorny cases. Here is an example of an invocation for a different dataset - https://github.com/publicmap/amche-atlas/issues/104#issuecomment-2842058569

--
Datameet is a community of Data Science enthusiasts in India. Know more about us by visiting http://datameet.org
---
You received this message because you are subscribed to the Google Groups "datameet" group.
To unsubscribe from this group and stop receiving emails from it, send an email to datameet+u...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/datameet/1e037636-d31e-4cb3-8703-433000a9a573n%40googlegroups.com.

sreeram kandimalla

unread,
Jul 21, 2025, 5:38:07 AMJul 21
to data...@googlegroups.com
Amazon textract(paid0 and mupdf are a couple of other alternatives to consider. In my experience amazon textract is the best available tool.
Reply all
Reply to author
Forward
0 new messages