Hi everyone,
I’ve been trying to parse the compiled PDFs uploaded by the CGWB
here (specifically the ones under “4. Water Level Data”) which contain four readings per monitoring well per year. However, I’ve run into an issue with overlapping text across columns, which is leading to jumbled or misaligned outputs.
For instance, on page 5 of the file titled “August Ground Water Level 1994–2023”, the district “Dr. B.R. Ambedkar Konaseema” appears as “Dr. B.R. Ambedkar Konaseem”, with the missing "a" mistakenly attached to the start of the following block name. Camelot (Python) is detecting these characters but struggles to resolve them correctly, likely because overlapping text layers in the PDF are assigned nearly identical coordinates, causing cell misassignments. Another example is all rows correspondeding to "Dadra and Nagar Haveli and Daman and Diu".
I wanted to check:
- Has anyone here successfully parsed this dataset before?
- Am I understanding the complexity of scraping this correctly?
- Does anyone have a contact at CGWB who might be able to share the original Excel files? The PDFs appear to have been exported via iLovePDF from XLSX files. Since these files are already publicly available, I’m hoping the CGWB might be open to sharing the source formats directly, but I'm worried the turnaround times might vary.
Any help, advice, or pointers would be really appreciated. Thanks so much!
Best,