Data Scrapping from e-courts website

236 views
Skip to first unread message

Lovish Sharma

unread,
Nov 8, 2020, 10:05:09 PM11/8/20
to datameet
Hi,

I am working as an associate for an NGO working in the field of crimes against women. Currently I am doing research on crimes against women in prominent cities. For that, I need to scrap the data from e-courts website, https://districts.ecourts.gov.in/ . 

Kindly help me with that.

Nikhil VJ

unread,
Nov 9, 2020, 1:34:05 AM11/9/20
to datameet
Hi Lovish,

Is there any link on this site where you are able to navigate through different data items? I only see pages where you have to input certain case code etc.
Please share more details with directions and examples - links or screenshots - of what you need.
Then maybe somebody might be able to help.

--
Cheers,
Nikhil VJ
https://nikhilvj.co.in


--
Datameet is a community of Data Science enthusiasts in India. Know more about us by visiting http://datameet.org
---
You received this message because you are subscribed to the Google Groups "datameet" group.
To unsubscribe from this group and stop receiving emails from it, send an email to datameet+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/datameet/4c1b2005-8594-4afe-ba88-a8e7147a5798n%40googlegroups.com.

pmay...@gmail.com

unread,
Nov 9, 2020, 2:11:52 AM11/9/20
to datameet
Hi Lovish,
My experience (only for district courts in MP) is that scraping is not possible. It is possible to look for all cases in which a specific IPC offence is involved (e.g. 376(D) Gang rape). But to find out what happened in each case, you must go to each seriatim, check what decision was made by the court and--if you're lucky--access the judgement made in the case. In MP, those judgements are in Hindi, rendered in KritiDev.
I've written a paper looking at some rape cases. Feel free to contact me directly.
best wishes,
Peter Mayer

Nikhil VJ

unread,
Nov 9, 2020, 3:20:14 AM11/9/20
to datameet
Hi Peter,

Can you share a sample instruction (click this -> click that) or link on how to reach a place on the website where we can see a listing under the IPC code?

<digressing>
About KritiDev - do you mean KrutiDev?

There's converters available now to convert from legacy ascii fonts (where we would use a custom font to make A's glyph look like one akshar and B look like another akshar and so on) to unicode (where different languages have their own char code and co-exist).

I found various websites on searching online for "hindi to unicode converter", but also there's this open source collection of htmls contain javascripts that I have used to work with earlier: https://sites.google.com/site/hindifontconverters/files. Has simple web page files with javascripts to do the conversions.

A budget document I was working with 5 yrs back had its own version of legacy font - I hacked into one javascript here, added in new mappings and customised my own converter.

Sorry to digress but just sharing in case the legacy font thing was being a blocker to anyone. Also if someone wants to build a full solution out of this that takes say word docs and converts to unicode without losing formatting and can bring in some resources - let me know. I didn't have the skills to programmatically work with office docs 5 yrs ago; I do now.

And there was one surprise finding related to this: I've found that legacy fonts survive the journey through pdfs better than unicode. So if an institution insists on sharing documents as pdf, I'd rather have them stick to their old legacy fonts and use one of these converter tools at my end to get the text out into unicode.

--
Cheers,
Nikhil VJ
https://nikhilvj.co.in

--
Datameet is a community of Data Science enthusiasts in India. Know more about us by visiting http://datameet.org
---
You received this message because you are subscribed to the Google Groups "datameet" group.
To unsubscribe from this group and stop receiving emails from it, send an email to datameet+u...@googlegroups.com.

sanjay singh

unread,
Nov 9, 2020, 4:18:42 AM11/9/20
to data...@googlegroups.com

pmay...@gmail.com

unread,
Nov 9, 2020, 9:13:49 PM11/9/20
to datameet
Hi Nikhil,
Here's a pdf I put together a while ago which gives some indication of the deeply nested nature of the e-courts website (it's very similar to what Sanjay has already posted, but dives a bit deeper). If one opens a case in a new window, the original search is still available to return to, but otherwise, each case entails a fresh search.
And yes, I did mean KrutiDev. I found a very useful site which coverts to Unicode: https://www.fontconverter.in/hindi.php?q=Krutidev-to-Unicode
best wishes,
Peter
Jabalpur rape cases - extract example.pdf

sanjay singh

unread,
Nov 9, 2020, 11:30:53 PM11/9/20
to data...@googlegroups.com
Hi,
And as Peter said, it looks difficult to scrape. There’s a recaptcha.
Regards,
Sanjay 

Lovish Sciences

unread,
Nov 10, 2020, 11:13:46 AM11/10/20
to data...@googlegroups.com
Hi,

Would it be possible to scrap data by using this link. Instead of scrapping through Court Complex option, can we do it from Court Establishment option ?



--
Thanks & Regards,
Lovish

Rahul Gupta

unread,
Nov 10, 2020, 11:19:39 AM11/10/20
to data...@googlegroups.com
Hi Lovish,

So the link you are sharing can be worked around with Selenium I believe. One complex challenge you'd face would be to decode the captcha everytime and send it back correctly.

The way I solved it for one of my work projects was to download captcha image everytime and pre-process image and OCR it to get a result. Got to a level where 9 out of 10 requests worked perfectly. But another option we tried was to go the way of Deep Learning and train a model to recognize and classify text letters. Which took longer but still worked good enough. 

P.S - Form can easily be filled with Selenium. Captcha and the next steps might be another challenge. 

Thanks & Regards-
Rahul Gupta

Lovish Sciences

unread,
Nov 12, 2020, 1:59:53 PM11/12/20
to data...@googlegroups.com

nikh...@gmail.com

unread,
Nov 12, 2020, 2:59:28 PM11/12/20
to datameet
Hi,

The purpose of captcha is to keep out automated scraping.
By defeating that purpose, we trigger a nuclear arms race which will end in one of these outcomes:
- The institution will (at great cost) achieve a way to effectively make scraping impossible.
- The institution shall shut down its public information portal and declare that please use offline methods.
- The institution shall push for government to hunt down and prosecute those who scrape it and will "make an example" by destroying some peoples' lives.

Just sharing the logical conclusions for your kind consideration.

In one scraping activity in the past I had followed a hybrid approach : Through automated keyboard strokes, I could automate everything upto the captcha stage, then I could type in the captcha manually, then start off another set of keystrokes. It sped up the process by a lot.
I was using "AutoHotKey" in windows years ago; there have been a lot of innovations since then in this kind of automating. This approach is also subject to its own arms race of course, but less chance of going nuclear, because with this you're not crashing the institution's servers.


Regards
Nikhil VJ

Praachi Misra

unread,
Nov 13, 2020, 12:08:20 AM11/13/20
to datameet
Is there any way to initiate a more widespread discussion on opening judicial data?

Some context to the use of captcha wrt. judicial data in India.

Amongst other common law countries (US, EU, Australia, UK); India is unique is using a captcha.

Some reasons I attribute for this are (in random order):
  1. Prior well developed relationships with digest publishers
  2. Inability to understand the rationale behind open judicial data
  3. Security measure (not certain about this, as other sites do not need it)
  4. Secrecy over existing court practices
Whatever might be the cause, there are very significant real world consequences to the sub-par working of Indian courts, extending beyond the litigators.

The Economic Survey for year 2018-19 carried a full chapter on the drag that the court system is. (Also see)

Praachi



Apoorv Anand

unread,
Nov 13, 2020, 9:03:54 AM11/13/20
to datameet
Hey Lovish,

I work with CivicDataLab and we have been assisting a few NGOs on working with data from eCourts. It is definitely possible. Can you share your requirements (acts/sections/districts/type of cases/variables required, etc) with me over an email, if possible and we can take this conversation forward. The data itself is sensitive in nature especially if we're dealing with rape cases, so we'll have to handle it in a responsible manner.
Reply all
Reply to author
Forward
0 new messages