Accessing data from a 30 GB file in json format

112 views
Skip to first unread message

Nibil Ashraf

unread,
Jul 1, 2019, 6:07:39 AM7/1/19
to django...@googlegroups.com
Hey,

I have a file with a size of around 30GB. The file is in json format. I have to access the data and write that to a csv file. When I tried to do that with my laptop which has a a RAM of 4GB, I am getting some error. I tried to load the json file like this json_parsed = json.loads(json_data)

Can someone help me with this? How should I do this? If I should go with some server, please let me know what specifications should I use? 

John Bagiliko

unread,
Jul 1, 2019, 6:11:03 AM7/1/19
to django...@googlegroups.com
What is the error message?

--
You received this message because you are subscribed to the Google Groups "Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-users...@googlegroups.com.
To post to this group, send email to django...@googlegroups.com.
Visit this group at https://groups.google.com/group/django-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-users/CAPBZ7vD6Ai5DOAudytO7QeW1ejUqdqyB5YH3F7aTg4YoXtF-uw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

John Bagiliko

unread,
Jul 1, 2019, 6:11:43 AM7/1/19
to django...@googlegroups.com
A screenshot may be better

Nibil Ashraf

unread,
Jul 1, 2019, 6:22:38 AM7/1/19
to django...@googlegroups.com
Its a Memory Error. PFA the screenshot.

image.png



Brian Mcnabola

unread,
Jul 1, 2019, 7:35:29 AM7/1/19
to Django users

John Bagiliko

unread,
Jul 1, 2019, 7:36:00 AM7/1/19
to django...@googlegroups.com
Try this: 

with open('part-oooo.json') as f:
    data = json.load(f)
print(data)


For more options, visit https://groups.google.com/d/optout.


--
Regards
JOHN BAGILIKO
MSc. Mathematical Sciences (Big Data and Computer Security)
African Institute for Mathematical Sciences (AIMS) | AIMS Senegal

Nibil Ashraf

unread,
Jul 1, 2019, 8:03:55 AM7/1/19
to django...@googlegroups.com
Thanks for the help. But still I am getting the error. If I move on to some server, what should be the RAM size to get this done?
image (1).png

John Bagiliko

unread,
Jul 1, 2019, 8:14:12 AM7/1/19
to django...@googlegroups.com
I can see you didn't bring the file extension. It may not be very necessary but try to add the extension to the file name and try again.


For more options, visit https://groups.google.com/d/optout.

Cornelis Poppema

unread,
Jul 1, 2019, 9:38:50 AM7/1/19
to Django users
To be able to traverse the JSON structure you'd normally need the entire structure in memory. For this reason you can't (easily) apply suggestions to iterate over a file efficiently to a JSON file: you can perhaps read the file efficiently, but the structure in memory will still grow in memory. I've found these packages made for efficiently reason large JSON files after a quick search: https://github.com/ICRAR/ijson or https://github.com/kashifrazzaqui/json-streamer. https://stackoverflow.com/a/17326199/248891 shows a simple example when using ijson

PASCUAL Eric

unread,
Jul 1, 2019, 11:07:23 AM7/1/19
to django...@googlegroups.com
Hi,
To be able to traverse the JSON structure you'd normally need the entire structure in memory.
Not mandatory, depending whats is to be done with the data.

The same problem exists with XML, and this is the reason why SAX parsers have been created in addition to DOM ones.
 
If the data process can accommodate with on the fly handling, implementing a callback based parser could solve the problem. Maybe have a look at projects such as Naya, UltraJSON and alike, they could be time (and memory 😉) savers.

HTH

Eric


From: django...@googlegroups.com <django...@googlegroups.com> on behalf of Cornelis Poppema <c.po...@gmail.com>
Sent: Monday, July 1, 2019 15:38
To: Django users
Subject: Re: Accessing data from a 30 GB file in json format
 
--
You received this message because you are subscribed to the Google Groups "Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-users...@googlegroups.com.
To post to this group, send email to django...@googlegroups.com.
Visit this group at https://groups.google.com/group/django-users.

PASCUAL Eric

unread,
Jul 1, 2019, 11:16:33 AM7/1/19
to django...@googlegroups.com
Hi again,

My bad, UltraJSON is not really suited for event based parsing. I confused it with something else.

Yajl C library (http://lloyd.github.io/yajl/) should match better, and has a Python binding focusing on stream based processing (http://pykler.github.io/yajl-py/), both for parsing and generation.

Eric


From: django...@googlegroups.com <django...@googlegroups.com> on behalf of PASCUAL Eric <eric.p...@cstb.fr>
Sent: Monday, July 1, 2019 17:06
To: django...@googlegroups.com

Nibil Ashraf

unread,
Jul 1, 2019, 11:57:16 AM7/1/19
to django...@googlegroups.com
Thanks Cornelis!

--
Reply all
Reply to author
Forward
0 new messages