problem on loading the dataset

Wing Wong

unread,

May 14, 2021, 12:25:57 PM5/14/21

to Web Data Commons

Hi

I would like to replicate the experiment of WDC Product corpus for Large-scale Product Matching.

when i load the json file to panda, it takes very long time and crashed.

Is there any better idea to load the data?

Thanks

Wengha

kyle...@gmail.com

unread,

May 14, 2021, 1:03:20 PM5/14/21

to Web Data Commons

I used the chunksize argument in read_json and then looped through the chunks. Here's my code:

https://github.com/kylegilde/Entity-Matching-in-Online-Retail/blob/master/01-parse-json-to-dfs.py#L91-L94

rall...@googlemail.com

unread,

May 18, 2021, 10:55:01 AM5/18/21

to Web Data Commons

Hi Wengha,

You can try using chunking as Kyle suggested. If you do not have enough memory on your machine you can also think about using other libraries specifically designed for handling larger than memory datasets like Dask.

The pandas documentation also has a short guide on how to scale to larger datasets including an example using Dask.

Which json file do you have a problem with exactly?

Cheers,
Ralph

Wing Wong

unread,

May 20, 2021, 1:37:21 AM5/20/21

to web-data...@googlegroups.com

Hi Ralph,

I have problem with the version 2 offers_corpus_english_v2.json.gz.

I tried the chunk size advised by Kyle, but still fail.

Thanks

Wing

--
You received this message because you are subscribed to the Google Groups "Web Data Commons" group.
To unsubscribe from this group and stop receiving emails from it, send an email to web-data-commo...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/web-data-commons/46ba6f12-4393-4ca2-863c-c665e09c30c1n%40googlegroups.com.

rall...@googlemail.com

unread,

May 21, 2021, 11:36:51 AM5/21/21

to Web Data Commons

Hi Wing,

Is it really necessary for you to load this corpus file? All of the training/validation/test sets that were derived from the corpus are available for separate download on the website and they are perfectly handleable on a personal machine. This should be enough to reproduce any experimental results.

Loading that corpus file into memory all at once is not possible with the usual 8-16GB RAM a personal laptop/desktop has. So if you really need to work with it you will have to either (1) get access to a workstation/server with more memory or (2) process the data in chunks as Kyle suggested or (3) use a solution like Dask which uses local storage in addition to the main memory to load the file.

Cheers,
Ralph

Reply all

Reply to author

Forward