What's the best way to collect a large number of small files?

Tie hky

unread,

Oct 2, 2014, 10:33:36 AM10/2/14

to web-data...@googlegroups.com

The Extraction framework generates a large number of small files, so we need to collect them.

As I know, there are different ways to do this:

1. PIG script

2. S3DistCP

3. JAVA code, just like the command "retrivedata" in the Extraction framework does for RDF data.

Are there other ways I missed? What is the best?

Would you please give some suggestions? Thanks.

Regards,

Tie.

Robert Meusel

unread,

Oct 2, 2014, 11:36:58 AM10/2/14

to web-data...@googlegroups.com

Hi Tie,

this depends on your use case. We used PIG to load the data and aggregate and sort/index it (which needs large amounts of RAM - so Hadoop was a good way).

S3DistCP simple collects the data but does not do any aggregation and is just a way to download all the files in parallel to a cluster.

Java Code can do everything which you thing about and what is managable with the resources you have on the machine.

Without knowing the real case - I cannot really help.

Cheers,

Robert

Tie hky

unread,

Oct 2, 2014, 11:20:24 PM10/2/14

to web-data...@googlegroups.com

Thanks Robert.

Just like I mentioned, I will collect some infos from each .JS file. Probably I will use the Extraction Framework to do, so it will generate a file for each ARc/WARC file.

When done, I need collect the result.

I think Java code is the best? What's your thought?

Tie hky

unread,

Oct 2, 2014, 11:24:58 PM10/2/14

to web-data...@googlegroups.com

The Extraction Framework generates a file for each ARC/WARC file to store the collected info.

And SimpleDB is used to store statistics and error info.

Is it possible to store the data that are collected from ARC/WARC in SimpleDB?

For example, I want to collect some info from each .JS file, is it ok that store the info in SimpleDB rather than in a file?

Robert Meusel

unread,

Oct 5, 2014, 5:16:28 AM10/5/14

to web-data...@googlegroups.com

So you simply want to collect js files. Than I would say first parse some of the WAT files to get an idea how many js files are there. WAT files ate far smaller so it will be fast. When you have an idea how many files you can expect you think about a cheap way to get the real files from the WARC files. Either parsing all WAT files first to get a list of directly parsing the WARC files. To collect the data you can use Java. You can include some logic to aggregate your data or push it to a database.

Robert Meusel

unread,

Oct 5, 2014, 5:17:22 AM10/5/14

to web-data...@googlegroups.com

Yes it is but I would not recommend as it is expensive plus you need to change some code.

Tie hky

unread,

Oct 5, 2014, 8:05:59 AM10/5/14

to web-data...@googlegroups.com

The percentage of JS files is less than 3%.

The problem is that I can not find a cheap way to get the real file from ARC/WARC files because there is no way to Seek/Read ARC/WARC file. Even I get the info of all .Js files, I still need to read all ARC/WARC files and read all records from begin to end. What's your suggestion?

Tie hky

unread,

Oct 5, 2014, 8:07:06 AM10/5/14

to web-data...@googlegroups.com

I am not familiar with SimpleDB, Would you please to explain why it is expensive? Thanks,

Robert Meusel

unread,

Oct 9, 2014, 10:28:03 AM10/9/14

to web-data...@googlegroups.com

have a look here: http://aws.amazon.com/simpledb/

you also can find the pricing for simpledb here: http://aws.amazon.com/simpledb/pricing/

also remember you are always writing to a database which has pros and cons.

Robert Meusel

unread,

Oct 9, 2014, 10:29:55 AM10/9/14

to web-data...@googlegroups.com

3% means that you will almost find in each file at least on JS file. so you have to parse the arc/warc files. i dont see any other way but: as the data transfer is free between s3 and ec2 you simple need to download each w/arc to the ec2,check each record within one file if its a .js file and parse only those which are. should be really fast.

Tie hky

unread,

Oct 10, 2014, 1:51:55 AM10/10/14

to web-data...@googlegroups.com

I have two different proposed approaches:

Option 1:

1. Scan all ARC/WARC files and fetch all .JS files.

2. Parse the .JS files.

Option 2:

1. Collect all URLs for .JS files and the related info, including ARC/WARC file name, offset, length, etc.(This can be done offline at no cost)

2. Partially fetch .JS files from ARC/WARC files.

3. Parse the .JS files

Option 2 need read whole ARC/WARC file while option 1 only read a smaller part of ARC/WARC file,

Option 1 Option 2

1. Loading ARC/WARC file Loading .JS files

2. Scan all records one by one. Parse .JS files.

3. Parse .JS files.

I am not sure how many percentage of time spending on loading whole ARC/WARC file in option 1 and on scan the records one by one(Although 97% of records are just skipped).

Reply all

Reply to author

Forward