List of Valid Segment IDs

Chris Stephens

unread,

Aug 14, 2012, 2:39:55 PM8/14/12

to common...@googlegroups.com

Hello Everyone,

Based on some feedback from people on this list, we are planning on updating our ARC files to add the detected charset and to adjust the ARC header content length values. Before we can do this update, we need to send out some information about a system for marking segments as active/inactive, or valid/invalid.

As many of you are aware, Amazon's S3 storage service does not provide an API to efficiently rename object keys nor "move" objects between key prefixes. Under a traditional filesystem, if we had to update a segment, we might generate new data for that segment, then swap the old folder for our new folder. Under S3, this kind of operation would need to be done one file at a time, and would involve literally copying the file - not just changing the "folder" (or key).

To give us a way to delineate between usable segments and segments that aren't ready, we'd like to ask everyone to read this file before each job:

s3n://aws-publicdatasets/common-crawl/parse-output/valid_segments.txt

This file will contain a list of valid, active segment IDs from which you can safely read data. If any segment does not appear on this list, it should be considered inactive - not ready for use.

For JAR users, you can use the valid segments list with the following code snippet:

String segmentListFile = "s3n://aws-publicdatasets/common-crawl/parse-output/valid_segments.txt";

FileSystem fs = FileSystem.get(new URI(segmentListFile), job);
BufferedReader reader = new BufferedReader(new InputStreamReader(fs.open(new Path(segmentListFile))));

String segmentId;

while ((segmentId = reader.readLine()) != null) {
String inputPath = "s3n://aws-publicdatasets/common-crawl/parse-output/segment/"+segmentId+"/*.arc.gz";
FileInputFormat.addInputPath(job, new Path(inputPath));
}

For streaming users, you may need to build a list of "-input" options or run your job one segment at a time:

hadoop fs -get s3n://aws-publicdatasets/common-crawl/parse-output/valid_segments.txt > valid_segments.txt
xargs -a valid_segments.txt -I{} hadoop jar ... -input {}

If one day Amazon S3 provides an efficient MOVE or key RENAME command, we can stop using the "valid_segments" list.

If anyone has any suggestions on an easier approach, we're all ears (and eyes? and fingers?).

Looking forward to hearing what you think ...

- Chris

Note: As of right now, *all* segments are "valid". So any analysis done on the existing corpus is valid without checking this file.

Amine MOUHOUB

unread,

Aug 15, 2012, 7:05:16 AM8/15/12

to common...@googlegroups.com

Hi Chris,

I think the idea of a valid_segments_list is an intuitive approach and easy to implement.
Actually this is what I'm using since the release of the new dataset when there still have been some invalid segments. I have made up a list in a textfile and included it in my jar to be read as a resource. The only long-term problem was the update of the file, but since you've made one public list I'm very happy to get the list directly from its unique location.

Thanks for the idea
Best regards
Amine

Mat Kelcey

unread,

Aug 15, 2012, 10:55:33 AM8/15/12

to common...@googlegroups.com

thanks chris!
i always find it most natural to work with manifest files like this,
so i think it's perfect
cheers,
mat

> --
> You received this message because you are subscribed to the Google Groups
> "Common Crawl" group.
> To post to this group, send email to common...@googlegroups.com.
> To unsubscribe from this group, send email to
> common-crawl...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/common-crawl?hl=en.

mARK bLOORE

unread,

Aug 28, 2012, 3:15:50 PM8/28/12

to common...@googlegroups.com

will segments ever be removed from the valid list? until now i have
just listed the bucket to put a list of ARC files into a DB which
drives my scripts. i have modded that to skip segments not in the
valid list, but do i need to also remove segments which are not in the
list?

--
mARK bLOORE <mbl...@gmail.com>

Chris Stephens

unread,

Aug 28, 2012, 4:16:11 PM8/28/12

to common...@googlegroups.com

Hi Mark,

Yes, we can see some cases where we would need to remove segments from the valid segments list. For example, if we need to correct a problem with some ARC files, we'll (a) generate a new segment, (b) copy files to the new segment and correct them, (c) change the entry in the "valid_segments.txt" file from the old segment to the new segment, (d) eventually, delete the old segment.

We'll try to always notify the list when we do this - so perhaps you can then reload your database of files (or schedule a job every couple of days to do so).

- Chris

mARK bLOORE

unread,

Sep 11, 2012, 3:57:57 PM9/11/12

to common...@googlegroups.com

are there really only 56 valid segments right now? that's what i see
in common-crawl/parse-output/valid_segments.txt.

Ahad Rana

unread,

Sep 11, 2012, 4:02:09 PM9/11/12

to common...@googlegroups.com

Yes Mark,

There are 56 valid segments as of now.

Ahad.

mARK bLOORE

unread,

Sep 11, 2012, 4:32:45 PM9/11/12

to common...@googlegroups.com

will segment numbers always increase? that makes it easier to get the
latest ones.

On Tue, Aug 28, 2012 at 4:16 PM, Chris Stephens <ch...@commoncrawl.org> wrote:

Ahad Rana

unread,

Sep 11, 2012, 5:20:27 PM9/11/12

to common...@googlegroups.com

Hi Mark,

The segment ids always increase in value over time since they are based on the Unix time (in milliseconds). Also, segment id generation is atomic, so there is no chance of collision.

Ahad.

mARK bLOORE

unread,

Sep 11, 2012, 6:04:14 PM9/11/12

to common...@googlegroups.com

i was thinking of the case where segments are replaced due to, eg, bad
arc files. if the files are regenerated, would they go into a new,
higher-numbered segment, even if they were from an older crawl? is
that the unix time of the crawl, or of the arc file creation?

Ahad Rana

unread,

Sep 11, 2012, 6:24:59 PM9/11/12

to common...@googlegroups.com

Hi Mark,

The bad splits get rolled over into a new segment. All the source RAW crawl data is removed from S3 when a split is successfully processed. So regeneration will not be an option. The timestamp is generated during segment generation, and the segment directory exists and has valid data in it while the map-reduce job generating is running. But it is not visible in the valid_segments.txt file until the job successfully completes and the segment is 'promoted' to be a valid segment.

Ahad.

mARK bLOORE

unread,

Sep 11, 2012, 6:39:18 PM9/11/12

to common...@googlegroups.com

ok, great. that means that they way i update my database of arc files
is valid -- i only look at keys which have a higher value that the
last one in my table.

thank you.

Robert Meusel

unread,

Sep 25, 2012, 10:34:07 AM9/25/12

to common...@googlegroups.com

Hello Everybody, I am currently working through all the stuff you already discussed about the valid segments but I am currently missing an information about what does actually belong the the 2012 crawl. Is it all data in the parse-output/segment folder (s3) or only the valid_segments? I just went into the different folders under parse-output/segment and checked the included data with the timestamp of the head-folder but they do not fit at all. Most crawls are from feb but the timestamps are older. Thanks a lot for the short reply.

Ahad Rana

unread,

Sep 25, 2012, 4:48:43 PM9/25/12

to common...@googlegroups.com

Hi Robert,

Only segments listed in the valid_segments.txt file should be considered valid. There are a bunch of old segment in the parse-output directory that we will be removing shortly. Sorry for the confusion.

Ahad.

Albert Chern

unread,

Sep 26, 2012, 1:08:22 AM9/26/12

to common...@googlegroups.com

Hi Ahad,

When are you guys planning on removing the old segments? I was planning on processing those since I already have code working for them and there are more pages.

Robert Meusel

unread,

Sep 26, 2012, 2:37:41 AM9/26/12

to common...@googlegroups.com

Thanks Ahad! So to get all available data for the 2012 crawl we will use the folders/segments listed in the valid_segment.txt file.

Robert

Ahad Rana

unread,

Sep 26, 2012, 5:04:09 AM9/26/12

to common...@googlegroups.com

Hi Albert,

I can hold off deleting the old data for a little bit. When do you think you will be done ? I am planning to run a secondary job that will hopefully pick up all / most of the missing data.

Best,

Ahad.

--

You received this message because you are subscribed to the Google Groups "Common Crawl" group.

To view this discussion on the web visit https://groups.google.com/d/msg/common-crawl/-/s5OfhGWtULMJ.

Albert Chern

unread,

Sep 26, 2012, 1:44:54 PM9/26/12

to common...@googlegroups.com

Hi Ahad,

I think we will finish in another week give or take a few days, but I think we could go faster by allocating more machines if needed.

Matt Luker

unread,

Sep 26, 2012, 6:17:12 PM9/26/12

to common...@googlegroups.com

Ahad,

Sorry if this is a stupid question, but how much data is present in the 56 segments? I.e. approximately how many web pages?

I'm trying to set up a rather large run, and the segments I have tested with seem to only have around 6M metadata records per segment. Unless I'm doing something really wrong, wouldn't that only mean there is metadata for 336M records? I mean, maybe I got some small segments or something, but that doesn't seem like as much data I would have expected (i.e. all the docs and announcements seem to imply there's closer to 5B).

Again, I've only been at this for a little while, so I'm fully expecting to have missed something--if so, could you please just point me to the docs or relevant threads so I can get on the right page :-)

Thanks again for all your hard work!

Matt

Ahad Rana

unread,

Sep 27, 2012, 3:26:43 PM9/27/12

to common...@googlegroups.com

Hi Matt,

That definitely doesn't sound right. We are using a new (modified) job to process the parse segments, and if a mapper takes too long, we fail it fast and process some data via a secondary (longer running) job. So, there is some reduction in the number of records per segment, but according to my records there should be 4783380596 metadata records in the 56 (regenerated) segments, 3172895475 html documents,81305186 feed items etc. On average, there should be 85 million metadata records per segment, with a standard deviation of 25 million records.

Can you give me the segment id for the segment which you believe yielded you 6 million metadata records ?

Thanks,

Ahad.

Matt Luker

unread,

Sep 28, 2012, 5:04:46 PM9/28/12

to common...@googlegroups.com

Ahad,

I think it may be something I was doing wrong. I'm currently wondering about how I was using globs. I had been using a general glob of "$BUCKET/$SEGMENT/*metadata*", which would work--but not on some segments. I switched to just "$BUCKET/$SEGMENT/metadata*", and now things seem to be running better.

Weird, but I highly suspect it's just my own fault (maybe my code or my approach). I've managed to process almost 10 segments, and I have an average of 94M records, which puts me in range of 5.2B for the whole set of valid segments. I'll keep a look out for anything suspect, but again, I'm thinking it's just PEBKAC!

Matt

Robert Meusel

unread,

Oct 14, 2012, 7:11:07 AM10/14/12

to common...@googlegroups.com

Hi Ahad,

we extracted the data from the files described in the valid_segment.txt and found some duplicates (same URL in the crawl) within the different files. Around 13 % according to our first check are not unique. Do you know if they are crawled more time or if this is a "aggregation issue" so that the same crawl was put more than once in the files?

Thanks,

Robert

Masumi Shirakawa

unread,

Jan 5, 2013, 6:59:00 AM1/5/13

to common...@googlegroups.com

Hi,

According to the previous posts, there are 56 segments including approximately 5 billion web pages in the crawl data from 2012.

But I found that valid_segments.txt contained 177 segment IDs in total, namely, IDs like 1350********* were added.

What are these IDs? New datasets?

Best regards,

Masumi

Robert Meusel

unread,

Mar 15, 2013, 8:35:55 AM3/15/13

to common...@googlegroups.com

*push*

any news on this - would be really interesting to new if are new data or if the data was split ?

Reply all

Reply to author

Forward