Chris Stephens
unread,Aug 14, 2012, 2:39:55 PM8/14/12Sign in to reply to author
Sign in to forward
You do not have permission to delete messages in this group
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to common...@googlegroups.com
Hello Everyone,
Based on some feedback from people on this list, we are planning on updating our ARC files to add the detected charset and to adjust the ARC header content length values. Before we can do this update, we need to send out some information about a system for marking segments as active/inactive, or valid/invalid.
As many of you are aware, Amazon's S3 storage service does not provide an API to efficiently rename object keys nor "move" objects between key prefixes. Under a traditional filesystem, if we had to update a segment, we might generate new data for that segment, then swap the old folder for our new folder. Under S3, this kind of operation would need to be done one file at a time, and would involve literally copying the file - not just changing the "folder" (or key).
To give us a way to delineate between usable segments and segments that aren't ready, we'd like to ask everyone to read this file before each job:
s3n://aws-publicdatasets/common-crawl/parse-output/valid_segments.txt
This file will contain a list of valid, active segment IDs from which you can safely read data. If any segment does not appear on this list, it should be considered inactive - not ready for use.
For JAR users, you can use the valid segments list with the following code snippet:
String segmentListFile = "s3n://aws-publicdatasets/common-crawl/parse-output/valid_segments.txt";
FileSystem fs = FileSystem.get(new URI(segmentListFile), job);
BufferedReader reader = new BufferedReader(new InputStreamReader(fs.open(new Path(segmentListFile))));
String segmentId;
while ((segmentId = reader.readLine()) != null) {
String inputPath = "s3n://aws-publicdatasets/common-crawl/parse-output/segment/"+segmentId+"/*.arc.gz";
FileInputFormat.addInputPath(job, new Path(inputPath));
}
For streaming users, you may need to build a list of "-input" options or run your job one segment at a time:
hadoop fs -get s3n://aws-publicdatasets/common-crawl/parse-output/valid_segments.txt > valid_segments.txt
xargs -a valid_segments.txt -I{} hadoop jar ... -input {}
If one day Amazon S3 provides an efficient MOVE or key RENAME command, we can stop using the "valid_segments" list.
If anyone has any suggestions on an easier approach, we're all ears (and eyes? and fingers?).
Looking forward to hearing what you think ...
- Chris
Note: As of right now, *all* segments are "valid". So any analysis done on the existing corpus is valid without checking this file.