Filtering whole pages with ruby streaming

26 views
Skip to first unread message

Christian Becker

unread,
Jan 11, 2015, 9:59:37 AM1/11/15
to common...@googlegroups.com
Hi There,

I thought I wanna do a simple thing: Searching the common crawl corpus for keywords and writing the single pages to filesystem when the keyword filter match. I wanna do this with ruby streaming with EWR.

My problem is, that I don't know how to write complete files to filesystem. Is there a way to do this with Hadoop directly or do I need to write each single file in the map script to S3?

Best regards
Pascal
Reply all
Reply to author
Forward
0 new messages