Systematic approach to analyze the data

62 views
Skip to first unread message

Cosmin Cabulea

unread,
Apr 6, 2011, 5:02:45 AM4/6/11
to icwsm-data
Hi,

I would like to know if you have a systematic approch to analyze the
data. 2 Terabyte of data is not easy to handle. How would you describe
your first steps to analyze the data? Which tools are you using? Are
you using a cloud service to handle the big data?

Looking forward to your replies

Thanks,
Cosmin

Ian Soboroff

unread,
Apr 6, 2011, 9:31:52 AM4/6/11
to icwsm...@googlegroups.com, Cosmin Cabulea
I can't answer directly since all our software scales to this data size pretty well.  We use Hadoop and Lucene.

You might consider operating on the data in a streaming fashion -- first sampling to train up a classifier then passing over the data.  Another approach is to consider a "realtime" architecture that only looks at a fixed sliding window over the data.

You might also consider segmenting the data, say to only look at English items or at News providers.

Ian


--
You received this message because you are subscribed to the Google Groups "icwsm-data" group.
To post to this group, send email to icwsm...@googlegroups.com.
To unsubscribe from this group, send email to icwsm-data+...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/icwsm-data?hl=en.


Cosmin Cabulea

unread,
Apr 7, 2011, 5:15:43 AM4/7/11
to icwsm-data
I think I have to upgrade my software to work with this data size.
Thanks Ian

On 6 Apr., 15:31, Ian Soboroff <isobor...@gmail.com> wrote:
> I can't answer directly since all our software scales to this data size
> pretty well.  We use Hadoop and Lucene.
>
> You might consider operating on the data in a streaming fashion -- first
> sampling to train up a classifier then passing over the data.  Another
> approach is to consider a "realtime" architecture that only looks at a fixed
> sliding window over the data.
>
> You might also consider segmenting the data, say to only look at English
> items or at News providers.
>
> Ian
>
> On Wed, Apr 6, 2011 at 5:02 AM, Cosmin Cabulea
> <cosmin.cabu...@dw-world.de>wrote:
Reply all
Reply to author
Forward
0 new messages