On 6 Apr., 15:31, Ian Soboroff <
isobor...@gmail.com> wrote:
> I can't answer directly since all our software scales to this data size
> pretty well. We use Hadoop and Lucene.
>
> You might consider operating on the data in a streaming fashion -- first
> sampling to train up a classifier then passing over the data. Another
> approach is to consider a "realtime" architecture that only looks at a fixed
> sliding window over the data.
>
> You might also consider segmenting the data, say to only look at English
> items or at News providers.
>
> Ian
>
> On Wed, Apr 6, 2011 at 5:02 AM, Cosmin Cabulea