Sorting streams of data

Calmera

unread,

Jul 3, 2012, 4:41:11 AM7/3/12

to bigd...@googlegroups.com

Hey Group,

Anyone an idea how you can sort a stream of data without buffering?

D.

Cedric De Vroey

unread,

Jul 3, 2012, 5:09:15 AM7/3/12

to bigd...@googlegroups.com

As far as I know you can't. You'll always need to keep track of previous records to know where the current one fits in.

--
Cedric De Vroey

Daan Gerits

unread,

Jul 3, 2012, 5:50:10 AM7/3/12

to bigd...@googlegroups.com

I found an article by linked in, but I am still thinking how you could do something like this, or different.

http://engineering.linkedin.com/algorithms/ranking-live-streams-data

Wim Van Leuven

unread,

Jul 3, 2012, 6:12:26 AM7/3/12

to bigd...@googlegroups.com

Is it about sorting the complete stream or just filtering out the top-10 high-hitters about something?

--
Kind regards,

Wim Van Leuven

Daan Gerits

unread,

Jul 3, 2012, 6:20:07 AM7/3/12

to bigd...@googlegroups.com

filtering the complete stream

Matt Casters

unread,

Jul 3, 2012, 6:31:10 AM7/3/12

to bigd...@googlegroups.com

Companies like linked-in always think about how they can make something scale because they take massive future growth into account.

There are various ways to make sorting work in the general case but it always involves chopping up the data into pieces.

- External sort: if sort data doesn't fit into memory (Size M), spool it off to disk (N blocks of size M), read N blocks back in, see which row is the lowest (or highest), pass it on as a sorted stream.

- Parallel sort: same as the external sort but done on a group of servers, each potentially doing an external sort but hopefully not. (fairly good scaling)

- Time-based : Take all the rows that are coming from a service (say log file) in any given time-frame (ex 1 hour), sort those and pass them along (incomplete sorting)

- Group based: like time-based it only looks at groups of the data, for example all records/rows belonging to a certain customer. (if you know that the streamed rows are grouped together)

Also, for specific cases you could indeed easily determine a top 10 of records, keep them in-memory.

Cheers,
Matt

2012/7/3 Daan Gerits <daan....@gmail.com>

--
Matt Casters <mcas...@pentaho.org>
Chief Data Integration, Kettle founder, Author of Pentaho Kettle Solutions (Wiley)

Pentaho - Powerful Analytics Made Easy

Davy Suvee

unread,

Jul 3, 2012, 8:21:41 AM7/3/12

to bigd...@googlegroups.com

Another solution could be to simply use Cassandra for storing your streams ...

- Just use a column family where you specify a particular order in which the (key of the) columns should be sorted.

- You could automatically let a particular column (or key) expire (i.e. removed from the ordered stream) by setting it's Time To Live

If you want to get the top 100 of a particular stream, you simply query for the first 100 elements of the column family. All the rest is taken care of by Cassandra ...

I've been using Cassandra for storing time-based information and it works well and fast.

Davy

Datablend.be

Geoffrey De Smet

unread,

Jul 4, 2012, 2:31:34 AM7/4/12

to bigd...@googlegroups.com

Use the same trick as a relational database?
When the data is written in the NoSQL database, keep an index of the "column" (=big data field) you're sorting on up to date in the same database.
When the data is read in the sorted way, reading the index instead, which will point to the records.

If you need to sort on multiple columns (say first age, then firstname), then that index should be concatenated out of the fields.
And it should also escape out the split character between them: (age.replaceAll("-", "--") + "-" + firstname.replaceAll("-", "--"))

Cedric De Vroey

unread,

Jul 4, 2012, 4:51:54 AM7/4/12

to bigd...@googlegroups.com

But then you are actually using the database as a sorting buffer, not ? :-)

--
Cedric De Vroey

Geoffrey De Smet

unread,

Jul 5, 2012, 1:31:27 AM7/5/12

to bigd...@googlegroups.com

Depends what you call a buffer :) It stores the sorting permanently, so
if you do a 1000 reads and 5 writes it will have only sorted them 5 times
if you data is too big to be loaded in memory, than it won't go out of memory as the sorting index is build incrementally as data is written.