How to handle a rather large data set

tom lurge

unread,

May 10, 2013, 1:41:30 PM5/10/13

to dc-js-us...@googlegroups.com

Hi,

I plan to visualize a rather large data set of network usage with providers, users, different services, bandwidth consumption etc covering a timespan of about 5 years. The raw data in uncompressed JSON has a size of about 70 GB. It will be loaded into a MongoDB.
I would like the visualization to be able to cross cut through this data, allowing to detect dependencies and correlations. DC.js seems to be the ideal candidate for this task.
I know there is no way that I can load this amount of data into browser memory. In MongoDB I will aggregate smaller data sets with different scales of precision (zoom in/out) and maybe also for smaller periods.

What strategies could I use in DC.js to handle as much data as possible?
Is DC.js caching data?
Is it possible to load data on demand?

Cheers,
Thomas

Nick Zhu

unread,

May 10, 2013, 7:50:18 PM5/10/13

to tom lurge, dc-js-us...@googlegroups.com

dc.js actually came out of a project like what you mentioned with MongoDB back-end and d3 visualization in front-end, though we did not use crossfilter but rather having MongoDB handling all the slice-and-dice as well as drill down and roll up on server side since the data set is just too big. If you want to use crossfilter then some kind of drill down and roll up need to be implemented by the server since crossfilter only provides slice-and-dice capability.

Nick

Ted Strauss

unread,

Oct 29, 2013, 1:53:06 PM10/29/13

to dc-js-us...@googlegroups.com

Hi Tom,

I have some similar requirements as you described here.

Did you make progress on this project?

Did you learn any thing about crossfilter.js, dc.js, mongo that you could share with the community?

Cheers

Ted

tom lurge

unread,

Oct 29, 2013, 3:30:20 PM10/29/13

to dc-js-us...@googlegroups.com

Hi Ted,

I got distracted by other issues but dc.js and crossfilter.js will be on my plate again in november. I'll most certainly come back then with more specific questions. So far I could share some experiences with MapReduce and MongoDB but that's not really in scope for this group.
Oh, and, btw: thanks to Nick for his quick response in May!

Cheers,
Thomas

Mrugank Parikh

unread,

Oct 30, 2013, 7:05:13 AM10/30/13

to dc-js-us...@googlegroups.com

Hi Tom,

We are also working on handling large amount of data at server end and duplicating features of crossfilter (for slicing and dicing). We are working on both push (real-time) and pull (near real-time) strategies for not only JSON data but also other type of data types. Can you share your experience with mapReduce and MogoDB in creating crossfilter kind functionality at server side?

tom lurge

unread,

Oct 30, 2013, 7:28:58 AM10/30/13

to dc-js-us...@googlegroups.com

The project I'm working on is called "visionion" and available on GitHub.The documentation also contains my thinking about the mapReduce strategies I chose. They are governed by limitations in the data we have and by what I think can be usefully visualized from that data.
I can't share much experience right now since implementation of the visualization frontend hasn't started yet. You can follow my progress on GitHub, but, as I said, I'll most probably come back to this group with more questions...

What's your project? Can you share more details?

Cheers,
Thomas

Mrugank Parikh

unread,

Oct 30, 2013, 7:37:51 AM10/30/13

to tom lurge, dc-js-us...@googlegroups.com

We are coding a spreadsheet and currently we are storing data in mySQL. We are working on providing crossfilter kind functionality of spreadsheet, however our intention is to view large amount of data in a browser-based spreadsheet, so we are working on writing server-side code for slicing dicing mostly inspired from mapReduce strategy adopted in cross-filter

Bertrand Dechoux

unread,

Nov 5, 2013, 8:17:33 AM11/5/13

to dc-js-us...@googlegroups.com

For your information, I built a private solution that
1) create small olap cubes from Hadoop
2) upload them to mongo
3) use DC.js to visualize them

Of course, the trick is to know which cubes to materialize up front. But that's not a new concept, it is almost the same as aggregated views for olap servers. For those interested by the subject, Avatara (http://engineering.linkedin.com/olap/avatara-olap-web-scale-analytics-products) is a nice read.

The real problem is to identify where the computation should be done : Hadoop versus Database/server versus Browser. The closest to the end user, the fastest the 'animation' can be but of course the farthest to the end user, the biggest is the amount of data that can be processed/manipulated.

Crossfilter is really only useful to do the computation on the browser.

Bertrand

Pravin Singh

unread,

Jul 4, 2014, 6:54:50 AM7/4/14

to dc-js-us...@googlegroups.com

Hi Tom,

I am facing very similar problems, can you please give me more details regarding your experience or any solution of it.

I have this issue for quite a while.

Here are a couple of questions :

http://stackoverflow.com/questions/24435439/dc-and-crossfilter-with-large-datasets