DC.js + MongoDB

Kai Feng Chew

unread,

Nov 20, 2013, 3:31:23 PM11/20/13

to dc-js-us...@googlegroups.com

Any example visualizing MongoDB database with dc.js?

Jacob Rideout

unread,

Nov 20, 2013, 5:16:02 PM11/20/13

to dc-js-us...@googlegroups.com

Not generically. There are people using mongo and dc, but the usage is specific to their particular applications. Depending upon what you want to accomplish there are different approaches. One simple way is:

You have a server route that proxies some simple data set from mongo, and returns json

You have a frontend page that loads that data via d3.json and visualized via dc

Message has been deleted

Jacob Rideout

unread,

Nov 22, 2013, 10:46:44 AM11/22/13

to dc-js-us...@googlegroups.com, Jacob Rideout

First, DC's niche is with "small" data sets, that can fit on a browser client, that have multiple dimensions you want to interactively compare. There are many uses cases that do not fall into this category. Any work with larger data sets, can still make use of DC, but will require some customization of the environment if not also DC itself.

>Because seems like I can't go more than 16,105 rows...

Probably you can handle more than that with some optimization. I certainly would remove unused columns and optimize the formats of others. DC has two areas of resource constraint:

1) client memory. You can have a data set as large as will fit in client memory. On my laptop, I've had 500Mb data sets work fine. But I wouldn't want to distribute something like that widely.

2) The complexity of the dom. The number of nodes created by the visualization will probably be a more visible bottleneck than the data set size. I've had complex visualizations of 20MB data sets perform worse than 100MB.

On #1 above the limiting factor is the javascript memory footprint - not the size of the csv. If you are using dc.csv to load your data, remember it will convert

h1, h2

r1c1, r1c2

r2c1, r2c2

into [{h1: "r1c1", h2: r1c2}, {h1: "r2c1", h2:"r2c2"}]

Notice the redundant h1 and h2's. This makes it easier to reference columns in your code, but id does bloat the memory footprint. As an alternative you can use a matrix style array of arrays, but that is much more prone to developer error. Such as [["r1c1","r1c2"],["r2c1","r2c2"] and specifying dimensions like d[0] rather than d.h1.

Another thing to consider are partial aggregation of the raw data. Let say you have something like this:

timestamp, account, amount

1385133212, 1, 10

1385133222, 2, 14

1385133232, 1, 12

you could aggregate against smallest timeframe that you will visualize, say by day, and still keep the transaction count:

day, account, amount, count

1122, 1, 22, 2

1122, 2, 14, 1

Notice, I also dropped the year, if that isn't needed.

Other optimizations include rounding numbers: 0.1232423425, could become 0.12. Or doing in-memory joins and read time for shared data common to multiple records.

Depending on your data these type of things can make a big impact on memory footprint. Also useful are limits on number of records in the data set. You can have a non-crossfilter, filter to dynamically load or remove data on some dimension (such as time). This adds more complexity again to the UI, and is very hard to manage if you also want to visualize aggregates of the data that hasn't been loaded, but can be necessary in some cases. For instance, you can have a range-chart that load a by day count summary for a long timeframe (say a year) but restricts zooming in for timeframes smaller than 30 days, dynamically loading the dataset by day for no more than 30 at a time.

As far as DOM complexity - favor multiple visualizations that look at small numbers of aggregates rather than fewer charts with larger aggregate ranges. Square solved this problem in cubism http://square.github.io/cubism/ by A) aggressivly limiting the data on screen by having a fixed display size and removing old data and B) using canvas rather than svg to limit the dom node interactions. You could use technique A with DC and B is interesting to explore as a future enhancement.

And last, you also need to consider the load time of the data set. Here you are limited by 1) the network latency for the duration of file transfer and 2) the parse time it take to transform the data, and 3) the time it take to perform the initial aggregations

#3 is a somewhat fixed cost. Optimizations to limit the number of rows and avoid unneeded calculations can help. Such as, for averages, only calculating the total and count in a reduce function, but then doing the division in an chart accessor rather than redundantly in the reduce.

#1 and #2 have tradeoffs, but generally limiting #1 is going to be the focus, as even on the client CPU+RAM-access is generally much faster than network transfer. That is one reason to prefer to TSV or CSV over json datasets. Often the CSV can be more concise, but it does then have a parsing cost. You can limit this cost by parsing as the file is transferred using something like oboe.js or the approach I've taken here: https://gist.github.com/jrideout/7111894/raw/ed0eeb28c87b572e2b441dfc371036f36c0a3745/index.html

I should also note, it would be wise to profile things first.

Jacob

On Wednesday, November 20, 2013 10:09:11 PM UTC-5, Kai Feng Chew wrote:

Hi Jacob,

Thanks for that.
Do you know any limitation of dc.js in visualizing .csv?
Because seems like I can't go more than 16,105 rows...
Maybe I have too many column... I'm visualizing social data like tweets, facebook posts...

Any ideas about this?

--
#BetterMalaysia,

Chew Kai Feng

Chief Evangelist @ Cawcah
Tel No : +6.03.5880.6943
Hp No : +6.017.952.8807

Message has been deleted

Jeff Friesen

unread,

Nov 23, 2013, 1:02:25 PM11/23/13

to dc-js-us...@googlegroups.com

Jacob, this is a great writeup. Maybe this would be good to put in the wiki?

Jacob Rideout

unread,

Nov 24, 2013, 10:38:15 AM11/24/13

to dc-js-us...@googlegroups.com

Having an Optimization and Performance tuning section on the wiki sounds like a good idea. Feel free to copy the text of my reply into a new page.

Jeff Friesen

unread,

Nov 24, 2013, 1:40:10 PM11/24/13

to dc-js-us...@googlegroups.com

Ok, done: https://github.com/NickQiZhu/dc.js/wiki

Ted Strauss

unread,

Nov 27, 2013, 11:34:12 AM11/27/13

to dc-js-us...@googlegroups.com

Great contribution Jacob!

I am studying your performance tips.

Anupam Mediratta

unread,

Feb 17, 2014, 6:22:32 AM2/17/14

to dc-js-us...@googlegroups.com

Hey Kai

did you manage to use monogdb with dc.js? Or do other optimisations which have been mentioned. And if the code is available to look at?

Thanks.

Message has been deleted

Kafe Chew

unread,

Oct 5, 2014, 12:42:18 AM10/5/14

to dc-js-us...@googlegroups.com

Hi Anupam,

Yes and No.

Have a look here: https://meteor.hackpad.com/Visualize-mongoDb-with-dcJs-Meteor-D0rvgO774Oo

Blair Nilsson

unread,

Oct 6, 2014, 10:48:26 PM10/6/14

to dc-js-us...@googlegroups.com

You CAN if you wish hook dc.js up to server side aggregators. It isn't easy, but it can be done.

It removes the limits on the dataset size you deal with.

Also if you are careful, you can get dc.js + crossfilter alone to do over 100k rows. It will become sluggish, but not too bad considering what you are asking it to do.

A lot of it is remembering to throwing out stuff you are not using anymore as soon as you can in javascript.

one example is once you have loaded your dataset into crossfilter, throw it away! It makes a huge difference.

Matt Traynham

unread,

Oct 8, 2014, 10:50:23 AM10/8/14

to dc-js-us...@googlegroups.com

One thing to point out, we (royal we) moved away from Crossfilter not because of the performance it was giving us, but because downloading 100k rows every time you open the client is incredibly slow.

Using a backend solution can be done, but it would be nice if there was a pluggable source framework for handing off data manipulation and translating filters.

Blair Nilsson

unread,

Nov 6, 2014, 5:32:07 PM11/6/14

to dc-js-us...@googlegroups.com

I am building one... it is not as easier thing as you may think.

Currently, we hand off processing to elasticsearch, which scales really well, but you have to be so very careful with the querys, and the query language is kinda horrid.

The plan is to opensource it when it works well enough that I am not embarrassed when people look at the source.

Bradley Spatz

unread,

Nov 21, 2014, 9:48:35 AM11/21/14

to dc-js-us...@googlegroups.com

>A lot of it is remembering to throwing out stuff you are not using anymore as soon as you can in javascript.

Could you be specific with a few examples here? I apologize if this is obvious to most.

Reply all

Reply to author

Forward