2d scatterplot aggregation ideas

indiemaps

unread,

May 6, 2011, 11:28:07 PM5/6/11

to d3-js

Hello! I'm using d3 for some simple 2d scatterplotting and it's
working great! Now that I'm plotting 1000s of points, I'd really like
to aggregate (or bin) the points somehow. Basically I just want to a
take a scatterplot and aggregate it to a heat chart or hexbin. In the
former case, I just need to divide the 2d x-y space into a 4x4 or 5x5
(etc.) grid of squares, coloring each according to how many points
fall in that area.

So, this wasn't that hard to do, but I'm wondering if I'm missing some
d3 methods that would make this more efficient, both code and
performance-wise. After determining the number of grid squares I
wanted (4x4), all I did was loop through my scatterplot data to
determine how many fell in each grid square. I then created a 1-
dimensional array containing this data, just listing the number of
points in each grid square, from top to bottom, left to right. For
example:

var data = [ 55, 44, 147, 161, 182, 174, 109, 20, 39, 84, 6, 12, 137,
88, 65, 101 ];

I then just calculated the max and created a color scheme from white
to red:

var max = d3.max( data );
var color = d3.interpolateRgb("#fff", "#f00");

Then I just had to create grid squares for each array value,
positioning and coloring according to array position and value:

heatChart.selectAll("rect")
.data(data)
.enter().append("svg:rect")
.attr("x", function(d,i) { return ( i % Math.sqrt(data.length) ) *
(size/Math.sqrt(data.length)); } )
.attr("y", function(d,i) { return ( Math.floor(i/
Math.sqrt(data.length)) * (size/Math.sqrt(data.length)) ); } )
.attr("width", size/Math.sqrt(data.length))
.attr("height", size/Math.sqrt(data.length))
.attr("fill", function(d, i){ return color(d / max); })
.attr("stroke", "#000")
.attr("stroke-width", 1);

This works fine. I'm just wondering if there's a more automatic way
of going from a scatterplot to such a heat chart. Further, I'm
wondering how I would make the simple adjustment of working from a
nested array, with rows separated into subarrays, like so:

var data =
[
[ 55, 44, 147, 161 ],
[ 182, 174, 109, 20 ],
[ 39, 84, 6, 12 ],
[ 137, 88, 65, 101 ]
];

This seems simple but I'm just unsure of the way to use nested array
data in d3. Anyway, just seeing if anybody out there is doing any
work in d3 with 2d point aggregation.

Thanks in advance.

Zach

Mike Bostock

unread,

May 7, 2011, 1:22:58 PM5/7/11

to d3...@googlegroups.com

Your approach to binning the data by looping over the elements sounds
reasonable. Hexagonal binning would be fun. I'd like to see that!

As for nested arrays, I'd take a look at the calendar and scatterplot
matrix examples:

http://mbostock.github.com/d3/ex/calendar.html
http://mbostock.github.com/d3/ex/splom.html

The calendar example has an array of years so that `vis` is a
selection of svg:svg elements, with one per year. Within each year,
there's one rect per day for that year. Similarly, in the scatterplot
matrix, there's one svg:g element per row, and then one svg:g element
per column for that row (making it a cell). The scatterplot matrix is
made a bit more complicated since I wanted to access the parent data
from the child cells, so I used a `cross` helper to copy the parent
data into the child data.

As a more general example, here's how you might construct a standard
HTML table from your data:

var table = d3.select("body").append("table");

var tr = table.selectAll("tr")
.data(data)
.enter().append("tr");

var td = tr.selectAll("td")
.data(function(d) { return d; }) // #1
.enter().append("td")
.text(function(d) { return d; }); // #2

This assumes that data is a two-dimensional array of numbers. Note
that the identity function, function(d) { return d; }), is used to
dereference each element in the array. In the first case (#1), the
data operator is evaluated on each element in `data`. For the first
row, this returns the array of numbers that we want for the cells:
[55, 44, 147, 161]. For the second row, it returns the second element
in data: [182, 174, 109, 20]. Then, the operators on the td are
evaluated per element. For the first cell in the first row, the text
operator returns 55. For the second cell in the first row, 44. And so
on.

If you want, you can run this whole thing as a single statement, and
make clever use of some JavaScript built-ins for the identity function
(#1 == Object, #2 == String):

d3.select("body").append("table")
.selectAll("tr").data(data).enter().append("tr")
.selectAll("td").data(Object).enter().append("td")
.text(String);

Mike

indiemaps

unread,

May 11, 2011, 12:43:10 AM5/11/11

to d3-js

Hey Mike, thanks a lot for this, tho I do have a question at the
bottom. My data ended up being a 2d array of JS objects
(data[ rowNum ][ colNum ]) of the form:

{
row : rowNum,
col : colNum,
density : 0
}

And I create my heat chart like so:

this.heatChart = this.plot.append("svg:g");

Then modeled after your table example above, I populate the chart like
so:

var heatChartRow = this.heatChart.selectAll( "g" )
.data(cells).enter().append( "svg:g" );

var _this = this;
var heatChartCell = heatChartRow.selectAll( "rect" )
.data(function(d) { return d; } )
.enter().append( "svg:rect" )
.attr("x", function(d,i) { return d.col * (_this.size/numCols); } )
.attr("y", function(d,i) { return d.row * (_this.size/numRows); } )
.attr("width", _this.size/numCols)
.attr("height", _this.size/numRows)
.attr("fill", function(d, i){ return color(d.density / max); })

.attr("stroke", "#000")
.attr("stroke-width", 1);

Boom, heat chart. I guess my questions concern how to update such a
heat chart, using an enter/update/exit strategy similar to what's
going on in the scatterplot (that you helped me with). So (as above)
this works:

var heatChartRow = this.heatChart.selectAll( "g" )
.data(cells).enter().append( "svg:g" );

In the above 'cells' is my 2d array of heat chart cells
(cells[ rowNum ][ colNum ]). I guess I'm just wondering how I'd
update the above with a completely new 2d array of cells?? I'm even
wondering how to split the above into two declarations. Why doesn't
this work?--

var heatChartRow = this.heatChart.selectAll( "g" )
.data(cells);

heatChartRow.enter().append( "svg:g" );

-- before declaring the 'heatChartCell' as above?

OK hopefully this makes sense. I'm really impressed by the d3
framework, just with I had more time to investigate...

Mike Bostock

unread,

May 11, 2011, 2:07:35 AM5/11/11

to d3...@googlegroups.com

> I guess I'm just wondering how I'd update the above with a
> completely new 2d array of cells?? I'm even wondering how
> to split the above into two declarations.

Sure. Let's start by looking at the simpler 1-dimensional case, say a
ul element with a variable number of child li elements. If we want to
handle the general case where we have a different number of items
across updates, then we'll need to use the `data` operator in
conjunction with the enter, update and exit selections. For example,
maybe the background-color style is a constant and only needs to be
set on enter, but the text content is derived from data and needs to
be set both on enter and update.

Let's start by doing the initial selection and data join:

var li = d3.select("ul").selectAll("li")
.data(data);

Breaking this down:

1. d3.select("ul") - select the (existing) ul element that will be the
parent node for the li elements.

2. selectAll("li") - select all the (existing) child li elements. The
first time this code is run, this selection will be empty. Subsequent
times, it will contain the previously-added elements.

3. data(data) - join the li elements with the specified array of data,
returning the update selection. This is set as the variable `li`.

Note that `li` refers to the updating selection. This means if we
previously-added li elements, then the updating selection will contain
the previously-added elements joined to the new data. So we can
immediately update those existing elements using the new data. In this
case, that might just be the next content:

// update
li.text(function(d) { return d; });

We may also want to access the entering nodes, if there are more
elements in the data array than exist in the document. We can access
placeholder nodes for these elements using the `enter()` operator, and
then immediately create and append li elements for them using
`append("li")`. From there, we can specify the desired operators.
Generally, this is a superset of what we would do on update, because
it includes both the constants and the data-driven ones:

// enter
li.enter().append("li")
.style("background-color", "red")
.text(function(d) { return d; });

Similarly, we may want to access the exiting surplus nodes and remove them:

// exit
li.exit().remove();

If you want to get fancy, you can also define separate transitions for
these selections. You can see another example of this in the source of
the chart templates:

https://github.com/mbostock/d3/blob/master/src/chart/bullet.js

So, now if we want to return to the more elaborate nested case, we can
deal with nested enter, update and exit. However, we can simplify by
making some safe assumptions. First, if we are entering new table
rows, they won't yet have any child cells. Second, if we are exiting
table rows, we can just remove the rows and not deal with exiting the
children. This results in something like this:

var tr = table.selectAll("tr")

.data(data);

// enter
tr.enter().append("tr")
.selectAll("td")
.data(function(d) { return d; })

.enter().append("td")
.text(function(d) { return d; });

// update
var td = tr.selectAll("td")
.data(function(d) { return d; });

// update / enter
td.enter().append("td")
.text(function(d) { return d; });

// update / update
td.text(function(d) { return d; });

// update / exit
td.exit().remove();

// exit
tr.exit().remove();

Of course, this is the most general way of structuring your code. You
can reduce some of the code duplication by extracting your functions
and naming them, rather than using strictly anonymous functions. You
can also group your code into JavaScript functions (optionally using
the `call` operator for method chaining, if desired).

Another common case is that you have separate code paths for
initializing your visualization, and then you just want to run
updates. Even simpler if your updates never change the cardinality of
your data, so you don't need to deal with enter & exit on update. In
this case, your initialization code only needs to handle enter:

function enter() {

table.selectAll("tr")
.data(data)
.enter().append("tr")

.selectAll("td")
.data(function(d) { return d; })

.enter().append("td")
.text(function(d) { return d; });
}

And your update code only needs to handle update:

function update() {
table.selectAll("tr")
.data(data)
.selectAll("td")
.data(function(d) { return d; })
.text(function(d) { return d; });
}

You can make the update code even simpler if you just update the
attributes of existing objects in-place. This way, you don't have to
rebind the data:

function update() {
table.selectAll("tr td").text(function(d) { return d.name; });
}

Mike

indiemaps

unread,

May 13, 2011, 1:29:35 AM5/13/11

to d3-js

Mike,

This was incredibly helpful. I ended up going with the initialization/
enter + update/update approach you mentioned last since my density
data will change often but the cardinality (5x5, 4x4, 3x3) will only
change occasionally. I'll post a generalized example here soon.

Zach

indiemaps

unread,

May 17, 2011, 12:34:18 AM5/17/11

to d3-js

Hello Mike (and all):

I ended up creating a scatterplot + heat density chart example over on
jsFiddle: http://jsfiddle.net/indiemaps/gzPDU/

I think it's pretty well optimized, but I'm sure I'm missing some
tricks. The scatterplot and associated heat chart are just shown side-
by-side and are live-linked together (that is, hovering over a
scatterplot point highlights the associated heat chart grid square;
and hovering over a grid square highlights all associated points on
the scatterplot). This isn't a very realistic use case, but looks
cool and aided in some debugging/testing.

It's interesting to turn the scatterplot off and go up to 50,000
random points, just to see how fast JS crunches all the numbers and
updates the heat chart. Going up to 50,000 with the scatterplot
visible can freeze up the browser for a bit.

Anyway, hopefully this helps someone out there!

Zach

Mike Bostock

unread,

May 17, 2011, 1:04:32 PM5/17/11

to d3...@googlegroups.com

Hey, that's great! Thanks for sharing. Doing the aggregation into a
heatmap makes a lot of sense for large datasets, as rendering
thousands of elements is usually the bottleneck before simpler data
processing.

I thought of a couple optimizations… not that you need them, but if
you're curious:

You can improve the performance of interaction by using two different
layers (svg:g elements) in the scatterplot. The background layer
contains your dots with a fill, and the foreground layer contains the
same dots with an optional stroke. That stroke color can be "none" (or
equivalently, null) until you mouseover. This way, you don't remove
and re-append elements on interaction, or change the radius, which
causes more expensive redraws. You can even render the background
layer statically (say, as an image), since it doesn't change on
mouseover. I used that trick first in the Protovis parallel
coordinates example:

http://vis.stanford.edu/protovis/ex/cars.html

Another technique is to provide more explicit binding between related
elements, so you don't have to reselect based on data. For example, on
cell mouseover, you could have each cell's data store the associated
point selection in the scatterplot, so you don't have to reselect. You
can do that using the `each` operator. When you create the
scatterplot, you can store the reference to the circle element in the
data:

circle.each(function(d) {
d.element = this;
});

Then when you create the heatchart, collect those elements into a selection:

rect.each(function(d) {
d.elements = d3.selectAll(d.points.map(function(d) { return d.element; }));
});

Then, your onCellOver looks something like this:

d.elements.attr("r", 4).attr("stroke", "#f00").attr("stroke-width", 3);

Lastly, one JavaScript tip: if you declare named functions, rather
than binding anonymous functions to a var, JavaScript "hoists" the
function definitions to the front, so you can call them out of order.
This gives you a little more flexibility in how you structure your
code. For example, this code causes an error because `foo` is not yet
defined:

var a = foo();
var foo = function() { return 42; };

Whereas this code works fine, because the definition of `foo` is hoisted:

var a = foo();
function foo() { return 42; }

MDC says: "Another unusual thing about variables in JavaScript is that
you can refer to a variable declared later, without getting an
exception. This concept is known as hoisting; variables in JavaScript
are in a sense 'hoisted' or lifted to the top of the function or
statement. However, variables that aren't initialized yet will return
a value of undefined."

Mike

rs762

unread,

Aug 8, 2013, 4:01:56 PM8/8/13

to d3...@googlegroups.com

Hi,

I clicked on your jsFiddle link, but can't seem to see your visualization?

Thug

unread,

Nov 12, 2013, 6:22:59 AM11/12/13

to d3...@googlegroups.com

@rs762

to see the visualization, you need to select d3 from the Frameworks & Extensions menu on the left of the jsFiddle display.

Reply all

Reply to author

Forward