This note tries to summarize some of our thinking. Please help
us to spot flaws in our analysis and design before we start
committing them to code!
-r
Background
==========
Each input data set will have a "data format" (DF), typically
based on the way the data was collected, obtained, etc. Each
layout will expect to see its own "layout format" (LF).
Somehow, we need to convert the data from DF to LF. The direct
solution is to create a function for each combination of data
and layout formats. Unfortunately, this solution scales very
poorly, requiring DF * LF conversion functions. (Bleah!)
To avoid this, we can define an intermediate format (IF) that
handles all of the expected variations in DF and LF. This lets
us create one function (DF->IF) for each input and one function
(IF->LF) for each layout.
Because the number of DFs and LFs can be expected to grow, this
approach has real scaling advantages. For example, if we have
50 DFs and 15 LFs:
The direct approach requires 750 (50 * 15) filters.
The indirect approach requires 65 (50 + 15) filters.
Use cases
=========
Let's look at some graph representation use cases, based on Mike
Bostock's examples.
flare-imports.json
------------------
flare-imports.json, from Mike's Hive Plot demo, uses a list of
hashes to define a directed graph. Each hash defines a software
module, giving its name, size, and a list of modules it imports.
For example, the module "flare.animate.Transitioner" is imported
by the module "flare.analytics.cluster.AgglomerativeCluster":
nodes = [
{ "name": "flare.analytics.cluster.AgglomerativeCluster",
"size": 3938,
"imports": [ "flare.animate.Transitioner", ... ]
}, ...
];
This format is easy for humans to use (eg, read, edit), but it
could be pretty inefficient in time and space (depending on the
JavaScript implementation and the size of the name strings).
miserables.json
---------------
Mike has used the Les Miserables data set a few times, eg:
http://mbostock.github.com/protovis/ex/arc.html
http://mbostock.github.com/protovis/ex/miserables.js
miserables.js has some useful comments:
This file contains the weighted network of coappearances of
characters in Victor Hugo's novel "Les Miserables". Nodes
represent characters as indicated by the labels, and edges
connect any pair of characters that appear in the same
chapter of the book. The values on the edges are the number
of such coappearances. The data on coappearances were taken
from D. E. Knuth, The Stanford GraphBase: A Platform for
Combinatorial Computing, Addison-Wesley, Reading, MA (1993).
The group labels were transcribed from "Finding and evaluating
community structure in networks" by M. E. J. Newman and M.
Girvan. [http://arxiv.org/pdf/cond-mat/0308217.pdf]
var miserables = {
nodes: [ { nodeName: "Myriel",
group: 1 }, ... ],
links: [ { source: 1,
target: 0,
value: 1 }, ... ]
}
miserables.json, from Mike's "force" example, has basically the
same format. A top-level hash contains a pair of arrays (nodes
and links) which define the graph. The extract below tells us
that Napoleon (source 1) appears once (value 1) with Myriel
(target 0) and that Napoleon and Myriel are in the same
coappearance group (1).
{
"nodes": [ { "name": "Myriel", "group": 1 },
"name": "Napoleon", "group": 1 }, ... ]
"links": [ { "source": 1, "target": 0, "value": 1 }, ... ]
}
Discussion
==========
A generalized version of the Flare format might look something like:
{
"...this_id...": [ [ "...that_id...", ... ], {...meta...} ], ...
}
The Les Miz format eliminates the strings, in favor of index values.
It might therefore have efficiency advantages. A generalized version
of this format might look something like:
{
"nodes": [ [ "...this_id...", {...meta...} ], ... ],
"links": [ [ this_ndx, that_ndx, {...meta...} ], ... ]
}
Neither of these formats has any direct support for N-ary graphs.
So, for example, they can't represent statements such as "Rich
drove his Scion to the San Bruno BART station on Saturday".
However, various diagramming techniques use artificial nodes to
resolve this deficiency, eg:
Conceptual Graphs (John Sowa)
Object-Role Modeling (Terry Halpin)
Ok, now for feedback. Are there any obvious problems with either
of these formats? Are there any reasons to prefer one of them (or
some completely different format)?
Inquiring minds need to know. (ducks :-)
--
http://www.cfcl.com/rdm Rich Morin
http://www.cfcl.com/rdm/resume r...@cfcl.com
http://www.cfcl.com/rdm/weblog +1 650-873-7841
Software system design, development, and documentation
Likewise, all of the layouts (perhaps with the exception of the stack
layout, which allows you to override the `out` accessor) have
well-defined output formats. This is useful for documentation purposes
(and understanding), and allows greater code reuse. Flexible input,
strict output.
To allow layouts to understand arbitrary input data, most D3 layouts
provide accessor functions. For example, hierarchy layouts have a
`children` accessor that is used to retrieve the array of child nodes
for each internal node, and a `value` accessor that returns the
quantitative value for leaf nodes.
https://github.com/mbostock/d3/wiki/Hierarchy-Layout
Likewise, you have components such as D3's shape generators that are
agnostic about the input format and can be completely customized using
accessors. Consider the arc generator, for example:
https://github.com/mbostock/d3/wiki/SVG-Shapes#wiki-arc
In effect, D3 uses these accessors to perform an implicit map from an
arbitrary user-defined representation to a standard representation. In
the case of layouts, the standard representation can then be decorated
with properties computed by the layout. Or in the case of shape
generators, the standard representation is used to compute attributes
(path data) and never exposed externally.
Even with the ability to override accessors, some assumptions must
still be made regarding the input format. The force layout, for
example, doesn't currently allow you to override the source and target
accessor for link objects—those are required to either be references
to nodes, or zero-based indexes that are converted to node references
upon initialization. That said, force layouts do use accessors for
link strength, link distance, charge strength, etc.
The input assumptions of the force layout are as follows:
* there's an array of nodes
* there's an array of links
* nodes are objects
* links are objects
* links have source and target properties
* source and target are either a node index or reference
That seems to be a smaller set of assumptions than the "generalized
versions" you described. It's still useful to establish conventions,
but I think it's nice if people can distinguish conventions from
requirements.
The force layout's output format is:
* node.index
* node.x
* node.y
* node.px
* node.py
* node.fixed
* node.weight
And, in some cases:
* link.source (when converting from index)
* link.target (when converting from index)
Hierarchy layouts have output formats too, such as setting the
`parent` node reference and the `value` for internal nodes. But, it's
nice to try to keep these output formats small, as we'd like to avoid
colliding with other meta data users ascribe to nodes.
Mike
https://github.com/d3/d3-plugins/tree/master/d3/data.graph
d3.data.graph accepts data in either the matrix (chord layout) or list
of links (force layout) formats. It stores the data internally as a
list of nodes and links. It needs a bit more work before it can handle
state for both the chord and force layout.
Another structure to target with pack/unpack would be this:
mbostock.github.com/d3/data/flare.json
The end goal would be users can import data in any one of these data
structures, and output the correct format for any d3.js graph layout
or visualization.
Filtering and graph traversal would make sense to include as well.
I see what you're getting at now. Yeah, it would be nice if the chord
layout and force layout were more interoperable in terms of the input
representation.
Nice work on the plugin. I wonder if it would be simpler as stateless
conversion methods, though. For example, consider:
d3.graph.nodes = function(matrix) {
return […]; // array of {index: i} objects, perhaps?
};
d3.graph.links = function(matrix) {
return […]; // array of {source: i, target: j}, perhaps?
};
d3.graph.matrix = function(nodes, links) {
return [[…], …]; // two-dimensional array
};
Mike
However, I'm pretty sure that I could come up with input data formats
that would not play nicely with the accessor function approach (eg,
requiring an expensive lookup for each data value).
Also, I don't see any consistent common interface (ie, "well-defined
output format") in the examples I discussed. Lacking such an interface,
several things become more difficult than they might otherwise be, eg:
* If I want to feed the Flare data to the force-directed layout
(or the Les Miz data to the Hive Plot), I'll need different
sets of accessor functions for each data/layout combination.
* I see no easy way to write generalized data filters (eg, to
select sub-graphs) that can work with all combinations of
input data and layouts.
* In the Chord layout that Kai was showing me, the layout data
is stored in a (non-sparse) array. Filtering out nodes and
edges is far from trivial in this format, whereas it could be
trivial in an intermediate format.
In summary, an intermediate format seems to solve several problems.
It's quite possible, however, that there is a cleaner approach (eg,
a mixture of intermediate data formats and accessor functions).
Let's work toward finding such an approach, while satisfying all of
the use cases that we feel are important.
-r
For instance, going from a 3x3 -> 3x3 matrix. What if one group had
been removed and another added? A data model could resolve this by
having the chart listen to remove then add events (in that order) so
charts could update properly:
var graph = d3.data.graph().matrix(my_matrix);
graph.on('add.node', function(entering_nodes) {
chart.add(entering_nodes);
});
graph.on('remove.node', chart.remove ); // without unnecessary
wrapping function
There may be a better solution with data-binding. I'm still a bit
fuzzy on how binding data works with layouts. The above pattern is
inspired by Backbone's models, which fire events when data is
modified. So that's my idea for the d3.data namespace: event-firing
stateful models that follow the reusable charts spec. Here are two
other libraries I'm drawing inspiration from:
https://github.com/tinkerpop/gremlin/wiki/Basic-Graph-Traversals
http://substance.io/michael/data-js
Gremlin's graph traversals and filter syntax is particularly appealing
to me. I watched Marko use Gremlin interactively at a meetup last
year, and was stunned how easily he explored a subset of dbpedia.
For now though, pure functions going from the different formats would
be much less complex. I'm not sure about this though:
d3.graph.matrix = function(nodes, links) {
return [[…], …]; // two-dimensional array
};
In the chord layout, chord.matrix expects a matrix. I think any matrix
method should always take/return a matrix.
Not sure I understand. I was intending my example d3.graph.matrix to
be a function which converted from a nodes + links (adjacency list)
representation to a matrix representation. By "two-dimensional array"
I mean a "matrix".
Mike
https://github.com/d3/d3-plugins/tree/master/d3/graph
I've also added the concept of a traversal, which is analogous to d3's
selections. Traversals could be used for getting a subgraph to bind to
a visualization, or to bind data to the graph itself.
I need to do a bit more research before implementing traversals.
Here's a resource I found on the topic:
http://opendatastructures.org/versions/edition-0.1d/ods-java/node59.html
This sounds interesting, so I did a bit of looking into it.
Here are some reactions and speculation, followed by some
resources that I found useful.
-r
Reactions:
D3 needs a flexible, extensible set of libraries to manipulate
data, create different kinds of charts, etc. Work is proceeding
on a variety of fronts:
* D3 and its examples are under active development.
* Efforts are under way to refactor working D3 examples into
driver scripts and (abstracted, generalized) libraries.
* The work in graphs started with some example data, but Kai
and I abstracted that a bit. Kai is now implementing Real
Code (TM) to do data conversion, add accessors, etc.
However, there is nothing to keep any of us from looking into
alternative approaches (eg, based on ggplot2) if we wish.
ggplot2 is basically a chart creation library for R, based on a
"Grammar of Graphics". So, it is likely to be nicely organized,
general, and theoretically-grounded. It also comes with a user
community, examples, reference implementation, documentation, etc.
So, it may make sense to use the Grammar as a testbed and a way
of organizing and naming (at least some) libraries. In any case,
it can't hurt to give the Grammar a fair evaluation.
I haven't made a detailed comparison, but I believe that D3, SVG,
and the Grammar share many concepts and capabilities. Looking
over the ggplot2 examples, I didn't notice anything that seems to
be out of reach for D3 and SVG. So, a plausible fit.
That said, there appear to be some real differences:
* ggplot2 can take advantage of R's statistical tools, etc.
D3 (unless used as a front-end to R) cannot.
* The D3/SVG platform has features (eg, data manipulation,
some geometric objects, interactivity) that ggplot2 lacks.
Speculation:
Here's an entirely speculative scenario:
* Have ggplot2 generate JavaScript-friendly serializations of
plotting requests (eg, command scripts and supporting data).
* Using D3, accept and implement the plotting requests.
* Create a CI framework, driven by ggplot2 examples and tests.
Are there any R and/or ggplot2 enthusiasts who would like to
give some of this a try?
Resources:
ggplot2: Elegant Graphics for Data Analysis (Use R!)
http://www.amazon.com/dp/B0041KLFRW
The Grammar of Graphics
http://www.amazon.com/dp/0387245448
http://had.co.nz/ggplot2
http://had.co.nz/ggplot2/resources/2007-past-present-future.pdf
http://had.co.nz/ggplot2/resources/2007-vanderbilt.pdf
http://cran.r-project.org/web/packages/ggplot2/index.html
http://cran.r-project.org/web/packages/ggplot2/ggplot2.pdf
> Have you checked out http://polychart.com/js#about ?
I hadn't seen that. It would be lovely to take advantage of this work,
but the licensing is a poison pill for any sort of inclusion in D3:
For personal use, Polychart is licensed under Creative Commons
Attribution-NonCommercial. This means that Polychart is free for
personal, academic, and non-profit use.
We also provide a licensing options for commercial use. Since
Polychart is still in active development, our current licensing
package will allow your company or organization to use all
versions of Polychart up until version 1.0 at a discounted price.
-- http://polychart.com/js#license
-r
I've done a fair amount of investigating into ggplot2, read the book
and much of the R source. It's a really good framework for breaking
down graphics into reusable pieces. But, it doesn't explore any kind
of interactivity or transitions. It's also fairly static when it comes
to things like legend generation, so it depends what you're looking
for out of a graphing library.
-N
The main thing the d3/js lacks are the statistical functions that
power many of the plots - e.g. loess, quantile regression, density
estimation, boxplots, ... I don't think any of these are too hard to
write by themselves, but in aggregate it's a lot of work.
ggplot2 also tries harder to implement non-Cartesian coordinate
systems. This is rather attractive theoretically, but I'm not sure
how useful it is in practice.
> That said, there appear to be some real differences:
>
> * ggplot2 can take advantage of R's statistical tools, etc.
> D3 (unless used as a front-end to R) cannot.
>
> * The D3/SVG platform has features (eg, data manipulation,
> some geometric objects, interactivity) that ggplot2 lacks.
>
>
> Speculation:
>
> Here's an entirely speculative scenario:
>
> * Have ggplot2 generate JavaScript-friendly serializations of
> plotting requests (eg, command scripts and supporting data).
>
> * Using D3, accept and implement the plotting requests.
>
> * Create a CI framework, driven by ggplot2 examples and tests.
>
> Are there any R and/or ggplot2 enthusiasts who would like to
> give some of this a try?
This is very close to possible in the latest version of ggplot2,
because when plotting you also get an (invisible) data frame that has
(almost) all of the data you need to generate the plot. It would be
easy to serialise this to json and then have d3 render it. I've also
been thinking it might be possible to convert ggplot2 code
automatically into d3, thinking more along the lines of creating
something basic that could then be hand tweaked.
I'm going to be in SF for a month this summer - working with
metamarkets to figure out what d3 + ggplot2 equals.
Hadley
--
Assistant Professor / Dobelman Family Junior Chair
Department of Statistics / Rice University
http://had.co.nz/
Assuming that this serialization were in place, what would it take
to make R/ggplot2 act as a back-end server for D3? For example, I
wonder whether it might be possible to use a web browser (with D3)
as an interactive front end for R.
Getting even more carried away, it might be interesting to play
with some of Bret Victor's ideas (see http://worrydream.com) on
rapid interaction, etc.
See http://gabrielflor.it/water is an example of what this might
look like. FYI, although the page says to hold down the control
key, on my Mac I found that the option key seems to be needed.
> I've also been thinking it might be possible to convert ggplot2
> code automatically into d3, thinking more along the lines of
> creating something basic that could then be hand tweaked.
This could be very useful (eg, in transforming examples and tests
for use in a combined system). BTW, do you have any thoughts on
how to do automated testing? Does ggplot2 have a test suite?
-r
Almost nothing - R already has a built-in server (for documentation,
but it can be (mis-)used for other purposes), or there are more
standard options (like apache etc). I have a half-written port of
Ruby's sinartra to R at https://github.com/hadley/sinartra
> Getting even more carried away, it might be interesting to play
> with some of Bret Victor's ideas (see http://worrydream.com) on
> rapid interaction, etc.
Yes, that would be really cool. See also Jeroen Oonm's web interface
for ggplot2: http://www.stat.ucla.edu/~jeroen/ggplot2/. It's not
quite rapid iteration, but it's a step in the right direction.
> See http://gabrielflor.it/water is an example of what this might
> look like. FYI, although the page says to hold down the control
> key, on my Mac I found that the option key seems to be needed.
That is cool, but I wonder if it makes it too easy to get distracted
by surface features of the visualisation, instead of thinking deeply
about the problem you are trying to solve.
>> I've also been thinking it might be possible to convert ggplot2
>> code automatically into d3, thinking more along the lines of
>> creating something basic that could then be hand tweaked.
>
>
> This could be very useful (eg, in transforming examples and tests
> for use in a combined system). BTW, do you have any thoughts on
> how to do automated testing? Does ggplot2 have a test suite?
ggplot2 has two test suites - a standard unit testing suite
(https://github.com/hadley/ggplot2/tree/master/inst/tests) which tests
data structures, and a visual (regression) testing suite
(https://github.com/wch/ggplot2/wiki/Visual-test-system), that
compares renderings across commits.
Automated testing for graphics is hard, but as I write more tests, the
underlying code becomes more amenable to testing. Some purely visual
tests will always be necessary, but since they require human
intervention, I'd rather keep them to a minimum.
gplot2: Elegant Graphics for Data Analysis (Use R!)
http://www.amazon.com/dp/0387981403
Hadley Wickham; Springer
here are a couple of relevant pull quotes (readable in context
via Amazon's "First Pages" feature and on Google Books,
http://books.google.com/books?id=F_hwtlzPXBcC):
Wilkinson (2005) created the grammar of graphics to describe
the deep features that underlie all statistical graphics. The
grammar of graphics is an answer to a question: what is a
statistical graphic? The layered grammar of graphics (Wickham,
2009) builds on Wilkinson's grammar, focussing on the primacy
of layers and adapting it for embedding within R.
In brief, the grammar tells us that a statistical graphic is a
mapping from data to aesthetic attributes (colour, shape, size)
of geometric objects (points, lines, bars). The plot may also
contain statistical transformations of the data and is drawn on
a specific coordinate system. Faceting can be used to generate
the same plot for different subsets of the dataset. It is the
combination of these independent components that make up a
graphic.
and
It does not describe interaction: the grammar of graphics
describes only static graphics and there is essentially no
benefit to displaying on a computer screen as opposed to on a
piece of paper. ggplot2 can only create static graphics, so
for dynamic and interactive graphics you will have to look
elsewhere. ...
Like the Grammar, D3 thinks in terms of mappings "from data to
aesthetic attributes (colour, shape, size) of geometric objects
(points, lines, bars)".
Also, the same declarative style is found in both D3 and ggplot2
calls. In summary, D3 users will find quite a bit of familiar
thinking in the Grammar and its ggplot2 implementation.
Syntactic issues aside, the major differences seem to be that:
* D3 doesn't understand statistics and has no conceptual
framework (or preconceptions) about how data graphics can
and should be presented.
* The Grammar doesn't cover dynamic and interactive graphics,
nor do I see any support for graph and network analysis.
However, I see no reason why the Grammar could not be
extended to support these and other capabilities.
Following up on the question of rapid interaction, I'd love to see
a "workbench", based on D3, ggplot2, and the Grammar. It would let
the user experiment with both high- and low-level controls on how
the data is being presented.
Although I agree that this could be a danger:
> ... I wonder if it makes it too easy to get distracted by surface
> features of the visualisation, instead of thinking deeply about
> the problem you are trying to solve.
I believe that users would quickly evolve (or learn) an approach
which allows them to experiment with various aspects of selection,
aggregation, presentation, etc. ggplot2 reduces the hassle of
creating statistical graphics in R; the user need only type in a
few parameters to see a different presentation. The workbench I
have in mind would simply carry that to a new level of speed and
convenience.
Finally, a note on representation. I think it should be possible
to define a JSON serialization for ggplot2 calls. For example:
qplot(carat, price, data = dsmall, geom = c("point", "smooth"))
might look something like:
{
'data': 'dsmall',
'geom': 'c("point", "smooth")',
'x': 'carat',
'y': 'price'
}
Encoding the parameters in object form is not just a syntactic
change. It could let D3 and its users manipulate the graphing
parameters (and results) in a dynamic, interactive fashion.