crossfilter and missing data

340 views
Skip to first unread message

Chris Withers

unread,
Nov 28, 2012, 2:54:03 AM11/28/12
to d3...@googlegroups.com
Hi All,

I didn't see a mailing list for crossfilter, if there is one, please can
someone point me there?

In the meantime, I want to know what to do about some missing data.
The data I'm exploring is actually very similar to the data in the
example at http://square.github.com/crossfilter/, I'm looking at delays
and cancellations on the UK rail network.

So, I have data that looks like this:

origin,destination,departed,delay
PAD,RED,201211230716,0
PAD,RED,201211260701,CANCELLED
PAD,RED,201211260721,2

...and I want to show cancelled services as a red bar stacked on top of
the "date" histogram in the example.

I had thought to re-structure the day to be like:

origin,destination,departed,delay,cancelled
PAD,RED,201211230716,0,0
PAD,RED,201211260701,?,1
PAD,RED,201211260721,2,0

...but, how do I sum up those cancels for each day to stack them onto
the "date" histogram?

More importantly, what do I put in place of the '?' for delay?
If I put "0" they'll show up in the "0 minutes delayed" bucket, which
wouldn't be correct ;-) How do I exclude cancelled services from the
"arrival delay" histogram?

cheers,

Chris

--
Simplistix - Content Management, Batch Processing & Python Consulting
- http://www.simplistix.co.uk

CG

unread,
Nov 28, 2012, 6:45:40 AM11/28/12
to d3...@googlegroups.com
Facing same issue here.

for ordinal dimensions (PAD, RED in your dataset), it is pretty easy. something like : dim = crossfilter.dimension(function(d) {return !!d.ordinal ? d.ordinal: '__missing')}) should work. 
What I am planning to do for interval dimension (numbers) is to have my dimension function return Infinity when the value is missing/NaN. This still allows for correct crossfilter ordering. Then I'd add one group for dis-aggregating/filtering Infinity from other correct values. 

my 2 cents
C.

Chris Withers

unread,
Dec 3, 2012, 3:15:38 AM12/3/12
to d3...@googlegroups.com, CG
On 28/11/2012 11:45, CG wrote:
> for ordinal dimensions (PAD, RED in your dataset), it is pretty easy.
> something like : dim = crossfilter.dimension(function(d) {return
> !!d.ordinal ? d.ordinal: '__missing')}) should work.

Okay, but what do you do with the __missing value?

> What I am planning to do for interval dimension (numbers) is to have my
> dimension function return Infinity when the value is missing/NaN. This
> still allows for correct crossfilter ordering.

That makes sense, but if you have a lot of data in this cateorgy, how do
you stop that category from artificially reducing the height of the
histogram? (or are D3's histograms smarter than that?)

> Then I'd add one group
> for dis-aggregating/filtering Infinity from other correct values.

Not sure I follow here, can you give me an example?

Chris

> On Wednesday, November 28, 2012 8:54:03 AM UTC+1, ChrisW wrote:
>
> Hi All,
>
> I didn't see a mailing list for crossfilter, if there is one, please
> can
> someone point me there?
>
> In the meantime, I want to know what to do about some missing data.
> The data I'm exploring is actually very similar to the data in the
> example at http://square.github.com/crossfilter/
> <http://square.github.com/crossfilter/>, I'm looking at delays
> and cancellations on the UK rail network.
>
> So, I have data that looks like this:
>
> origin,destination,departed,delay
> PAD,RED,201211230716,0
> PAD,RED,201211260701,CANCELLED
> PAD,RED,201211260721,2
>
> ...and I want to show cancelled services as a red bar stacked on top of
> the "date" histogram in the example.
>
> I had thought to re-structure the day to be like:
>
> origin,destination,departed,delay,cancelled
> PAD,RED,201211230716,0,0
> PAD,RED,201211260701,?,1
> PAD,RED,201211260721,2,0
>
> ...but, how do I sum up those cancels for each day to stack them onto
> the "date" histogram?
>
> More importantly, what do I put in place of the '?' for delay?
> If I put "0" they'll show up in the "0 minutes delayed" bucket, which
> wouldn't be correct ;-) How do I exclude cancelled services from the
> "arrival delay" histogram?
>
> cheers,
>
> Chris
>
> --
> Simplistix - Content Management, Batch Processing & Python Consulting
> - http://www.simplistix.co.uk
>
>
> ______________________________________________________________________
> This email has been scanned by the Symantec Email Security.cloud service.
> For more information please visit http://www.symanteccloud.com
> ______________________________________________________________________

CG

unread,
Dec 5, 2012, 6:37:01 AM12/5/12
to d3...@googlegroups.com, CG
On 28/11/2012 11:45, CG wrote: 
> for ordinal dimensions (PAD, RED in your dataset), it is pretty easy.
> something like : dim = crossfilter.dimension(function(d) {return
> !!d.ordinal ? d.ordinal: '__missing')}) should work.

Okay, but what do you do with the __missing value?
I treat them as a "usual" group. For instance, for a question that allows 'Yes' or 'No' answers, the visualization I would implement would be a pieChart counting Yes, No and Missing categories. Hope that make sense. 

> What I am planning to do for interval dimension (numbers) is to have my
> dimension function return Infinity when the value is missing/NaN. This
> still allows for correct crossfilter ordering.

That makes sense, but if you have a lot of data in this cateorgy, how do
you stop that category from artificially reducing the height of the
histogram? (or are D3's histograms smarter than that?)
DC's histograms are actually that smart ; ). histogram.elasticY(true) would rescale the chart on the basis of selected data.  

> Then I'd add one group
> for dis-aggregating/filtering Infinity from other correct values.

Not sure I follow here, can you give me an example?
For one interval dimension, you will have 2 visualizations: 1 histogram with Infinity values removed from the scale, and maybe one pieChart showing the proportion of missing values vs non-missing. I might have one example to show in a few days if that helps. 
Cheers,
C.

Chris Withers

unread,
Dec 5, 2012, 7:17:44 AM12/5/12
to d3...@googlegroups.com, CG
On 05/12/2012 11:37, CG wrote:
> > for ordinal dimensions (PAD, RED in your dataset), it is pretty
> easy.
> > something like : dim = crossfilter.dimension(function(d) {return
> > !!d.ordinal ? d.ordinal: '__missing')}) should work.
>
> Okay, but what do you do with the __missing value?
>
> I treat them as a "usual" group. For instance, for a question that
> allows 'Yes' or 'No' answers, the visualization I would implement would
> be a pieChart counting Yes, No and Missing categories. Hope that make
> sense.

Yep.

> That makes sense, but if you have a lot of data in this cateorgy,
> how do
> you stop that category from artificially reducing the height of the
> histogram? (or are D3's histograms smarter than that?)
>
> DC's <http://nickqizhu.github.com/dc.js/>histograms are actually that
> smart ; ). histogram.elasticY(true) would rescale the chart on the basis
> of selected data.

Are you using these for histograms?

https://github.com/NickQiZhu/dc.js/wiki/API#wiki-bar-chart

If so, how do you get histograms with variable-width buckets?

> > Then I'd add one group
> > for dis-aggregating/filtering Infinity from other correct values.
>
> Not sure I follow here, can you give me an example?
>
> For one interval dimension, you will have 2 visualizations: 1 histogram
> with Infinity values removed from the scale, and maybe one pieChart
> showing the proportion of missing values vs non-missing. I might have
> one example to show in a few days if that helps.

What I really want to do is show cancelled services stacked on top of
the other bars, eg, stacked on departure time, date and time of day. Any
idea if DC can do that kind of stacked bar?
Reply all
Reply to author
Forward
0 new messages