Is it possible to implement the Largest Triangle Thee Buckets downsampling algorithm in Vega?

234 views

Skip to first unread message

Job van der Zwan

unread,

Sep 21, 2016, 11:53:59 AM9/21/16

to vega-js

Hey all,

I work at a research group for single cell RNA sequencing, building a web-based data browser for this group. Currently we're exploring Vega - it looks really powerful, and also quite elegant in its declarative approach.

I have a test-case where I basically have a huge array of numerical values; much than we can show in a single bar chart. So instead we can opt to use the mean values instead. Now I think I figured out a decent way to do that using the following transform when importing the array:

"transform": [

{ "type": "rank", "field": "_id" },

{ "type": "bin", "field": "rank", "maxbins": 400, "output": { "mid": "idx" } },

{

"type": "aggregate",

"groupby": "idx",

"summarize": [{ "field": "data", "ops": ["mean"], "as": ["avg"] }]

}

]

First we rank it by unique _id (this _id is guaranteed to be in the order of the array, correct?), then bin by rank, then calculate the mean value over those bins.

However, given the type of data exploration that we are doing, the values that are visually most important are the outliers, which is exactly what is smoothed away when we take the mean values. Another issue is that our data tends to be sparse - that is, full of zeroes.

One fairly simple but effective algorithm to deal with this is Sven Steinnarsson's Largest Triangle Three Buckets algorithm.

Online demo: http://flot.base.is/
Original master thesis by Steinnarsson: http://skemman.is/handle/1946/15343

JS implementation: https://github.com/joshcarr/largest-triangle-three-buckets.js

We use a simplified variant of the algorithm, which works as follows:

partition data in bins
calculate average values of each bin
for each bin, select the value that has the highest difference with the average of the neighbouring averages

What this effectively does is sample the value with the highest local contrast, which is biased towards the outliers. Exactly what we want!

My problem is that I'm stumped with implementing step 3; I've described my approach to step 1 and 2 above. Is it even possible to compare data to elements next to it at the moment?

Of course, I could prepare the data before handing it over to Vega, but since this downsampling depends on the visual dimensions of the chart I'd rather leave that responsibility to Vega if at all possible.

Reply all

Reply to author

Forward

0 new messages