use of D3..js for data visualization of 100 millions of record

2,670 views
Skip to first unread message

Rishav Sharma

unread,
Jun 8, 2015, 10:10:45 AM6/8/15
to d3...@googlegroups.com

Hii all,

I am very new with data visualization and i got some requirement to solve, so i have some basic doubts. Please help me to clarify these

I have one requirement of data visualization with 100 millions of record that is on SQL server. Currently we are doing the reporting with .NET but it is taking couple of days to prepare all the reports. Now i want to enhance the performance. After doing some study i got to know about using the D3.js or DC with crossfilter.js. Before going for implementation i have some

 doubts:

1. As i studied somewhere that if size of data is very big means cross the 1millions record then we can’t use static file like .json or or .CSV because there is high probability to crash the browser because of keeping huge amount of data into the browser. 


2. As we want to make our data visualization interactive so that user can interact with charts and see the data in different view like they have done http://dc-js.github.io/dc.js/. So what i am thinking is in this approach everytime we need to send the request to server whenever we want to change the view of our report. So my server side code will again C# only then how can we enhance the performance? Every user request will hit the database and process millions of record and perform the DML logic there and bring the result back. As i’m thinking how can it do the process in seconds as previously it is taking hours or day.

 

Everywhere i am seeing the D3.js with big data, but as i am observing D3.js is not handling big data it is performing at its best only when it is getting small and aggregated data. We can’t process even 1GB of data directly with D3.js then how it is related with big data.  

 

I want to visualize my data with D3.js only with faster performance .


Please suggest me the right way to solve this problem. 

Gil Luz

unread,
Jun 8, 2015, 2:02:00 PM6/8/15
to d3...@googlegroups.com
Hi Rishav.
D3 is great but it doesn't solve big data alone.
You need some kind of database (in its broad definition) that can handle the amount of data and the data processing you want.
It is a question of good architecture and good data modelling that will solve your performance you like.
Just an example:
You can load your data to Amazon Redshift or Google Big Query. Those tools can process billions of records, it is just a matter of scale (that translates to money).
Those tools or any other DB or DB + web server should return the minimum of data that relevant for the user that queries the data. Then you can visualise the data with D3.

Hope I helped.
Gil

--
You received this message because you are subscribed to the Google Groups "d3-js" group.
To unsubscribe from this group and stop receiving emails from it, send an email to d3-js+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Rishav Sharma

unread,
Jun 9, 2015, 2:26:10 AM6/9/15
to d3...@googlegroups.com
Hii Gil,
  Thanks for your reply. it really helped me. 
   Tell me one more thing.
    I am thinking about option to keep the data in company's infrastructure only. So i will not think about any cloud based solution.
    In on-premise infrastructure which architecture would be good :
    1. R/python/julia+D3.js+MongoDb
    2. Hadoop with or without mongoDB
   
   Thanks 

Gil Luz

unread,
Jun 9, 2015, 3:13:13 AM6/9/15
to d3...@googlegroups.com
Hi Rishav.
In the current present the architecture should be tailored to the use case.
There are a lot of variables to know before building the right architecture.
Just from the top of my mind:
1. What is the type of your company? How many employees? 
2. What is the data you are collecting?
3. How much (in size) data every day are you collecting? How can it grow/scale in the future?
4. Does the data has a schema?
5. Which questions / queries are you going to ask this solution?
6. How many users are going to query the data and are they internal company users or external users? Are they future to grow and how?

To refer to the only 2 possibilities you gave in your questions, it seems that you ask if it is better to use hadoop or mongo (the other tools should be there if you need them in both options).
Hadoop is good at batch processing and those processes can scale linearly by adding more machines.
If you have a lot of users that should query the data, yes, you should add some kind of nosql DB like mongo to put hadoop's results in (and if not, RDBMS are still relevant).
Please note that hadoop query time is very long by design. Put your results in another DB or use Impala or alike.
Also, please note that mongo DB is a bit hard to maintain in production and hadoop is even worse. That is one of the reasons that a lot of companies goes to the cloud for managed services. If you have a good IT/DevOps team that you are good, but get them onboard.

I like to put an emphasis on data modelling, you can have a cluster of thousands of nodes and can still fails with the wrong data model.

Hope I helped, I believe we drifted a lot from this forum's topic :)

Cheers,
Gil

Rishav Kumar

unread,
Jun 9, 2015, 5:12:53 AM6/9/15
to d3...@googlegroups.com
Thanks Gil,
It was really helpful info.

--
You received this message because you are subscribed to a topic in the Google Groups "d3-js" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/d3-js/ix58Fu_5eLY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to d3-js+un...@googlegroups.com.

nick

unread,
Jun 9, 2015, 1:48:20 PM6/9/15
to d3...@googlegroups.com
Yeah, the DOM , and therefore d3 is unlikely to meet your need. Here, here to understanding your schema.

If your data is big, but you want something that still feels reasonably interactive and iterative, bokeh and vispy (both python) work very well in the ipython notebook.

Bokeh feels a bit like d3, but is already set up to do things efficiently over large data sets:
http://bokeh.pydata.org/en/latest/

Vispy is related, but is even better at animation, and 3d as it draws right to the graphic card:
http://vispy.org/

As for your data pipeline, have a look at blaze (still python):
http://blaze.pydata.org/en/latest/

It has pluggable backends, and makes intelligent decisions so that you can start with small, local set of data, then scale up to big data sets without changing your code, and a moderate amount of coding.

Especially if you are in the thrall of Redmond, I recommend the anaconda distribution which ships all of the above, except vispy.

Curran

unread,
Jun 9, 2015, 3:07:38 PM6/9/15
to d3...@googlegroups.com
D3 can be used to represent "Big Data" data visually only after it has been reduced. This is a great paper about the "ImMens" system that contains an overview of various methods of "Big Data" reduction that can be used in conjunction with visualization: http://vis.stanford.edu/files/2013-imMens-EuroVis.pdf (see the section "3. Data Reduction Methods").

For example, rather than show a scatter plot with 100,000,000,000,000 points, you could compute a 2 dimensional histogram, where each square in, for example, a 100 X 100 grid, can show the count of points in that bucket. This kind of data reduction is called "binned aggregation", and can be used to preserve salient features in the data set (like the distribution) while reducing the actual number of visual marks that needs to be drawn (and reducing the amount of data that needs to be sent to the browser). This is also related to the concept of OLAP data cubes http://en.wikipedia.org/wiki/OLAP_cube , which represent aggregated summaries of multidimensional data sets used for analysis and visualization. Another way to reduce your data is to pick a categorical dimension, compute counts, sums, or averages for each distinct value, then visualize the result as a bar chart. There will only be as many bars as there are distinct values in the column you choose to aggregate by, which is typically much less than the number of rows in your original data set.

Hope this information is helpful, all the best with your work.

Best regards,
Curran

Zoe C

unread,
Jun 9, 2015, 9:46:48 PM6/9/15
to d3...@googlegroups.com
Hi, I have around 800,000 records to visualize in one screen. 

Does it mean that I cannot achieve that?

I need to do some data reduction before the visualization?
Message has been deleted

Henry McDaniel

unread,
Jun 9, 2015, 10:48:41 PM6/9/15
to d3...@googlegroups.com
How do you plan to get the data into the client web browser? You want to minimize the loading time and you're easily looking at 7 megabytes (uncompressed) of data with 800k very simple type records in a flat format like CSV.  So you have to think about that too. Yes, you need to reduce the data.

cesar pachon

unread,
Jun 10, 2015, 10:06:26 AM6/10/15
to d3...@googlegroups.com
hello, as other posts said, you need to balance client side with server side processing. as more chart-ready is your data, you will release pressure in the browser.

said that, here there are some thoughts about our experience with d3 and big data:
- a typical chart of 1024*768 pixels will hold 786k pixels. this is a good reference because probable you are not intending to plot each item as a 1px element.
- consider using canvas instead of SVG rendering for some charts. canvas helps to reduce the DOM overload but at the price of losing typical DOM interactivity. but.. if you have near 1px items, typical mouse-over handlers will not make sense. 
- think on ways to create multiple meaningful visualizations with top-level aggregated data and then drill down interactively to detailed views. make the server side to return just the aggregated data that each view requires, or at least, groups of datasets of a reasonable size, reusable between a small set of charts)
Reply all
Reply to author
Forward
0 new messages