Dataset sizes

1,188 views
Skip to first unread message

Shrikant Narasimhan

unread,
Jul 6, 2015, 1:36:25 PM7/6/15
to redash...@googlegroups.com
Hi,

What's the largest dataset size that can be returned and displayed, and subsequently downloaded as a CSV? Is this only constrained by the available memory on the server?

Thanks for a great piece of software!

Cheers,
shrikant

Eran Sandler

unread,
Jul 6, 2015, 1:42:26 PM7/6/15
to Shrikant Narasimhan, redash...@googlegroups.com

It's a combination of both server memory and browser memory as this data is loaded to the page.

You can obtain a query url with key that will generate a json and then you will only be constraint by the server memory but you will have to convert it to csv on your own.

Eran


--
You received this message because you are subscribed to the Google Groups "re:dash users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to redash-users...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Arik Fraimovich

unread,
Jul 6, 2015, 2:54:55 PM7/6/15
to Eran Sandler, Shrikant Narasimhan, redash...@googlegroups.com

-- 
Arik Fraimovich

Eran Sandler

unread,
Jul 6, 2015, 2:55:46 PM7/6/15
to Arik Fraimovich, Shrikant Narasimhan, redash...@googlegroups.com

Cool. I wasn't aware of that :-)

Shrikant Narasimhan

unread,
Jul 7, 2015, 10:24:52 AM7/7/15
to redash...@googlegroups.com, er...@sandler.co.il, ar...@arikfr.com
The dataset resulting from the query is a pretty fat table of about 1.6 million rows and 80 odd mostly text columns. This works out to about 800 MB.

Thanks Arik -- the API returning a CSV directly seems like it will serve the purpose.

Could you outline an approach I can use to allow users to export results of large queries directly instead of having the results displayed within Redash?

Regards,
Shrikant

Arik Fraimovich

unread,
Jul 12, 2015, 4:11:21 PM7/12/15
to Shrikant Narasimhan, redash...@googlegroups.com, Eran Sandler
It might be even larger in re:dash, due to the way we serialize data. I want to improve handling of large data sets, and somehow to allow "streaming" them from the source into our storage, but it never was high priority enough.

If you manage to query the data and store it using re:dash, then avoiding showing it in the UI should be relatively easy. We just need to add a flag to the query result indicating if it's a large data set (>100K rows?). In such cases we won't send the results automatically to the UI and just show some info about the results, and let the user download it.

If you want to implement this, I can provide more elaborate details.

-- 
Arik Fraimovich

Shrikant Narasimhan

unread,
Jul 12, 2015, 4:17:02 PM7/12/15
to Redash...@googlegroups.com

Thanks Arik, I'd love to implement this. Could you provide additional details?

Cheers,
Shrikant

Arik Fraimovich

unread,
Jul 12, 2015, 4:24:14 PM7/12/15
to Shrikant Narasimhan, Redash...@googlegroups.com
Sure ! I'll review the needed changes, and send my suggestions on how to implement this.

It's night in my time zone, so it will be only tomorrow morning.

-- 
Arik Fraimovich

Arik Fraimovich

unread,
Jul 13, 2015, 1:29:50 AM7/13/15
to Shrikant Narasimhan, Redash...@googlegroups.com
All the query runners collect the results first, convert them to our results format and then serialize it to JSON.

This has the following issues:
1. Our results format stores the column names with each row. For large data sets, this can result in significant overhead.
2. JSON is not a binary format, so this increases memory usage even more.
3. Collecting all the results means we need to keep them all in memory. And in this case we keep the unserialized and serialized versions in memory.

Ideally we would read a row from the source and write one to our storage. I'm not entirely sure if it's possible, it might be with Postgres's lobjects, but need to look more into it.
Another option is not to store large datasets in PostgreSQL, but rather on some filesystem (either the server's local, or something like S3).

What can be done without much changes is to change the results format to store rows without the column headers. On the backend is straightforward, while it will require some changes in the client side, which should be done in a backward compatible way.

Next issue to tackle is to avoid choking the UI with huge result sets. We need to change the backend to detect large datasets (for example with rowcount method of the cursor for DB-API based sources), and treat them differently;

1. Keep a flag on query_results object that indicates it has large dataset.
2. Store the results in a different way -- either different table (the easiest), or lobject or filesystem.

Next when requesting from the API the results, you will get the query result object along with the flag indicating that this is a big dataset, and results should be queried with additional API call.

This is sort of a brain dump, so probably not so clear. Please feel free to follow up with questions.

-- 
Arik Fraimovich

Arik Fraimovich

unread,
Jul 13, 2015, 10:30:48 AM7/13/15
to Shrikant Narasimhan, Redash...@googlegroups.com
Gave it some more thought and it might be easier to have the data of a query returned on a separate API call for all queries, and not only the large ones. This will make the logic simpler, and will require less changes.

You will only need to change a bit the queries APIs and add another field that indicates # of rows returned, based on which the UI can decide if to automatically load the data or not.

Next you need to check if your query result can fit in your instance's memory -- if it is, then you don't need to change anything in the way we process query results or store them. If it doesn't, then you can start improving this based on the notes above.

-- 
Arik Fraimovich

Lance N.

unread,
Jul 13, 2015, 4:06:03 PM7/13/15
to redash...@googlegroups.com
About query sizes: I set up something like this at a previous job and included a subsampler to enforce maximum query sizes. Reservoir sampling is the right way to go.

As to formats: take a look at the jsonp format used by Afterquery. It would be great if re:dash could feed afterquery.

Arik Fraimovich

unread,
Jul 15, 2015, 1:08:21 AM7/15/15
to Lance N., redash...@googlegroups.com
Sampling can definitely work for some use cases, although there are other use cases where you will need the complete data set. For example: we use queries to determine the users we want to target with a push notifications campaign. We need the query to return all the device ids for the users we want to target, so we can't sample.

As for afterquery -- adding support should be quite simple. In what cases would you like to use afterquery w/ re:dash? To embed results?

-- 
Arik Fraimovich

--

Leonard Linde

unread,
Feb 17, 2016, 9:41:10 AM2/17/16
to re:dash users, shrik...@gmail.com, Redash...@googlegroups.com, ar...@arikfr.com
Got here via issue #87 - this issue is a bit of a problem for us, because I have a couple of users who want big (500K rows +) result sets that tend to cause large response time (or crashes) in their browser.  Simply turning off the display of results for queries over a certain threshhold would be enough to make re:dash much more useful to us.  I realize there are probably far more sophisticated ways to do it, but really there's no point in having a tabular display of 500K rows of data.

Thanks for a great piece of software, btw.

Arik Fraimovich

unread,
Feb 18, 2016, 3:59:34 AM2/18/16
to Leonard Linde, re:dash users, Shrikant Narasimhan
What are they doing with these result sets? Downloading them for further processing in some other tool?

-- 
Arik Fraimovich

Leonard Linde

unread,
Feb 18, 2016, 7:42:35 AM2/18/16
to Arik Fraimovich, re:dash users, Shrikant Narasimhan
Yes - this is mainly to download and analyze in another tool (Excel, etc.)
Reply all
Reply to author
Forward
0 new messages