All the query runners collect the results first, convert them to our results format and then serialize it to JSON.
This has the following issues:
1.
Our results format stores the column names with each row. For large data sets, this can result in significant overhead.
2. JSON is not a binary format, so this increases memory usage even more.
3. Collecting all the results means we need to keep them all in memory. And in this case we keep the unserialized and serialized versions in memory.
Ideally we would read a row from the source and write one to our storage. I'm not entirely sure if it's possible, it might be with Postgres's lobjects, but need to look more into it.
Another option is not to store large datasets in PostgreSQL, but rather on some filesystem (either the server's local, or something like S3).
What can be done without much changes is to change the results format to store rows without the column headers. On the backend is straightforward, while it will require some changes in the client side, which should be done in a backward compatible way.
Next issue to tackle is to avoid choking the UI with huge result sets. We need to change the backend to detect large datasets (for example with rowcount method of the cursor for DB-API based sources), and treat them differently;
1. Keep a flag on query_results object that indicates it has large dataset.
2. Store the results in a different way -- either different table (the easiest), or lobject or filesystem.
Next when requesting from the API the results, you will get the query result object along with the flag indicating that this is a big dataset, and results should be queried with additional API call.
This is sort of a brain dump, so probably not so clear. Please feel free to follow up with questions.