How to fix memory error loading largish amount of data

52 views
Skip to first unread message

William Volkman

unread,
Mar 28, 2024, 3:47:53 PMMar 28
to Druid User
I'm using 0.29 single node local install.  I'm investigating druid for processing a stream of transaction data.  Initial dataset is 1.5 billion records, I've already converted the parquet files into .json files for the upload.  Using the local file process there are 100 JSON files each just over 15 million records.  Initially I just pointed at the directory and said GO.  And was please to see it trundle through the data and report that it had found all 1.5 billion of them, then it got to stage 2, hung for  abit, and then said "Done".  No data loaded, off the tasks menu I found some tasks with "FAILED" and part of a OutOfMemory error messge without details.  The host system has 126GB of memory for user programs, the task count that was running was 14.  So I tried again with just 1 file, much slower ingest single threaded, and it finished and all 15 million+ records where present, I then tried to load 5 files specifying a unix wildcard which is apparently not supported by the wildcardFileFilter class.  So I move 5 files into a separate directory, pointed there and said go, was able to see that it finished stage 1 again, correct count of records, then said done and no files loaded.  Task with that same OutOfMemory error.
1. How to find out what memory setting to boost to assist with this?
2. Is 1.5 billion records too much for Druid?
3. Recommendations on improved approach appreciated.

Thank you,

William Volkman

unread,
Mar 30, 2024, 5:24:46 PMMar 30
to Druid User
So I keep poking at this.  I've tried running the SQL batch loader with smaller datasets, I was hoping for an append type operation, and it was able to load 45 million records however now no further progress with putting smaller sets of files into the directory.  Created a second upload directory and tried the normal batch upload with 5 files, it failed.  Tried the SQL batch upload and it created an "upload" table however the UI didn't show it until the 10th reload of the interface and rummaging around the other tables in the query explorer. Went back to the normal batch upload and this time gave it just one additional file, and lowered the number of records in memory by dividing it by 10. It finished sucessfully however the result of count(*) returns 15 million additional records instead of the 5 million that were in the single file given to it.  Kind of at a loss here.

William Volkman

unread,
Mar 30, 2024, 6:07:13 PMMar 30
to Druid User
Yeah, thie data import from JSON does not work.  I gave it 2 files which make a total of just over 10 million records, ran the import and it added 30 million records to the collection. The 0.29 version appears to be broken.

William Volkman

unread,
Mar 31, 2024, 4:15:45 AMMar 31
to Druid User
So I found this, how to fix this?

2024-03-31T07:30:20,184 INFO [main] org.apache.zookeeper.ZooKeeper - Client environment:os.memory.free=464MB
2024-03-31T07:30:20,184 INFO [main] org.apache.zookeeper.ZooKeeper - Client environment:os.memory.max=512MB
2024-03-31T07:30:20,184 INFO [main] org.apache.zookeeper.ZooKeeper - Client environment:os.memory.total=512MB

So a parquet file standard size is 256MB - Snappy compressed means 5X increase requires 1280MB, need 2 copies so 2560MB
which has to be less that 40% of processing heap so these values should be 8GB.  Not clear who needs the value adjusted as the documentation
does not seem to address this.

2024-03-31T07:30:26,679 INFO [processing-0] org.apache.parquet.hadoop.InternalParquetRecordReader - RecordReader initialized will read a total of 1372160 records.
2024-03-31T07:30:26,679 INFO [processing-0] org.apache.parquet.hadoop.InternalParquetRecordReader - at row 0. reading next block
2024-03-31T07:30:28,968 INFO [processing-0] org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor [.snappy]
2024-03-31T07:30:29,048 INFO [processing-0] org.apache.parquet.hadoop.InternalParquetRecordReader - block read in memory in 2368 ms. row count = 679936
2024-03-31T07:31:18,635 INFO [processing-0] org.apache.parquet.hadoop.InternalParquetRecordReader - Assembled and processed 679936 records from 43 columns in 48854 ms: 13.917714 rec/ms, 598.46173 cel$
2024-03-31T07:31:18,635 INFO [processing-0] org.apache.parquet.hadoop.InternalParquetRecordReader - time spent so far 4% reading (2368 ms) and 95% processing (48854 ms)
2024-03-31T07:31:18,635 INFO [processing-0] org.apache.parquet.hadoop.InternalParquetRecordReader - at row 679936. reading next block
2024-03-31T07:31:19,310 INFO [processing-0] org.apache.parquet.hadoop.InternalParquetRecordReader - block read in memory in 675 ms. row count = 692224
Terminating due to java.lang.OutOfMemoryError: Java heap space

William Volkman

unread,
Mar 31, 2024, 5:29:42 AMMar 31
to Druid User
So the "auto" configuration grossly undersized the midleManager and javaOptsArray values, I multiplied by 10X and now it's making progress.

William Volkman

unread,
Mar 31, 2024, 3:37:04 PMMar 31
to Druid User
The changes to the memory allocation to merge the xlarge jvm options in with the results of using "--compute" on the auto has enabled the data set claim it has been imported:

1,515,037,217 rows inserted into "shipping_data".

Insert query took 5:04:32. Show details

Open new tab with: SELECT * FROM "shipping_data"

William Volkman

unread,
Mar 31, 2024, 4:00:57 PMMar 31
to Druid User
However all of the data is not there.
Count company_count
1077792247 5196689

Those values should actually be:
 total_records | unique_companies
---------------+------------------
    1515037217 |          6879446

William Volkman

unread,
Mar 31, 2024, 6:44:38 PMMar 31
to Druid User
Okay the datasources tab showed that only 60% of the segments where available, so that count matches up.
Spent time digging through the docs, no mention of this issue that I could find.  Asked AI and it hallucinated a set of actions none of them related.
Found this post through normal search https://groups.google.com/g/druid-user/c/uEnJJo_7ODo
And the issue is that another default setting is too small for a decent sized dataset.

conf/druid/auto/historical/runtime.properties (was 300g)

druid.segmentCache.locations=[{"path":"var/druid/segment-cache","maxSize":"2000g"}]

Fixed it so now the total count is correct however the distinct values for the company ids is incorrect.

total_records unique_companies
1515037217 6740965

So missing 138481 values, data accuracy should not be a sizing problem but an implementation problem.  It's been nice talking to myself, I hope this is of some value but I'm marking Druid as not suitable for use for data analysis projects. Disappointed to say this least that I spent almost a week on this to find it not viable.

William Volkman

unread,
Apr 1, 2024, 5:38:45 AMApr 1
to Druid User
Okay, a bug in distinct handling?
select count(*) as total_records
     , count(distinct(exporter_company_canon_id)) as unique_companies
from shipping_data
total_records unique_companies
1515037217 6740965

This query returns the correct results:
WITH base as (
select exporter_company_canon_id
     , count(*) as rec_count
from "shipping_data"
group by exporter_company_canon_id)
select count(*) as company_count
     , sum(rec_count) as total_transactions
from base
company_count total_transactions
6879446 1515037217

John Kowtko

unread,
Apr 1, 2024, 8:26:41 AMApr 1
to Druid User
Hi William, 

I just read through your thread ... sounds like you are slowly figuring things out.

If nobody has responded up to now, you might try the OS Druid Slack channel, there is a much larger community conversing there any you may be able to get more eyes on your issue.

I don't know much about Parquet so cannot comment on this.

Re: the distinct count issue, Druid by default uses HLLs for approximate count distinct ... the group by is focing an exact distinct count ... so use adding the query context  "useApproximateCountDistinct": false and see if the number match ...

Thanks.  John

William Volkman

unread,
Apr 1, 2024, 2:30:42 PMApr 1
to Druid User
I would argue that having useApproximateCountDistinct default to True would be a "Bad Thing"(TM), it will cause people to doubt that the tool is viable as I did.  How can I force that to False.  At least with Mongo your direct way of querying does not approximate as approximations are are by definition always bad and you have to explicitly declare that you are not interested in accuracy.   How does one set a context value from the Web UI?  If I'm going to have the customer look at this that will be their access.  I've only seen some references  when forming an API payload.

John Kowtko

unread,
Apr 1, 2024, 3:29:36 PMApr 1
to Druid User
Since Druid began as a noSQL product and was focused on high performance ingestion/analytics, I can understand why approximations were the default.  But Druid is moving quickly into the SQL world, a lot of effort put into cleaning up NULL handling in the past year ... so to me your suggestion is a valid one.

William Volkman

unread,
Apr 1, 2024, 5:31:21 PMApr 1
to Druid User
I understand the use case for approximate queries, just want the semantics to be distinct from exact queries.  Having a statistically generated measure for visual presentations in quasi-realtime is useful.  Just when you want to use the values for business projections or to make other decisions such as monetization. I found this setting and I assume that this would do the correct thing: druid.query.default.useSketch.
I've put Druid back on the list of recommendations for my client, the good news it's ingestion is much better than Mongo and Redis not quite up to PostgreSQL 15 or Greenplum (single node Druid was 89K r/s, 6 node Greenplum was 148K r/s so getting a cluster setup would make sense to finalize comparison, single node PostgreSQL 15 was 155K r/s).  Query times for Druid and Greenplum were within 20% of each other, around a minute for a full trundle through the data.

John Kowtko

unread,
Apr 1, 2024, 5:39:28 PMApr 1
to Druid User
Agreed ... in the SQL world you would expect "count(distinct X)" to be exact, and maybe something like "count(approximate distinct X)" or "approx_distinct_count(x)" to be used for the approximate form.

Re: competitive positioning ... I did a lot of competitive performance benchmarking when I was at Oracle back in the early days ... a lot of it has to do with how each product is tuned ... Druid is architected for some very specific performance areas so it is important to use "the right tool for the right job" :) .

Reply all
Reply to author
Forward
0 new messages