Druid Datasource size difference

23 views
Skip to first unread message

Laxmikant Pandhare

unread,
Oct 21, 2025, 6:17:25 PMOct 21
to Druid User
Hi Team,

In my Previous data load, 

Step 1: I was loading data to one temporary table
Step 2: Using Multi Stage Query, I was using some of the columns from temporary table and creating JSON_OBJECT one common field.

This Process loaded data with less segments and creating smaller data size but this process was taking longer time for processing due to loading on two tables. For one day, it is creating around 150 Segments and around 150 GB data.

In Current scenario,

Step 1: We are processing data in spark and creating this JSON column thru spark. 
Step 2: Then, loading it to Druid via Kafka. 

But, this process loading that JSON column as text and creating more segments and more data size as well. For one day, it is creating around 580 Segments and around 600 GB data. Almost four times of the previous process.

After changes to current scenario, I found one way which uses MSQ and replaces that json column and makes it as JSON_OBJECT using MSQ like below:


REPLACE INTO "datasource_name"
OVERWRITE WHERE "__time" >= TIMESTAMP '2025-10-01 00:00:00' AND "__time" < TIMESTAMP '2025-10-02 00:00:00'
SELECT "__time", "abc", "pqr", "test",  TRY_PARSE_JSON('otherSourceCols') AS "otherSourceCols", "mnp"
FROM "datasource_name"
WHERE "__time" >= TIMESTAMP '2025-10-01 00:00:00' AND "__time" < TIMESTAMP '2025-10-02 00:00:00'
PARTITIONED BY DAY

Above conversion is helping in terms of segments and data size similar to approach one.

Is there any other work around which we can apply at ingestion level. As Replace multi stage query I have to schedule it separately.

Any help will be appreciated

Thank You,
Laxmikant



Laxmikant Pandhare

unread,
Oct 22, 2025, 4:42:03 PMOct 22
to Druid User
Hi John/Team,

Any help or suggestion here for JSON_OBJECT option of Multi Stage Query.

John Kowtko

unread,
Nov 8, 2025, 3:27:54 PM (2 days ago) Nov 8
to Druid User
Hi Laxmikant,

Streaming ingestion generally will not sort and/or colocate data very well because of the fragmented nature of the ingestion tasks.  That is why Compaction jobs were created -- to reorganize (and further compact) the data.   Batch jobs generally do this automatically, so the segments they generate usually do not need further compaction.

Seeing more and larger segments from streaming tasks does not surprise me.  But there are some adjustments you can make in the Supervisor spec to try to minimize the up front fragmentation.   If you want to try that approach, please post your Superisor spec, an also if you can the task log from one of the ingestion jobs that has run based on this Supervisor spec.  We can take a look at that and maybe identify some potential areas for optimization.

Thanks.  John

Reply all
Reply to author
Forward
0 new messages