Hi Team,
In my Previous data load,
Step 1: I was loading data to one temporary table
Step 2: Using Multi Stage Query, I was using some of the columns from temporary table and creating JSON_OBJECT one common field.
This Process loaded data with less segments and creating smaller data size but this process was taking longer time for processing due to loading on two tables. For one day, it is creating around 150 Segments and around 150 GB data.
In Current scenario,
Step 1: We are processing data in spark and creating this JSON column thru spark.
Step 2: Then, loading it to Druid via Kafka.
But, this process loading that JSON column as text and creating more segments and more data size as well. For one day, it is creating around 580 Segments and around 600 GB data. Almost four times of the previous process.
After changes to current scenario, I found one way which uses MSQ and replaces that json column and makes it as JSON_OBJECT using MSQ like below:
REPLACE INTO "datasource_name"
OVERWRITE WHERE "__time" >= TIMESTAMP '2025-10-01 00:00:00' AND "__time" < TIMESTAMP '2025-10-02 00:00:00'
SELECT "__time", "abc", "pqr", "test", TRY_PARSE_JSON('otherSourceCols') AS "otherSourceCols", "mnp"
FROM "datasource_name"
WHERE "__time" >= TIMESTAMP '2025-10-01 00:00:00' AND "__time" < TIMESTAMP '2025-10-02 00:00:00'
PARTITIONED BY DAY
Above conversion is helping in terms of segments and data size similar to approach one.
Is there any other work around which we can apply at ingestion level. As Replace multi stage query I have to schedule it separately.
Any help will be appreciated
Thank You,
Laxmikant