Druid Datasource size difference

115 views
Skip to first unread message

Laxmikant Pandhare

unread,
Oct 21, 2025, 6:17:25 PM10/21/25
to Druid User
Hi Team,

In my Previous data load, 

Step 1: I was loading data to one temporary table
Step 2: Using Multi Stage Query, I was using some of the columns from temporary table and creating JSON_OBJECT one common field.

This Process loaded data with less segments and creating smaller data size but this process was taking longer time for processing due to loading on two tables. For one day, it is creating around 150 Segments and around 150 GB data.

In Current scenario,

Step 1: We are processing data in spark and creating this JSON column thru spark. 
Step 2: Then, loading it to Druid via Kafka. 

But, this process loading that JSON column as text and creating more segments and more data size as well. For one day, it is creating around 580 Segments and around 600 GB data. Almost four times of the previous process.

After changes to current scenario, I found one way which uses MSQ and replaces that json column and makes it as JSON_OBJECT using MSQ like below:


REPLACE INTO "datasource_name"
OVERWRITE WHERE "__time" >= TIMESTAMP '2025-10-01 00:00:00' AND "__time" < TIMESTAMP '2025-10-02 00:00:00'
SELECT "__time", "abc", "pqr", "test",  TRY_PARSE_JSON('otherSourceCols') AS "otherSourceCols", "mnp"
FROM "datasource_name"
WHERE "__time" >= TIMESTAMP '2025-10-01 00:00:00' AND "__time" < TIMESTAMP '2025-10-02 00:00:00'
PARTITIONED BY DAY

Above conversion is helping in terms of segments and data size similar to approach one.

Is there any other work around which we can apply at ingestion level. As Replace multi stage query I have to schedule it separately.

Any help will be appreciated

Thank You,
Laxmikant



Laxmikant Pandhare

unread,
Oct 22, 2025, 4:42:03 PM10/22/25
to Druid User
Hi John/Team,

Any help or suggestion here for JSON_OBJECT option of Multi Stage Query.

John Kowtko

unread,
Nov 8, 2025, 3:27:54 PM11/8/25
to Druid User
Hi Laxmikant,

Streaming ingestion generally will not sort and/or colocate data very well because of the fragmented nature of the ingestion tasks.  That is why Compaction jobs were created -- to reorganize (and further compact) the data.   Batch jobs generally do this automatically, so the segments they generate usually do not need further compaction.

Seeing more and larger segments from streaming tasks does not surprise me.  But there are some adjustments you can make in the Supervisor spec to try to minimize the up front fragmentation.   If you want to try that approach, please post your Superisor spec, an also if you can the task log from one of the ingestion jobs that has run based on this Supervisor spec.  We can take a look at that and maybe identify some potential areas for optimization.

Thanks.  John

Laxmikant Pandhare

unread,
Mar 31, 2026, 4:56:59 PMMar 31
to Druid User
Hi John,

below is the spec i m using for loading data from streaming job.


{
  "type": "kafka",
  "spec": {
    "dataSchema": {
      "dataSource": "table_name",
      "timestampSpec": {
        "column": "time",
        "format": "d/M/yyyy:H:mm:ss",
        "missingValue": null
      },
      "dimensionsSpec": {
        "dimensions": [
          {
            "type": "string",
            "name": "abc",
            "multiValueHandling": "SORTED_ARRAY",
            "createBitmapIndex": true
          },
          {
            "type": "string",
            "name": "acx",
            "multiValueHandling": "SORTED_ARRAY",
            "createBitmapIndex": true
          },
          {
            "type": "string",
            "name": "sww",
            "multiValueHandling": "SORTED_ARRAY",
            "createBitmapIndex": true
          },
          {
            "type": "string",
            "name": "sss",
            "multiValueHandling": "SORTED_ARRAY",
            "createBitmapIndex": true
          },
          {
            "type": "string",
            "name": "otherSourceCols",
            "multiValueHandling": "SORTED_ARRAY",
            "createBitmapIndex": true
          },
          {
            "type": "string",
            "name": "qqq",
            "multiValueHandling": "SORTED_ARRAY",
            "createBitmapIndex": true
          },
          {
            "type": "string",
            "name": "kkk",
            "multiValueHandling": "SORTED_ARRAY",
            "createBitmapIndex": true
          },
          {
            "type": "string",
            "name": "lll",
            "multiValueHandling": "SORTED_ARRAY",
            "createBitmapIndex": true
          },
          {
            "type": "string",
            "name": "ppp",
            "multiValueHandling": "SORTED_ARRAY",
            "createBitmapIndex": true
          },
          {
            "type": "string",
            "name": "uuu",
            "multiValueHandling": "SORTED_ARRAY",
            "createBitmapIndex": true
          },
          {
            "type": "string",
            "name": "rrr",
            "multiValueHandling": "SORTED_ARRAY",
            "createBitmapIndex": true
          },
          {
            "type": "string",
            "name": "ooo",
            "multiValueHandling": "SORTED_ARRAY",
            "createBitmapIndex": true
          },
          {
            "type": "string",
            "name": "ppp",
            "multiValueHandling": "SORTED_ARRAY",
            "createBitmapIndex": true
          },
          {
            "type": "string",
            "name": "www",
            "multiValueHandling": "SORTED_ARRAY",
            "createBitmapIndex": true
          }
        ],
        "dimensionExclusions": [
          "__time",
          "time"
        ],
        "includeAllDimensions": false,
        "useSchemaDiscovery": false
      },
      "metricsSpec": [],
      "granularitySpec": {
        "type": "uniform",
        "segmentGranularity": "DAY",
        "queryGranularity": {
          "type": "none"
        },
        "rollup": false,
        "intervals": []
      },
      "transformSpec": {
        "filter": null,
        "transforms": []
      }
    },
    "ioConfig": {
      "topic": "abc_topic",
      "inputFormat": {
        "type": "kafka",
        "headerFormat": null,
        "keyFormat": null,
        "valueFormat": {
          "type": "json",
          "keepNullColumns": false,
          "assumeNewlineDelimited": false,
          "useJsonNodeReader": false
        },
        "headerColumnPrefix": "kafka.header.",
        "keyColumnName": "kafka.key",
        "timestampColumnName": "kafka.timestamp"
      },
      "replicas": 2,
      "taskCount": 5,
      "taskDuration": "PT3600S",
      "consumerProperties": {
        "security.protocol": "SASL_PLAINTEXT",
        "bootstrap.servers": "broker1:port,broker2:port,broker3:port,broker4:port,broker5:port"
      },
      "autoScalerConfig": null,
      "pollTimeout": 100,
      "startDelay": "PT5S",
      "period": "PT30S",
      "useEarliestOffset": true,
      "completionTimeout": "PT1800S",
      "lateMessageRejectionPeriod": null,
      "earlyMessageRejectionPeriod": null,
      "lateMessageRejectionStartDateTime": null,
      "configOverrides": null,
      "idleConfig": null,
      "stream": "abc_topic",
      "useEarliestSequenceNumber": true,
      "type": "kafka"
    },
    "tuningConfig": {
      "type": "kafka",
      "appendableIndexSpec": {
        "type": "onheap",
        "preserveExistingMetrics": false
      },
      "maxRowsInMemory": 1000000,
      "maxBytesInMemory": 0,
      "skipBytesInMemoryOverheadCheck": false,
      "maxRowsPerSegment": 1000000,
      "maxTotalRows": null,
      "intermediatePersistPeriod": "PT10M",
      "maxPendingPersists": 0,
      "indexSpec": {
        "bitmap": {
          "type": "roaring"
        },
        "dimensionCompression": "lz4",
        "stringDictionaryEncoding": {
          "type": "utf8"
        },
        "metricCompression": "lz4",
        "longEncoding": "longs"
      },
      "indexSpecForIntermediatePersists": {
        "bitmap": {
          "type": "roaring"
        },
        "dimensionCompression": "lz4",
        "stringDictionaryEncoding": {
          "type": "utf8"
        },
        "metricCompression": "lz4",
        "longEncoding": "longs"
      },
      "reportParseExceptions": false,
      "handoffConditionTimeout": 0,
      "resetOffsetAutomatically": false,
      "segmentWriteOutMediumFactory": null,
      "workerThreads": null,
      "chatThreads": null,
      "chatRetries": 8,
      "httpTimeout": "PT10S",
      "shutdownTimeout": "PT80S",
      "offsetFetchPeriod": "PT30S",
      "intermediateHandoffPeriod": "P2147483647D",
      "logParseExceptions": false,
      "maxParseExceptions": 2147483647,
      "maxSavedParseExceptions": 0,
      "skipSequenceNumberAvailabilityCheck": false,
      "repartitionTransitionDuration": "PT120S"
    }
  },
  "context": null
}


is there any modification we can do in order to fix this storage issue.

John Kowtko

unread,
Mar 31, 2026, 8:24:43 PMMar 31
to druid...@googlegroups.com
Hi Laxmikant,

Here it looks like the main difference between the two is ingesting otherSourceCols as JSON instead of STRING.   You should be able to use JSON column type in the Supervisor spec ... did that not work for you?

Other than that, you can increase the number of rows per segment and have >1GB segment size, unless the segment build time is getting to be too long.  

Let me know if you try either of the above two ideas.

Thanks.  John

--
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/druid-user/aac61016-57c7-4a43-bf50-c33354dd17b6n%40googlegroups.com.

Laxmikant Pandhare

unread,
Apr 3, 2026, 4:32:28 PMApr 3
to Druid User
Hi John,

thank you for your reply. I tried the JSON option for otherSourceCols but still there is not much difference in size.

What i am thinking of is like in otherSourceCols we have 30-40 fields combined.

Like below, I just added few fields from them 

"{\"xyztime\":\"2026-04-03 00:00:02\",\"abclostnet\":\"-\",\"abclostnet\":\"-\",\"bytes\":\"4144\",\"hostname\":\"abc.xyz.com\",\"httphost\":\"abc.xyz.com:443"}

What I feel here is data is less but due to the field names like xyztime,  abclostnet, abclostnetm, bytes, hostname and httphost. We have around 500 million rows per day and these fields names repeated that many amount of time and creating storage issue.

what do you think about above?

John Kowtko

unread,
Apr 4, 2026, 8:15:37 AMApr 4
to druid...@googlegroups.com
The fields within the JSON are stored as columar with values held in dictionaries within each segment.  If you have high cardinality but high overlap in values for a given field across segments, then creating fewer, larger segments should help consolidate the dictionaries and reduce overlap ... which will reduce storage size.  This will happen even within the Supervisor, when it builds a segment.

I notice your maxRowsPerSegment is 1m on the Supervisor.  I suggest try increasting that to see if it reduces the overall size of the datasource.

Thanks.  John

Laxmikant Pandhare

unread,
Apr 6, 2026, 10:25:15 AMApr 6
to Druid User
I can increase maxRowsPerSegment beyond 1 million but the problem is segment size for 1 million record itself is 1 GB. That's why I kept it as 1 million so that segment size should not exceed more.

John Kowtko

unread,
Apr 6, 2026, 10:32:50 AMApr 6
to druid...@googlegroups.com
1GB is not “too big” .. I am now working with a cluster that is generating segments as high as 20GB in size.  So I would suggest try 2m rows per segment, see if that shrinks overall size at all.

Laxmikant Pandhare

unread,
Apr 6, 2026, 10:34:29 AMApr 6
to druid...@googlegroups.com
Let me try it and see if it shrinks size

Laxmikant Pandhare

unread,
Apr 9, 2026, 7:33:52 PM (11 days ago) Apr 9
to Druid User
I did the change but it is not helping. Changed to 2 million.

Laxmikant Pandhare

unread,
Apr 16, 2026, 5:13:40 PM (4 days ago) Apr 16
to Druid User
Increase in maxRowsPerSegment to 2 million didn't help.

We are using Druid 27.0.0 is there any better enhancement in newer version that I can upgrade too and it will help with data size for JSON.

Muhammad Zaid Qadri

unread,
Apr 18, 2026, 8:37:56 PM (2 days ago) Apr 18
to druid...@googlegroups.com
Hello,

Kindly unsubscribe my email address from this group. 

Regards
Reply all
Reply to author
Forward
0 new messages