segments

Marco Villalobos

unread,

Mar 2, 2022, 7:43:18 PM3/2/22

to druid...@googlegroups.com

My ingestion spec includes the following granularitySpec and transformSpec.

"granularitySpec": {
"segmentGranularity": "day",
"queryGranularity": "fifteen_minute",
"rollup": true
},
"transformSpec": {
"transforms": [
{ "type": "expression", "name": "__time", "expression": "timestamp_ceil(__time, 'PT15M') + __time - timestamp_floor(__time,'PT15M')" }
]
}

Although I have many days in my kafka topic, I noticed that when I run this

a single nano-server or single small server, it only creates two segments.

I am expecting a segment each day.

Am I missing something?

Mark Herrera

unread,

Mar 4, 2022, 6:47:53 PM3/4/22

to Druid User

I'll try to reproduce this. For my own clarification: are the Kafka tasks running but not publishing segments?

Peter Marshall

unread,

Mar 7, 2022, 11:14:58 AM3/7/22

to Druid User

Also it may be worth checking the setting you have for `useEarliestOffset` and `*messageRejection*` in your `tuningConfig`
https://druid.apache.org/docs/latest/development/extensions-core/kafka-ingestion.html#kafkasupervisorioconfig

Marco Villalobos

unread,

Mar 7, 2022, 8:13:46 PM3/7/22

to druid...@googlegroups.com

Hi Mark, yes, that was the behavior that I observed. The segments were not published. I was running this on both nano quickstart and single server small configurations without deep storage.

Is there a limit to how much data it can ingest? Most of my settings are default. I only changed the ports around and enabled Kafka Indexing and avro.

--
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/c1c49ab4-fd25-41b0-9ed6-3fdb68a6e37cn%40googlegroups.com.

Marco Villalobos

unread,

Mar 7, 2022, 8:15:16 PM3/7/22

to druid...@googlegroups.com

Hi Peter,

Thank you for replying. I did not set any of the *messageRejection* configurations.

--

You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/95e952f5-dd76-47fe-b6eb-c3e3e8a02587n%40googlegroups.com.

Ben Krug

unread,

Mar 8, 2022, 12:47:47 PM3/8/22

to druid...@googlegroups.com

There shouldn't be a limit as to how much data can be stored - up to your available disk space. If records aren't being read, you might check task logs, middle manager logs, coordinator logs. If they are, but you only see 2 segments, I'd also check the timestamp config in the spec. You said you have many days of data, but I'd verify that you're pulling the right field for the timestamp.

To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/CAKnFSR3fNqeMjenYvJhrd0tfsuyRH_gLa_mWw11qB02AyVCzJw%40mail.gmail.com.

Marco Villalobos

unread,

Mar 8, 2022, 1:45:51 PM3/8/22

to druid...@googlegroups.com

Well, this only happens when "useEarliestOffset": true. If I use "useEarliestOffset": false then I can see many different segments load.

This kafka topic constantly recieves time-series data.

To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/CAP%2BetTv33yeJ9f6w-XRThbn7BfY2bE8VaOG1b5DBRm0TsksdRA%40mail.gmail.com.

Sergio Ferragut

unread,

Mar 14, 2022, 8:25:26 PM3/14/22

to Druid User

Is the data in the topic published in time order ?

It seems strange that useEarliestOffset = false would produce many segments. I would expect the opposite because it is supposed to mean that you start reading the topic from the end, from the most recent messages and only continue with new messages. useEarliestOffset = true would read from the beginning of the topic, from the earliest message available in the topic and read all the messages up to now.

One other item that might come into play is that if you ran a streaming ingestion spec that feeds a given datasource and then use another ingestion spec (or an updated one) that still targets the same datasource/topic, the ingestion will continue from the last offset it completed before, disregarding the useEarliestOffset setting. So this might be messing with your expected results.

In order to reset this behavior you will need to do a Hard Reset on the supervisor task either through the UI or using the API:

/druid/indexer/v1/supervisor/<supervisorId>/reset

Reply all

Reply to author

Forward