Druid Ingestion and handling adjustments

222 views
Skip to first unread message

Yong Cheng Toh

unread,
Jun 5, 2017, 10:31:22 PM6/5/17
to Druid User
Hi Druid devs/users! 

We want to use Druid as a fast data store to be able to query millions to billions of data points. We have thousands of files in S3 amounting to thousands of hundreds of gigabytes that are generated everyday. However, there are cases where event data for previous days come in late, and we want to find a way to handle these adjustments/late arrivals. So we have the following questions:

  1. What are the usual ways to handle adjustments? We don't want to reingest a whole day's worth of event data when late data comes in, that'll be wasteful of computing resources.
  2. An idea we are thinking of is: 
    • Store the late data into another table in Druid called late_events using files that came in late
    • However we will have 2 timestamps:
      1. the actual event time (late)
      2. the time that it was ingested into Druid
    • The segments will be saved using the timestamp (no 2) that it was ingested into Druid
    • The actual event time will exist as field to allow us to query day-level adjustments
    • However, the JSON information in the files do not contain the timestamp that the file was created. 
    • Question is, is there a way to specify a fixed value for a dimension in the ingestion spec? e.g. I specify a value of "2017-06-05 01:00:00" for all data ingested. Or is there a way to use the S3 file properties (namely the last created datetime)

Thanks! :D

Cheers,
YC


Gian Merlino

unread,
Jun 6, 2017, 2:40:27 AM6/6/17
to druid...@googlegroups.com
In Druid 0.10.0 you can set "appendToExisting" : true in your index tasks to avoid re-reading the entire day of data when you just insert a batch of late-arriving events. However if you do this too often then you can get fragmentation that affects your query performance, and you might want to reindex the whole day anyway to get rid of that fragmentation. But that reindexing could be done by reading from Druid and writing back to Druid, so you don't have to hit the original raw data.

Gian

--
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+unsubscribe@googlegroups.com.
To post to this group, send email to druid...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/bbbe3d09-b1c7-4e7c-8866-1729d94e8a66%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Yong Cheng Toh

unread,
Jun 6, 2017, 3:08:02 AM6/6/17
to Druid User
Hi Gian,

Firstly, thank you for the response. I forgot to say that the reason why we don't want to reindex the whole day's data, is because this data is going to be used for billing, so it is also important that we want to have as much information as possible to see day-level adjustments. 

Not sure if you saw the second question, any idea if we can get Druid to ingest and specify a fixed value for a dimension that doesn't exist in the data?

Thanks again!

YC

On Tuesday, 6 June 2017 14:40:27 UTC+8, Gian Merlino wrote:
In Druid 0.10.0 you can set "appendToExisting" : true in your index tasks to avoid re-reading the entire day of data when you just insert a batch of late-arriving events. However if you do this too often then you can get fragmentation that affects your query performance, and you might want to reindex the whole day anyway to get rid of that fragmentation. But that reindexing could be done by reading from Druid and writing back to Druid, so you don't have to hit the original raw data.

Gian

On Mon, Jun 5, 2017 at 7:31 PM, Yong Cheng Toh <tohyon...@gmail.com> wrote:
Hi Druid devs/users! 

We want to use Druid as a fast data store to be able to query millions to billions of data points. We have thousands of files in S3 amounting to thousands of hundreds of gigabytes that are generated everyday. However, there are cases where event data for previous days come in late, and we want to find a way to handle these adjustments/late arrivals. So we have the following questions:

  1. What are the usual ways to handle adjustments? We don't want to reingest a whole day's worth of event data when late data comes in, that'll be wasteful of computing resources.
  2. An idea we are thinking of is: 
    • Store the late data into another table in Druid called late_events using files that came in late
    • However we will have 2 timestamps:
      1. the actual event time (late)
      2. the time that it was ingested into Druid
    • The segments will be saved using the timestamp (no 2) that it was ingested into Druid
    • The actual event time will exist as field to allow us to query day-level adjustments
    • However, the JSON information in the files do not contain the timestamp that the file was created. 
    • Question is, is there a way to specify a fixed value for a dimension in the ingestion spec? e.g. I specify a value of "2017-06-05 01:00:00" for all data ingested. Or is there a way to use the S3 file properties (namely the last created datetime)

Thanks! :D

Cheers,
YC


--
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+...@googlegroups.com.

Vendrad

unread,
Jun 12, 2017, 10:55:45 AM6/12/17
to Druid User
I would also like to understand if there is a way to have static fields during batch ingestion.
Reply all
Reply to author
Forward
0 new messages