--
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+unsubscribe@googlegroups.com.
To post to this group, send email to druid...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/bbbe3d09-b1c7-4e7c-8866-1729d94e8a66%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
In Druid 0.10.0 you can set "appendToExisting" : true in your index tasks to avoid re-reading the entire day of data when you just insert a batch of late-arriving events. However if you do this too often then you can get fragmentation that affects your query performance, and you might want to reindex the whole day anyway to get rid of that fragmentation. But that reindexing could be done by reading from Druid and writing back to Druid, so you don't have to hit the original raw data.
Gian
On Mon, Jun 5, 2017 at 7:31 PM, Yong Cheng Toh <tohyon...@gmail.com> wrote:
Hi Druid devs/users!We want to use Druid as a fast data store to be able to query millions to billions of data points. We have thousands of files in S3 amounting to thousands of hundreds of gigabytes that are generated everyday. However, there are cases where event data for previous days come in late, and we want to find a way to handle these adjustments/late arrivals. So we have the following questions:
- What are the usual ways to handle adjustments? We don't want to reingest a whole day's worth of event data when late data comes in, that'll be wasteful of computing resources.
- An idea we are thinking of is:
- Store the late data into another table in Druid called late_events using files that came in late
- However we will have 2 timestamps:
- the actual event time (late)
- the time that it was ingested into Druid
- The segments will be saved using the timestamp (no 2) that it was ingested into Druid
- The actual event time will exist as field to allow us to query day-level adjustments
- However, the JSON information in the files do not contain the timestamp that the file was created.
- Question is, is there a way to specify a fixed value for a dimension in the ingestion spec? e.g. I specify a value of "2017-06-05 01:00:00" for all data ingested. Or is there a way to use the S3 file properties (namely the last created datetime)
Thanks! :DCheers,YC
--
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+...@googlegroups.com.