2014-09-30 20:04:15,834 INFO [task-runner-0] io.druid.indexing.common.index.YeOldePlumberSchool - Spilling index[0] with rows[100000] to: /tmp/persistent/task/index_simple_outclick_2014-09-30T19:53:52.433Z/work/simple_outclick_2014-09-27T00:00:00.000Z_2014-09-28T00:00:00.000Z_2014-09-30T20:03:49.816Z_0/simple_outclick_2014-09-27T00:00:00.000Z_2014-09-28T00:00:00.000Z_2014-09-30T20:03:49.816Z/spill0
I'm trying to ingest data in 82 CSV files to druid use the local firehouse and indexing service method.Each CSV file has 100,000 rows in it, each row is unique across all files.Here is the JSON template for the indexing tasks:{"type": "index","dataSource": "simple_outclick","granularitySpec":{"type": "uniform","gran": "DAY","intervals": ["$STARTDATE/$ENDDATE"]
Now If I do a timeBoundary query on my data:$ curl --silent --show-error -d @timeboundary_simple.json -H 'content-type: application/json' 'http://my-druid-broker-ip:8080/druid/v2/' --data-urlencode 'pretty' | python -mjson.tool[{"result": {"maxTime": "2014-09-28T00:03:06.000Z","minTime": "2014-09-26T23:40:01.000Z"},"timestamp": "2014-09-26T23:40:01.000Z"}]
This e-mail, including attachments, contains confidential and/or proprietary information, and may be used only by the person or entity to which it is addressed. The reader is hereby notified that any dissemination, distribution or copying of this e-mail is prohibited. If you have received this e-mail in error, please notify the sender by replying to this message and delete this e-mail immediately.
--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-developm...@googlegroups.com.
To post to this group, send email to druid-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/1df3ad5c-4151-4c06-8ad4-fa59e562acee%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Hi, see inline.On Tue, Sep 30, 2014 at 1:34 PM, Amy Troschinetz <atrosc...@rmn.com> wrote:"intervals": ["$STARTDATE/$ENDDATE"]What are the start/end dates supposed to be?
2014-09-30 20:04:15,834 INFO [task-runner-0] io.druid.indexing.common.index.YeOldePlumberSchool - Spilling index[0] with rows[100000] to: /tmp/persistent/task/index_simple_outclick_2014-09-30T19:53:52.433Z/work/simple_outclick_2014-09-27T00:00:00.000Z_2014-09-28T00:00:00.000Z_2014-09-30T20:03:49.816Z_0/simple_outclick_2014-09-27T00:00:00.000Z_2014-09-28T00:00:00.000Z_2014-09-30T20:03:49.816Z/spill0Note the number of rows: 100,000.Note the segment: simple_outclick_2014-09-27T00:00:00.000Z_2014-09-28T00:00:00.000Z_2014-09-30T20:03:49.816Z_0That is for a certain interval of data.
On further inspection, not all the files have 100,000 rows, but most of them do. I took a look at the druid_segments table in MySQL and it looks like some of the data has been set to used = 0. Maybe that's the issue?
mysql> select id, used from druid_segments where dataSource = "simple_outclick";+--------------------------------------------------------------------------------------------+------+| id | used |+--------------------------------------------------------------------------------------------+------+| simple_outclick_2014-09-26T00:00:00.000Z_2014-09-27T00:00:00.000Z_2014-09-30T18:22:38.684Z | 0 |[...]| simple_outclick_2014-09-26T00:00:00.000Z_2014-09-27T00:00:00.000Z_2014-09-30T20:02:36.830Z | 1 || simple_outclick_2014-09-27T00:00:00.000Z_2014-09-28T00:00:00.000Z_2014-09-30T19:53:42.350Z | 0 |[...]| simple_outclick_2014-09-27T00:00:00.000Z_2014-09-28T00:00:00.000Z_2014-09-30T20:04:54.548Z | 1 |[...]| simple_outclick_2014-09-28T00:00:00.000Z_2014-09-29T00:00:00.000Z_2014-09-30T19:57:58.369Z | 0 || simple_outclick_2014-09-28T00:00:00.000Z_2014-09-29T00:00:00.000Z_2014-09-30T19:58:32.725Z | 1 |+--------------------------------------------------------------------------------------------+------+Can I resolve this issue by changing the values for these rows to used = 1?
Based on this comment: https://groups.google.com/d/msg/druid-development/klPt_qiICMw/s5e0R7nTR-IJIt seems that this behavior is expected. Am I not loading data correctly here? I have lots of CSV files (waaaay more than just 82 of them) that have overlapping data in terms of time boundary. I want to aggregate all the events in all the files.
See inline.
On Tuesday, September 30, 2014 3:49:51 PM UTC-5, Fangjin Yang wrote:Hi, see inline.On Tue, Sep 30, 2014 at 1:34 PM, Amy Troschinetz <atrosc...@rmn.com> wrote:"intervals": ["$STARTDATE/$ENDDATE"]What are the start/end dates supposed to be?I was just using the minimum and maximum days (year and month and day) in each file.
How do I deal with the issue that sometimes I will have data for the same time boundary split across multiple files? I need to merge all the data together, not overwrite it.2014-09-30 20:04:15,834 INFO [task-runner-0] io.druid.indexing.common.index.YeOldePlumberSchool - Spilling index[0] with rows[100000] to: /tmp/persistent/task/index_simple_outclick_2014-09-30T19:53:52.433Z/work/simple_outclick_2014-09-27T00:00:00.000Z_2014-09-28T00:00:00.000Z_2014-09-30T20:03:49.816Z_0/simple_outclick_2014-09-27T00:00:00.000Z_2014-09-28T00:00:00.000Z_2014-09-30T20:03:49.816Z/spill0Note the number of rows: 100,000.Note the segment: simple_outclick_2014-09-27T00:00:00.000Z_2014-09-28T00:00:00.000Z_2014-09-30T20:03:49.816Z_0That is for a certain interval of data.I didn't realize each segment was defined by the intervals specified in the ingestion task, I figured it would be defined by the data itself, and that the intervals given in the index task were more or less arbitrary hints.
So how do I go about ingesting this data? Do I have to post process it into ordered rows of monotonically increasing clickDate and then specify the indexing task intervals down to the second? I'd rather just merge all the data together somehow automatically if that's possible.
Druid does atomic swaps of segments. This means that if you have 2 segments that cover the exact same interval, Druid queries for data from the segment with the most recent version identifier.Segments are uniquely identified by datasource_interval_version_partitionNumber(optional). In your case, you have multiple segments for the same time range, which is fine. Druid automatically invalidates segments with data that has been obsoleted by newer segments.
Are you creating unique indexing tasks per file or a single indexing task that ingests all files? The former may lead to the behaviour you are seeing and the later I believe is what you actually want to do.
Looking at these two params:"baseDir": "$(pwd)","filter": "$DATAFILE"Does baseDir include every single file you want to ingest?
--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-developm...@googlegroups.com.
To post to this group, send email to druid-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/07112628-F0A8-45CE-BD33-5C8BB22005DF%40rmn.com.
What you want to do is create a single indexing job for all your files and not indexing jobs per file.
com.metamx.common.ISE: Found no files to ingest! Check your schema.
The reason is that multiple indexing jobs will generate multiple segments for the same range of time and cause Druid to only use the most recently generated segments.
Apologies for the trouble you are having. The firehose should be documented much better.I've created https://github.com/metamx/druid/pull/774/files to improve ingestion using the local firehose. You can now pass in wildcard patterns for your files. There are also docs in the PR.