--
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+...@googlegroups.com.
To post to this group, send email to druid...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/d2dc3e16-3bcd-4504-8cb0-68eaf9530d6a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Hey Karol,Parallel tasks is the way to get better use of resources. Each task has only a single thread devoted to indexing. If you're using Tranquility then you can set the parallelism through "task.partitions" (see https://github.com/druid-io/tranquility/blob/master/docs/configuration.md).10GB of RAM per task sounds like a lot, you should be able to get by with a GB or two of heap and a GB or two offheap. That 8 core, 32GB machine should be able to run 4–8 tasks.That should also help with the "wrote too many bytes" errors; those happen when output columns are too large (no single column can be more than 2GB).
On 17 May 2016 at 02:17, Gian Merlino <gi...@imply.io> wrote:Hey Karol,Parallel tasks is the way to get better use of resources. Each task has only a single thread devoted to indexing. If you're using Tranquility then you can set the parallelism through "task.partitions" (see https://github.com/druid-io/tranquility/blob/master/docs/configuration.md).10GB of RAM per task sounds like a lot, you should be able to get by with a GB or two of heap and a GB or two offheap. That 8 core, 32GB machine should be able to run 4–8 tasks.That should also help with the "wrote too many bytes" errors; those happen when output columns are too large (no single column can be more than 2GB).Thanks Gian for the answer. Sorry for long delay, we've been investigating based on your response.It seems that java is just hungry for RAM [surprise ;-)!], if we put a lower limit, it is still able to keep up without using that much memory. We were able to run 4 tasks on the above machine and cut the ingestion time to about 6-7 hours per day, which is a significant improvement.Still looking for more optimizations, if anyone has any ideas :-).Partitioning (aka sharding) indeed helps with "too much bytes" errors, but it makes HDD space usage a lot less efficient. We think that partitioning on IDs (that's the only "dimension" we have in this data) would yield better results, than on timestamps (which - like I wrote - are pretty sparse).We are not using Tranquility yet, we're mostly waiting for "windowPeriod" removal [Also not enough time to transition current setups...] :-). For now, we've created our own "poor man's Tranquility", which uses "index_realtime" tasks directly.I have a couple of implementation questions, if you don't mind. Hope you can help :-).1. Is there a way to put CSV formatted data to "index_realtime" endpoint, instead of JSON? We've tried setting the same parserSpec that works with "index" task, but it still says it expects JSON.
2. How to find out which port is assigned to which task?
We know this is stored in zookeeper, but IDs in zookeeper are different than task IDs in Overlord.Looking at Tranquility code, it seems to be using "runningTasks" endpoint on Overlord and a "location" property of task. But when we query this, there's no "location" property. Is there anything else to setup to get these or am I misunderstanding the code? Right now we've come around this by setting separate "serviceName" for each task, but that bloats zookeeper a lot.
3. As far as we understand, there's currently no way to finish a task, other than by using the "timed" firehose, right? Any plans/ETAs on changing that?
--
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+...@googlegroups.com.
To post to this group, send email to druid...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/CADq%3DMYBjrM_tiSs%2B1y1yF%2BP7TXCJWXZzXAb6-upJ1V-%3Ds3n4mg%40mail.gmail.com.
see InlineOn Mon, 23 May 2016 at 19:05 Karol Woźniak <wozn...@gmail.com> wrote:On 17 May 2016 at 02:17, Gian Merlino <gi...@imply.io> wrote:Hey Karol,Parallel tasks is the way to get better use of resources. Each task has only a single thread devoted to indexing. If you're using Tranquility then you can set the parallelism through "task.partitions" (see https://github.com/druid-io/tranquility/blob/master/docs/configuration.md).10GB of RAM per task sounds like a lot, you should be able to get by with a GB or two of heap and a GB or two offheap. That 8 core, 32GB machine should be able to run 4–8 tasks.That should also help with the "wrote too many bytes" errors; those happen when output columns are too large (no single column can be more than 2GB).Thanks Gian for the answer. Sorry for long delay, we've been investigating based on your response.It seems that java is just hungry for RAM [surprise ;-)!], if we put a lower limit, it is still able to keep up without using that much memory. We were able to run 4 tasks on the above machine and cut the ingestion time to about 6-7 hours per day, which is a significant improvement.Still looking for more optimizations, if anyone has any ideas :-).Partitioning (aka sharding) indeed helps with "too much bytes" errors, but it makes HDD space usage a lot less efficient. We think that partitioning on IDs (that's the only "dimension" we have in this data) would yield better results, than on timestamps (which - like I wrote - are pretty sparse).We are not using Tranquility yet, we're mostly waiting for "windowPeriod" removal [Also not enough time to transition current setups...] :-). For now, we've created our own "poor man's Tranquility", which uses "index_realtime" tasks directly.I have a couple of implementation questions, if you don't mind. Hope you can help :-).1. Is there a way to put CSV formatted data to "index_realtime" endpoint, instead of JSON? We've tried setting the same parserSpec that works with "index" task, but it still says it expects JSON.EventReceiverFirehose only supports JSON formatted data at present.2. How to find out which port is assigned to which task?We know this is stored in zookeeper, but IDs in zookeeper are different than task IDs in Overlord.Looking at Tranquility code, it seems to be using "runningTasks" endpoint on Overlord and a "location" property of task. But when we query this, there's no "location" property. Is there anything else to setup to get these or am I misunderstanding the code? Right now we've come around this by setting separate "serviceName" for each task, but that bloats zookeeper a lot.Current master has a way to discover task port using overlord HTTP Api to make task discovery simpler. ( https://github.com/druid-io/druid/pull/2419 )3. As far as we understand, there's currently no way to finish a task, other than by using the "timed" firehose, right? Any plans/ETAs on changing that?We recently added a way to manually specify shutdown time to event receiver firehose. ( https://github.com/druid-io/druid/pull/2803 )We hope to release 0.9.1-rc1 in a week or two, till then you can try building a release from current master and try it out.