Segments delete

1,900 views
Skip to first unread message

Taras Puhol

unread,
Apr 16, 2014, 10:33:13 AM4/16/14
to druid-de...@googlegroups.com
Hi.

I have a question:

1) How I can delete segments from historic node that are larger than X days?

2) on version 0.6.66 if I add
config/historical/runtime.properties

druid.storage.type=local
druid.storage.storageDirectory=/var/lib/druid/localStorage

it not use this directory and still write to /tmp/druid/localStorage

but another settings
druid.segmentCache.locations=[{"path": "/var/lib/druid/indexCache", "maxSize"\: 10000000000}]
is working normally

anything I missed?

Thanks,
Taras

Nishant Bangarwa

unread,
Apr 16, 2014, 11:14:51 AM4/16/14
to druid-de...@googlegroups.com
Hi Taras, 
See Inline


On Wed, Apr 16, 2014 at 8:03 PM, Taras Puhol <eines...@gmail.com> wrote:
Hi.

I have a question:

1) How I can delete segments from historic node that are larger than X days?
you can add a PeriodLoadRule for X days to achieve this 
More details on Rules are here -  http://druid.io/docs/latest/Rule-Configuration.html 

2) on version 0.6.66 if I add
config/historical/runtime.properties

druid.storage.type=local
druid.storage.storageDirectory=/var/lib/druid/localStorage
 
it not use this directory and still write to /tmp/druid/localStorage

but another settings
druid.segmentCache.locations=[{"path": "/var/lib/druid/indexCache", "maxSize"\: 10000000000}]
is working normally

Historical Nodes never generates segment, they only load the segments generated by the batch or Ingestion tasks.
druid.storage.storageDirectory is used by the the tasks generating the segment. 
Historical nodes does not need this config at all, they figures out segment location from segmentMetadata stored in mysql. 

Historical nodes locally cache the segments once they read them from the deep storage. 
druid.segmentCache.locations is defining the location to use for caching the segments. 
 
anything I missed?

Thanks,
Taras

--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-developm...@googlegroups.com.
To post to this group, send email to druid-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/4444025f-7f80-4efb-878a-c5e7d201bc73%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--

Will Lauer

unread,
Apr 16, 2014, 2:50:00 PM4/16/14
to druid-de...@googlegroups.com

Is the documentation for the drop rules incorrect (or perhaps my reading of it incorrect)? From what I see in the documentation, “The interval of a segment will be compared against the specified period. The period is from some time in the past to the current time. The rule matches if the period contains the interval.”, which implies that a rule with a period of “P1M” would drop all segments for the last month, but keep segments older than that. This seems to be exactly the opposite of what someone would normally want. I would think that normally you would want a rule saying to drop segments older than a certain time (i.e. keep one month of segments and drop things as they get older than that).

 

Will

 

Will Lauer
Tech Yahoo, Software Sys Dev Eng, Sr
P: 217.255.4262  M: 508.561.6427 
2021 S First St Suite 110  Champaign IL  61820
 
http://forgood.zenfs.com/logos/yahoo.png
 

Taras Puhol

unread,
Apr 16, 2014, 3:04:45 PM4/16/14
to druid-de...@googlegroups.com
Nishant Bangarwa

Does it mean I need indicate this rule during batch indexing injections?
or there is a way to send drop rule {   "type" : "dropByInterval",   "interval" : "2012-01-01/2013-01-01" } to indexing service and interval will be dropped ?


thanks
Taras

Fangjin Yang

unread,
Apr 16, 2014, 11:52:49 PM4/16/14
to druid-de...@googlegroups.com
Hi Will,

There are per datasource rules that you specify and default rules that apply to all datasources. A common setup we use is something like this:
use a period load rule and load 1M of recent data into a "hot" tier,
use a period load rule and load 2Y of data into a "cold" tier,
use a period drop rule of a forever drop rule for a much larger interval, all other data is thus dropped.

Does this make sense?

Segments match the first rule that applies to it
If you have a period load rule of P1M and a period drop rule of P2M, everything from now to 1 month in the last is loaded, everything from 1M in the past to 2M in the past is dropped.

Hi Taras, 

See Inline

 

To unsubscribe from this group and stop receiving emails from it, send an email to druid-development+unsubscribe@googlegroups.com.
To post to this group, send email to druid-development@googlegroups.com.



 

--

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Nishant

 

Software Engineer

|

METAMARKETS

 

+91-9729200044

 

--
You received this message because you are subscribed to the Google Groups "Druid Development" group.

To unsubscribe from this group and stop receiving emails from it, send an email to druid-development+unsubscribe@googlegroups.com.
To post to this group, send email to druid-development@googlegroups.com.

Fangjin Yang

unread,
Apr 16, 2014, 11:55:11 PM4/16/14
to druid-de...@googlegroups.com
The general idea behind rules is that you have to be a bit explicit about the date ranges you have to manage. If you configure a per datasource rule that drops data for the current month, and there is a default rule where everything is loaded, then yes, data for the current month is dropped and all older data is loaded. If you instead configure a load rule for the current month followed by a drop rule for everything else, then the current month of data is kept, and all older data is dropped.

Fangjin Yang

unread,
Apr 16, 2014, 11:55:43 PM4/16/14
to druid-de...@googlegroups.com
Hi Taras,

There is no way to send rules during indexing. You have to specify them using the coordinator console.

Will Lauer

unread,
Apr 17, 2014, 10:01:21 AM4/17/14
to druid-de...@googlegroups.com

OK, that makes more sense. I missed the bit about the rules being ordered. Once you add that, it makes more sense.

 

Will Lauer
Tech Yahoo, Software Sys Dev Eng, Sr
P: 217.255.4262  M: 508.561.6427 
2021 S First St Suite 110  Champaign IL  61820
 
http://forgood.zenfs.com/logos/yahoo.png
 

 

From: druid-de...@googlegroups.com [mailto:druid-de...@googlegroups.com] On Behalf Of Fangjin Yang
Sent: Wednesday, April 16, 2014 10:53 PM
To: druid-de...@googlegroups.com
Subject: Re: [druid-dev] Segments delete

 

Hi Will,

Hi Taras, 

See Inline

 

To unsubscribe from this group and stop receiving emails from it, send an email to druid-developm...@googlegroups.com.
To post to this group, send email to druid-de...@googlegroups.com.



 

--

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Nishant

 

Software Engineer

|

METAMARKETS

 

+91-9729200044

 

--
You received this message because you are subscribed to the Google Groups "Druid Development" group.

To unsubscribe from this group and stop receiving emails from it, send an email to druid-developm...@googlegroups.com.
To post to this group, send email to druid-de...@googlegroups.com.

--

You received this message because you are subscribed to the Google Groups "Druid Development" group.

To unsubscribe from this group and stop receiving emails from it, send an email to druid-developm...@googlegroups.com.
To post to this group, send email to druid-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/21dbbb55-a678-4dc5-b2fe-e896aef64ed3%40googlegroups.com.

Taras Puhol

unread,
Apr 17, 2014, 1:13:55 PM4/17/14
to druid-de...@googlegroups.com
Fangjin Yang

1) so as I understand, I send query to coordinator node and it do what specified in json request, am I right?
like
curl -X POST "http://localhost:8082/druid/v2/?pretty" -H 'content-type: application/json' -d @query.body


and another questions:

2) when I inject 1 file (100k rows) it takes me 5-6 secs
when I inject 100 files (each file 1000 rows, the same 100k rows totally) it takes me 2 min 40 secs
how can be? what I can change to inject 100 files more faster?

3)  I put to config/overlord/runtime.properties
druid.storage.type=local
druid.storage.storageDirectory=/var/lib/druid/localStorage

and is working now for me, thanks to you. but if I reinject for the same period data I have new and old data in separated folders. is any way how to rewrite old data and have only new one if I do reinject?

Thanks
Taras

--
You received this message because you are subscribed to a topic in the Google Groups "Druid Development" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/druid-development/iUuR7nvkYF0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to druid-developm...@googlegroups.com.

To post to this group, send email to druid-de...@googlegroups.com.

Fangjin Yang

unread,
Apr 18, 2014, 1:11:22 AM4/18/14
to druid-de...@googlegroups.com
Hi Taras, see inline.


On Thursday, April 17, 2014 10:13:55 AM UTC-7, Taras Puhol wrote:
Fangjin Yang

1) so as I understand, I send query to coordinator node and it do what specified in json request, am I right?
like
curl -X POST "http://localhost:8082/druid/v2/?pretty" -H 'content-type: application/json' -d @query.body

You send queries to broker nodes, not coordinator nodes. Broker nodes route queries to historical and real-time nodes, which compute answers in parallel. Coordinator nodes are responsible for load balancing, assigning new data to historical nodes, and dropping old data. 


and another questions:

2) when I inject 1 file (100k rows) it takes me 5-6 secs
when I inject 100 files (each file 1000 rows, the same 100k rows totally) it takes me 2 min 40 secs
how can be? what I can change to inject 100 files more faster?

What do you mean by inject 1 file? Are you ingesting the file and if so, how are you ingesting the file? 

3)  I put to config/overlord/runtime.properties
druid.storage.type=local
druid.storage.storageDirectory=/var/lib/druid/localStorage

and is working now for me, thanks to you. but if I reinject for the same period data I have new and old data in separated folders. is any way how to rewrite old data and have only new one if I do reinject?

If you use batch indexing, Druid will create immutable segments for a time period with a version identifier associated. If you reindex the same time period of data, segments will be created with new versions and once loaded into Druid, invalidate older segments for the same time period with older versions.

Does that make sense?

FJ 

Thanks
Taras

On 17 April 2014 06:55, Fangjin Yang <fan...@metamarkets.com> wrote:
Hi Taras,

There is no way to send rules during indexing. You have to specify them using the coordinator console.


On Wednesday, April 16, 2014 12:04:45 PM UTC-7, Taras Puhol wrote:
Nishant Bangarwa

Does it mean I need indicate this rule during batch indexing injections?
or there is a way to send drop rule {   "type" : "dropByInterval",   "interval" : "2012-01-01/2013-01-01" } to indexing service and interval will be dropped ?


thanks
Taras

--
You received this message because you are subscribed to a topic in the Google Groups "Druid Development" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/druid-development/iUuR7nvkYF0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to druid-development+unsubscribe@googlegroups.com.
To post to this group, send email to druid-development@googlegroups.com.

Taras Puhol

unread,
Apr 18, 2014, 2:53:49 AM4/18/14
to druid-de...@googlegroups.com
Hi Fangjin Yang,


On 18 April 2014 08:11, Fangjin Yang <fan...@metamarkets.com> wrote:
Hi Taras, see inline.


On Thursday, April 17, 2014 10:13:55 AM UTC-7, Taras Puhol wrote:
Fangjin Yang

1) so as I understand, I send query to coordinator node and it do what specified in json request, am I right?
like
curl -X POST "http://localhost:8082/druid/v2/?pretty" -H 'content-type: application/json' -d @query.body

You send queries to broker nodes, not coordinator nodes. Broker nodes route queries to historical and real-time nodes, which compute answers in parallel. Coordinator nodes are responsible for load balancing, assigning new data to historical nodes, and dropping old data. 


is clear now. thanks
 

and another questions:

2) when I inject 1 file (100k rows) it takes me 5-6 secs
when I inject 100 files (each file 1000 rows, the same 100k rows totally) it takes me 2 min 40 secs
how can be? what I can change to inject 100 files more faster?

What do you mean by inject 1 file? Are you ingesting the file and if so, how are you ingesting the file? 

I'm doing benchmark test. I'm using batch ingestion to load segments to my datastores. For my system, I want to use like for each user separated datastore. but see that load 1 segment to 1 data store takes me 5 sec, and load 100 segments to 100 dataStores (total size the same) takes me 2 min 40 sec. is there any way how to make this faster?
 

3)  I put to config/overlord/runtime.properties
druid.storage.type=local
druid.storage.storageDirectory=/var/lib/druid/localStorage

and is working now for me, thanks to you. but if I reinject for the same period data I have new and old data in separated folders. is any way how to rewrite old data and have only new one if I do reinject?

If you use batch indexing, Druid will create immutable segments for a time period with a version identifier associated. If you reindex the same time period of data, segments will be created with new versions and once loaded into Druid, invalidate older segments for the same time period with older versions.

yes, I understand that I can invalidate older segments, but is there are rules that they are deleted so I not use a lot hdd space if need reinject a lot of times?


Thanks,
Taras



Does that make sense?

FJ 

Thanks
Taras

On 17 April 2014 06:55, Fangjin Yang <fan...@metamarkets.com> wrote:
Hi Taras,

There is no way to send rules during indexing. You have to specify them using the coordinator console.


On Wednesday, April 16, 2014 12:04:45 PM UTC-7, Taras Puhol wrote:
Nishant Bangarwa

Does it mean I need indicate this rule during batch indexing injections?
or there is a way to send drop rule {   "type" : "dropByInterval",   "interval" : "2012-01-01/2013-01-01" } to indexing service and interval will be dropped ?


thanks
Taras

--
You received this message because you are subscribed to a topic in the Google Groups "Druid Development" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/druid-development/iUuR7nvkYF0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to druid-development+unsubscribe@googlegroups.com.
To post to this group, send email to druid-development@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/6e5d242e-0308-403b-998f-632167beecb9%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.


--
You received this message because you are subscribed to a topic in the Google Groups "Druid Development" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/druid-development/iUuR7nvkYF0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to druid-developm...@googlegroups.com.
To post to this group, send email to druid-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/392f7afe-f35d-4c33-8d97-eb18c62d5863%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Taras Puhol
Linux Network Engineer,
Mobile: +380504300957
E-mail : eines...@gmail.com


-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
This email is intended for the use of the addresse(s). It may be confidential and/or private. If you are not the
intended recipient please destroy it immediately and kindly notify the sender.
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Fangjin Yang

unread,
Apr 18, 2014, 2:51:14 PM4/18/14
to druid-de...@googlegroups.com
Inline.


On Thursday, April 17, 2014 11:53:49 PM UTC-7, Taras Puhol wrote:
Hi Fangjin Yang,


On 18 April 2014 08:11, Fangjin Yang <fan...@metamarkets.com> wrote:
Hi Taras, see inline.


On Thursday, April 17, 2014 10:13:55 AM UTC-7, Taras Puhol wrote:
Fangjin Yang

1) so as I understand, I send query to coordinator node and it do what specified in json request, am I right?
like
curl -X POST "http://localhost:8082/druid/v2/?pretty" -H 'content-type: application/json' -d @query.body

You send queries to broker nodes, not coordinator nodes. Broker nodes route queries to historical and real-time nodes, which compute answers in parallel. Coordinator nodes are responsible for load balancing, assigning new data to historical nodes, and dropping old data. 


is clear now. thanks
 

and another questions:

2) when I inject 1 file (100k rows) it takes me 5-6 secs
when I inject 100 files (each file 1000 rows, the same 100k rows totally) it takes me 2 min 40 secs
how can be? what I can change to inject 100 files more faster?

What do you mean by inject 1 file? Are you ingesting the file and if so, how are you ingesting the file? 

I'm doing benchmark test. I'm using batch ingestion to load segments to my datastores. For my system, I want to use like for each user separated datastore. but see that load 1 segment to 1 data store takes me 5 sec, and load 100 segments to 100 dataStores (total size the same) takes me 2 min 40 sec. is there any way how to make this faster?

Can you describe your system specs? E.g. are you using SSDs? What is your deep storage? You are loading 100 segments to 100 historical nodes? What are the size of the segments? FWIW, if you care about how fast data loads, you should look into realtime ingestion. 
 

3)  I put to config/overlord/runtime.properties
druid.storage.type=local
druid.storage.storageDirectory=/var/lib/druid/localStorage

and is working now for me, thanks to you. but if I reinject for the same period data I have new and old data in separated folders. is any way how to rewrite old data and have only new one if I do reinject?

If you use batch indexing, Druid will create immutable segments for a time period with a version identifier associated. If you reindex the same time period of data, segments will be created with new versions and once loaded into Druid, invalidate older segments for the same time period with older versions.

yes, I understand that I can invalidate older segments, but is there are rules that they are deleted so I not use a lot hdd space if need reinject a lot of times?


Yes, drop rules will drop segments that don't match the rule. You can also drop entire datasources in Druid. You can do all of this using the coordinator console (http://druid.io/docs/latest/Coordinator.html).
 

Thanks,
Taras


Fangjin Yang

unread,
Apr 18, 2014, 2:51:50 PM4/18/14
to druid-de...@googlegroups.com
Also, invalidation == data being deleted and dropped from druid, but not deep storage.

Taras Puhol

unread,
Apr 18, 2014, 3:37:38 PM4/18/14
to druid-de...@googlegroups.com
Hi FJ,

see inline


On 18 April 2014 21:51, Fangjin Yang <fan...@metamarkets.com> wrote:
Inline.


On Thursday, April 17, 2014 11:53:49 PM UTC-7, Taras Puhol wrote:
Hi Fangjin Yang,


On 18 April 2014 08:11, Fangjin Yang <fan...@metamarkets.com> wrote:
Hi Taras, see inline.


On Thursday, April 17, 2014 10:13:55 AM UTC-7, Taras Puhol wrote:
Fangjin Yang

1) so as I understand, I send query to coordinator node and it do what specified in json request, am I right?
like
curl -X POST "http://localhost:8082/druid/v2/?pretty" -H 'content-type: application/json' -d @query.body

You send queries to broker nodes, not coordinator nodes. Broker nodes route queries to historical and real-time nodes, which compute answers in parallel. Coordinator nodes are responsible for load balancing, assigning new data to historical nodes, and dropping old data. 


is clear now. thanks
 

and another questions:

2) when I inject 1 file (100k rows) it takes me 5-6 secs
when I inject 100 files (each file 1000 rows, the same 100k rows totally) it takes me 2 min 40 secs
how can be? what I can change to inject 100 files more faster?

What do you mean by inject 1 file? Are you ingesting the file and if so, how are you ingesting the file? 

I'm doing benchmark test. I'm using batch ingestion to load segments to my datastores. For my system, I want to use like for each user separated datastore. but see that load 1 segment to 1 data store takes me 5 sec, and load 100 segments to 100 dataStores (total size the same) takes me 2 min 40 sec. is there any way how to make this faster?

Can you describe your system specs? E.g. are you using SSDs? What is your deep storage? You are loading 100 segments to 100 historical nodes? What are the size of the segments? FWIW, if you care about how fast data loads, you should look into realtime ingestion. 
 
I'm using ssd. all druids nodes are at the same machine. (broker, historic, overlord, coordinator) default settings
My deep storage is local, /var/lib/druid
all other things are default for druid. all like in http://druid.io/docs/latest/Tutorial:-Loading-Your-Data-Part-1.html
just my data
Each segments is injected to unique dataStore. I want to test how much time I'll need to inject files to 1000 customers. for each customer I want to have separate dataStore
What I mean, I do benchmark
1) 1 file, 100 000 rows, takes 5 sec
2) 100 files, each 1000 rows. (total space the same  as point 1) 2 min 40 secs
3) 1000 files, each 100 rows (so total space still the same) - 28 mins

why inject time is so much different? how I can speedup ?

 

3)  I put to config/overlord/runtime.properties
druid.storage.type=local
druid.storage.storageDirectory=/var/lib/druid/localStorage

and is working now for me, thanks to you. but if I reinject for the same period data I have new and old data in separated folders. is any way how to rewrite old data and have only new one if I do reinject?

If you use batch indexing, Druid will create immutable segments for a time period with a version identifier associated. If you reindex the same time period of data, segments will be created with new versions and once loaded into Druid, invalidate older segments for the same time period with older versions.

yes, I understand that I can invalidate older segments, but is there are rules that they are deleted so I not use a lot hdd space if need reinject a lot of times?


Yes, drop rules will drop segments that don't match the rule. You can also drop entire datasources in Druid. You can do all of this using the coordinator console (http://druid.io/docs/latest/Coordinator.html).
 
sorry, perhaps I was not clear.
why if I reinject data old and new segments are stored in deep storage, in my case (druid.storage.storageDirectory=/var/lib/druid/localStorage)
is any rule that if I reinject data deepStorage only store last version of segment?
mean example, after 1hour I accepted log that was delayed, so doing reinject, than this can be few times. so deep storage is grow up quickly.

and other related question, in case I save logs, and data were injected, so I see in Historic, as I understand if I delete all in storageDirectory=/var/lib/druid/localStorage will I have any issues?


sorry if questions are so stupid but I can not fully understand from documentation

Thanks
Taras

 
To unsubscribe from this group and all its topics, send an email to druid-developm...@googlegroups.com.
To post to this group, send email to druid-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/4669fce0-b415-439e-a711-5aad81bc5ed3%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Nishant Bangarwa

unread,
Apr 20, 2014, 2:49:53 PM4/20/14
to druid-de...@googlegroups.com
Hi Taras, 
see Inline


The difference between these is largely due to the no. of segments being generated, 
In the first case druid is generating only 1 segment (which includes, writing segment metadatas to mysql, persisting segment to deep storage) 
while in the last case segments being generated are 1000 (1000X no of indexes being generated and persisted to deep storage, metadatas written to mysql) 
Also I wonder with separating data in different datasources the amount of compression being achieved might be different. 
Do you see any noticeable diff in the size of a single segment in deep storage Vs total size of 1000 segments in use case 3?

 

3)  I put to config/overlord/runtime.properties
druid.storage.type=local
druid.storage.storageDirectory=/var/lib/druid/localStorage

and is working now for me, thanks to you. but if I reinject for the same period data I have new and old data in separated folders. is any way how to rewrite old data and have only new one if I do reinject?

If you use batch indexing, Druid will create immutable segments for a time period with a version identifier associated. If you reindex the same time period of data, segments will be created with new versions and once loaded into Druid, invalidate older segments for the same time period with older versions.

yes, I understand that I can invalidate older segments, but is there are rules that they are deleted so I not use a lot hdd space if need reinject a lot of times?


Yes, drop rules will drop segments that don't match the rule. You can also drop entire datasources in Druid. You can do all of this using the coordinator console (http://druid.io/docs/latest/Coordinator.html).
 
sorry, perhaps I was not clear.
why if I reinject data old and new segments are stored in deep storage, in my case (druid.storage.storageDirectory=/var/lib/druid/localStorage)
is any rule that if I reinject data deepStorage only store last version of segment?
mean example, after 1hour I accepted log that was delayed, so doing reinject, than this can be few times. so deep storage is grow up quickly.

and other related question, in case I save logs, and data were injected, so I see in Historic, as I understand if I delete all in storageDirectory=/var/lib/druid/localStorage will I have any issues?
by deleting all in storage, I am assuming you are trying to delete all old segments from deep storage and then reinject newer data after that, and since the data is already loaded in historical nodes you expect that to be available for queries till your new data is ingested ?  
Its not a clean way of deleting stuff as in case your historical needs to be rebooted or the segment needs to unloaded and loaded again due to any rule changes in that period, It will break stuff and will have issues loading the segments again until new segments are available. 
Also its not recommend to manually doing delete from deep storage as it leaves mysql metadata and deep storage in inconsistent stage. 
Alternatively, once the segment is replaced by a newer version the older version is marked as unused in the DB. 
you can try using KillTask to delete unused segments and remove them from both mysql and deep storage.  


--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-developm...@googlegroups.com.

To post to this group, send email to druid-de...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Fangjin Yang

unread,
Apr 21, 2014, 1:19:49 PM4/21/14
to druid-de...@googlegroups.com
Hi Taras, see inline.
This is pretty much what we do in production.
 
What I mean, I do benchmark
1) 1 file, 100 000 rows, takes 5 sec
2) 100 files, each 1000 rows. (total space the same  as point 1) 2 min 40 secs
3) 1000 files, each 100 rows (so total space still the same) - 28 mins

why inject time is so much different? how I can speedup ?

Are you running the hadoop index task? If so, there is definitely overhead associated with each hadoop task for both setup and tear down time. Each file is given its own mapper. It doesn't really make sense to run hadoop jobs on files that are smaller than at least a hundred megabytes. 
 

Taras Puhol

unread,
Apr 23, 2014, 12:21:13 PM4/23/14
to druid-de...@googlegroups.com
Hi Nishant,

see inline


On 20 April 2014 21:49, Nishant Bangarwa <nishant....@metamarkets.com> wrote:
Hi Taras, 
see Inline

I'm using ssd. all druids nodes are at the same machine. (broker, historic, overlord, coordinator) default settings
My deep storage is local, /var/lib/druid
all other things are default for druid. all like in http://druid.io/docs/latest/Tutorial:-Loading-Your-Data-Part-1.html
just my data
Each segments is injected to unique dataStore. I want to test how much time I'll need to inject files to 1000 customers. for each customer I want to have separate dataStore
What I mean, I do benchmark
1) 1 file, 100 000 rows, takes 5 sec
2) 100 files, each 1000 rows. (total space the same  as point 1) 2 min 40 secs
3) 1000 files, each 100 rows (so total space still the same) - 28 mins

why inject time is so much different? how I can speedup ?
The difference between these is largely due to the no. of segments being generated, 
In the first case druid is generating only 1 segment (which includes, writing segment metadatas to mysql, persisting segment to deep storage) 
while in the last case segments being generated are 1000 (1000X no of indexes being generated and persisted to deep storage, metadatas written to mysql) 
Also I wonder with separating data in different datasources the amount of compression being achieved might be different. 
Do you see any noticeable diff in the size of a single segment in deep storage Vs total size of 1000 segments in use case 3?


the total size of 1000 segments and size of 1 file that I sent to druid was the same, but in druid I see huge difference

for 1000 files I have
106M    indexCache
59M     localStorage
 
for 1 file
19M     indexCache
660K    localStorage

Also, I can see about 11M is needed to make all empty folders and subfolders under indexCache that are used if we inject 1000 segments (example /var/lib/druid/indexCache/project_1/2014-04-23T16:00:00.000Z_2014-04-23T17:00:00.000Z/2014-04-23T15:49:26.353Z/0)


but I have a question, is there any strategy to speedup process for index batch ingestion?  maybe using multi-thread, middle manager, etc?

Thanks
Taras

Nishant Bangarwa

unread,
Apr 24, 2014, 2:08:18 AM4/24/14
to druid-de...@googlegroups.com
Hi Taras, 
the dataset you are trying to benchmark with is very small where the overheads of setting up the batch indexing tasks and creation of indexes is outweighing the actual processing time. 
Can you try benchmarks with bigger data set ? atleast > 1G in size. 
In hadoop batch ingestion druid already spwans multiple mappers and reducers to do work in parallel. 

--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-developm...@googlegroups.com.
To post to this group, send email to druid-de...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Taras Puhol

unread,
Apr 24, 2014, 4:19:13 AM4/24/14
to druid-de...@googlegroups.com
Hi Nishant

but how I can speedup processing small segments? is it possible? should I use middle manager or any other approach? I just need quickly inject a lot of small segments to separate dataStores.  with streaming I can not dynamically manage a lot dataStores configs

Thanks
Taras

On 24 April 2014 09:08, Nishant Bangarwa <nishant....@metamarkets.com> wrote:
Hi Taras, 

Nishant Bangarwa

unread,
Apr 24, 2014, 10:24:19 AM4/24/14
to druid-de...@googlegroups.com
Hi Taras, 
One of the optimization that can be done in batch ingestion is to skip determine partitions, since the data set is too small you can set druid to not shard the data, 
To do this in your partitionsSpec set targetPartitionsSize to -1, then druid will not attempt to shard the data.  

Fundamentally batch Ingestion is designed to ingested large scale data sets, 
If you want to ingest the data and make it available to queries as soon as you ingest them, you can try with using realtime nodes. there you might need to set an appropriate rejectionPoilcy that suits your need.  


--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-developm...@googlegroups.com.
To post to this group, send email to druid-de...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Taras Puhol

unread,
Apr 24, 2014, 11:46:51 AM4/24/14
to druid-de...@googlegroups.com
Hi Nishant

I'm using Indexing Service like in http://druid.io/docs/0.6.96/Tutorial:-Loading-Your-Data-Part-1.html, so doesn't have such settings that are from HadoopDruidIndexer

Taras


--
You received this message because you are subscribed to a topic in the Google Groups "Druid Development" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/druid-development/iUuR7nvkYF0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to druid-developm...@googlegroups.com.

To post to this group, send email to druid-de...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Reply all
Reply to author
Forward
0 new messages