how to drop segments quickly?

3,310 views
Skip to first unread message

Rui Wang

unread,
Oct 31, 2013, 7:25:08 PM10/31/13
to druid-de...@googlegroups.com
Hi,

I was testing loading bulk data into druid using the batch ingestion. As a result, for the same raw data, it was
loaded twice, each under a different dataSource name. Now, I want to delete one of them. I'm not sure how
to do this quick. I disabled the dataSource on the master console, but seems it is slowly removing the related
segments.

I also tried to use 'permanent delete segments', however, it seems complaining about the range setting.
I tried all kinds of syntax, and only got Error everytime. For example, I write

2013-08-01/2013-10-10
2013-08-01T00/2013-10-10T00
2013-08-01T00:00:00.000Z/2013-10-10T00:00:00.000Z

and many other forms, but none worked...am I missing something really easy?

Thanks,
Rui

Fangjin Yang

unread,
Oct 31, 2013, 9:34:42 PM10/31/13
to druid-de...@googlegroups.com
Hi Rui,

Druid should take care of automatically dropping obsoleted segments for a time range. When you query for data, only the most recent data for a time range is scanned. If you really want to manually remove a segment, you can set the "used" flag for that segment to 'false' in your mysql database. Disabling a datasource should cause all segments of that datasource to be removed. There was a config parameter called "druid.master.millisToWaitBeforeDeleting" which determines how long Druid waits before starting to drop segments. In more recent versions of Druid, you should be able to dynamically configure this parameter from the master console under the dynamic configuration link.

Perma delete segments will wipe all segments for a datasource from druid, mysql, and deep storage. This can only be done with disabled segments AND if you have an indexing service up and running. It is more of an internal feature right now and one we have not fully documented.

Does that help answer your questions?

Thanks,
FJ


--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-developm...@googlegroups.com.
To post to this group, send email to druid-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/3b291c8c-5692-4da5-89b2-99698e292cc4%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Eric Tschetter

unread,
Nov 1, 2013, 12:34:30 AM11/1/13
to druid-de...@googlegroups.com
For the really simple answer, if you are just in a POC/dev environment and are not worried about potentially doing it wrong, you can just go to the segments table in MySQL and remove all of the rows for the segments you don't want.

This won't delete the segments from deep storage, but it will get them out of the Druid system.


Rui Wang

unread,
Nov 6, 2013, 1:02:04 PM11/6/13
to druid-de...@googlegroups.com
Hi Fangjin,

Thanks for the note.


Druid should take care of automatically dropping obsoleted segments for a time range. When you query for data, only the most recent data for a time range is scanned. If you really want to manually remove a segment, you can set the "used" flag for that segment to 'false' in your mysql database. Disabling a datasource should cause all segments of that datasource to be removed. There was a config parameter called "druid.master.millisToWaitBeforeDeleting" which determines how long Druid waits before starting to drop segments. In more recent versions of Druid, you should be able to dynamically configure this parameter from the master console under the dynamic configuration link.

For our situation, I need to remove those segments to make space for other segments. :-P I know it sounds funny, just we only had 3 compute nodes and a total of 2.4TB to start with. It's still an evaluation project and we are trying different things to know more about
druid. Hopefully we could really use it for our needs.

Could you elaborate a little more on this issue? Did you mean that, I need to

1. disable a data source(so all segments for this DS will be disabled)
2. have an indexing service(which is http://druid.io/docs/0.5.48/Indexing-Service.html)

to make the function of "permanently delete segments" on master console work?
 
Thanks!
Rui

Rui Wang

unread,
Nov 6, 2013, 1:03:27 PM11/6/13
to druid-de...@googlegroups.com


On Thursday, October 31, 2013 9:34:30 PM UTC-7, Eric Tschetter wrote:
For the really simple answer, if you are just in a POC/dev environment and are not worried about potentially doing it wrong, you can just go to the segments table in MySQL and remove all of the rows for the segments you don't want.

This won't delete the segments from deep storage, but it will get them out of the Druid system.

Thanks, Eric. In that case, will those segments become orphan? could you get them back in Druid if you want to?

Rui Wang

unread,
Nov 6, 2013, 1:09:19 PM11/6/13
to druid-de...@googlegroups.com


On Wednesday, November 6, 2013 10:02:04 AM UTC-8, Rui Wang wrote:
Hi Fangjin,

Thanks for the note.

Druid should take care of automatically dropping obsoleted segments for a time range. When you query for data, only the most recent data for a time range is scanned. If you really want to manually remove a segment, you can set the "used" flag for that segment to 'false' in your mysql database. Disabling a datasource should cause all segments of that datasource to be removed.

Btw, Fangjin, I have 2 additional questions:

1. by set the 'used' flag to false in mysql, does that
     a. make the segment disabled -- ready for deletion(or disable the datasource does this)?   ...or
     b. really remove the segment?
     c. and in either case, it won't be used in a query right?

2. by disabling the datasource, I do see that segments are being dropped. this is what we want...but it is going
   very slow. is it the way it should be? looks like in over 3 days, each machine dropped about 200gb of segments.

thanks,
Rui
 

Fangjin Yang

unread,
Nov 6, 2013, 10:04:09 PM11/6/13
to druid-de...@googlegroups.com
Hi Rui, see inline.


On Wednesday, November 6, 2013 10:09:19 AM UTC-8, Rui Wang wrote:


On Wednesday, November 6, 2013 10:02:04 AM UTC-8, Rui Wang wrote:
Hi Fangjin,

Thanks for the note.

Druid should take care of automatically dropping obsoleted segments for a time range. When you query for data, only the most recent data for a time range is scanned. If you really want to manually remove a segment, you can set the "used" flag for that segment to 'false' in your mysql database. Disabling a datasource should cause all segments of that datasource to be removed.

Btw, Fangjin, I have 2 additional questions:

1. by set the 'used' flag to false in mysql, does that
     a. make the segment disabled -- ready for deletion(or disable the datasource does this)?   ...or
     b. really remove the segment?
     c. and in either case, it won't be used in a query right?

It tells Druid the segment is no longer valid and Druid at some point later will drop the segment.
 
2. by disabling the datasource, I do see that segments are being dropped. this is what we want...but it is going
   very slow. is it the way it should be? looks like in over 3 days, each machine dropped about 200gb of segments.

Dropping should be very fast, are you saying after 3 days all the segments of the datasource are not dropped? That should definitely not be the case. Do you see thing not get dropped at all or dropped and loaded back?

Eric Tschetter

unread,
Nov 7, 2013, 2:03:25 PM11/7/13
to druid-de...@googlegroups.com
For the really simple answer, if you are just in a POC/dev environment and are not worried about potentially doing it wrong, you can just go to the segments table in MySQL and remove all of the rows for the segments you don't want.

This won't delete the segments from deep storage, but it will get them out of the Druid system.

Thanks, Eric. In that case, will those segments become orphan? could you get them back in Druid if you want to?


The way things work is that segments listed in the segments table with used = true *should* be loaded by the system.  Those are the segments that the master/coordinator makes sure somebody is serving.

Druid's only mechanism of going back into deep storage and cleaning stuff out is to use a task on the Indexing service.  We chose to not delete things automatically because we didn't want bugs causing accidental deletions.  You can always "resurrect" an old segment by re-creating the metadata in the segments table (or just setting the "used" flag to true if you haven't deleted the entry from the table).

So, yes, if you just delete the rows from the segments table, those segments will become "orphans".

--Eric

 

--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-developm...@googlegroups.com.
To post to this group, send email to druid-de...@googlegroups.com.

Rui Wang

unread,
Nov 7, 2013, 2:13:49 PM11/7/13
to druid-de...@googlegroups.com
Hi Fangjin,


 
2. by disabling the datasource, I do see that segments are being dropped. this is what we want...but it is going
   very slow. is it the way it should be? looks like in over 3 days, each machine dropped about 200gb of segments.

Dropping should be very fast, are you saying after 3 days all the segments of the datasource are not dropped? That should definitely not be the case. Do you see thing not get dropped at all or dropped and loaded back?
 

It was removing segments at a constant slow speed -- 3 days, each machine removed about 200GB worth of segments. One thing I should mention -- these segments were created when we had oom problems in the hadoop ingestion so each segment is about 14MB, quite small, and we need to drop 35000 of them. is this the reason that makes it slow?

thanks,
Rui

Eric Tschetter

unread,
Nov 7, 2013, 2:17:22 PM11/7/13
to druid-de...@googlegroups.com
It was removing segments at a constant slow speed -- 3 days, each machine removed about 200GB worth of segments. One thing I should mention -- these segments were created when we had oom problems in the hadoop ingestion so each segment is about 14MB, quite small, and we need to drop 35000 of them. is this the reason that makes it slow?

It *shouldn't* be that slow, but it's also true that we haven't optimized for the "I want to drop thousands of segments right now" case, so it's very possible that you hit some sort of corner case or bottleneck that slowed things down.

Does it appear to have removed all of the small segments, or is it still working on them?

--Eric

 

thanks,
Rui

--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-developm...@googlegroups.com.
To post to this group, send email to druid-de...@googlegroups.com.

Rui Wang

unread,
Nov 12, 2013, 2:41:42 PM11/12/13
to druid-de...@googlegroups.com
Hi Eric,

 
Does it appear to have removed all of the small segments, or is it still working on them?

It took quite a few days...but now it is all clear. all those segments were gone.

One thing to note, while it was dropping the segments, it didn't load the new segments.
I did a time boundary query on the large segments that were waiting to load, but only after
those smaller ones were all gone, these large segments became available.  Is this expected?
I thought that as long as hdd became available, it would start to load...

Thanks,
Rui

Eric Tschetter

unread,
Nov 13, 2013, 12:55:56 AM11/13/13
to druid-de...@googlegroups.com
Rui,

Yes, unfortunately.  The way the loading/dropping works is that it only does one segment at a time.  Due to some fun things with zookeeper, if the drops happen quickly, it can actually take quite a while for the coordinator node to notice and issue the next load/drop command.  Drops are higher priority than loads, so all of the drops were queued up in front of the loads and it was waiting for them to finish before loading.

--Eric


--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-developm...@googlegroups.com.
To post to this group, send email to druid-de...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages