How to drop a dataSource in druid

7,946 views
Skip to first unread message

Tridib Samanta

unread,
Sep 22, 2014, 1:51:04 PM9/22/14
to druid-de...@googlegroups.com
Hello Experts,
I started to explore druid for our new application. I am trying to load my own data. I set up a kafka ingestion. While doing that I uploaded some dummy/invalid data. Now that everything is setup properly (I guess so), I want to clean up the data and start fresh.

How do I drop the dataSource?
How do I clean up corrupted data leaving the good data intact?

Druid Version: 0.6.152

Thank you!
Tridib

Fangjin Yang

unread,
Sep 22, 2014, 2:59:32 PM9/22/14
to druid-de...@googlegroups.com
Hi Trilib,

Welcome! See inline about comments.


On Monday, September 22, 2014 10:51:04 AM UTC-7, Tridib Samanta wrote:
Hello Experts,
I started to explore druid for our new application. I am trying to load my own data. I set up a kafka ingestion. While doing that I uploaded some dummy/invalid data. Now that everything is setup properly (I guess so), I want to clean up the data and start fresh.

How do I drop the dataSource?

You can look at rules for automating dropping of data that's older than a certain range (http://druid.io/docs/latest/Rule-Configuration.html). You can also completely drop datasources from druid using the coordinator console(http://druid.io/docs/latest/Coordinator.html) or do a complete wipe of data everywhere (http://druid.io/docs/latest/Tasks.html - See Kill Task).
 
How do I clean up corrupted data leaving the good data intact?

Generally we recommend rerunning batch processing here. Let me know if you'd like more details. 

Fangjin Yang

unread,
Sep 22, 2014, 3:00:07 PM9/22/14
to druid-de...@googlegroups.com
I suck at typing and meant to spell Tridib :)

Tridib Samanta

unread,
Sep 22, 2014, 3:51:15 PM9/22/14
to druid-de...@googlegroups.com
Hi Fanjin,
Thanks for your quick response. I have a POC demo on Wednesday. Hopefully will be able to put together all the things.

Thanks & Regards
Tridib

Tridib Samanta

unread,
Sep 22, 2014, 4:10:39 PM9/22/14
to druid-de...@googlegroups.com
I am unable to figure out how to delete from console. It has option only to browse.
It gives option to delete segments by time interval. But then it does not list my data source. It only lists data source "Wikipedia". :(

Thanks
Tridib


On Monday, September 22, 2014 11:59:32 AM UTC-7, Fangjin Yang wrote:

Tridib Samanta

unread,
Sep 22, 2014, 4:26:08 PM9/22/14
to druid-de...@googlegroups.com
Just figure out the console does not work in IE. With FireFox I am able to see the data sources. But no way to delete it. Any idea?

Thanks & Regards
Tridib

Fangjin Yang

unread,
Sep 23, 2014, 12:17:45 AM9/23/14
to druid-de...@googlegroups.com
Go to ip:port/ in a web browser on your coordinator node

you should see these links:

Configure Assignment Rules <-- go here to drop data based on recency
Enable/Disable Datasources <-- go here to drop a datasource from Druid
Permanent Segment Deletion <-- go here to remove everything about a datasource (only if u have indexing service set up)

Tridib Samanta

unread,
Sep 23, 2014, 9:16:18 AM9/23/14
to druid-de...@googlegroups.com
Unfortunately, I still don't see my datasource in enable/disable section. I am seeing various weird things. Not sure, if my cluster is setup properly. I am using all default value except it's configured for kafka eight. Here is my cluster configuration:
Real time node
Historical node
Coordinator
Broker
Overlord
Zoo Keeper
MySQL

All these services are running. I created a new dataSource called "claim".

Now here is the inconsistency/observation:
1. When I go to complete cluster view, I only see Realtime node and Historical node. Picture attached.
2. When I go to Enable/Disable Datasources to disable data sources, I only see "Wikipedia" not my own datasource "claim". Though "claim" data source exists and visible in data source listing.
3. In Delete Segment page I only see "Wikipedia" datasource.
4. To create "claim" data source, I only specified it in real time kafka 8 firehose configuration. Is this the only place to configure data sources?
5. My druid version is 0.6.152 and kafka extension version is "io.druid.extensions:druid-kafka-eight:0.6.147". Is it correct?
6. Is there a way to completely get rid of all data and start with a clean setup? Where does druid stores it data?

I have attached screen shots for better understanding.

Thanks & Regards
Tridib
data_source_disable_wikipedia_available_only.png
data_sources.PNG
delete_segments.png
full_cluster_view.PNG

Fangjin Yang

unread,
Sep 23, 2014, 2:31:26 PM9/23/14
to druid-de...@googlegroups.com
Hi Tridlib, it seems all your data is being served by real-time nodes and handoff is not actually working. Can you provide some details about your handoff process?

Tridib Samanta

unread,
Sep 23, 2014, 2:54:31 PM9/23/14
to druid-de...@googlegroups.com
I just started on druid. Not very familiar with the handoff processes log. How can I capture it?

Fangjin Yang

unread,
Sep 24, 2014, 1:56:33 PM9/24/14
to druid-de...@googlegroups.com
There's a longer description of handoff and Druid in general here: http://static.druid.io/docs/druid.pdf

For your use case, can you describe your cluster? Are all nodes running on a single machine right now? If so, can you share your runtime.properties configuration?

If not, do you have a deep storage set up?

Thanks,
FJ

Tridib Samanta

unread,
Sep 24, 2014, 2:52:59 PM9/24/14
to druid-de...@googlegroups.com
Yes, all nodes are running in single machine. Runtime configuration attached. Deep storage is default temp directory.
I have a question: In Full Cluster view should see all the nodes listed? I can only see Realtime and Historical node.
druid-config.tar.gz

Tridib Samanta

unread,
Sep 25, 2014, 12:31:27 PM9/25/14
to druid-de...@googlegroups.com
I looked at the real time node configuration and found that druid.publish.type = nope. I also configured the MySQL db configuration which was commented out in default configuration. Just wondering why in the default configuration hand-off is off.

Thanks for all your help.

Fangjin Yang

unread,
Sep 26, 2014, 12:11:16 AM9/26/14
to druid-de...@googlegroups.com
By default it was turned off to avoid having to set up more pieces during the first introduction tutorial. Is handoff working now with publish.type=db?

Tridib Samanta

unread,
Sep 26, 2014, 8:06:46 AM9/26/14
to druid-de...@googlegroups.com
I guess so. I see segments in historical node now in cluster view.

Fangjin Yang

unread,
Sep 26, 2014, 8:15:42 PM9/26/14
to druid-de...@googlegroups.com
Cool, you should be able to disable and drop those segments in the historical cluster now.
Reply all
Reply to author
Forward
0 new messages