hbase error caused UID mappings to get blown out, then recreated, so now we have duplicates with data we can not access

Brett Hawn

unread,

Oct 6, 2017, 11:35:50 AM10/6/17

to OpenTSDB

Long story short, due to an issue with our hbase, the UID mappings went missing for a period of time and the metrics ended up being recreated (and apparently all tagk and tagv mappings to boot), so now we have duplicate UID mappings of everything. Data that is mapped to the new UIDs is available, however data mapped to the old UIDs is not. The question is, is there a way to;

a) extract the data that's mapped to the old UIDs so it can be re-imported to the new UIDs or even better re-mapped

b) remove the old mappings afterwards

I'm not particularly concerned with the loss of the UIDs themselves, while it's a fair number of UIDs that would become unavailable, it's not a significant enough number to be of concern, however the data gap... caused by the problem, that's a real issue.

I did a scan of the tsdb-uid table (redirected to a file) and an example is below

grep net.circuit.ifhcinoctets screenlog.18

;\x7F\x15 column=name:metrics, timestamp=1506097504076, value=net.circuit.ifhcinoctets

net.circuit.ifhcinoctets column=id:metrics, timestamp=1506097504079, value=;\x7F\x15

net.circuit.ifhcinoctets column=id:metrics, timestamp=1507157614607, value=\x9A\xFF\x9D

\x9A\xFF\x9D column=name:metrics, timestamp=1507157614604, value=net.circuit.ifhcinoctets

you can see where the old mapping is, and what the new mapping is, there's about 10 days of data that's mapped to the old UID and if we do a tsdb uid fsck we can see the following output

2017-10-06 08:34:36,287 ERROR [main] UidManager: Duplicate forward metrics mapping: net.circuit.ifhcinoctets -> 3B7F15 and net.circuit.ifhcinoctets -> 9AFF9D. kv=KeyValue(key="net.circuit.ifhcinoctets", family="id", qualifier="metrics", value=[-102, -1, -99], timestamp=1507157614607)

2017-10-06 08:34:36,751 ERROR [main] UidManager: Inconsistent reverse metrics mapping 3B7F15 -> net.circuit.ifhcinoctets vs 9AFF9D -> net.circuit.ifhcinoctets / net.circuit.ifhcinoctets -> 9AFF9D

as you can imagine, this is ... not a happy situation and I'm looking for potential resolution that doesn't involve wiping the cluster (not an option) and/or major data loss.

ManOLamancha

unread,

Jan 29, 2018, 8:39:32 PM1/29/18

to OpenTSDB

yeah, this isn't an easy one to solve. Ping me via my git commit email and I can try to help in detail.

Brett Hawn

unread,

Feb 9, 2018, 12:38:30 PM2/9/18

to OpenTSDB

Honestly, we gave up on rescuing the data some time ago and have since moved on without it.

Reply all

Reply to author

Forward