Parsing NaN in metric columns

383 views
Skip to first unread message

bachr chi

unread,
Aug 28, 2015, 11:40:03 AM8/28/15
to Druid User
Hi,
I'm ingesting a local tsv file with the indexing service, and using the TSV parser spec.

Many of the metric columns with doubleSum aggregation may have a string NaN that druid fails to parse:

2015-08-28T11:02:25,045 ERROR [task-runner-0] io.druid.indexing.overlord.ThreadPoolTaskRunner - Exception while running task[IndexTask{id=index_994_2015_08_12_12_standard_feed_2015-08-28T11:02:09.516Z, type=index, dataSource=994_2015_08_12_12_standard_feed}]
com.metamx.common.parsers.ParseException: Unable to parse metrics[served_tasks_time], value[NaN]
	at io.druid.data.input.MapBasedRow.getFloatMetric(MapBasedRow.java:112) ~[druid-api-0.3.8.jar:0.3.8]
	at io.druid.segment.incremental.IncrementalIndex$1$3.get(IncrementalIndex.java:113) ~[druid-processing-0.8.0.jar:0.8.0]
	at io.druid.query.aggregation.DoubleSumAggregator.aggregate(DoubleSumAggregator.java:60) ~[druid-processing-0.8.0.jar:0.8.0]
	at io.druid.segment.incremental.OnheapIncrementalIndex.addToFacts(OnheapIncrementalIndex.java:169) ~[druid-processing-0.8.0.jar:0.8.0]
	at io.druid.segment.incremental.IncrementalIndex.add(IncrementalIndex.java:452) ~[druid-processing-0.8.0.jar:0.8.0]
	at io.druid.segment.realtime.plumber.Sink.add(Sink.java:125) ~[druid-server-0.8.0.jar:0.8.0]
	at io.druid.indexing.common.index.YeOldePlumberSchool$1.add(YeOldePlumberSchool.java:115) ~[druid-indexing-service-0.8.0.jar:0.8.0]
	at io.druid.indexing.common.task.IndexTask.generateSegment(IndexTask.java:374) ~[druid-indexing-service-0.8.0.jar:0.8.0]
	at io.druid.indexing.common.task.IndexTask.run(IndexTask.java:205) ~[druid-indexing-service-0.8.0.jar:0.8.0]
	at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:235) [druid-indexing-service-0.8.0.jar:0.8.0]
	at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:214) [druid-indexing-service-0.8.0.jar:0.8.0]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_51]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_51]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_51]
	at java.lang.Thread.run(Thread.java:745) [?:1.8.0_51]
Caused by: java.lang.NumberFormatException: For input string: "NULL"
	at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043) ~[?:1.8.0_51]
	at sun.misc.FloatingDecimal.parseFloat(FloatingDecimal.java:122) ~[?:1.8.0_51]
	at java.lang.Float.parseFloat(Float.java:451) ~[?:1.8.0_51]
	at java.lang.Float.valueOf(Float.java:416) ~[?:1.8.0_51]
	at io.druid.data.input.MapBasedRow.getFloatMetric(MapBasedRow.java:109) ~[druid-api-0.3.8.jar:0.3.8]
	... 14 more

Is there a way to tell druid what is default value for a given column in case of failure (e.g. replace NaN with 0.0) or ignore the row??

Nishant Bangarwa

unread,
Aug 31, 2015, 10:13:58 AM8/31/15
to druid...@googlegroups.com
Hi, 
IndexTask doesn't have any config to replace NaN with 0.0. 
I think the best way to handle this is to clean it in ETL layer. 

--
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+...@googlegroups.com.
To post to this group, send email to druid...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/59f1de25-948c-4c69-9025-45700bc0a87a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--

bachr chi

unread,
Aug 31, 2015, 12:22:09 PM8/31/15
to Druid User
Why not returning Float.NaN (which is a float) when a NumberFormatException is thrown by getFloatMetric()?

Fangjin Yang

unread,
Sep 1, 2015, 6:57:24 PM9/1/15
to Druid User
Hi Bachr, we'd love a PR with this change and some unit tests.

-- FJ
Reply all
Reply to author
Forward
0 new messages