Best practice handling NaN in TimeSeriesRDDs and DateTimeIndex management

Geoffray Bories

unread,

Aug 25, 2017, 5:00:56 AM8/25/17

to spar...@googlegroups.com, Robert Suhada, Simon Ondracek

Hello Spark-TS support team,

Following a previous question I had about handling parallelized ARIMA modelization , I’m facing a new issue now that I’ve switched to real live data.

The issue I have is that my source data is coming from several sensors which are sending their data at irregular intervals. I have created an IrregularDateTimeIndex from the timestamps that I have in my database

but my problem is that it results in a lot of NaN as all those various devices don’t share the same timeframe , as a result I get lots of NaN values as you can observe in the sample below :

({device:dev2},[NaN,NaN,NaN,0.342496961,NaN,NaN,NaN,NaN,NaN,0.676741958,NaN,NaN,NaN,NaN,NaN,1.01296675,NaN,NaN,NaN,NaN,NaN,1.35176611,NaN,NaN,NaN,NaN,NaN,1.68595779,NaN,NaN,NaN,NaN,NaN,2.02736568,NaN,NaN,NaN,NaN,NaN,2.37802577,NaN,NaN,NaN,NaN,NaN,2.71618986, ….

({device:dev3},[NaN,NaN,NaN,NaN,0.362333804,NaN,NaN,NaN,NaN,NaN,0.730960131,NaN,NaN,NaN,NaN,NaN,1.11360252,NaN,NaN,NaN,NaN,NaN,1.49264932,NaN,NaN,NaN,NaN,NaN,1.88771224,NaN,NaN,NaN,NaN,NaN,2.26302838,NaN,NaN,NaN,NaN,NaN,2.66522837,NaN,NaN,NaN,NaN,NaN,3.06329536,NaN, ….

({device:dev4},[NaN,NaN,NaN,NaN,NaN,0.543007612,NaN,NaN,NaN,NaN,NaN,1.0685935,NaN,NaN,NaN,NaN,NaN,1.61586177,NaN,NaN,NaN,NaN,NaN,2.15444374,NaN,NaN,NaN,NaN,NaN,2.69110751,NaN,NaN,NaN,NaN,NaN,3.22683144,NaN,NaN,NaN,NaN,NaN,3.74374008,NaN,NaN,NaN,NaN,NaN,4.26203299,NaN,NaN, ….

({device:dev5},[0.018188294,NaN,NaN,NaN,NaN,NaN,0.0353797004,NaN,NaN,NaN,NaN,NaN,0.0519058816,NaN,NaN,NaN,NaN,NaN,0.0694523975,NaN,NaN,NaN,NaN,NaN,0.0860962197,NaN,NaN,NaN,NaN,NaN,0.103950158,NaN,NaN,NaN,NaN,NaN,0.122693457,NaN,NaN,NaN,NaN,NaN,0.141855747,NaN,NaN,NaN,NaN,NaN, ….

Of course, I cannot use the fill function because this would result in having more interpolated points than original ones but the main reason is scale. Indeed, the timeseries are long enough to reach the 10000 iterations if I apply the ARIMA model on just one of them with parameters greater than 1.

I didn’t saw any built-in functions to deal with NaNs in the spark-ts API, do you have any ideas regarding how to handle this NaN issue ?

Also , I assume that all the timeseries present in a TimeseriesRDD have to share the time index ? Is it a prerequisite that all devices send their information at the same time interval ?

Many thanks in advance for your feedback

Best regards,

Geoffray BORIES

Data Scientist

FOXCONN 4TECH (Foxconn CZ s.r.o.)

K Žižkovu 813/2, 190 00 Praha 9, Czech Republic

Mobile (CZ): +420 733 781 523 (INT VPN 527-22 523)

geoffra...@foxconn4tech.com

Luis Fernando Nicolás

unread,

Aug 25, 2017, 5:26:17 AM8/25/17

to Geoffray Bories, spar...@googlegroups.com, Robert Suhada, Simon Ondracek

To the best of my knowledge, this library is not suitable for the problem you are having because it is expecting to have a single timestamp index across all timeseries. This is, essentially a table where one of the columns is the index. It is true that columns are processed in parallel thanks to Spark but tabular data in the end.

At the point that you have irregular timestamps that generates vary sparse data after aligning the timestamps as your case, even the simplest computations can be tricky.

I want to mention that I have faced this problem at work in the field of automotive sector (car are essentially made of a bunch of sensors talking each other) and we developed a library to increase our productivity. Unfortunately it will not be open sourced. In my GitHub (https://github.com/lnicalo/Sensors), you can see something similar that I developed over my personal time. It is in a very very very early stage but it can give you an idea how to approach the problem. As anything else, there is no single solution. I am open to discuss this question because it has been chasing me for quite a few years.

geoffray.bories@foxconn4tech.com

--
You received this message because you are subscribed to the Google Groups "Time Series for Spark (the spark-ts package)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-ts+unsubscribe@googlegroups.com.
To post to this group, send email to spar...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/spark-ts/VI1PR0101MB20946DE21665982999F9D655CF9B0%40VI1PR0101MB2094.eurprd01.prod.exchangelabs.com.
For more options, visit https://groups.google.com/d/optout.

Geoffray Bories

unread,

Aug 28, 2017, 8:11:55 AM8/28/17

to Luis Fernando Nicolás, spar...@googlegroups.com, Robert Suhada, Simon Ondracek

Hello Luis,

Thank you for this clarification, you confirmed my first intuition.

Thank you for sharing your Github as well, the approach you took looks pretty interesting but as you said as you are the only contributor it’s still pretty early stage dev, let’s hope your idea gets traction from the community soon.

Although I would like to participate, right now my priority isn’t on developing this type of approach,

For the endgame I’m pursuing right now, I decided to go for 1 minute-aggregated values from my NoSQL database. Timewise this makes more sense to me at the moment. Those will be fitted to a defined uniform DateTimeIndex, making by data somewhat tabular. I’m hoping this alternative option will work.

Should I need to go for a thinner and more detailed approach for further applications,

I would be happy to discuss with you about how we can improve the existing solution, I’ll get in touch if this situation happens.

Thanks again for your help and good luck for the future.

Best regards,

Geoffray BORIES

Data Scientist

Mobile (CZ): +420 733 781 523 (INT VPN 527-22 523)

geoffra...@foxconn4tech.com

--

You received this message because you are subscribed to the Google Groups "Time Series for Spark (the spark-ts package)" group.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-ts+u...@googlegroups.com.

Reply all

Reply to author

Forward