Hello Spark-TS support team,
Following a previous question I had about handling parallelized ARIMA modelization , I’m facing a new issue now that I’ve switched to real live data.
The issue I have is that my source data is coming from several sensors which are sending their data at irregular intervals. I have created an IrregularDateTimeIndex from the timestamps that I have in my database
but my problem is that it results in a lot of NaN as all those various devices don’t share the same timeframe , as a result I get lots of NaN values as you can observe in the sample below :
({device:dev2},[NaN,NaN,NaN,0.342496961,NaN,NaN,NaN,NaN,NaN,0.676741958,NaN,NaN,NaN,NaN,NaN,1.01296675,NaN,NaN,NaN,NaN,NaN,1.35176611,NaN,NaN,NaN,NaN,NaN,1.68595779,NaN,NaN,NaN,NaN,NaN,2.02736568,NaN,NaN,NaN,NaN,NaN,2.37802577,NaN,NaN,NaN,NaN,NaN,2.71618986, ….
({device:dev3},[NaN,NaN,NaN,NaN,0.362333804,NaN,NaN,NaN,NaN,NaN,0.730960131,NaN,NaN,NaN,NaN,NaN,1.11360252,NaN,NaN,NaN,NaN,NaN,1.49264932,NaN,NaN,NaN,NaN,NaN,1.88771224,NaN,NaN,NaN,NaN,NaN,2.26302838,NaN,NaN,NaN,NaN,NaN,2.66522837,NaN,NaN,NaN,NaN,NaN,3.06329536,NaN, ….
({device:dev4},[NaN,NaN,NaN,NaN,NaN,0.543007612,NaN,NaN,NaN,NaN,NaN,1.0685935,NaN,NaN,NaN,NaN,NaN,1.61586177,NaN,NaN,NaN,NaN,NaN,2.15444374,NaN,NaN,NaN,NaN,NaN,2.69110751,NaN,NaN,NaN,NaN,NaN,3.22683144,NaN,NaN,NaN,NaN,NaN,3.74374008,NaN,NaN,NaN,NaN,NaN,4.26203299,NaN,NaN, ….
({device:dev5},[0.018188294,NaN,NaN,NaN,NaN,NaN,0.0353797004,NaN,NaN,NaN,NaN,NaN,0.0519058816,NaN,NaN,NaN,NaN,NaN,0.0694523975,NaN,NaN,NaN,NaN,NaN,0.0860962197,NaN,NaN,NaN,NaN,NaN,0.103950158,NaN,NaN,NaN,NaN,NaN,0.122693457,NaN,NaN,NaN,NaN,NaN,0.141855747,NaN,NaN,NaN,NaN,NaN, ….
Of course, I cannot use the fill function because this would result in having more interpolated points than original ones but the main reason is scale. Indeed, the timeseries are long enough to reach the 10000 iterations if I apply the ARIMA model on just one of them with parameters greater than 1.
I didn’t saw any built-in functions to deal with NaNs in the spark-ts API, do you have any ideas regarding how to handle this NaN issue ?
Also , I assume that all the timeseries present in a TimeseriesRDD have to share the time index ? Is it a prerequisite that all devices send their information at the same time interval ?
Many thanks in advance for your feedback
Best regards,
Geoffray BORIES
Data Scientist
FOXCONN 4TECH (Foxconn CZ s.r.o.)
K Žižkovu 813/2, 190 00 Praha 9, Czech Republic
Mobile (CZ): +420 733 781 523 (INT VPN 527-22 523)
--
You received this message because you are subscribed to the Google Groups "Time Series for Spark (the spark-ts package)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-ts+unsubscribe@googlegroups.com.
To post to this group, send email to spar...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/spark-ts/VI1PR0101MB20946DE21665982999F9D655CF9B0%40VI1PR0101MB2094.eurprd01.prod.exchangelabs.com.
For more options, visit https://groups.google.com/d/optout.
Hello Luis,
Thank you for this clarification, you confirmed my first intuition.
Thank you for sharing your Github as well, the approach you took looks pretty interesting but as you said as you are the only contributor it’s still pretty early stage dev, let’s hope your idea gets traction from the community soon.
Although I would like to participate, right now my priority isn’t on developing this type of approach,
For the endgame I’m pursuing right now, I decided to go for 1 minute-aggregated values from my NoSQL database. Timewise this makes more sense to me at the moment. Those will be fitted to a defined uniform DateTimeIndex, making by data somewhat tabular. I’m hoping this alternative option will work.
Should I need to go for a thinner and more detailed approach for further applications,
I would be happy to discuss with you about how we can improve the existing solution, I’ll get in touch if this situation happens.
Thanks again for your help and good luck for the future.
Best regards,
Geoffray BORIES
Data Scientist
Mobile (CZ): +420 733 781 523 (INT VPN 527-22 523)
You received this message because you are subscribed to the Google Groups "Time Series for Spark (the spark-ts package)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-ts+u...@googlegroups.com.