Anomalies in raw data

479 views
Skip to first unread message

Peter K

unread,
Dec 30, 2021, 8:32:36 AM12/30/21
to golden-cheetah-users
Hi All,

Looking at anomalies in the raw data, i see "Data spike candidate". Can you point me to the documentation or the calculation that causes this message? Looking at the data the jump in power between 10039 and 10040 is larger than the points 10040 and 10041.

Peter


DataSpikeGC.png
Message has been deleted
Message has been deleted

Peter K

unread,
Dec 30, 2021, 9:06:37 AM12/30/21
to golden-cheetah-users
Here is a better picture of the data and the json for the ride
DataSpikeGC2.png
Ride.json

Ale Martinez

unread,
Dec 30, 2021, 11:07:07 AM12/30/21
to golden-cheetah-users
El jueves, 30 de diciembre de 2021 a la(s) 10:32:36 UTC-3, Peter K escribió:
Hi All,

Looking at anomalies in the raw data, i see "Data spike candidate". Can you point me to the documentation or the calculation that causes this message?

Peter K

unread,
Dec 31, 2021, 11:37:16 AM12/31/21
to golden-cheetah-users
Thanks Ale. It appears that the "Data spike candidate" is set based on the "Fix Power Spikes" settings tool. The default setting is 200 watts so anything over that is setting the anomaly flag. 

Ale Martinez

unread,
Dec 31, 2021, 11:46:30 AM12/31/21
to golden-cheetah-users
El viernes, 31 de diciembre de 2021 a la(s) 13:37:16 UTC-3, Peter K escribió:
Thanks Ale. It appears that the "Data spike candidate" is set based on the "Fix Power Spikes" settings tool. The default setting is 200 watts so anything over that is setting the anomaly flag. 

I had the same idea when Eric introduced that changes, but he says it doesn't work that way: https://github.com/GoldenCheetah/GoldenCheetah/issues/3241#issue-534276843, honestly, I don't use that feature. 

Peter K

unread,
Dec 31, 2021, 12:50:29 PM12/31/21
to golden-cheetah-users
I changed the default value of 200 to 458 (the PMax value from the CP chart) and left the variance at 20. Now I get the data spike anomaly for everything over 458, even for an entry at 458.377 watts. I'm not sure how the variance is used. I'll have to look into Eric's code and see if i can debug it. I was looking at the estimated power tool and started looking at the anomalies to see why am i getting power spikes over 1000 watts. 

Ale Martinez

unread,
Dec 31, 2021, 5:13:49 PM12/31/21
to golden-cheetah-users
I think 20 is too low variability  for cycling.

Eric Christoffersen

unread,
Jan 3, 2022, 7:24:58 PM1/3/22
to golden-cheetah-users
Hey Peter,

The behavior is that a point is considered an anomaly if the power is above threshold X AND the value has more percent variance than Y. The variance is computed using a 30 second window.

The filter is used for anomaly detection as well as for the Fix Power Spikes tool. I really just use it to remove power spikes and for that the actual threshold X doesn't really matter to me. I used 200 because I just don't care if I have spikes below 200. 20% variance worked ok for me to remove spikes, setting higher it didn't rewrite enough, setting lower it rewrote too many points, but thats just me. Mostly on my rides I just want to remove the billion watt values that come from wifi interference.

Variance of a set of points is the square of the sd. In this case it has units of watts^2. % of variance gives a threshold for what is considered an outlier. The behavior will be sort of weird to understand because it is dynamic based on noisiness of the current 30 second window: a 5 watt outlier in a perfectly stable region will get eliminated while a 20 watt outlier in a noisy region will not.

I was skeptical that variance and threshold is the right way to do this but it seems to work fine for me in practice. I think right path for you is to raise variance % instead of touching power threshold.

Something that might improve behavior:
Suppose you have threshold set to 200 watts and a noisy window with a mean of 400. And there's a few points at 5 watts, those 5 watt points will be ignored because they don't exceed threshold. Arguably they're just as much noise as an 800watt sample. Maybe instead of requiring each point to exceed threshold, we should test if mean of window exceeds threshold, once mean exceeds threshold then all points are candidates - this would let anomaly detect dropouts?

Peter K

unread,
Jan 4, 2022, 12:29:22 PM1/4/22
to golden-cheetah-users
Thanks for the explanation Eric. To be honest, i'll have to dig into the math a bit more to better understand what is going on. Also, it sounds that you developed the tool for the different use case. In your case, you are trying to filter out power spikes due to noise. In my case, i'm trying to use it to filter out power spikes generated by the "Estimate Power Values..." tool. I'm not sure if that is a correct use case but here is why i was asking the questions.

I imported a ride from Strava and I see some large spikes in the slope. I applied the "Fix Elevation errors..." tool and that seems to fix the elevation and slope. At this point I'm not fixing the recording gaps but applied the "Estimate power values..." tool. I now see large spikes in power that are over 500 watts. I was trying to use the "Fix Power spikes..." tool to fix them. I set the max = 458, variance 200 and window 13. You can see the result in the before and after pictures attached. First, I expected the spikes to drop closer to the 458 number but that may be because i don't understand the math used. Second, I expected the anomalies to go away after the tool is applied but as you can see the same rows are still marked as anomalies. 

Again, not sure i'm using the tool the way it was intended.

Peter

Peter K

unread,
Jan 4, 2022, 12:35:03 PM1/4/22
to golden-cheetah-users
Sorry, forgot the attachments
After Fix Power Spikes.png
Before Fix Power Spikes.png

Paul J

unread,
Jan 4, 2022, 1:06:52 PM1/4/22
to golden-cheetah-users
Hi Peter,

   The median filtering option should remove large spikes, so set the median tick box, set the variance to the largest step you are expecting in your data, and the window to the number of samples for the median filter, note the Max field is not used for the median filtering. If you are using a 200 Watt variance with the median filter, you will still have 200 watt spikes from the window's median value (the middle value of the windows samples ordered from lowest to highest). Try the default values in the dialogue box below with the median filter set, and see if this removes the spikes.

median.PNG

The dialog description reads:

Power meters will occasionally report erroneously high values for power. For crank based power meters such as SRM and Quarq this is caused by an erroneous cadence reading as a result of triggering a reed switch whilst pushing off.

This function provides two algorithms that look for spikes/anomalies in power data and replace the erroneous data by:

i) Replacing the point in question with smoothed/interpolated data from either side of the point in question, it takes the following parameters:

Absolute Max (Watts)- this defines an absolute value for watts, and will smooth any values above this absolute value that have been identified as being anomalies (i.e. at odds with the data surrounding it)

Variance (Watts) - This determines the threshold beyond which a data point will be smoothed/interpolated, if the difference between the data point value and the 30 second rolling average wattage prior to the spike exceeds this parameter.

ii) Replacing the point in question with the median value of a window centred upon the erroneous data point. This approach is robust to local outliers, and preserves sharp edges, it takes the following parameters:

Window Size - this defines the number of neighbouring points used to determine a median value; the window size is always odd to ensure we have a central median value.

Variance (Watts) - Determines the threshold beyond which a data point will be fixed, if the difference between the data point value and the median value exceeds this parameter.

Paul.

Eric Christoffersen

unread,
Jan 4, 2022, 4:22:21 PM1/4/22
to golden-cheetah-users
I didn't write the anomaly tool, I just fixed some of its bugs. It used to be really broken and it now seems to work ok for me, its not perfect but I think it does a reasonable enough job, it at least does what it says it will do. Looks like interface has changed though, maybe works better now.

A problem with your approach: Beware "Fix elevation errors". It goes out on the internet and gathers elevation data from a map service. That data is often noisy and sometimes has a really low resolution, so it often solidly damages your elevation data. I think thats the core issue is that it is corrupting your ride.

For example I once received elevation data from a map service that only offered ~1m resolution. When the ride's lat/lon mapped to something near a whole meter the reported elevation would go up/down by a meter for each sample. Where the ride was supposed to see a continuous 8% gradient I was getting +/- 60%. This is called quantization and can be really difficult to remove.

I think the better approach is to use Fix Gps Errors to filter your altitude signal. I wrote it and it works well for me for smoothing rides before re-riding them virtually. Its not perfect but at least it is fair.

If estimate power is generating spikes then it means your route data is damaged, you need to fix that not just remove spikes after the fact. Was thinking someday it might be interesting to filter route data by having it ensure that route was possible with reasonable power.

Peter K

unread,
Jan 5, 2022, 9:23:52 AM1/5/22
to golden-cheetah-users
@Paul - Thanks for the suggestion. I Did try the median option with the settings in the picture but i still get a lot of spikes above 500 watts. 

@Eric - I agree that the data is the cause of the spikes. I was hoping to use the anomalies to identify the issues with the data. I'll take a look at the code for the fix GPS errors. I didn't realize that the tool is fixing elevation too.

Peter

Ale Martinez

unread,
Jan 5, 2022, 10:37:03 AM1/5/22
to golden-cheetah-users
El martes, 4 de enero de 2022 a la(s) 14:29:22 UTC-3, Peter K escribió:
Thanks for the explanation Eric. To be honest, i'll have to dig into the math a bit more to better understand what is going on. Also, it sounds that you developed the tool for the different use case. In your case, you are trying to filter out power spikes due to noise. In my case, i'm trying to use it to filter out power spikes generated by the "Estimate Power Values..." tool. I'm not sure if that is a correct use case but here is why i was asking the questions.

I think the intended use of Fix Spikes is to filter interferences, the reason I don't use it is because I don't experience interferences riding mostly outdoor.

I see to much focus on indoor riding at continuous intensities, while riding outdoor is much more variable due to changing terrain, sprints, etc. so default may not be adequate.

WRT Estimate Power, I think it is not very useful in general riding conditions, mainly because aerodynamic resistance is a big component which cannot be reliably estimated.
Anyway, the minimum requirements are good quality speed and altitude data and GPS data is crap for that, calibrated speed sensors and temperature compensated barometric altimeter may help, but again, I am not sure it is worth except in very special conditions.

Ale Martinez

unread,
Jan 6, 2022, 8:25:23 AM1/6/22
to golden-cheetah-users
El miércoles, 5 de enero de 2022 a la(s) 12:37:03 UTC-3, Ale Martinez escribió:
WRT Estimate Power, I think it is not very useful in general riding conditions, mainly because aerodynamic resistance is a big component which cannot be reliably estimated.
Anyway, the minimum requirements are good quality speed and altitude data and GPS data is crap for that, calibrated speed sensors and temperature compensated barometric altimeter may help, but again, I am not sure it is worth except in very special conditions.

I think an example of what I mean "very special conditions" may help: if you have access to a sheltered road (or trail with very good footing) without passing vehicles, ideally steep, where you can solo ride without using the brakes and not changing positions, power estimation can give very good results with a speed sensor and calibrated altimetry, you can even use no-pedaling until stop from different speeds to estimate CdA and Crr for better results.

OTOH, if there is variable footing, wind and obstacles, passing traffic, you change positions (s.t. tops/hoods/drops/aerobar), ride in a group, use the brakes, use GPS data for second by second speed or altitude, there no chance power estimation (specially second by second) can provide accurate results.

I understand this is a hard pill for folks using power in their trainers and wishfully thinking they can have accurate second by second power outside without buying a power meter, but everyone who have used a power meter outside, or have a good understanding of the physics of the problem, knows that will not happen.

Peter K

unread,
Jan 7, 2022, 9:47:09 AM1/7/22
to golden-cheetah-users

I understand that virtual power has many variables and will not be accurate based on physics and I'm OK with an estimate. My challenge is the data. It appears that FIT files provide the most data which is collected from an HR sensor, a speed sensor, a cadence sensor and a Garmin 1030. Assuming lon, lat and elevation (barometric altimeter) is OK, I'm looking at the slope included in the fit file and I have no idea how it is calculated. Looking at the picture below (data in metric system) you can see that the altitude doesn't change but the slope changes. The biggest change at time 00:41:37 to 00:41:38 goes from 11% to 0% while the altitude change is 0.19m and the distance is less than 5m. In addition, after 0041:38 the elevation doesn't change but the changes. I looked at Garmin fit files from 3 different units and i see the same issues with the slope. Does anyone else see this or is it just me? Or am I misinterpreting the data?SlopeIssues.png

Ale Martinez

unread,
Jan 7, 2022, 10:10:41 AM1/7/22
to golden-cheetah-users
El viernes, 7 de enero de 2022 a la(s) 11:47:09 UTC-3, Peter K escribió:

I understand that virtual power has many variables and will not be accurate based on physics and I'm OK with an estimate. My challenge is the data. It appears that FIT files provide the most data which is collected from an HR sensor, a speed sensor, a cadence sensor and a Garmin 1030. Assuming lon, lat and elevation (barometric altimeter) is OK, I'm looking at the slope included in the fit file and I have no idea how it is calculated. Looking at the picture below (data in metric system) you can see that the altitude doesn't change but the slope changes. The biggest change at time 00:41:37 to 00:41:38 goes from 11% to 0% while the altitude change is 0.19m and the distance is less than 5m. In addition, after 0041:38 the elevation doesn't change but the changes. I looked at Garmin fit files from 3 different units and i see the same issues with the slope. Does anyone else see this or is it just me? Or am I misinterpreting the data?SlopeIssues.png

When you import a FIT file the slope computed by the device is used if present in record messages: https://github.com/GoldenCheetah/GoldenCheetah/blob/f4630ea078366227e2f2427d8928e253fd2079be/src/FileIO/FitRideFile.cpp#L1884, so your questions should be directed to Garmin in this case.


You can force the second behavior deleting the slope series in Editor and saving the activity, IIRC.

Peter K

unread,
Jan 7, 2022, 3:56:16 PM1/7/22
to golden-cheetah-users
Thanks Ale, I like the second behavior.  However, i have a question on the code:   https://github.com/GoldenCheetah/GoldenCheetah/blob/f4630ea078366227e2f2427d8928e253fd2079be/src/FileIO/RideFile.cpp#L2543
Why is deltaDistance multiplied by 10?

Ale Martinez

unread,
Jan 7, 2022, 6:00:52 PM1/7/22
to golden-cheetah-users
El viernes, 7 de enero de 2022 a la(s) 17:56:16 UTC-3, Peter K escribió:
Thanks Ale, I like the second behavior.  However, i have a question on the code:   https://github.com/GoldenCheetah/GoldenCheetah/blob/f4630ea078366227e2f2427d8928e253fd2079be/src/FileIO/RideFile.cpp#L2543
Why is deltaDistance multiplied by 10?

Peter K

unread,
Jan 8, 2022, 8:23:12 AM1/8/22
to golden-cheetah-users
Thanks, I tried the second behavior and I'm not sure I understand the results. After deleting the slope column and saving the ride I get the following. If the altitude doesn't change, shouldn't the slope be zero?

GC Slope.png

Ale Martinez

unread,
Jan 8, 2022, 12:13:14 PM1/8/22
to golden-cheetah-users
El sábado, 8 de enero de 2022 a la(s) 10:23:12 UTC-3, Peter K escribió:
Thanks, I tried the second behavior and I'm not sure I understand the results.

Second case is when the file doesn’t include slope.
 
After deleting the slope column and saving the ride I get the following.

That was and educated guess, I have not tested that scenario.
 
If the altitude doesn't change, shouldn't the slope be zero?

Yes for the computed one, likely that’s the original slope imported from the file.

Peter K

unread,
Jan 8, 2022, 5:00:23 PM1/8/22
to golden-cheetah-users
Just FYI i imported a TCX file (there is no slope data in it) and i see the same result as the picture above. Is it possible that something else is changing the slope in the raw data view?

Ale Martinez

unread,
Jan 8, 2022, 7:31:39 PM1/8/22
to golden-cheetah-users
El sábado, 8 de enero de 2022 a la(s) 19:00:23 UTC-3, Peter K escribió:
Just FYI i imported a TCX file (there is no slope data in it) and i see the same result as the picture above. Is it possible that something else is changing the slope in the raw data view?

I don't think this is related to Ride Editor, you can check the JSON file to confirm, but to the smoothing applied after slope is derived: https://github.com/GoldenCheetah/GoldenCheetah/blob/0e8702c3b0b82f7135cab1eca955ca8e10af03ec/src/FileIO/RideFile.cpp#L2673 

Peter K

unread,
Jan 10, 2022, 11:34:22 AM1/10/22
to golden-cheetah-users
Thanks, that makes sense. I'll take a look at the smoothing. Is there any way to bypass any processing during import?

Ale Martinez

unread,
Jan 10, 2022, 11:55:45 AM1/10/22
to golden-cheetah-users
El lunes, 10 de enero de 2022 a la(s) 13:34:22 UTC-3, Peter K escribió:
Thanks, that makes sense. I'll take a look at the smoothing. Is there any way to bypass any processing during import?

There is no processing when slope is present in the imported file, only derived slope is smoothed.

This is not configurable, but it is possible to use a Python Data Processor for a different computation.
Reply all
Reply to author
Forward
0 new messages