Dirty WSPR data?

Onno VK6FLAB

unread,

Feb 22, 2024, 11:52:13 PM2/22/24

to ham...@googlegroups.com

Before I spend the next month (or year - ha!) attempting to categorise and quantify the nature and level of dirty data in the WSPR dataset, I thought I'd ask here first to see if anyone has already done this work.

Some examples:

Across 191 months there are 18,873 reports showing a TX power level greater than 60 dBm, several claiming 103 dBm.
I have made visualisations of grid squares for both transmit and receive and observed that *all* squares on the planet have been "claimed". Digging into this per band shows all manner of patterns. This might be due to poor decoding, rather than activation.
I know of at least one receiver reporting transmissions on the wrong band. I detected this because I operate a transmitter that was reported incorrectly. It appears that the receiver reported other stations on the wrong band.
I have previously simulated SNR impacts on WSPR decodes and have shown that this changes between decoder versions, meaning that the 80% successful decode rate changes between versions.
I've found gaps in reporting, specifically, I went looking for data shortly after the hybrid solar eclipse on the 20th of April, 2023 at 04:17:56 UTC, that data, just under two hours and 12 minutes before the eclipse and the 38 minutes following it, was missing, apparently due to an outage. I have not checked since to see if the data magically reappeared.

This leads to some other questions.

Can the WSPR data-set be "trusted" and to what extent?
How should research take into account spurious data?
Is there any prior activity on this front?

Anyone?

--

73, de Onno VK6FLAB

Listen to the Foundations of Amateur Radio Podcast or check out the eBooks.

Martin

unread,

Feb 23, 2024, 12:27:52 AM2/23/24

to ham...@googlegroups.com

There may be other misusers, but in the amateur radio ballooning world, extended Type 2 or 3 WSPR messages are used with repurposed fields. The power field is often used to encode altitude but there is a WSPR buoy in the Pacific, KQ6RS, that is using it for extra lat/long resolution. There are several competing encoding schemes. I think that there are some sending several compound packets. The second packet has a hashed version of the call sign and there are hash collisions. There are also transmitting on the same frequency collisions too.

Most of the balloon traffic is likely on 20 m but there are Techs that operate on 10 m. See http://lu7aa.org/wsprx.asp or https://amateur.sondehub.org/

I'd like to see a better non-dirty system for this use case, but that's just me.

73 Martin W6MRR

--
Please follow the HamSCI Community Participation Guidelines at http://hamsci.org/hamsci-community-participation-guidelines.
---
You received this message because you are subscribed to the Google Groups "HamSCI" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hamsci+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/hamsci/CACybYRU7iFfHFYZD4usjMw03v9tkhK9W%3DLeL6xBQW0qGft%2Bhbw%40mail.gmail.com.

Gwyn Griffiths

unread,

Feb 23, 2024, 7:26:45 AM2/23/24

to HamSCI

Onno

I commend your intention to categorise and quantify the nature and level of dirty data in the WSPR dataset - it could be a valuable report for those using WSPR for propagation and other studies. Here are some responses to your three questions:

1. Can the WSPR data-set be "trusted" and to what extent?

I've always found motto of the UK national science academy (Royal Society) 'Nullius in verba', taken to mean 'take nobody's word for it', a good starting point. It certainly applies to WSPR.
To what extent one should trust the data is, I suspect, closely tied in with the nature of the research approach, the questions being asked, and the resulting requirements put on the data. A research approach based on 'big data' to study a propagation 'climate' question may be more tolerant of dirty data than a single-path study looking to identify and quantify less common propagation modes, e.g. chordal hops or Pedersen rays where one might not be sure if an outlier was bad data or a real propagation mode.

2. How should research take into account spurious data?

There are the first-cut filters, e.g. you mentioned transmitter power, and Martin W6MRR mentioned balloon data.
More difficult questions to answer are those on the reliability of SNR and frequency. For example, can the time series of SNR at a station be taken to be a reliable proxy for signal level? Another example: I see two spots decoded 20 Hz apart with some 20 dB difference in SNR - am I seeing a spurious signal from the transmitter or am I seeing a Doppler shifted echo from aircraft scatter?
These and many other questions have come up in studies I've been involved with, for which presentations and reports are available. In many cases it has only been through correspondence with those transmitting and receiving WSPR (and FST4W) that I've had the confidence in the data subsets I ended up using.

3. Is there any prior activity on this front?

WsprDaemon Rob Robinett AI6VN's decoding and reporting system for WSPR and FST4W tackles several of the points you raise, including:
a) There is local caching at the receive site computer to upload spots after service resumes at wsprnet.org after an outage.
b) A separate upload to extended data tables on the triplicated WsprDaemon servers includes additional information from the wsprd and jt9 decoders that may be of help with analysis. For example, the osd_decode flag, which shows whether a spot was decoded with the Fano decoder, which can fail to decode (0), or the Ordered Statistics Decoder (1), which will always produce a decode, but which may be wrong (and there is a procedure to minimise, but not remove) the chance of a false spot being reported.
c) Measurement of frequency spread, as this is a source of variability in decode likelihood alongside SNR. Spread is made up of contributions at the transmitter, receiver and propagation. Some receivers / transmitters may use up half the available frequency spread budget. Unfortunately, it is only by personally gathering metadata from individual operators that one can identify, for example, those using GPS disciplined oscillators.

For studies I'm involved with, what works well is: a) using WsprDaemon b) asking questions that can be answered using single-path data, and c) correspondence with the operators to find the essential metadata. I've invariably found that they are pleased to hear that their transmissions / reception reports are being used.

regards

Gwyn G3ZIL

Bob Gerzoff

unread,

Feb 23, 2024, 7:45:08 AM2/23/24

to ham...@googlegroups.com

Onno,

I have been looking at the WSPR SNR levels and will have a talk at the upcoming HamSci conference titled, “Extreme Values in Short-Term 2023 Twenty Meter Sequential Matched WSPR Observations.” This is a follow-up work to a presentation I did a few years back. Without giving away the entire story, let me mention that what I find is that there are enough generalizable patterns within the extreme values SNR reports to make me believe that there are underlying causal phenomena that warrant investigation and suggest that the SNR data can be useful. Dealing with the “eccentricities,” however, will require some finesse.

I’m happy to go over my work with you one-on-one before or after the conference, whatever works best for you and perhaps we can collaborate on some investigations.

73,

Bob, WK2Y

HamSCI Invites Abstracts for its 2022 Workshop Learn Morse Code - CW with The Long Island CW Club

PS As I am about to send this off, I see Gwyn has sent an email with some great things to consider.

--

Black Michael

unread,

Feb 23, 2024, 7:45:11 AM2/23/24

to ham...@googlegroups.com

Some operators do not use CAT control -- so they may show being on the wrong band/frequency.

hamspots.next for example black balls those people who are obviously reporting the wrong band.

And WSPR does not contain CRC checking like the FT modes do so bad grid decodes are more than likely and why grids should tied to callsigns for analysis by preponderance.

One would have to curate the data a fair bit but less likely to have bad call and bad grid at the same time.

Mike W9MDB

On Thursday, February 22, 2024 at 10:52:16 PM CST, Onno VK6FLAB <c...@vk6flab.com> wrote:

Before I spend the next month (or year - ha!) attempting to categorize and quantify the nature and level of dirty data in the WSPR dataset, I thought I'd ask here first to see if anyone has already done this work.

As for comparing SNR you can only do that against the same station so WSPR version won't matter there.

Some examples:

Across 191 months there are 18,873 reports showing a TX power level greater than 60 dBm, several claiming 103 dBm.
I have made visualisations of grid squares for both transmit and receive and observed that *all* squares on the planet have been "claimed". Digging into this per band shows all manner of patterns. This might be due to poor decoding, rather than activation.
I know of at least one receiver reporting transmissions on the wrong band. I detected this because I operate a transmitter that was reported incorrectly. It appears that the receiver reported other stations on the wrong band.
I have previously simulated SNR impacts on WSPR decodes and have shown that this changes between decoder versions, meaning that the 80% successful decode rate changes between versions.
I've found gaps in reporting, specifically, I went looking for data shortly after the hybrid solar eclipse on the 20th of April, 2023 at 04:17:56 UTC, that data, just under two hours and 12 minutes before the eclipse and the 38 minutes following it, was missing, apparently due to an outage. I have not checked since to see if the data magically reappeared.

This leads to some other questions.

Can the WSPR data-set be "trusted" and to what extent?
How should research take into account spurious data?
Is there any prior activity on this front?

Anyone?

--

73, de Onno VK6FLAB

Listen to the Foundations of Amateur Radio Podcast or check out the eBooks.

Bill Liles

unread,

Feb 23, 2024, 8:19:56 AM2/23/24

to ham...@googlegroups.com

You might find parts of Sam Lo's PhD dissertation of interest as he had to clean up the WSPR data for his research.

Use of Novel Distributed Instrumentation in Ionospheric Research

https://purehost.bath.ac.uk/ws/portalfiles/portal/230350535/Sam_Submitted_whole_thesis_final_minor_corrections_2.pdf

He had a very specific research question so he only addressed the problems related to his research. For example, if a given transmitter was not received by a given receiver, he had to determine if the transmitter was off or the propagation did not support that path.

Bill

To view this discussion on the web visit https://groups.google.com/d/msgid/hamsci/1697230311.3426523.1708692307266%40mail.yahoo.com.

keith....@gmail.com

unread,

Feb 23, 2024, 9:26:36 AM2/23/24

to HamSCI

Hello Onno,

I like your report of what you’re seeing in the database. For one, I’ve seen my beacon misreported as being on an incorrect band, and on occasion I’ve seen location data that is way off the mark. I attribute some of that to the nature of transmitting and receiving signals that are so near the noise level. Another possibility that I wonder about is collisions between my signal and stronger signals. When two signals get mixed together, it may be difficult for this automated system to sort out which is which. Still another explanation might be changes in the ionosphere that occur during the 110.6 seconds of each transmission.

I’ve been a researcher for over 50 years. I’ve hardly ever found a perfect data set. There’s always some outlier. One of the most time-consuming tasks can be to either clean up the data or exclude the spurious outliers. Analysis of outliers can lead to the most important findings. On another hand, outliers might be noise that is the result of natural processes. Knowing the difference between those that are important vs those that are spurious can be subjective. Often it takes a lot of time to track down each outlier and determine its importance. One practice that I’ve followed is to not recode the original database. That’s because I might find the recode was incorrect and without an ability to restore the original, then the findings could end up being incorrect.

When working with a new data set, the first thing I do is calculate distributions of key values. Some data set values are normally distributed, others might have any of a variety of patterns. They might be bimodal, or skewed. It’s fascinating work, and very often there are outliers. And then the next question is, what is the meaning of the outliers, and what meaning can be assigned to the patterns?

On the question of reports for 60+dBm I think the question is to try to sort out why that occurs. In the opposite direction, if a near receiver and a distant receiver both report -29 dBm, I wonder if that’s because of the path or some other natural phenomenon, or maybe a difference in the receiving equipment. For me, the explanation is still outstanding.

Keith

Onno VK6FLAB

unread,

Feb 28, 2024, 10:44:50 PM2/28/24

to ham...@googlegroups.com

Wow, what a response! Thank you all for your thoughtful comments, sample code, and advice, both sent here and privately. While I'm still working my way through the provided links, suggestions and kind offers, it appears that there isn't any form of structured representation of what I've until now called "dirty data".

To address this, I propose the creation of an annotation data-set, which I'm happy to create, for the sole purpose of tracking records in the WSPR data-set that have values that appear out of the norm. Given that each row has a unique ID, it seems useful to use it as the primary key.

I spent some time considering the best way to achieve this and given my software development background and preference for version control, I arrived at publishing this as a github repository[1]. This would allow people to contribute in a structured fashion and for errors and omissions to be included as issues, patches and commits - in other words, create a structured framework to manage change.

Given that there at (currently) 6.7 Billion rows of data, it seems that this data-set could quickly grow to exceed the maximum file size of a GitHub file, at 2 Gb, so I propose to structure the "WSPR annotations repository" as a collection of folders, one for each type of "anomaly", each containing a README.md file with a full description of what the folder represents, and a CSV file that has a column for the following items:

WSPR Row ID
Reason Code

I also propose a repository "meta file", that maps Reason Codes to something that humans can understand, perhaps as free-form text, or an SQL statement, like SELECT ID FROM WSPR WHERE Power > 60;

This implies that you can have multiple reasons in a single annotation set, which seems fitting, since as the understanding of the data improves, this is likely to evolve.

This structure would allow a researcher to create a list of IDs matching a research criteria and use them to select the matching annotated rows from the full WSPR data-set, or alternatively, use them to exclude them from the data-set being investigated.

This type of structure will also allow a researcher to gather a list of reports from things like balloons and add them as an annotated set.

Before I embark on this process. Have I missed anything, is there something you'd do differently?

One final question, how should this be licensed? Should it be "Public Domain", one of the (currently six) "Creative Commons" licenses, or something else?

Anomalies I'm aware of:

Power Levels that appear to represent other information, like altitude associated with Balloons
Callsigns that identify balloons - allowing for them to be identified in the data
A reported bug where if WSJT-X switched bands while decoding, the uploaded report uses the wrong band
Improbable grid square locators

Your comments and feedback are welcome.

Kind regards,

o

[1] Yes, I'm aware that Microsoft purchased GitHub and that there are implications in relation to hosting there. I've not yet found a suitable replacement, short of hosting it on my own infrastructure.

Reply all

Reply to author

Forward