8-12% missing or misattributed network annotations between 2020-03-10 and 2023-02-09

145 views
Skip to first unread message

Stephen Soltesz

unread,
Mar 2, 2023, 3:50:04 PM3/2/23
to discuss

This only affects “client.Network” (e.g. ASN and ASName) annotations on M-Lab data collected between 2020-03-10 and 2023-02-09. The “client.Geo” (e.g. Latitude, Longitude, SubDivsion1ISOCode, City and Country) annotations are not affected. We are working to correct these annotations by early April or sooner.


Impact

The network annotations on all data collected between 2020-03-10 and 2023-02-09 may be incorrect. We estimate ~7-10% are missing and ~1-2% are attributed to an incorrect (larger) network address block. These incorrect annotations were not random and depend on the client IP being annotated. So, if a client IP was annotated incorrectly, it would continue to receive an incorrect annotation.


We deployed a fix for new annotations on 2023-02-09. So, all data collected since 2023-02-10 will be correct. We are working on a plan to repair the historical network annotations between 2020-03-10 and 2023-02-09.


Unfortunately, until the historical data is reprocessed we will not know precisely which historical annotations are incorrect. We cannot identify present-but-incorrect annotations until we recreate the annotation correctly. For aggregate analysis using the ASNs, you should expect ~1-2% errors. For analysis targeting specific networks and depending on the ASN annotations, the impact is harder to quantify and could be much higher.


Context

In 2020-03-10, M-Lab introduced a measurement annotation process (uuid-annotator) that runs at measurement-time on nodes rather than during post-processing by the data pipeline. This architectural change decoupled the collection of annotations from the need to archive client IP addresses.


However, we recently discovered that the percentage of missing annotations was unexpectedly high, ~10%. After further investigation, we discovered a fundamental bug in the uuid-annotator's network annotations that resulted in both the missing annotations and the potential for misattributed annotations. Based on a prototype reprocessor, we estimate that between 1-2% of annotations are annotated with incorrect ASNs because a shorter network prefix was chosen over a correct longer prefix, e.g. 12.0.0.0/8 vs 12.a.b.0/24.


Repair

Because the annotation and hopannotation1 datatypes are collected at measurement-time without the client IP, these annotations were originally intended to be created once and loaded directly into BigQuery by the data pipeline. Recreating the annotation was not part of the original design. So, to repair these network annotations we must build a new data processing utility to recreate the annotation archives and reprocess them with the existing pipeline.


We estimate this work will take four to six weeks, ideally early April.


More information and updates will be added here:


Please let us know if you have any questions or concerns.

Livingood, Jason

unread,
Mar 3, 2023, 8:51:20 AM3/3/23
to discuss

Thanks for this disclosure! I assume that for anyone that developed an analysis based on network identifiers that those analyses should be re-run once the data is corrected sometime in April 2023?

 

Jason

--
You received this message because you are subscribed to the Google Groups "discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to discuss+u...@measurementlab.net.
To view this discussion on the web visit https://groups.google.com/a/measurementlab.net/d/msgid/discuss/127eeac8-b5f3-4bdb-a462-a22f268c1fa6n%40measurementlab.net.

Stephen Soltesz

unread,
Mar 3, 2023, 2:08:54 PM3/3/23
to Livingood, Jason, discuss
That's correct. We'll send a follow up note when reprocessing is complete.

Stephen Soltesz

unread,
Apr 3, 2023, 2:20:29 PM4/3/23
to discuss
TL;DR
  • If you only use the unified NDT views, there will temporarily be fewer annotated tests for Apr 3rd. All annotations will be restored for Apr 3rd after the repair process is complete.
  • Today we are adding new datatypes, annotation2 and hopannotation2, to replace the earlier annotation and hopannotation1 datatypes in preparation for the repair process, starting this and next week.
Summary of Changes

M-Lab is renaming the datatypes for "annotation" and "hopannotation1". The new datatype names are "annotation2" and "hopannotation2". These new datatypes will only include corrected annotations. The old datatypes will not be updated after April 4th. New and repaired data will only be added to the new datatypes.

The daily data pipeline for April 3rd will be partially annotated, since it will have a mixture of old and new datatypes. This will be temporary until the historical data can be reprocessed after the annotation repair and rename. Until then, this may appear in analysis as fewer annotated tests for April 3rd.

The daily data pipeline for April 4th will be fully annotated using the new datatypes.
  • measurement-lab.ndt_raw.annotation (last update Apr 4th)
  • measurement-lab.ndt_raw.hopannotation1 (last update Apr 4th)

  • measurement-lab.ndt_raw.annotation2 (to be created Apr 4th)
  • measurement-lab.ndt_raw.hopannotation2  (to be created Apr 4th)
Best,
Stephen

Stephen Soltesz

unread,
Apr 17, 2023, 1:38:05 PM4/17/23
to discuss, Stephen Soltesz
The repair and reprocessing of the annotation2 datatype for the complete history of ndt5 and ndt7 is up to date.
If you use the NDT unified views, you will have automatic access to these updates.

Any analysis that depended on the ndt network annotations can be rerun now.

The original annotations were missing at least 5% of the time (yellow/original) and are now available almost 100% of the time (green/production).

Screenshot 2023-04-13 at 4.46.26 PM.png

We are continuing:
  • historical reprocessing of the NDT sidecar datatypes, tcpinfo, scamper1 with the updated annotations.
  • repair and reprocessing of the hopannotation1 datatype

Stephen Soltesz

unread,
May 8, 2023, 3:17:53 PM5/8/23
to discuss, Stephen Soltesz
The correction and reprocessing of all NDT network annotations is now complete.
  • The correction and reprocessing of the annotation2 datatype is complete.
  • The historical reprocessing of all NDT data (ndt5, ndt7) and sidecar datatypes (tcpinfo,scamper1) using annotation2 data is complete.
  • The correction and reprocessing of the traceroute hopannotation2 datatype is complete.
You can find a summary of data before and after in the final update to https://github.com/m-lab/data-annotations/issues/34#issuecomment-1538832929
Reply all
Reply to author
Forward
0 new messages