Graph transcription errors

202 views
Skip to first unread message

Brandon Heller

unread,
Jul 11, 2011, 11:37:03 PM7/11/11
to The Internet Topology Zoo
Hi,

First off, many thanks to the developers of the Zoo - this is a great
resource for researchers who want to verify and analyze algorithms
that depend on physical topologies. I went through the same manual
process for one topology, and I can appreciate the effort required to
go through 200+.

The first step I did to understand the Zoo was to do basic sanity
checks: does each node have a location? Is each node connected to the
graph? Surprisingly, 2/3 of topologies don't pass these basic tests.
I took a look at location first, finding 10 topologies that don't have
node locations, which is fine. Then I looked at connected components,
and found lots of errors: 32 topos have one or more disconnected
components; some of these are due to network map ambiguities, but a
whole bunch are due to what looks like transcription errors.
- 8 are network map ambiguities
- 11 appear to be human errors
- the rest either have dead sources or I'm not sure about.

From this list of topologies w/>= 1 disconnected component:
['Bandcon', 'Bren', 'BtLatinAmerica', 'Colt', 'DeutscheTelekom',
'Dfn', 'DialtelecomCz', 'Eunetworks', 'Evolink',
'Garr199901', 'Garr199904', 'Garr199905', 'Garr200004', 'Garr200109',
'Garr200112', 'Garr200212', 'Globenet',
'GtsCe', 'HiberniaUs', 'Istar', 'KentmanApr2007', 'KentmanAug2005',
'Navigata', 'Nsfcnet', 'Ntelos', 'Ntt',
'Oteglobe', 'Padi', 'Telcove', 'Tw', 'Uninett', 'UsSignal']

The following are OK (disconnections indicated on the map):
Bandcon: NY/NJ disconnected
BtLatinAmerica: hopelessly disconnected
DialtelecomCz: some node junctions unclear
Eunetworks: primary map shows one lone node (Duct-only market)
Ntelos: OK, 47 is disconn (Washington DC). Intentional: network
partner.
Ntt: OK, peering points shown.
Oteglobe: OK, peering points shown.
Telcove: OK, 62 (Hickory) /66 (Wilmington) disconn on map.

These seem like errors:
** Dfn: node 30, WUP, is disconnected - should connect to DOR(29) and
BIR(31). 30 to 31.
Looking at the map (http://www-win.rrze.uni-erlangen.de/cgi-bin/ipqos/
map.pl?config=win)
** Evolink: 21 and 38 are disconnected? doesn't look like it on the
map.
21: Sevlievo - should connect to Sofia.
38: Instanbul virtual connection - off-site connection, OK
** Globenet: 1 disconn comp, 13: Manaus, but Manaus is connected to
Boa Vista (node 11.)
** GtsCe: 149 (Bucuresti labeled 1) disconn, and only one node there.
Brasov (90) and Constanta (94) should be connected, at least.
Ploiesti should connect to the leftmost one, right?
Would be nice to fix - a really interesting topology.
PDF at: http://www.gtsce.com/file/en/maps/gts-ce-network.pdf
Easier to grab details than the flash map.
** HiberniaUs: 21 disconn - Internal 0
Shouldn't there be a green link connecting 21 to Halifax?
** KentmanApr2007: 11 (Medway ACL) is missing, yet it is connected to
UoG-M - unclear at what speed.
** Navigata: 1(Victoria) is disconn.
source has died:
http://www.navigata.ca/about-us/our-network/map_national_network.pdf
new map at http://www.navigata.ca/about-us/network-map.aspx
If you click on the map, it shows details for the area around
Victoria, and how it connects to Vancouver.
** Nsfcnet: 9 is disconn (CERNET). Should connect to Tsinghua
** Tw: 6 (Amarillo), 28 (Corpus Christi), 63(Greenville),
66(Spartanbug), 68(Charleston)
http://www.twtelecom.com/about_us/networks.html
Labels for greenville and Chattanooga are switched; Chattanooga is the
disconnected one.
** Uninett: 23 (HiAK Kjeller), 45 (HSM Molde), 47 (HSM Kristiansun)
23 (HiAK Kjeller) should connect to UNIK Kjeller in the bottom right
45 (HSM Molde) should connect to HiA Alesund and HSM Kristian-sund
47 (HSM Kristiansun) should connect to HSM Molde and Uninett Teknobyen
** UsSignal: 6(Akron), 16(Evansville) disconn
6(Akron) connects to Cleveland, Lima, and Youngstown
16(Evansville) (OK, is disconn)

I'm not sure about the following; can someone take a look at these?
** Bren: why are there so many nodes in the gml, but not the graph?
** Colt: dashed lines and disconnections
** DeutscheTelekom: many connections not shown
** Garr199901: source field seems down.
node 7 (EUMED CONNECT) is missing a connection and location, but not
shown on the map.
Source yields 404. http://www.garr.it/reteGARR/mappa.php
** Garr200212: component 14 is disconnected (EUMED connect)
** other Garr* topos
** Istar: 9 disconn: Hamilton, CA
Why is the source MCI? The link in the note field doesn't seem to
connect to the topology either.
** Padi: lots of disconn.
padi2.ps domain appears dead:
http://www.padi2.ps/maps.php

What is the process for updating the Zoo? All I see is a download
link, with no indication of when it was last updated or what the
history of changes is. I also don't see a way to re-generate the GML
files myself after making edits to yED sources. I would much rather
make modifications myself and re-build, so that I don't have to
manually edit GML files, and I'd really prefer revision-controller
source access to track whether the fixes got made - or even just push
them myself. I doubt you'd like to send patches around.

What are your plans for this process? Something like github would be
the least effort for everyone - then I could update the latest changes
without having to do diffs and download a whole nother zip archive.
Plus, then I could send a pull request, and you guys could simply
merge it if it passed your check. This is exactly the kind of thing
that git works really well at - managing updates from lots of people
with minimal effort.

Also, would it be possible to define an attribute, disconnected, for
nodes that are intentionally disconnected? It would provide a nice
way to verify that the disconnected nodes are expected.

Thanks,
Brandon Heller
Stanford


Matthew Roughan

unread,
Jul 12, 2011, 4:03:01 AM7/12/11
to Brandon Heller, topolo...@googlegroups.com
Hi

first off thanks hugely. This type of involvement is exactly what we
need to get this dataset to the level of accuracy that will make it
really useful to people. We did a bunch of checks on the data, but with
so much some errors were bound to creep in.

There are legitimate reasons some sanity checks will fail. For instance,
some network are actually disconnected (as far as their map goes) and
there are some reasons this may make sense (e.g. they get transit from a
provider rather than having their own network connection). And there
are definitely some ambiguities in the data. However, it looks like you
have found a bunch of issues that we really need to fix.

We are using version control, but only locally. I like to idea of using
github or something similar to allow contributions to be merged directly
into the zoo. We'll have a talk about this in the next week or so and
get something more useful useful set up for contributors.

I guess another thing we need is an FAQ to answer some basic questions
about the data. I'll see about starting that as well.

Can I also ask where you came across the dataset?

Cheers,
Matt

Brandon Heller

unread,
Jul 12, 2011, 4:26:48 AM7/12/11
to topolo...@googlegroups.com
On Tue, Jul 12, 2011 at 1:03 AM, Matthew Roughan <matthew...@adelaide.edu.au> wrote:
Hi

first off thanks hugely. This type of involvement is exactly what we need to get this dataset to the level of accuracy that will make it really useful to people. We did a bunch of checks on the data, but with so much some errors were bound to creep in.

There are legitimate reasons some sanity checks will fail. For instance, some network are actually disconnected (as far as their map goes) and there are some reasons this may make sense (e.g. they get transit from a provider rather than having their own network connection).  And there are definitely some ambiguities in the data. However, it looks like you have found a bunch of issues that we really need to fix.

 
Exactly, it looks like a lot of network maps are ambiguous, even ones I've seen outside the Topology Zoo dataset.

We are using version control, but only locally. I like to idea of using github or something similar to allow contributions to be merged directly into the zoo. We'll have a talk about this in the next week or so and get something more useful  useful set up for contributors.
I guess another thing we need is an FAQ to answer some basic questions about the data. I'll see about starting that as well.

 
That'd be great.  It'd be nice to have expected some of the quirks, like no-location 'Internal 0' nodes, or unlabeled nodes marked 'hyperedge', as well as to know the standard policy when a link or node is ambiguously marked on a map.

I would suggest github because it has an integrated issues tracker and makes forking/pulling really easy.  Right now, I'm partially blocked because some of the topologies I'm interested in, like TataNld and Bellsouth, have hyperedges.  I would love to just fill in Lat/Long estimates and re-gen the GML files, but until those sources are accessible, I'll have to directly modify the GMLs, which feels wrong.

Can I also ask where you came across the dataset?


It was forwarded to me by an industry lab researcher who thought it would be useful for my project - who had seen it from a forward himself.  Word gets around when stuff is useful.

Thanks,
-b

Brandon Heller

unread,
Jul 25, 2011, 3:36:45 PM7/25/11
to topolo...@googlegroups.com
Checking back on the feedback from two weeks ago:

- Are there any plans to make change tracking easier, via direct source access?
- Has anyone checked out / implemented the bugs I reported?
- Do you have any plans to add a 'schema doc' to explain (and standardize) the meaning of each attribute value?

Thanks,
-Brandon

Simon Knight

unread,
Jul 25, 2011, 11:21:14 PM7/25/11
to topolo...@googlegroups.com
Hello Brandon,

Sorry about the delay in response.
I have created a GitHub repository, and will soon check the source into them.
The sources differ slightly from the GML files in the zoo, as they
contain x-y co-ordinates from when the network was traced from the
source image. They also contain some extra information for the
geocoding script. These are then converted into the GML format used in
the zoo using the yed2zoo tool described at
http://topology-zoo.org/toolset.html

I have been working on a paper, so haven't had a chance to investigate
the full set of bugs you reported. I appreciate the report, and will
get onto them early next week.

I agree a schema is important, and will get onto this as well soon.

Is there something in the meantime that I could directly provide you
with that will allow you to get up and running quicker?

Thanks
Simon

On Tue, Jul 26, 2011 at 5:06 AM, Brandon Heller

Simon Knight

unread,
Jul 25, 2011, 11:54:46 PM7/25/11
to topolo...@googlegroups.com
Hi Brandon,

if the biggest holdup for you is the lack of co-ordinates for the
hyperedges, we can probably provide a programatic solution. We can
remove the hyperedge, and connect its neighbors together, either as a
clique or as a minimum spanning tree.

Another option is to retain the hyperedge, but approximate its
location, such as the midpoint of its neighbors.

Finally, you could manually set the location. This could be done by
editing the GML. However, if I were to do this method I would write a
small Python script to update the relevant node information, and then
write it back to a GML file.

Does any of this help?

We will address the transcribing errors shortly. Once the sources are
in GitHub corrections will be simpler, and we will easily be able to
produce a list of diffs.
Thanks for your feedback, it is much appreciated.

Thanks
Simon

Brandon Heller

unread,
Jul 26, 2011, 2:19:48 AM7/26/11
to topolo...@googlegroups.com
On Mon, Jul 25, 2011 at 8:54 PM, Simon Knight <simon....@gmail.com> wrote:
Hi Brandon,

if the biggest holdup for you is the lack of co-ordinates for the
hyperedges, we can probably provide a programatic solution. We can
remove the hyperedge, and connect its neighbors together, either as a
clique or as a minimum spanning tree.


None of this stuff is actually a holdup - I can always manually edit the GML files - but it just feels dirty to edit generated files, because it means there's going to be extra effort later to change them from the source file.  Plus, there's the higher chance of making an error when they're manually edited. 
 
Another option is to retain the hyperedge, but approximate its
location, such as the midpoint of its neighbors.


Like you say, there are a whole bunch of ways to handle ambiguous hyperedges, but I think the best option is to look at a Google map and just estimate the position as closely as possible - there aren't _that_ many of these.  This would be close enough for most purposes, but it would be good to mark these nodes with an 'approximate' tag.
 
Finally, you could manually set the location. This could be done by
editing the GML. However, if I were to do this method I would write a
small Python script to update the relevant node information, and then
write it back to a GML file.

Does any of this help?

We will address the transcribing errors shortly. Once the sources are
in GitHub corrections will be simpler, and we will easily be able to
produce a list of diffs.
Thanks for your feedback, it is much appreciated.

Definitely, this will make it easy to track fixes and submit additional graphs (I have some to add).  There's some effort to set this up, but then the effort is lower for everyone, and it's easier to build a community around the data set to bring it to a higher quality.

Thanks for releasing this, and good luck with INFOCOM,
-b

Simon Knight

unread,
Sep 1, 2011, 4:30:03 AM9/1/11
to topolo...@googlegroups.com
Hi Brandon,
Sorry about the long delay in replying.
The networks are now in GitHub. https://github.com/sk2/topologyzoo

I am still finalising the toolset for conversion, which will hopefully
be done in the next few days.

We have also merged in your corrections - thank you for spotting
these. We have started to use the issue tracker on GitHub, so that
should make it easier to report new networks and corrections in the
future: https://github.com/sk2/topologyzoo/issues

We are now using GraphML as our working file format, as this allows
properties (for nodes, edges, and the graph itself) to be directly
edited in yED. Previously we were using GML with an external
properties CSV, which wasn't going to scale well with a public version
control system.

Thanks

Simon

Reply all
Reply to author
Forward
0 new messages