Draft of an email

3 views

Skip to first unread message

Theo Armour

unread,

Apr 8, 2013, 11:08:05 PM4/8/13

to urdacha

Hello Team Urdacha

Although Saturday's events at Swissnex were totally disappointing for us, there may be a small matter of interest buried in the process.

Kindly have a look at the following draft message and reply with comments regarding suitability and appropriateness of the following message which I would like to send to the Urban Dat Challenge Google Group.

And, yes, this message most likely would not have been written if the team had won a prize. Interestingly, the motivation has not changed an iota and this is: to help cities improve the way cities operate and manage their public transport systems

Theo

***

Hello Urban Data Challenge Participants

This is a very long email. You should delete it ASAP.

At last Saturday's prize giving, representatives of the TBG. SFMTA and VBZ spoke at length about their respective systems, the challenges they faced and the success and failures in gathering and maintaining the appropriate statistics. M. Antoine Stroh of TPG and Mr. Christopher A. Pangilinan of SFMTA were both open, candid and insightful regarding the current successes and failures in developing and monitoring of their services. In contrast the comments of Herr Bruno Mändi of VBZ were directed far more towards highlighting the perfection of the VBZ system and the excellence and even superiority of VBZ's data gathering process.

Team Urdacha thinks it might be a fun thing to double-click into the VBZ data set and see if the statistics show that reality is as Herr Mändi portrays it.

There were at least two places where Herr Mändi directly discussed the numbers.

The first instance was in response to an audience question regarding the quality of the data in the data sets supplied - particularly in reference to the difficulty in obtaining accurate data when vehicles are overloaded. Both M. Stroh and Mr Pangilinan were forthright and open regarding the difficulties in obtaining and maintaining good statistics. Mr Mändi, on the other hand, let it be known that the the data gathering aspects of VB had no issues worthy of discussion and (if we remember correctly) implied a sense of perfection with regards to the Zurich data gathering capabilities.

Team Urdacha has prepared an applet that searches through the Zurich data set and finds anomalies.

The running app is available here:

http://jaanga.github.io/urdacha/improved-csv/zurich/find-data-imperfections/index.html

- Do click on the question mark to open up the Read Me. This app is not a game.

The app source code is available here:

https://github.com/jaanga/urdacha/tree/gh-pages/improved-csv/zurich/find-data-imperfections

Using this app you will see that on any given day, the word "NULL" appears in between 3% to 10% of the records in the Zurich data set. The number "-1" is displayed in 0.5 % to 1.3% of the records relating to seconds after midnight. Curiously, whenever the -1 appears there is neverless valid time data in adjacent fields.

When the words 'NULL" or "-1" appear in a data file, this generally indicates a failure to collect or record the data properly that normally should or would be recorded. The instances of such failure in a fully operational system should be minsicule.

For example, excellence in the computer industry is frequently referred using with the term "five nines". These words are used to indicate that success is 99.999% better.

Team Urdacha finds that the Zurich data set is closer to 92% in the number of error-free lines.

[If there is one error in a line, should the other data points in that line. The normal and safe practice is to throw out the entire line. ]

[Later: Upon going through the Zurich data yet again, we find that in the first file from bytes zero to 49839827 there are 76,793 occurrences of double semi-colons.';;'. More empty data. This is so unprofessional, it made us laugh.]

The second incident where Herr Mändi spoke of VBZ in a measurable response was in response to an observation (disclosure: from a member of this team) that - even in this modern era - bunches and gaps in vehicle disbursement are still an issue. While both M. Stroh and Mr Pangilinan both agreed with this observation, Herr Mändi countered that bunches are not an issue in Zurich and if there are bunches these are intentional and are to do with providing vehicles as when and where they are most needed.

Team Urdacha has prepared an applet that searches through the data sets of the three and identifies bunches and gaps.

The running app is available here:

http://jaanga.github.io/urdacha/improved-csv/zurich/zurich-bunch-and-gap/index.html

The app source code is available here:

https://github.com/jaanga/urdacha/tree/gh-pages/improved-csv/zurich/zurich-bunch-and-gap

The app is currently at an early stage. As set up it only shows gap or bunches in one direction. Changing this requires editing the code. The basic algorithm is quite simple. The five most recent arrivals of every vehicle stop are tracked while the data is replayed in sequential time order. Any vehicle that arrives that arrives at a stop in half the average time or less is deemed to be in a bunch. Any vehicle that that takes over 50% longer than the average is deemed to be in a gap. In other to allow for schedule changes any vehicle that takes over 200% of the average time is ignored. Also ignored are data sets with five or fewer items. The algorithm also ignores two contiguous check-ins by the same vehicle at the same stop.

With this app, it is possible to compare and contrast the bunches and gaps that occur among vehicles in the three cities. We were quite surprised with the results. Given that we are total amateurs at transportation statistics, we anticipated each of the three cities would display quite different results - which is what usually happens when you don't know what you are doing. Not so. In all three data sets, bunches seem to occur at about 10% of the time and gaps at about 2%.The shorter the average time between vehicles the greater the bunching. In all of this, VBZ appears to be no better or worse at controlling bunches and gaps than the TPG or the SFMTA.

Actually, we could do a far deeper analysis of gapping and bunching with regards to bunches being intentional responses to passenger load except that Zurich supplied passenger load data for only about 20% of the events while the other two cities supplied 100%. Did Zurich not keep the good data or the bad data?

Apart from the two issues arising from the Saturday talks, there are a number of interesting aspects that can be considered relating to the Zurich data set.

The first thing one notices is that one file is very big. The file titled 'schedule-vs-arrival.csv' is over 514 MB is size. Typically you can open CSV files with a spreadsheet or text editor. But since the VBZ file was so big not one of our apps could open the VBZ file. We ask ourselves: Is it really open data if you can't open the data?

In the end, we had to write a program especially to access the data. We learned much while doing this, so thank you VBZ for the learning challenge.

With such a big file, we anticipated seeing much interesting data.

Here are some ofthe things we found:

The fields titled 'serviceDate' and 'date' both contain identical information: which is the day of the year. Here is a sample of the way the data is presented:

2012-10-01 00:00:00.000; 2012-10-01 00:00:00.000

Can you imagine this data -47 characters] repeated - byte for byte two hundred thousand times? Then change the one for a two. Then repeat that data two hundred thousand times. Actually, you don't need to imagine this. Just look at th VBZ data set. We can show you how to do this.

Then the fields containing route information are titled routeNumber and routeNameShort and routeName. Here is a sample of the way the data is presented:

31,31,31

Again you may notice a certain amount of duplication of information. And again if you see that repetition once in the data set, guess what? You will also see it repeated many tens of thousands of times.

We have not yet written an app to confirm that all this data is 99.999% duplicates. But if we were to write such an app then we would also check out whether routeId is just routeNumber in a different form and the same with stopId, stopNumber and stopNameShort.

In other words, during the course of the past few weeks, we have gained the feeling that the Zurich data set is padded and inflated with far more 'stuff' than is needed for normal business needs.

Yes, inside a proper relational database the data is always held in in a more compact format than in the ASCII format we were supplied. But twice the data, table and relations is still twice what's needed.

The usual justification for such padding and duplication is for the data processors and managers to pontificate that such redundancy helps prevent errors and loss of data. But, as we have showed in the previous sections, the Zurich data set is rife with data anomalies.

Then again, perhaps the data up in the mainframe is really perfect and neat and efficient in the main database. And what we received was just an unfortunate incident. But then how did the data get out to the Challenge in such an inflated, corrupted and redundant state?

Let us for the minute assume that the data set was prepared by the greenest, least-educated, most underpaid VBZ staff-member. Pause. If a smarter person had prepared the data then all this duplication would have been hidden. We would never have seen it. The manager who released this data was not educated enough to know how to look at the data.

Let's think about nicer things.

If anybody is interested, there are many more fun things that we can talk about regarding the Zurich data set.

For example, the latitudes and longitudes are presented in WGS84 format - whereas the other cities proved the normal latitudes and longitudes used by the most mapmakers and math apps. If VBZ wants to use a bizarre geodesy format why doesn't VBZ use your own CH1903?

And what about the semi-colons? The term CSV stands for Comma Separated Values. In the supplied excerpt, commas where used and the strings with commas in them were encapsulated as they should be by quotes. The big file, 'schedule-vs-arrival.csv' uses semi-colons as a separator. Not a big deal, but yet another curious funky anomaly.

And what about the data itself? Both San Francisco and Zurich were able to supply complete passenger load data sets for all the routes that covered. Zurich was only able to present passenger load data for about twenty percent of the routes. In essence this make it impossible to compare the VBZ passenger load data with the data sets from the other cities [Not that the other cities actually made it easy either].

We could go on exploring the VBZ data set. The more we look at the Zurich data the more we find fun things to wonder about, but perhaps it's time just to sit back and think about and consider what we have seen so far.

The first thing is that we have questions such as these:

What kind of organization creates and maintains a data set that is over twice the size that it needs to be? Who replicates information so that it stays in the system but can be hidden in management reports and yet can make the huge printouts look like vast and heroic efforts? Who allows large percentages of anomalies to creep into the data that therefore requires extra coding in order to be processable.

But we are just amateurs and outsiders as well, so who are we to know why such things happen.

We do have a strong feeling however.

Herr Mändi: you should never ever do this again. If you wish to maintain the status quo of VBZ and to make sure that your IT staff have ongoing and peaceful jobs and to assure that your suppliers and consultants maintain a steady workflow then this thing of releasing your data to public should be absolutely forbidden. Furthermore that the VB data set has been made public in such a perilous state is highly unfortunate. The perpetrator should be identified as soon as possible. Either the perpetrator is quite incompetent or this person may even be some sort of anonymous whistleblower type. And if VBZ ever really wants to learn how to obfuscate data big time, we can show you how to do it in 3D.

If on the other hand and if you are not Herr Mändi and you find this sort of thing fun to explore - and especially if you live in Zurich - then do feel free to contact Team Urdacha and let's continue this fascinating bus ride, er, investigation...

Theo Armour

PS Thank you to Sophie and everybody that helped put the Urban Data Challenge together. Our team didn't win anything, but we learned a lot. In particular: Don't submit a heavyweight 3D real-time analytical complicated trying-to-break-new ground effort into a smartphone app contest.

Reply all

Reply to author

Forward

0 new messages