Fwd: [UrbanData] Comments on the VBZ Data Set

26 views

Skip to first unread message

Theo Armour

unread,

Apr 10, 2013, 9:52:37 PM4/10/13

to urdacha

FYI

---------- Forwarded message ----------
From: Sophie Lamparter <sophie.l...@swissnexsanfrancisco.org>
Date: Wed, Apr 10, 2013 at 6:20 PM
Subject: Re: [UrbanData] Comments on the VBZ Data Set
To: Theo Armour <t.ar...@gmail.com>
Cc: "joh...@swissnexsf.org (joh...@swissnexSF.org)" <joh...@swissnexsf.org>, "\"Mändli Bruno (VBZ)\"" <Bruno....@vbz.ch>

Hi Theo,

Thank you so much for your insights! I wanted to thank you anyway for all the incrdeibly work you and your team did

on the data sets for your project and also helping other people in the group. You really had a great impact on creating and active, participatory Urban Data Challenge

community.

We discussed your project for a long time in the jury, but it did not make at the end for an exhibition proposal

because it was not easy understandable / readable for a general audience and also not very visual.

But I do think you should be in conversation with the public transportation departments from the three cities,

because you could really help them give them insights to their own data sets.

I CC'd here in the email Bruno Mändli, who is now probably back in CH so he can answer you directly to your

suggestions.

Thank a lot again & continue the good work and let me know if I can assist you in anyway.

Sophie

SOPHIE LAMPARTER | Head of Public Programs | t: (415) 912 5901 x108 | swissnex San Francisco
Scouting the nextrends from Silicon Valley.

On Apr 10, 2013, at 3:36 PM, Theo Armour <t.ar...@gmail.com> wrote:

Hello Urban Data Challenge Participants

This is a very long email. You should delete it ASAP.

At last Saturday's prize giving, representatives of the TBG. SFMTA and VBZ spoke at length regarding their respective systems, the challenges they faced and the success and failures in gathering and maintaining the appropriate statistics. Mr Antoine Stroh of TPG and Mr Christopher Pangilinan of SFMTA were both open, candid and insightful regarding the current successes and failures in developing and monitoring of their services. In contrast the comments of Mr Bruno Mändi of VBZ were directed far more towards highlighting the perfection of the VBZ system and the excellence and even superiority of the VBZ data gathering process.

Having had some issues with the VBZ dataset particularly, we would like to respond to Mr Mändi’s comments by highlighting some of the imperfections of the VBZ dataset. We will cover first the specific instances where Mr Mändi directly discussed the numbers, with examples from the dataset. Then we will highlight some anomalies within the VBZ dataset which lead us to our main point of dissent: the VBZ dataset has many issues that should be addressed and not covered up.

1. Data Anomolies

The first instance was in response to an audience question regarding the quality of the data in the data sets supplied - particularly in reference to the difficulty in obtaining accurate data when vehicles are overloaded. Both Mr Stroh and Mr Pangilinan were forthright and open regarding the difficulties in obtaining and maintaining good statistics. Mr Mändi, on the other hand, let it be known that the data gathering aspects of VB have no issues worthy of discussion and (if we remember correctly) implied a sense of perfection with regards to the Zurich data gathering capabilities.

Team Urdacha has prepared an applet that searches through the Zurich data set and finds anomalies.

The running app is available here:

http://jaanga.github.io/urdacha/improved-csv/zurich/finata-imperfections/index.html

- This app is not a game and requires some help to understand it. A readme appears when you open the app.

The app source code is available here:

https://github.com/jaanga/urdacha/tree/gh-pages/improved-csv/zurich/find-data-imperfections

Using this app you will see that on any given day, the word "NULL" appears in between 3% to 10% of the records in the Zurich data set. The number "-1" is displayed in 0.5 % to 1.3% of the records relating to seconds after midnight. Curiously, whenever the -1 appears there is nevertheless valid time data in adjacent fields.

When the words 'NULL" or "-1" appear in a data file, this generally indicates a failure to collect or record the data properly that normally should or would be recorded. The instances of such failure in a fully operational system should be miniscule.

For example, excellence in the computer industry is frequently referred using with the term "five nines". These words are used to indicate that success is 99.999% better. We have found that the Zurich data set is closer to 92% in the number of error-free lines.

One large such error is that in the first file from bytes zero to 49839827 there are 76,793 occurrences of double semi-colons.';;'. This is a laughably large amount of empty data.

[BTW, If there is one error in a line, there can easily be other data points with errors in the same line. Therefore the normal and safe practice is to discard the entire line. ]

2. Bunches and Gaps

The second incident where Mr Mändi spoke of VBZ where there is response we could measure was in reply to an observation that - even in this modern era - bunches and gaps in vehicle disbursement are still an issue. While both Mr Stroh and Mr Pangilinan both agreed with this observation, Mändi countered that bunches are not an issue in Zurich and if there are bunches these are intentional and are to do with providing vehicles as when and where they are most needed.

Team Urdacha has prepared an applet that searches through the data sets of the three cities and identifies bunches and gaps.

The running app is available here:

http://jaanga.github.io/urdacha/improved-csv/zurich/zurich-bunch-and-gap/index.html

The app source code is available here:
https://github.com/jaanga/urdacha/tree/gh-pages/improved-csv/zurich/zurich-bunch-and-gap

The app is currently at an early, simplistic stage. As set up, it only shows gap or bunches in one direction. Changing this requires editing the code, however the basic algorithm is quite simple. The five most recent arrivals of every vehicle stop are tracked while the data is replayed in sequential time order. Any vehicle that arrives that arrives at a stop in half the average time or less is deemed to be in a bunch. Any vehicle that that takes over 50% longer than the average is deemed to be in a gap. In other to allow for schedule changes any vehicle that takes over 200% of the average time is ignored. Also ignored are data sets with five or fewer items. The algorithm also ignores two contiguous check-ins by the same vehicle at the same stop.

With this app, it is possible to compare and contrast the bunches and gaps that occur among vehicles in the three cities. We were quite surprised with the results: given that we are total amateurs at transportation statistics, we anticipated each of the three cities would display quite different results - which is what usually happens when you don't know what you are doing. Not so. In all three data sets, bunches seem to occur at about 10% of the time and gaps at about 2%.The shorter the average time between vehicles the greater the bunching. In all of this, VBZ appears to be no better or worse at controlling bunches and gaps than the TPG or the SFMTA.

Mr Mandi remarked that Zurich vehicle bunches are often intentional responses to passenger load issues. A more thorough analysis of gapping and bunching is possible when passenger load data is taken into account. VBZ, however, supplied passenger load data for only about 20% of the events while the other two cities supplied 100%.

3. Further Data Set Issues

Apart from the two issues arising from the Saturday talks, there are a number of interesting aspects that can be considered relating to the Zurich data set.

The first thing one notices is that one file is very big. The file titled 'schedule-vs-arrival.csv' is over 514 MB is size. Typically you can open CSV files with a spreadsheet or text editor. But since the VBZ file was so big not one of our apps could open the VBZ file. We ask ourselves: Is it really open data if you can't open the data?

In the end, we had to write a program especially to access the data. We learned much while doing this, and would love to exchange ideas with other teams and see how they were able to deal with the data.

With such a big file, we anticipated seeing much interesting data. Here are some of the things we found:

The fields titled 'serviceDate' and 'date' both contain identical information: which is the day of the year. Here is a sample of the way the data is presented:

2012-10-01 00:00:00.000; 2012-10-01 00:00:00.000

Can you imagine this data - 47 characters repeated byte for byte two hundred thousand times? Then change the one for a two. Then repeat that data two hundred thousand times. Actually, you don't need to imagine this. Just look at the VBZ data set. We can show you how to do this.

Then the fields containing route information are titled routeNumber and routeNameShort and routeName. Here is a sample of the way the data is presented:

31,31,31

Again you may notice a certain amount of duplication of information. And if you see that repetition once in the data set, you will also see it repeated many tens of thousands of times.

We have not yet written an app to confirm that all this data is 99.999% duplicates. But if we were to write such an app then we would also check out whether routeId is just routeNumber in a different form and the same with stopId, stopNumber and stopNameShort.

In other words, during the course of the past few weeks, we have gained the feeling that the Zurich data set is padded and inflated with far more 'stuff' than is needed for normal business needs.

Yes, inside a proper relational database the data is always held in in a more compact format than in the ASCII format we were supplied. But twice the data, table and relations is still twice what's needed.

The usual justification for such padding and duplication is for the data processors and managers to pontificate that such redundancy helps prevent errors and loss of data. But, as we have showed in the previous sections, the Zurich data set is rife with data anomalies.

Another typical justification is that this sort of thing is due to all the legacy apps that still need to be maintained. But a data processing unit that cannot control the output and appearance of its legacy apps is typically entering a sort of processing death spiral.

Then again, perhaps the data up in the mainframe is really perfect and neat and efficient in the main database. And what we received was just an unfortunate incident. But then how did the data get out to the Challenge in such an inflated, corrupted and redundant state?

Let us for the minute assume that the data set was prepared by the greenest, least-educated, most underpaid VBZ staff-member. Pause. If a smarter person had prepared the data then all this duplication would have been hidden. We would never have seen it. The manager who released this data - how come this person was not educated enough to know how to look at the data?

4. Conclusions

If anybody is interested, there are many more fun things that we can talk about regarding the VBZ data set.

For example, the latitudes and longitudes are presented in WGS84 format - whereas the other cities proved the normal latitudes and longitudes used by the most mapmakers and math apps. If VBZ wants to use a bizarre geodesy format why doesn't VBZ use their own CH1903?

And what about the semi-colons? The term CSV stands for Comma Separated Values. In the supplied excerpt, commas where used and the strings with commas in them were encapsulated as they should be by quotes. The big file, 'schedule-vs-arrival.csv' uses semi-colons as a separator. Not a big deal, but yet another curious anomaly.

And what about the data itself? Both San Francisco and Zurich were able to supply complete passenger load data sets for all the routes that covered. Zurich was only able to present passenger load data for about twenty percent of the routes. In essence this created an obstacle to comparing the VBZ passenger load data with the data sets from the other cities.

The more we look at the Zurich data, the more we find imperfections and data quality issues. Which brings us to some disturbing questions:

What kind of organization creates and maintains a data set that is over twice the size that it needs to be?

Who replicates information so that it stays in the system but can be hidden in management reports and yet can make the huge printouts look like vast and heroic efforts?

Who allows large percentages of anomalies to creep into the data that therefore requires extra coding in order to be processable?

Although we are on the outside looking in, it still feels strange that we should run into such errors and then be told that this data is flawless. We recommend that VBZ look into who is collecting and compiling the data. The goal of these sort of competitions is for both the participants and the hosts gain fresh insights. We hope that through this analysis, VBZ gains some valuable feedback on the real state of their data. It is far from perfect, and should not be presented as a superior dataset to be emulated. By stating that it is, Mr Mändi is maintaining a status quo that, although ensuring jobs and steady workflow for staff and consultants, could be a disservice to the VBZ and to all those who could profit from a better understanding and use of the data to ensure smoother public transport.

Please feel free to contact Team Urdacha if you would like to read more or discuss and explore this and other data sets. It is our goal to continue to better everyone’s experience with open data challenges, and proclamations like that of Mr Mändi should not go unchallenged. Let's continue this fascinating bus ride, er, investigation...

Theo Armour

PS Thank you to Sophie and everybody that helped put the Urban Data Challenge together. Our team didn't win anything, but we learned a lot. In particular: Don't submit a heavyweight 3D real-time analytical complicated trying-to-break-new ground effort into a smartphone applet contest.

--
You received this message because you are subscribed to the Google Groups "Urban Data Challenge" group.
To unsubscribe from this group and stop receiving emails from it, send an email to urban-data-chall...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Theo Armour

unread,

Apr 15, 2013, 2:22:36 PM4/15/13

to urdacha

FYI...

Scroll down to see lengthy response.

---------- Forwarded message ----------
From: Mändli Bruno (VBZ) <Bruno....@vbz.ch>
Date: Mon, Apr 15, 2013 at 12:53 AM
Subject: AW: [UrbanData] Comments on the VBZ Data Set
To: Theo Armour <th...@evereverland.net>

Hi Theo

Thank you for your answer.

May we have a phone call today?

Because of our different time zones (9 hours) this would be possible from 8.30 PM (CET) = 11.30 AM your time. (I’m not available between 07.30 and 11.30 PM your time because of a personal reason)

Best Regards

Bruno

verkehrsbetriebe zürich - www.vbz.ch

bruno mändli, leiter informatik

luggwegstrasse 65, postfach 8048 zürich

tel. direkt +41 44 434 44 40, fax +41 44 434 47 84

mailto:bruno....@vbz.ch

ein unternehmen der stadt zürich

PPlease don't print this e-mail unless you really need to.

Von: t.ar...@gmail.com [mailto:t.ar...@gmail.com] Im Auftrag von Theo Armour
Gesendet: Samstag, 13. April 2013 02:17
An: Mändli Bruno (VBZ)
Cc: Sophie Lamparter; joh...@swissnexsf.org (joh...@swissnexSF.org); Conde Antonio (VBZ); Lutz Richard (VBZ)
Betreff: Re: [UrbanData] Comments on the VBZ Data Set

Hi Bruno

Thank you very much for your very detailed and speedy response.

I had been planning to send an add-on saying that replying in detail to my 'rant' would not be necessary - but this was delayed because of other matters. Sorry for this.

Here's why: The over-arching element in all this is that VBZ actually did publish data in an open manner, and people used it and VBZ obtained a variety of feedback. And, hopefully, feedback of all types that VBZ might not have received without running through this process. This is cool and the way of the future.

Some thoughts: I have ridden on the systems of all three cities in the challenge - and all have their amazing aspects. Furthermore I was #3 architect in charge of designing 12 MTRC stations in Hong Kong and, in yet another life, I was the program manager charged with designing three releases of AutoCAD at Autodesk. So dealing with reality, data and complexity seems to come naturally to me.

Therefore my 'gut' feeling is that if we continue this conversation it will soon become so interesting that we not be able to get anything else done.

;-)

So let us consider the discussion as not requiring any further effort.

Sorrows: I am sorry that you were not able to get the app to run. One issue - that - I should have noted in the 'readme' - is that we, in these preliminary efforts, only work and test with the latest version of the Google Chrome browser.

Also our recent focus is working on real-time data (such as the data from nextbus.com) and and building 3D apps so engineers can carry out Exploratory Data Analysis. Therefore our historical data in 2D skills need improving.

In any case, all the source code is in JavaScript on GitHub so any coder can quickly see the logic of what we were working on.

Conclusions: Bruno - again thank you for thinking so hard about all this. I agree with much of what you say - thought not all of it. But the main thing is for you to keep the VBZ buses and the data flowing and running smoothly...

Warm regards,

Theo

+1 415 828 0000

On Fri, Apr 12, 2013 at 6:28 AM, Mändli Bruno (VBZ) <Bruno....@vbz.ch> wrote:

Hi Theo

Thank you very much for your big work, you spent into this topic.

I think, there are many misunderstandings today. I try to give answers.

I really would like to propose having a phone call first, before doing any further action. My mobile phone number is +41 79 430 91 66. I would call you, if you would give me your number.

Regards, Bruno

My answers:

Data Anomalies

As I remember I said that data staging process is most important. So, we focussed on this process, to get a proper data baseline.

Collected operational data never will be perfect. We’re not doing financial transactions. You cannot add or correct missing data or implausible data during staging process. You have to do the best with the data you get.

We use a BI system, containing several data marts, to obtain fast access for analysis reason.

(On the other hand, we use a vendor propriety statistical application for detail analysis regarding operational and quality questions.)

Quality does not mean to add or “correct” data, but to put the right data sets together, to mark implausible data and to know some operational conditions. Missing operational data are absolutely not anomalies. Theo, if you run these kind of analysis all you guys did for the urban data challenge, what’s the difference concerning confidence interval of a statistical analysis if the result is based on 80%, 90% or 100% of the total amount of data?

I also said that data staging and plausibility are most important and most CPU consuming. (maybe I said that after the plenum discussion.) And this is really true. But this does not mean, that you’ll get 100% correct data. It means, that, after staging process, you have the possibility to use proper data to process analysis and reports.

On the other hand I said, that we are focusing on customer services as service quality, passenger information quality and transfer information quality. (We also run transfer protection, even between several public transport providers).

One thing I have been concerning from the begin of this challenge is the influence of dispatch actions. We run active dispatch actions as e.g. trip offset, reassignment of run or block, and many things more.

It would not have been possible for you to take all these things into consideration within this challenge. You can’t do such things within weeks. But these things affect our operational data, too.

This is an item we discussed in the jury team.

As I said on Saturday, we never have published some data so far. We had an ongoing discussion about this situation. My concerns were exact about what happened now. Just to deliver raw data would be easy, but what would the result be at the end, regarding an app? If you guys would not be aware about all the internals? So, with a realtime passenger information app, passenger could get wrong information.

And after reading your email, I’m convinced even more, not to publish row data only. Additional data should be prepared to show things from a passenger’s point of view. (As we do with our information- services) A simple example given: If there is a general delay on a route of 8 Minutes, and if the trip interval would be 8 Minutes, the situation would be perfect from a passenger’s point of view. The delay would just be a “technical delay”. Passengers never concern about our run/route or block numbers. They just expect a departure at a certain point and at a certain time. In this situations, almost everything would be perfect.

Unfortunately we can’t run your app. Just the “Test” works” And what I could see there using “dataset 1a” are pull-out trips from garage. Theo, we delivered all data we had, even all the unproductive segment parts.

Then, all data comes from the real operation. And you cannot expect a “five nine” quality coming from real operation. This would be fake, nothing else.

Bunches and Gaps

Unfortunately we could not run your application, so we cannot view your data.

I know the situation in Zürich, and I learned about the situation in SF, both cities just by riding public transport.

My proposal: Just try the service in reality and you will see.

The problem in normal situation is, that, if a vehicle get delayed, because of what reason ever, it tends to get more delayed because of the bigger passenger change and other factors. Especially in rush hour time periods.

So, what I said was, that we are in the comfortable situation to adjust vehicle capacity, e.g. by using the big 4 axes trolleybusses.

Additionally, we use our CAD/AVL system, that supports dispatch actions to handle such situations, if they occur. This is a comfortable situation, too.

Theo, what possibility do you have to manage these situation, from a given baseline, except

- justify vehicle capacity

- justify trip interval (it’s fix in Zürich in a given time period of a day)

- do an effective traffic light preemption (this works well in Zürich)

- have wide low floor doors to enable a fast passenger change, even for disabled persons.

- give public vehicles separate lanes.

- adjust schedule. (what may end in longer travel time)

By end of this yr, all Zürich busses will have low floor doors. And today, usually every second tramway trip has low floor doors. And on passenger information display, passenger may see, if next tram will have low floor doors or not.

We cannot really prevent other situations as vehicle breakdown, blocking of a route (because of any reason) or the influence of private traffic. Zürich does not have a Metro, traffic is all on the surface. And there is a big interaction of private and public traffic. So, I hope you compare e.g. tram routes by tram routes. Our system does also support dispatch actions, if such situations occur. (including reinforcement trips).

And, of course you are right, the smaller the planned trip intervals, the more do they trend to bunch. So, you also have to take this factor into consideration, if you compare situations. And Theo, do you really compare route by route?

Regarding passenger counting data: You’re right. Zürich equipped 20% of the vehicles with a passenger counting system. So far, there is no need to do more, because passenger load is calculated by statistical projection.

Further Data Set Issues

As I said several times, we never have published data so far. I agree, that data we delivered was probably not the best understandable way. As you can see, we delivered all data, even what we call “unproductive segments”. Theo, we did not get any questions or further requests about our data.

We delivered our data “as is”, coming from our BI system. You are right, there is redundancy. But, all things you mentioned are used in our system. There is 100% transparency.

As I mentioned before, operational data cannot be perfect, and there has to be a way to mark missing attributes.

Conclusion

I really think, we should not discuss about representation of Geo data. We use WSG84 format. It’s a common format for GPS receivers.

All other things: It would have been great, if you had contacted us earlier.

- Yes, only 20% of our vehicles are equipped with passenger counting system. By using statistical projection it is not possible, to break down missing passenger data to a certain trip and a certain stop.

- Within our BI system, staging of data is done, before data is filled into the different data marts.

- There is nothing to hide within our data marts. And, we do not add or change data, of course, within our staging process. (except to mark implausible data sets or set attributes as missing).

- Missing operational data sets or attributes, e.g. missing GPS data are not “anomalies”. E.G: Every 2^nd radio datagram coming from vehicle does not have GPS data. Because we get datagram currently every 6 to 12 seconds, there is no problem for us at all. (with every 2^nd datagram we send a “logical distance in meters” instead of the GPS data.)

- I’m sorry about the comma and semicolon issue.

Theo, at the end, we are focused, from a passengers perspective, on service quality, on passenger information (that also includes transfer information, as I told you) and on transfer protection. We spent much more time on these things then on statistical data. But, it would not be possible to reach our today’s quality for example on real time passenger information, if the data quality would be as bad as you describe it in your email. And we use our statistical data, of course, to improve our service, too.

Theo, another example: Another Urban data challenge team implemented reliability analysis. As you could see, VBZ is not as bad concerning reliability. But, if we are proceeding transfer protection with our system, and when vehicles for this reason have to wait for delayed other vehicles (also trains at the train station), schedule reliability will decrease. But, imagine, you stand on a bus stop at 11 pm, waiting for next bus coming in one hour, service reliability increases if we do so, even if schedule reliability decreases. Often these things are different if you change your focus.

With kind regards