ctbk.dev, Feb+March 2021 data quality issues

331 views
Skip to first unread message

Ryan Williams

unread,
Apr 21, 2021, 4:15:56 PM4/21/21
to citibike...@googlegroups.com
ctbk.dev dashboard
Hi all, I made ctbk.dev recently to answer questions I frequently have about Citibike system-wide numbers:
Screen Shot 2021-04-21 at 4.11.28 PM.png
(link to screenshot, in case it doesn't render here)

Some highlights:
Feb+March 2021 Data Quality Issues
The "user type" (annual "subscriber" vs. daily "customer") and "gender" fields seem to have gone all out of whack in the last 2mos of data.

Historically, "customers" were 10-30% of rides, but in the last 2mos suddenly jumped to ≈80% (link):
Screen Shot 2021-04-21 at 3.21.47 PM.png
The brighter bars on the bottom of each month's stack are "customers", the darker bars on top are "subscribers". Note that the clear majority of rides are "subscribers" in every past month through January 2021, but that suddenly flips for Feb+March.

Similarly, the gender breakdown changed wildly in the last 2mos' data (link):
Screen Shot 2021-04-21 at 3.23.56 PM.png
From bottom to top, each month's stack is "unspecified", "female", "male". Proportions were roughly 10%/30%/60% through Jan 2021, when they became ≈80% "unspecified" (relative M/F breakdown seems steady at around 70/30, and closer to even since the pandemic)

I've audited my pipeline a few times and believe that these apparent anomalies are present in the original source data at s3://tripdata. Here's a notebook pulling these numbers directly from there; relevant plots:
image.png
image.png
If anyone has ideas about what could be going on here, or knows the best way to take it up with Citibike folks directly, I'm all ears!

Thanks,

-Ryan

Clif Kranish

unread,
May 4, 2021, 11:43:42 AM5/4/21
to citibike...@googlegroups.com, Ryan Williams

I ran my own cleansing and visualization and I see the same explosive growth in percentage of rides by customers in February.

OTOH the percentage of rides by females has declined.

Citi BIke publishes monthly operating reports. The latest available is for February here:

https://d21xlh2maitm24.cloudfront.net/nyc/February-2021-Citi-Bike-Monthly-Report.pdf?mtime=20210317150548

Where it says (emphasis mine):

Ridership

There were 639,789 trips in February, with an average 23,695 trips per day. The combined distance traveled for all trips was 1,088,843.34 miles. The average trip lasted 13 minutes and 45 seconds and covered 1.7 miles. Annual members completed the majority of trips, recording 526,349 trips, compared to 119,911 trips by casual members. Ridership was generally higher on weekdays, but weekends were more popular among casual riders. February 24th was the highest day for ridership this month with 45,834 rides.

So unless we're both missing something they are using different data for analysis then what they provide us.

--
You received this message because you are subscribed to the Google Groups "BikeNYC and CitiBikeNYC Hackers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to citibike-hacke...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/citibike-hackers/CAEE_Dw-y4g7302Fpw69625DEGL1O-XX6RMRBNY872cdCSZAzJw%40mail.gmail.com.

Sakib Ahmed

unread,
May 29, 2021, 5:33:40 PM5/29/21
to BikeNYC and CitiBikeNYC Hackers
Hi,

I was looking at the recent data for a school project as well. It seems like the monthly reports and the data in the csv files don't match. Did some one figure out why?

- Sakib

Clif Kranish

unread,
Jun 4, 2021, 4:46:46 PM6/4/21
to citibike...@googlegroups.com

The May data is just as off. I've tried reaching out to Citi Bike through both Customer Service and Media Inquiries but got nowhere. I also tried contacting a manager through LinkedIn but got no reply. I guess no one from Citi Bike is reads this group.

Ryan Williams

unread,
Jun 15, 2021, 7:17:09 PM6/15/21
to citibike...@googlegroups.com
Before going into the weeds too much, one cool thing is that the May data shows a new all-time high! ctbk.dev

(rolling 12mo avg also all-time high)

Data Quality Issues
I'm seeing that the recent anomalous months (since February, now including May) were all updated when the May dated landed (on June 11):



Notable changes:
  • "Gender" field is gone altogether (maybe unsurprising since it had been overwhelmingly "unknown" since February)
  • New field: "Rideable Type"
    • Possible values: "docked_bike", "electric_bike". Promising!
    • Currently just 122 "electric_bike" rides in April and 148 in May (all in NYC region). I'm guessing they are just phasing in something here.
  • New field: "member_casual"
    • Possible values: "member", "casual"
    • Seems like a drop-in replacement for "User Type" ("Subscriber", "Customer")
I've updated my pipeline to handle these as best I can (with a "🚧" next to the "Gender" and "Rideable Type" fields, since they are not very usable at present):



I also put some similar notes there about the data quality situation.

I think I can contact some Lyft folks to dig into this more, but there are other improvements I want to make (mostly to make the whole thing more static and also hopefully more responsive… gonna try moving from Heroku to static SQLite files 😂 and do some other basic perf optimizations).


To unsubscribe from this group and stop receiving emails from it, send an email to citibike-hackers+unsubscribe@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "BikeNYC and CitiBikeNYC Hackers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to citibike-hackers+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/citibike-hackers/ff80b5a0-e7bc-585d-f378-484d4226ee28%40gmail.com.

Danielle Carrick

unread,
Jul 17, 2021, 7:12:54 PM7/17/21
to BikeNYC and CitiBikeNYC Hackers
I believe February was when they made it possible to unlock a bike with a QR code. I wonder if the data quality issues had anything to do with that update. Perhaps subscribers that unlocked a biked with the QR code got counted as customers? Probably doesn't explain the entire shift, but does seem like they rolled out a lot of updates around that time.

--
You received this message because you are subscribed to the Google Groups "BikeNYC and CitiBikeNYC Hackers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to citibike-hacke...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/citibike-hackers/ff80b5a0-e7bc-585d-f378-484d4226ee28%40gmail.com.

ckran

unread,
Aug 11, 2021, 8:06:10 AM8/11/21
to BikeNYC and CitiBikeNYC Hackers

The Citi Bike system data page (finally) describes the new data format we've seen, although with no further explanation for the change. 

Conor Skelding

unread,
Aug 11, 2021, 10:31:49 AM8/11/21
to citibike...@googlegroups.com
Hey, all,

I write features for the Sunday New York Post. 

(I'm on this list dating back to a data science bootcamp I did a few years ago, where my final project was something on Citi Bike rides starting/ending near transit hubs.)

Anyone have any idea why they might have removed the "gender" field from their reported system data?

My cell is 917 553 2992 if you want to talk.

And if anyone on this list finds a cool thing or does a cool project, I'd love to pitch it to my editors for a feature. They love transportation stories.

Thanks,
Conor Skelding



--
Reply all
Reply to author
Forward
Message has been deleted
0 new messages