Turnstile Data Missing Data and Glitches

55 views
Skip to first unread message

Stephen Bauman

unread,
May 20, 2023, 10:57:54 AM5/20/23
to mtadeveloperresources
1. Missing Data
The new subway data omits vital information that was present on the old format.

Here's the csv heading from last week's data:

C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATE,TIME,DESC,ENTRIES,EXITS

There are two export buttons on the NYS Open Data website.

The first downloads all the data from Feb 2022. Here's the csv header:

transit_timestamp,station_complex_id,station_complex,borough,routes,payment_method,ridership,transfers,latitude,longitude,Georeference

The second export button provides this csv header:

transit_timestamp,station_complex_id,station_complex,borough,routes,payment_method,ridership,transfers,latitude,longitude,Georeference

You will notice the Exit count that was present on the previous data is no longer present. This is a serious omission that prevents determining travel pattern changes. 

2. Time Stamp Glitch

Here are time stamp samples from the new format:

2023-04-02T00:00:00.000
and
03/20/2022 11:00:00 PM

A problem arose with the release of hourly data. Time zone information is important because there are two clock changes between standard and daylight time. This means there will be two entries for 02:00:00.000 or 02:00:00 AM on the 05 Nov 2023 when there is a time change. 

Neither timestamp version shows the difference between local time and UTC. That's standard practice, when timestamps are displayed in local time.

3. Extra Information - Station location.

Is it necessary to include two forms for each station's geographic coordinates on an hourly basis? People who need to use geographic data should be proficient in converting between x, y coordinates and a Point. 

Also, does this information need to be presented on an hourly basis? Do subway station locations move that quickly? 

4. Download Size.

The csv of the weekly turnstile data required around 33 MB. The two new csv's required  521.1 MB and 548.3 MB. The major reason for this increased data size is the new data isn't available in weekly chunks. There's no way to filter the data by time before downloading.

Stephen Bauman

Marcel Dejean

unread,
May 21, 2023, 8:24:05 AM5/21/23
to mtadevelop...@googlegroups.com
>The csv of the weekly turnstile data required around 33 MB. The two new csv's required  521.1 MB and 548.3 MB. The major reason for this increased data size is the new data isn't available in weekly chunks. There's no way to filter the data by time before downloading.


--
You received this message because you are subscribed to the Google Groups "mtadeveloperresources" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mtadeveloperreso...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/mtadeveloperresources/ad7f6f70-0ca7-4dac-bcdc-5e5375fdbef5n%40googlegroups.com.

Stephen Bauman

unread,
May 21, 2023, 11:29:33 AM5/21/23
to mtadeveloperresources
Thank you very much for your interest.

Unfortunately, the command:

resulted in the following response:

bash: =1000000: command not found

Redirecting output to ‘wget-log.2’.
[steve@localhost Downloads]$ cat wget-log.2
--2023-05-21 11:07:49--  https://data.ny.gov/resource/wujg-7c2s.csv?=transit_timestamp%20between%20%222023-05-13T00:00:00%22%20and%20%222023-05-20T00:00:00%22
Resolving data.ny.gov (data.ny.gov)... 52.206.140.205, 52.206.140.199, 52.206.68.26
Connecting to data.ny.gov (data.ny.gov)|52.206.140.205|:443... connected.
HTTP request sent, awaiting response... 400 Bad Request
2023-05-21 11:07:49 ERROR 400: Bad Request.


I appreciate your advice. Please don't pursue this further on solely my behalf. I have used the socrata query successfully on other requests from the NYS Open Data site. The lack of exit totals reduces my interest in examining it.

Again, thanks for your response.

Preston Marshall

unread,
May 21, 2023, 4:15:06 PM5/21/23
to mtadevelop...@googlegroups.com
Try escaping your $ with a \ so \$limit and \$where. Bash is interpreting those as variables.

Stephen Bauman

unread,
May 21, 2023, 10:22:42 PM5/21/23
to mtadeveloperresources
Thanks for the suggestion. I don't think it's a bash problem.

I've removed the limit. Here are some of the responses.


Resolving data.ny.gov (data.ny.gov)... 52.206.140.205, 52.206.140.199, 52.206.68.26
Connecting to data.ny.gov (data.ny.gov)|52.206.140.205|:443... connected.
HTTP request sent, awaiting response... 400 Bad Request
2023-05-21 21:34:49 ERROR 400: Bad Request.

It would appear the major filtering problem is on the NYS end. In particular there are problems parsing the query part of the wget request.

Here's what worked:

[steve@localhost Downloads]$ wget https://data.ny.gov/resource/wujg-7c2s.csv?transit_timestamp=2023-05-13T10:00:00.000
--2023-05-21 21:49:56--  https://data.ny.gov/resource/wujg-7c2s.csv?transit_timestamp=2023-05-13T10:00:00.000
Resolving data.ny.gov (data.ny.gov)... 52.206.140.199, 52.206.68.26, 52.206.140.205
Connecting to data.ny.gov (data.ny.gov)|52.206.140.199|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/csv]
Saving to: ‘wujg-7c2s.csv?transit_timestamp=2023-05-13T10:00:00.000’
wujg-7c2s.csv?transit_timestamp=2023-05-13T10:0     [ <=>                                                                                                 ]  53.85K  --.-KB/s    in 0.04s  
2023-05-21 21:49:57 (1.49 MB/s) - ‘wujg-7c2s.csv?transit_timestamp=2023-05-13T10:00:00.000’ saved [55143]

Here's what did not:

[steve@localhost Downloads]$ wget https://data.ny.gov/resource/wujg-7c2s.csv?transit_timestamp>2023-05-13T10:00:00.000
--2023-05-21 21:50:55--  https://data.ny.gov/resource/wujg-7c2s.csv?transit_timestamp
Resolving data.ny.gov (data.ny.gov)... 52.206.68.26, 52.206.140.199, 52.206.140.205
Connecting to data.ny.gov (data.ny.gov)|52.206.68.26|:443... connected.

HTTP request sent, awaiting response... 400 Bad Request
2023-05-21 21:50:56 ERROR 400: Bad Request.

[steve@localhost Downloads]$ wget https://data.ny.gov/resource/wujg-7c2s.csv?transit_timestamp\ >\ 2023-05-13T10:00:00.000
--2023-05-21 22:00:18--  https://data.ny.gov/resource/wujg-7c2s.csv?transit_timestamp%20
Resolving data.ny.gov (data.ny.gov)... 52.206.140.199, 52.206.68.26, 52.206.140.205
Connecting to data.ny.gov (data.ny.gov)|52.206.140.199|:443... connected.

HTTP request sent, awaiting response... 400 Bad Request
2023-05-21 22:00:19 ERROR 400: Bad Request.

[steve@localhost Downloads]$ wget https://data.ny.gov/resource/wujg-7c2s.csv?transit_timestamp%20>%202023-05-13T10:00:00.000
--2023-05-21 22:00:55--  https://data.ny.gov/resource/wujg-7c2s.csv?transit_timestamp%20
Resolving data.ny.gov (data.ny.gov)... 52.206.140.199, 52.206.68.26, 52.206.140.205
Connecting to data.ny.gov (data.ny.gov)|52.206.140.199|:443... connected.

HTTP request sent, awaiting response... 400 Bad Request
2023-05-21 22:00:55 ERROR 400: Bad Request.

The what worked link came from this web page: https://dev.socrata.com/foundry/data.ny.gov/wujg-7c2s

If the new dataset contained the all information contained on the dataset it replaced, I could manage downloading the entire file and filtering it on my side. It's just an inconvenience for those who are not interested in trip destinations.

Thanks for your reply.

Steve
Reply all
Reply to author
Forward
0 new messages