Change in data format

81 views
Skip to first unread message

DJ Sharkey

unread,
Mar 18, 2017, 11:55:21 PM3/18/17
to BikeNYC and CitiBikeNYC Hackers
Hi all,

I was wondering if anyone else noticed a recent change in the trip data format. It looks like in October of 2016 they stopped quoting values in the CSV data files which actually makes it way easier to use Hive to run SQL over the data without doing any work to actually load it into a database. (EG we can just dump the data in S3 and use AWS Athena to query it!)

EG an example row from the the October CSV ( https://s3.amazonaws.com/tripdata/201610-citibike-tripdata.zip )
328,2016-10-01 00:00:07,2016-10-01 00:05:35,471,Grand St & Havemeyer St,40.71286844,-73.95698119,3077,Stagg St & Union Ave,40.70877084,-73.95095259,25254,Subscriber,1992,1

"975","9/1/2016 00:00:02","9/1/2016 00:16:18","312","Allen St & Stanton St","40.722055","-73.989111","313","Washington Ave & Park Ave","40.69610226","-73.96751037","22609","Subscriber","1985","1"

I know both are valid CSV, but new format lets us use the more capable LazySimpleSerDe instead of CSVSerde when creating a Hive table. One consequence of this switch is that columns in the hive table can be created with proper data types instead of all as Strings see http://stackoverflow.com/questions/28603201/hive-table-creation-using-opencsv-serde for info). 

Has anyone else noticed this change/know if it was intentional on Citi Bike's part? Also, did anyone encounter this problem when trying to work with old data and work around it?

Thanks!
DJ
Reply all
Reply to author
Forward
0 new messages