Strange time zone behavior with readr

92 views
Skip to first unread message

ArjunaCap

unread,
May 22, 2016, 4:42:37 PM5/22/16
to manipulatr
Given the sample df:

df
<- data_frame(dates = lubridate::ymd_hms("2016-05-21 04:59:00") + hours(0:5), my_data = 11:16)


we inspect the default time zone properties:

> tz(df$dates)
[1] "UTC"


similarly,

tz
(df)
[1] "UTC"


though it's unclear what the tz of the entire df really means.  
anyway, let's set the tz of the dates variable:

> tz(df$dates) <- "America/New_York"


if we now save and reload the file with readr:
readr::write_csv(df, "tmp.csv")



We see that the times zones have been shifted: 

> readr::read_csv("tmp.csv")

Source: local data frame [6 x 2]


                dates my_data
               
(time)   (int)
1 2016-05-21 08:59:00      11
2 2016-05-21 09:59:00      12
3 2016-05-21 10:59:00      13
4 2016-05-21 11:59:00      14
5 2016-05-21 12:59:00      15



this is different behavior than the base functions. Ie,

> write.csv(df, "tmp.csv")


gives

> read.csv("tmp.csv")


  X               dates my_data
1 1 2016-05-21 04:59:00      11
2 2 2016-05-21 05:59:00      12
3 3 2016-05-21 06:59:00      13
4 4 2016-05-21 07:59:00      14
5 5 2016-05-21 08:59:00      15
6 6 2016-05-21 09:59:00      16


Why the difference in behavior, and is it desirable?  

Based on the help pages, I'm guessing it's maybe related to a diff in implementation of ISO8602, but it's unclear why or how.

jim holtman

unread,
May 22, 2016, 5:05:50 PM5/22/16
to ArjunaCap, manipulatr
You need to look at the files written to disk.  Here is the one written by 'readr':

dates,my_data
2016-05-21T08:59:00Z,11
2016-05-21T09:59:00Z,12
2016-05-21T10:59:00Z,13
2016-05-21T11:59:00Z,14
2016-05-21T12:59:00Z,15
2016-05-21T13:59:00Z,16

Notice it has the time zone (UTC) with the data and the time was offset 4 hour from your original data.

Here is what was written out by write.csv:

"","dates","my_data"
"1",2016-05-21 04:59:00,11
"2",2016-05-21 05:59:00,12
"3",2016-05-21 06:59:00,13
"4",2016-05-21 07:59:00,14
"5",2016-05-21 08:59:00,15
"6",2016-05-21 09:59:00,16

Notice that it does not have the time zone and has the original time time you had.  So the files are different, so that is why the results are different.



Jim Holtman
Data Munger Guru
 
What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.

--
You received this message because you are subscribed to the Google Groups "manipulatr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to manipulatr+...@googlegroups.com.
To post to this group, send email to manip...@googlegroups.com.
Visit this group at https://groups.google.com/group/manipulatr.
For more options, visit https://groups.google.com/d/optout.

Michael Cawthon

unread,
May 22, 2016, 9:26:17 PM5/22/16
to jim holtman, manipulatr

Thank you for the response.

I see and agree that readr::write_csv is in fact behaving differently than base::write.csv, and that, as you point out, this difference is observable in the file output.

and yet, my question remains: is this behavior desirable and intentional, esp given that write_csv sould seems intended as a faster but side effect free/ idempotent alternative to write.csv?

as another point of comparison, the treatment of dates under write_rds works like write.csv; ie, it coerces the actual date and preserves it.

-- 

Michael Cawthon
Chief Investment Officer
Green Street Energy LLC
mcaw...@greenstenergy.com
p: 479-442-1407

Hadley Wickham

unread,
May 23, 2016, 10:18:43 AM5/23/16
to Michael Cawthon, jim holtman, manipulatr
I'd take a lot, but your example isn't reproducible.
Hadley
http://hadley.nz

ArjunaCap

unread,
May 23, 2016, 11:00:59 AM5/23/16
to manipulatr, mcaw...@greenstenergy.com, jhol...@gmail.com
Let me try again:

from the orig post, here's a toy df:

df <- data_frame(dates = lubridate::ymd_hms("2016-05-21 04:59:00") + hours(0:5), my_data = 11:16)

dates are EST, so we set them

lubridate::tz(df$dates) <- "America/New_York"

the following three alternatives give inconsistent output:

way #1
#expected date/time behavior; note the hour
write
.csv(df, "tmp.csv", row.names = FALSE)
way1
<- read_csv("tmp.csv")

way #2

#unexpected date/time behavior; note the hour shift

write_csv
(df, "tmp.csv")
way2
<- read_csv("tmp.csv")


way #3

# expected date behavior

write_rds
(df, "tmp.rds")
way3
<- read_rds("tmp.rds")



I'd naively expect them to all be the same


But failing that, I'd at least expect way #2 and way #3 to be consistent, but they aren't, ie:


#this is false
identical
(way2, way3)

Brandon Hurr

unread,
May 23, 2016, 12:22:18 PM5/23/16
to ArjunaCap, manipulatr, jhol...@gmail.com
So essentially, when readr::write_csv() writes out the time it adds 4 hours when it takes the UTC and converts to Eastern time as it writes it out.

library(readr)
library(tibble)
library(lubridate)
library(dplyr)
df <- data_frame(dates = lubridate::ymd_hms("2016-05-21 04:59:00") + hours(0:5), my_data = 11:16)

#change timezone
tz(df$dates) <- "America/New_York"

#expected date/time behavior; note the hour
write.csv(df, "tmp.csv", row.names = FALSE)
way1 <- read_csv("tmp.csv")

#unexpected date/time behavior; note the hour shift
write_csv(df, "tmp.csv")
way2 <- read_csv("tmp.csv")

# expected date behavior
write_rds(df, "tmp.rds")
way3 <- read_rds("tmp.rds")

#this is false
identical(way2, way3)

#have a look
glimpse(way1)
glimpse(way2)
glimpse(way3)

#even better
data_frame(dfdates = df$dates, way1dates = way1$dates, way2dates=way2$dates, way3dates= way3$dates)

Broken down more:

When output_column goes through and formats for writing the format function adds 4 hours to the time by converting "America/New_York" to "UTC". During DST, the difference between the times is indeed 4 hours so the difference checks.

In base R it looks like it just makes it as.character() and ignores the time zone/locale information: https://github.com/wch/r-source/blob/e5b21d0397c607883ff25cca379687b86933d730/src/library/utils/R/write.table.R#L48-L62

#start over
df <- data_frame(dates = lubridate::ymd_hms("2016-05-21 04:59:00") + hours(0:5), my_data = 11:16)
#what's it look like as it's made?
format(df$dates, "%Y-%m-%dT%H:%M:%OSZ", tz = "UTC")

#let's make it EDT (as of typing)
tz(df$dates) <- "America/New_York"
#what's that look like? It added 4 hours again
format(df$dates, "%Y-%m-%dT%H:%M:%OSZ", tz = "UTC")

#let's make it UTC again
tz(df$dates) <- "UTC"
#back to normal
format(df$dates, "%Y-%m-%dT%H:%M:%OSZ", tz = "UTC")

I would say that when you change the Time Zone you are only doing that. The times are in that time zone. When readr::write_csv() formats them it puts them in UTC, which is proper ISO08601, and this adds 4 hours. When base::write.csv() writes it out, it seems to ignore the time zone completely and write out whatever it is as text. When this is read back in, R assumes it is UTC. So if your times are in "EDT" and you use write.csv() you are losing information and should be very careful if you're doing comparisons across time zones. 

At least I think that's what is going on... Hadley would be able to comment much better than I. 



# devtools::session_info()
Session info -----------------------------------------------------------------------------------------------
 setting  value                       
 version  R version 3.3.0 (2016-05-03)
 system   x86_64, darwin13.4.0        
 ui       AQUA                        
 language (EN)                        
 collate  en_US.UTF-8                 
 tz       America/Los_Angeles         
 date     2016-05-23                  

Packages ---------------------------------------------------------------------------------------------------
 package    * version    date       source                           
 assertthat   0.1        2013-12-06 CRAN (R 3.3.0)                   
 DBI          0.4-1      2016-05-08 CRAN (R 3.3.0)                   
 devtools     1.11.1     2016-04-21 CRAN (R 3.3.0)                   
 digest       0.6.9      2016-01-08 CRAN (R 3.3.0)                   
 dplyr      * 0.4.3.9001 2016-05-19 Github (hadley/dplyr@b8ad8a9)    
 lazyeval     0.1.10     2015-01-02 CRAN (R 3.3.0)                   
 lubridate  * 1.5.6.9000 2016-05-16 Github (hadley/lubridate@5b8c8fe)
 magrittr     1.5        2014-11-22 CRAN (R 3.3.0)                   
 memoise      1.0.0      2016-01-29 CRAN (R 3.3.0)                   
 R6           2.1.2      2016-01-26 CRAN (R 3.3.0)                   
 Rcpp         0.12.5     2016-05-14 CRAN (R 3.3.0)                   
 readr      * 0.2.2.9000 2016-05-10 Github (hadley/readr@870058a)    
 stringi      1.0-1      2015-10-22 CRAN (R 3.3.0)                   
 stringr      1.0.0      2015-04-30 CRAN (R 3.3.0)                   
 tibble     * 1.0-5      2016-05-19 Github (hadley/tibble@19235d2)   
 withr        1.0.1      2016-02-04 CRAN (R 3.3.0)                   

Hadley Wickham

unread,
May 24, 2016, 9:20:02 AM5/24/16
to Brandon Hurr, ArjunaCap, manipulatr, jim holtman
(Thanks to Brandon for making an example that I could actually copy
and paste in to R)

I think the problem is that the data frame method doesn't print the
associated time zone, so you can't see what time a column actually
represents. The readr behaviour looks correct to me:

library(readr)
library(tibble)
df <- data_frame(x = lubridate::ymd_hms("2016-01-01 01:00:00") + hours(0:1))
tz(df$x) <- "America/New_York"
df$x
#> [1] "2016-01-01 01:00:00 EST" "2016-01-01 02:00:00 EST"
write_csv(df, "tmp.csv")
read_csv("tmp.csv")$x
#> [1] "2016-01-01 06:00:00 UTC" "2016-01-01 07:00:00 UTC"

ArjunaCap

unread,
May 24, 2016, 4:24:36 PM5/24/16
to manipulatr, brando...@gmail.com, mcaw...@greenstenergy.com, jhol...@gmail.com
ok.  for completeness: does the data frame method account for the differential printing behavior?

#one copy/paste-able block:
library(readr) 
library(tibble) 
df <- data_frame(x = lubridate::ymd_hms("2016-01-01 01:00:00") + hours(0:1)) 
tz(df$x) <- "America/New_York" 
df$x 
#> [1] "2016-01-01 01:00:00 EST" "2016-01-01 02:00:00 EST" 
write_csv(df, "tmp.csv") 
read_csv("tmp.csv")$x 
#> [1] "2016-01-01 06:00:00 UTC" "2016-01-01 07:00:00 UTC" 
write_rds(df, "tmp.rds") 
read_rds("tmp.rds")$x 

#> [1] "2016-01-01 01:00:00 EST" "2016-01-01 02:00:00 EST"

Hadley Wickham

unread,
May 24, 2016, 4:33:18 PM5/24/16
to ArjunaCap, manipulatr, Brandon Hurr, jim holtman
That looks correct to me - those are the same times, albeit in
different time zones.
Hadley
Reply all
Reply to author
Forward
0 new messages