[R] download/retain text file structure with RCurl/getURL()

1 view
Skip to first unread message

zack holden

unread,
Jan 19, 2009, 1:26:37 PM1/19/09
to r-h...@r-project.org

Dear list,

I'm trying to download a text file directly from the internet using the RCurl package and the command getURL. Duncan Lang graciously helped me solve the first step in this problem using the following command:

#################
txtfile <- getURL('ftp://ftp.wcc.nrcs.usda.gov/data/snow/snow_course/table/history/idaho/13e19.txt',
ftp.use.epsv = FALSE)
#################

This brings the text file into R in a single long character string. I've spent many hours now trying to bring this text file into R into a sensible form. I've tried every variant of different commands in getURL help file, as well as different
strsplit() commands to try to break this character string into a sensible rows and columns, to no avail.

Can anyone suggest a solution for doing this? I suspect there is a getURL command I'm missing. Alternatively, do I really have to break this long character string into rows and columns that I can then assemble into a table?

I'd be grateful for any advice.

Thanks in advance,

Zack


______________________________________________
R-h...@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Gabor Grothendieck

unread,
Jan 19, 2009, 1:38:21 PM1/19/09
to zack holden, r-h...@r-project.org
If you are having problems with the default download.file method you
can try method = "wget":

f <- "ftp://ftp.wcc.nrcs.usda.gov/data/snow/snow_course/table/history/idaho/13e19.txt"
download.file(f, basename(f), method = "wget")

David Winsemius

unread,
Jan 19, 2009, 2:52:13 PM1/19/09
to zack holden, r-h...@r-project.org
It's a fixed width format, with irregular entries, perhaps something
along the lines of:

read.fwf(textConnection(txtfile), skip = 8, # skips the header
widths = <column widths vector>,
colnames= <colnames> ,
nrows=48 ) #drops the trailing summary text

perhaps :

widths = c(2, -1, 1, -1 ,4, -1, 3 .... the rest # the -col
entries drop the white-space
names = c("year","card", "Jan.date", "Jan.dep" .....
the rest

Just the first few columns seem to come in acceptably, although the
lines with all NA's will need to be deleted:
> read.fwf(textConnection(txtfile), skip = 8, # skips the header
+ widths = c(2, -1, 1, -1 ,4, -1, 3), # the -col entries drop
the white-space
+ col.names = c("year","card", "Jan.date", "Jan.dep"),
nrows=48 )
year card Jan.date Jan.dep
1 61 1 E/ST NA
2 62 1 E/ST NA
3 63 1 K/31 15
4 64 1 K/30 12
5 NA NA <NA> NA
6 65 1 E/ST NA
7 66 1 1/07 17
8 67 1 E/ST NA
9 68 1 K/28 12
10 69 1 K/31 22
11 NA NA <NA> NA
12 70 1 K/30 16
13 71 1 K/29 28
14 72 1 K/28 32
15 73 1 1/02 16
snip
--
David Winsemius

zack holden

unread,
Jan 21, 2009, 10:53:03 AM1/21/09
to spe...@stat.berkeley.edu, r-h...@r-project.org


Dear list,
I'm posting the solution to my problem in case others may find this useful. This code was sent to me by Phil Spector. With a bit of cleaning, it can easily be converted to a usable format. Thanks to Gabor Grothendieck, David winsemius and Martin Morgan for also sending possible solutions. Thank you all for taking the time to help, I would not have solved this on my own.

###############################################
require(RCurl)
txtfile = getURL('ftp://ftp.wcc.nrcs.usda.gov/data/snow/snow_course/table/history/idaho/13e19.txt',ftp.use.epsv = FALSE)
txtvec = strsplit(txtfile,'\n')[[1]]
widths = c(4,rep(c(5,4,6),6))
res = read.fwf(textConnection(txtvec[9:65]),widths=widths,stringsAsFactors=FALSE)
nums = c(3,4,6,7,9,10,12,13,15,16,18,19)
res[,nums] = sapply(res[,nums],as.numeric)
################################################

Best,

Zack

----------------------------------------
> Date: Mon, 19 Jan 2009 11:08:48 -0800
> From: spe...@stat.berkeley.edu
> To: zack_...@hotmail.com
> Subject: Re: [R] download/retain text file structure with RCurl/getURL()
>
> Zack -
> Here's a start:
>
> txtfile = getURL('ftp://ftp.wcc.nrcs.usda.gov/data/snow/snow_course/table/history/idaho/13e19.txt',ftp.use.epsv = FALSE)
> txtvec = strsplit(txtfile,'\n')[[1]]
> widths = c(4,rep(c(5,4,6),6))
> res = read.fwf(textConnection(txtvec[9:65]),widths=widths,stringsAsFactors=FALSE)
> nums = c(3,4,6,7,9,10,12,13,15,16,18,19)
> res[,nums] = sapply(res[,nums],as.numeric)
>
> Hope this helps.
> - Phil Spector
> Statistical Computing Facility
> Department of Statistics
> UC Berkeley
> spe...@stat.berkeley.edu

Reply all
Reply to author
Forward
0 new messages