Loading csv data in a .txt file on the internet

60 views
Skip to first unread message

Epee Sharkey

unread,
Feb 1, 2013, 1:45:53 PM2/1/13
to project...@googlegroups.com
Hi,

as a brief pre-amble : wanted to say thanks to johnmyleswhite and anyone else involved in putting this really useful tool together. I discovered it yesterday via Coursera class on Data Analysis which majors on the use of R. PT looks like it will help save me from myself when I am down in the "analysis trenches" over coming weeks.

Now the substantive point : 

In trying to load earthquake data from an internet file using PT via a .url input file specification : 


The data loads fine but is xxx lines of 1 variable rather than the xxx lines of 10 variables I was hoping for. The reason why I say was hoping for - the file is actually a well-formed csv file which just happens to have a .txt extension. (for instance if I just download it and change the file name to .csv it will load fine via read.csv function).  

I've looked into the R code (i.e. on github) and can see that any file with .txt extension is read using a .wsv (whitespace sep variable) command - i.e. read.csv function with sep = '  ' parameter. 

As a test, trying the following commands in console give results as : 

>con2<-url("http://earthquake.usgs.gov/earthquakes/catalogs/eqs7day-M1.txt,"r")
>equake<-read.csv(con2,header=T,sep=" ")
Result : same as with PT - xxx lines of 1 variable

>equake<-read.csv(con2,header=T,sep=",")
Result : the 'hoped for result' - xxx lines of 10 variables

My actual question : is it possible to specify a separator for files in this case ? Or is it a case of accept it as it is and then look to the 'munging' steps to separate out the data. Another thought strikes me - can the /data subfolder contain a script with the two lines of R code I showed there - (i.e. >con2<-url("foourl/~/foofile","r");>equake<-read.csv(con2,header=T,sep=",") - would that have the same effect ?  

And a supplementary : is it possible to 'timestamp' data accessed in this way ? The 'earthquake' data here is changing from minute to minute, so it would be most useful to be able to say, here is an analysis based on data access from foofile@foourl on dd/mm/yy hh:mm-ss

Any input gratefully received ! 

Eoin 
UK
R 2.15.2
Rstudio 0.97.248
PT : 0.4-2
Win 7 on an antique but functional Lenovo tower. 

John Myles White

unread,
Feb 2, 2013, 10:50:37 AM2/2/13
to project...@googlegroups.com
Hi Epee,

Glad you're enjoying ProjectTemplate.

I think you're hitting up against one of the subjective qualities of ProjectTemplate: it gains ease of use by discouraging configuration and encouraging standardization of practices. If a file is a CSV file, it should advertise itself as such.

In this case, I think the best solution is to write a script that grabs the data you're analyzing and then store that data permanently inside of the "data" folder with timestamp information added to the filenames. This makes your analysis much more reproducible. I personally would not be happy with an analysis program that changed results every time I ran it because it downloaded a different data set. I think keeping a breadcrumb trail of previous data sets is a good thing.

That said, you are right that ProjectTemplate can use R files inside of the "data" folder. If you put the two lines of code you mentioned inside a file like "data/downloader.R", then the script will run and load your data for you.

Sorry for the delay in responding. My dissertation is due in a month and it's sort of unfortunate for me that there's a surge of interest in ProjectTemplate right when I have the least time possible to answer questions.

Best,

 -- John

--
You received this message because you are subscribed to the Google Groups "ProjectTemplate" group.
To unsubscribe from this group and stop receiving emails from it, send an email to projecttempla...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Epee Sharkey

unread,
Feb 2, 2013, 11:06:31 AM2/2/13
to project...@googlegroups.com
Many thanks John,

indeed I had just experimented (like 10 mins ago) to find I can make a file called getonlineequakedata.R in ~/data with the following lines : 

tempcon<-url("http://earthquake.usgs.gov/earthquakes/catalogs/eqs7day-M1.txt","r")
equakeonline<-read.csv(tempcon,sep=",",header=TRUE)
close(tempcon)

to read the data as needed. 

I appreciate your point about making a crumb-trail version. I guess it is feasible to adapt my 2-liner so it first downloads the file saving it with a timestamp name, and then load the timestamped data. 

ProjectTemplate is being promoted on Prof Leek's Data Analysis Coursera MOOC (in the Week 2 lecture notes released last Sunday). I think in the region of 40,000 signed-up, even if only 10% are actually taking the course and only 10% of those check-out PT, that is still a lot of traffic I guess.

Best wishes for your dissertation, I hope your supervisor gives you some credits for putting this great tool out there as well as your thesis.

Eoin 
UK
R 2.15.2
Rstudio 0.97.248
PT : 0.4-2
Win 7 on an antique but functional Lenovo tower. 

Reply all
Reply to author
Forward
0 new messages