Question about tabular ingest process and Rdata files

56 views
Skip to first unread message

Meghan Goodchild

unread,
Dec 10, 2019, 4:08:41 PM12/10/19
to Dataverse Users Community
Would someone be able to explain some details about how the Rdata files are created as part of the tabular ingest process? 
1. Are these files created from the original file or the TAB file?
2. When are they created? As part of the tabular ingest process or are they created on the fly (i.e., when someone downloads it)?

Thanks for your help in understanding this process (which will help us with some work on the Dataverse-Archivematica integration project and the resulting METS files).

Best,
Meghan
Scholars Portal Dataverse

Philip Durbin

unread,
Dec 10, 2019, 5:06:38 PM12/10/19
to dataverse...@googlegroups.com
Yes, I do believe that Rdata files are created from the tab-separated file rather than the original files. I say this because I'm seeing this in the code[1]...

RJobRequest sro = new RJobRequest(dataVariables, vls);
sro.setTabularDataFileName(tabFile.getAbsolutePath());
sro.setRequestType(SERVICE_REQUEST_CONVERT);
sro.setFormatRequested(FILE_TYPE_RDATA);
resultInfo = dfs.execute(sro);

... followed by some operations on that "tabFile".

I hope this is right. :)

I'm not sure if they are created on the fly or not.

I hope this helps,

Phil


--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/3489b53e-3061-4ce8-8ec9-529a6c69991d%40googlegroups.com.


--

Philip Durbin

unread,
Dec 11, 2019, 6:18:24 AM12/11/19
to dataverse...@googlegroups.com
Whoops! Correction! From a closer look at the same code it looks like Stata and SPSS files and converted directly from those formats to RData. This is in `dfs.directConvert(origFile, origFormat)` at https://github.com/IQSS/dataverse/blob/v4.18.1/src/main/java/edu/harvard/iq/dataverse/dataaccess/DataConverter.java#L240

So, to summarize, as someone who didn't write the code but took a quick look, here's what I think:

- Stata and SPSS files are converted from their original formats to RData format.
- Excel and CSV files are converted first to TSV which is then converted to RData format.

I hope this helps! I'll try to remember to ask someone who knows more about this at standup this morning.

Back to you other question. I just uploaded a Stata file called "stata14-auto-withstrls.tab" (we use this for testing) and on disk the Stata file is called 16ef4a6bd9b-6d95589b48a2.orig a TSV file created by Dataverse is called 16ef4a6bd9b-6d95589b48a2. Then I clicked "Download" and then "RData Format" and a file appeared on disk called 16ef4a6bd9b-6d95589b48a2.RData.

From this I conclude that RData files are not created on ingest. A download will trigger their creation. I'm not sure if there are other ways.

Phil

Meghan Goodchild

unread,
Dec 11, 2019, 11:07:53 AM12/11/19
to Dataverse Users Community
Thanks Phil. This information is helpful! If you have any updates from your standup meeting this morning, please let us know!

Thanks again,
Meghan


On Wednesday, December 11, 2019 at 6:18:24 AM UTC-5, Philip Durbin wrote:
Whoops! Correction! From a closer look at the same code it looks like Stata and SPSS files and converted directly from those formats to RData. This is in `dfs.directConvert(origFile, origFormat)` at https://github.com/IQSS/dataverse/blob/v4.18.1/src/main/java/edu/harvard/iq/dataverse/dataaccess/DataConverter.java#L240

So, to summarize, as someone who didn't write the code but took a quick look, here's what I think:

- Stata and SPSS files are converted from their original formats to RData format.
- Excel and CSV files are converted first to TSV which is then converted to RData format.

I hope this helps! I'll try to remember to ask someone who knows more about this at standup this morning.

Back to you other question. I just uploaded a Stata file called "stata14-auto-withstrls.tab" (we use this for testing) and on disk the Stata file is called 16ef4a6bd9b-6d95589b48a2.orig a TSV file created by Dataverse is called 16ef4a6bd9b-6d95589b48a2. Then I clicked "Download" and then "RData Format" and a file appeared on disk called 16ef4a6bd9b-6d95589b48a2.RData.

From this I conclude that RData files are not created on ingest. A download will trigger their creation. I'm not sure if there are other ways.

Phil

On Tue, Dec 10, 2019 at 5:06 PM Philip Durbin <philip...@harvard.edu> wrote:
Yes, I do believe that Rdata files are created from the tab-separated file rather than the original files. I say this because I'm seeing this in the code[1]...

RJobRequest sro = new RJobRequest(dataVariables, vls);
sro.setTabularDataFileName(tabFile.getAbsolutePath());
sro.setRequestType(SERVICE_REQUEST_CONVERT);
sro.setFormatRequested(FILE_TYPE_RDATA);
resultInfo = dfs.execute(sro);

... followed by some operations on that "tabFile".

I hope this is right. :)

I'm not sure if they are created on the fly or not.

I hope this helps,

Phil


On Tue, Dec 10, 2019 at 4:08 PM Meghan Goodchild <meghan.goo...@gmail.com> wrote:
Would someone be able to explain some details about how the Rdata files are created as part of the tabular ingest process? 
1. Are these files created from the original file or the TAB file?
2. When are they created? As part of the tabular ingest process or are they created on the fly (i.e., when someone downloads it)?

Thanks for your help in understanding this process (which will help us with some work on the Dataverse-Archivematica integration project and the resulting METS files).

Best,
Meghan
Scholars Portal Dataverse

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.

Leonid Andreev

unread,
Dec 11, 2019, 12:16:40 PM12/11/19
to Dataverse Users Community
Hi Meghan, 
Confirming what Phil said - if the original ingested file was Stata (*.dta) or SPSS (*.sav or *.por), we use R package "foreign" to directly convert that saved original file to an .RData dataframe. 
For all the other supported formats, the dataframe is generated by R from the tab-delimited file and the variable metadata in the database. 

The file is generated on demand, the first time the RData format download is requested (not during ingest that is). It is then cached, in addition to the native/original format and the .tsv, so that we don't have to generate it again. 

best,
-Leonid

Meghan Goodchild

unread,
Dec 11, 2019, 4:13:57 PM12/11/19
to Dataverse Users Community
Great, thanks Leonid for confirming. Very helpful!
Reply all
Reply to author
Forward
0 new messages