Dataverse Ingest Queue workflow

102 views
Skip to first unread message

Kaitlin Newson

unread,
Apr 12, 2022, 4:02:53 PM4/12/22
to Dataverse Users Community
Hi Dataverse community,

Our team is in the process of a data migration to our Dataverse installation, and we had some questions about the ingest process. Specifically we're wondering about how the ingest queue works. Are files ingested one at a time, or are multiple at a time? If it is more than one, is this value set somewhere? Any other important information we should know about how this queue works?

We're wondering as we want to make sure we're not going to cause major issues in our production environment when we upload a large number of files that require ingest. I'm also happy to add relevant info to the docs on ingest.

Thanks!

leo...@g.harvard.edu

unread,
Apr 12, 2022, 7:42:32 PM4/12/22
to Dataverse Users Community
Hi Kaitlin, 
I'll try to answer this. But I want to mention right away that we have just started talking about reorganizing/reimplementing that queue. The way it is implemented now has not been touched much in many years. But it may be time to build something better, primarily to provide a better way to monitor and manage that queue, to be able to see what's on it and how long it should be expected to take, etc. There's an open issue for this and I'm hoping we'll get this done for the upcoming Dataverse v6. 
The files are ingested one at a time. Ingest can be very expensive, in terms of both memory and CPU cycles. So even with only one file at a time being processed, if you are adding a large number of large 'ingestable" files (STATA, SPSS, ...) at once, it is possible to end up with your server struggling with that queue for hours or days in ways that would be noticeable to your users. How much data are we talking about anyway? If you wanted to prevent your installation from being overloaded like that, I would deal with that by setting the ingest size limits to something low. These cutoff limits can be set either for all ingestable files, or for specific formats; if a file is larger than the size limit, Dataverse skips putting it on the ingest queue and adds it to the dataset as is, as row Stata, SPSS, etc. Later on you can decide which of these files you want to ingest, we have an API for ingesting individual existing datafiles. I.e. this could be done later without committing to adding all these files to the queue all at once. 
What is your planned data migration process, is it going to be scripted batch job, using our native APIs to create datasets, add files, etc.? 
There are some peculiarities of how the ingest queue currently operates (especially when you need to purge something that's already on the queue; we recently realized that this does not work all that well in some situations). But I'll skip that for now. 

Best,
-Leo.

Victoria Lubitch

unread,
Apr 13, 2022, 10:29:38 AM4/13/22
to Dataverse Users Community

Hi Leo, I am working with Kaitlin on this migration project and I will try to answer. Yes, our plan is a scripted job, using native APIs. It creates dataset, add files to it and then publish it. At the moment everything is done sequentially. Script waits for locks, before adding additional files and before publishing. We are ingesting SPSS files. Our biggest file is around 93 Mb at the moment and it may take almost an hour to ingest. The rest are not as big, but still it takes time to ingest them. Publishing is also can be slow, since it does checksum validation and DOI validation. We also are experimenting with threading (each create dataset part)
What do you think is better, to create all datasets and then add one SPSS file to each dataset at once without waiting for ingest to finish? But then we will have to monitor it to finish in order to publish these datasets. Or maybe and more secure to do it sequentially as we do now and maybe add threading (pool) for creation of each dataset part?

leo...@g.harvard.edu

unread,
Apr 13, 2022, 1:39:26 PM4/13/22
to Dataverse Users Community

Hi, 
The way your import process is organized now sounds very sensible to me. Specifically the part about waiting for the locks to clear, before proceeding with the next step. This way the queueing is run outside Dataverse; and frankly this way it is more manageable and transparent than adding all your ingestable at once then waiting for the JMS queue to process them. 
Note that you can speed up the publishing step by disabling the checksum validation; you probably do want to verify all the checksums eventually, to really make sure that the files are all there and intact. But that's something that could be done later, if the goal is get all the imports processed as quickly as possible (?).  
Threading/importing multiple datasets and files in parallel is definitely an option too. I'm assuming we are talking about splitting the source data into N batches, and running N sequential jobs in parallel; letting Payara thread them as needed. I'm fairly positive that the ingest jobs would still be executed one at a time (since there's one central JMS queue). But everything else there that's involved in creating datasets and storing files, etc. would be done in parallel, meaning N times the load on the server. Could be prudent to try it with 2 batches in parallel, and see if it slows the server for everybody else in a noticeable way, etc. 

Victoria Lubitch

unread,
Apr 14, 2022, 10:06:46 AM4/14/22
to Dataverse Users Community
Thank you Leo for your answer. We probably are going to proceed as you suggested: sequential process for each dataset with locks, but run several datasets in parallel.
Reply all
Reply to author
Forward
0 new messages