How to request help for specific database queries in v4.6.1 - details of new datasets (and major dataset versions) published in previous x days

37 views
Skip to first unread message

MM - ADA

unread,
Aug 2, 2018, 2:15:33 AM8/2/18
to Dataverse Users Community
Hi - I am wondering how to request assistance to determine 2 specific but similar database queries in v4.6.1:

1. Determine the dataset details (name, doi, published date) for all new datasets (version number = 1, minor version number = 0) published in the previous x days.

2. Determine the dataset details (name, doi, published date) for new major versions of datasets (version number  > 1, minor version number = 0) published in the previous x days.


I know there is this document https://docs.google.com/document/d/1-Y_iUduSxdDNeK1yiGUxe7t-Md7Fy965jp4o4m1XEoE/edit but is it still being used?


Thanks!

julian...@g.harvard.edu

unread,
Aug 2, 2018, 9:02:43 AM8/2/18
to Dataverse Users Community
Hi Marina,

I think this is a good place to ask database query-related questions. This way others can correct me! :)

That database queries document is still being used. I don't think any of the queries in it have had to change because of database changes since 4.6.1, so it should still be a reliable reference. (The only database change I'm aware of that would be different between Dataverse 4.6.1 and the current version (4.9.1) is that for 4.6.1, parts of the persistent ID might be in the dataset table, while in later Dataverse versions, the PID was moved to the dvobject table. In my example query below, I don't specify which table those PID parts are coming from, so it should work no matter what table the PID is in for 4.6.1.

All of the info for both queries should be in these tables:
  • dataset (for DOI info in 4.6.1?)
  • datasetversion
  • datasetfield, datasetfieldtype and datasetfieldvalue (for the dataset title metadata field)
  • dvobject (for publication date)

For the first query, I'm assuming you also want to avoid harvested datasets. For both queries, I'm assuming that you want to avoid deaccessioned datasets:

select datasetfieldvalue.value as "title", publicationdate, protocol||':'||authority||'/'||identifier as "DOI"
from datasetfieldvalue
left outer join datasetfield on datasetfieldvalue.datasetfield_id = datasetfield.id
left outer join datasetfieldtype on datasetfield.datasetfieldtype_id = datasetfieldtype.id
left outer join datasetversion on datasetfield.datasetversion_id = datasetversion.id
join dataset on dataset.id = datasetversion.dataset_id
join dvobject on dvobject.id = dataset.id
where harvestingclient_id is null and versionstate != 'DEACCESSIONED' and datasetfield.template_id is null
and datasetfield.datasetfieldtype_id = 1

and datasetversion.createtime in (select min(datasetversion.createtime) as max from datasetversion group by datasetversion.dataset_id)
-- the min in (select min(datasetversion.createtime) is asking for the version of the dataset that was created first. Change that to max (select max(datasetversion.createtime) - for the last dataset version

and publicationdate >= 'yyyy-mm-dd'


Hope that helps!
Julian

julian...@g.harvard.edu

unread,
Aug 2, 2018, 9:10:15 AM8/2/18
to Dataverse Users Community
Hmmm, for your second query, to get only major versions, you might want to add

and minorversionnumber = 0

julian...@g.harvard.edu

unread,
Aug 2, 2018, 9:15:58 AM8/2/18
to Dataverse Users Community
For the publicationdate, instead of specifying a cutoff date, if you need or prefer to include the number of days (like 10 days), I'd use:

and publicationdate >= date_trunc('day', current_date - interval '10' day)

MM - ADA

unread,
Aug 6, 2018, 9:18:16 PM8/6/18
to Dataverse Users Community
Thanks for the assistance, Julian - it helped immensely.

I did tweak your queries a bit...

For new datasets (versionnumber=1, minorversionnumber=0, not harvested, state='RELEASED') this is what I used:

select to_char(publicationdate,'FMDD Month YYYY') as "publish date", datasetfieldvalue.value as "title", protocol||':'||authority||'/'||identifier as "DOI"
from datasetfieldvalue
left outer join datasetfield on datasetfieldvalue.datasetfield_id = datasetfield.id
left outer join datasetfieldtype on datasetfield.datasetfieldtype_id = datasetfieldtype.id
left outer join datasetversion on datasetfield.datasetversion_id = datasetversion.id
join dataset on dataset.id = datasetversion.dataset_id
join dvobject on dvobject.id = dataset.id
where harvestingclient_id is null and versionstate = 'RELEASED' and datasetfield.template_id is null
and datasetfield.datasetfieldtype_id = 1
and datasetversion.versionnumber = 1
and datasetversion.minorversionnumber = 0
and publicationdate >= date_trunc('day', current_date - interval '7' day)
order by publicationdate,title




For new major versions (versionnumber>1, minorversionnumber=0, not harvested, state='RELEASED') this is the query I used (using datasetversion.releasetime instead of publicationdate since publicationdate is the date the original v1.0 is published and is the same for all subsequent versions)

select datasetfieldvalue.value as "title",to_char(releasetime,'FMDD Month YYYY') as "publish date", protocol||':'||authority||'/'||identifier as "DOI", versionnumber, minorversionnumber
from datasetfieldvalue
left outer join datasetfield on datasetfieldvalue.datasetfield_id = datasetfield.id
left outer join datasetfieldtype on datasetfield.datasetfieldtype_id = datasetfieldtype.id
left outer join datasetversion on datasetfield.datasetversion_id = datasetversion.id
join dataset on dataset.id = datasetversion.dataset_id
join dvobject on dvobject.id = dataset.id
where harvestingclient_id is null and versionstate = 'RELEASED' and datasetfield.template_id is null
and datasetfield.datasetfieldtype_id = 1
and datasetversion.versionnumber = (select max(versionnumber) as maxversion from datasetversion where datasetversion.dataset_id=dataset.id and versionnumber > 1 group by datasetversion.dataset_id)
and datasetversion.minorversionnumber = 0
and releasetime >= date_trunc('day', current_date - interval '7' day)
order by title,versionnumber


The release time could be used in the first query as well...

Thanks again.

Marina
Reply all
Reply to author
Forward
0 new messages