SQL query to determine users that have filedownload access to specific datasets (identified by dataset id)?

93 views
Skip to first unread message

MM - ADA

unread,
Jan 29, 2019, 6:38:06 PM1/29/19
to Dataverse Users Community
Hi - I am trying to figure out the sql query to determine the users that have filedownload permission for specific datasets. (I chose ds4, ds7 and ds9 just to specify that it's not the first 3 datasets but specific datasets out of the many that ADA hosts).


The output desired is a crosstabs-type result (specifying the dataset_id to pick out ds4, ds7, ds9 in the query):

               ds4_title ds7_title ds9_title
user1 (user name)                      x            x
user8 (user name)                      x
user12 (user name)      x      x


Every user that has filedownload permission to any of the 3 will be included in the user column.

Would anybody have an idea as to how to get this result? There are so many tables involved!

Thanks!

M.

Gautier, Julian

unread,
Jan 30, 2019, 7:12:10 AM1/30/19
to dataverse...@googlegroups.com
Hi M,

Every (non-custom) role has the "file download" permission, so this query lists each dataset_id you specify along with the username that has any role on that dataset:

select

roleassignment.definitionpoint_id,

roleassignment.assigneeidentifier

from

roleassignment

where

definitionpoint_id in (‘datasetID1’, ‘datasetID2’)


The username in the roleassignment table is prepended with an @ sign.


If your installation has any custom roles that don't have the "file download" permission, you would have to find out those roles' role_ids in the roleassignment table and exclude them from the query. The dataverserole table has the role_ids and their names. That table's permissionbits column looks like it shows which permissions those roles have. Not sure if those numbers could be used somehow to figure out which users have certain permissions on datasets...


To replace the dataset_id with the title of the dataset, I combined that query with another query to get metadata values. This is where three other tables join:


select 
datasetfieldvalue.value,
roleassignment.assigneeidentifier
from 
datasetfieldvalue
left outer join datasetfield on datasetfield.iddatasetfieldvalue.datasetfield_id
left outer join datasetversion on datasetversion.iddatasetfield.datasetversion_id
join roleassignment on roleassignment.definitionpoint_id = datasetversion.dataset_id
where 
datasetfield.datasetfieldtype_id = 1
and datasetfield.template_id is null
and definitionpoint_id in (datasetID1’, ‘datasetID2)
and datasetversion.createtime in 
(
select max(datasetversion.createtime) as max 
from datasetversion 
group by datasetversion.dataset_id
)

This gets whatever dataset title is in the dataset title metadata field of each dataset's last version created.

The query doesn't organize the info the way you want, but I figured this would at least show how to get the info. Maybe postgresql's crosstab function could help with pivoting the table? (The database I work with doesn't have crosstab, and I can't figure out how to pivot without that function, so I just move the table into excel or google sheets and use pivot tables there, which also lets me count the number of roles each user has on each dataset.)

I hope this is helpful! I added the first query to our database queries cheat sheet

Julian

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To post to this group, send email to dataverse...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/4d68f932-0db4-4a1f-8d70-0be95ba8cf97%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--
Julian Gautier
Product Research Specialist, IQSS

Julian Gautier

unread,
Jan 30, 2019, 10:32:35 AM1/30/19
to Dataverse Users Community
Another caveat: those queries don't count roles that are inherited from parent dataverses. So for the dataset in the screenshot below, the query wouldn't return the "Contributor" role that @jggautier has, because it's inherited from the dataverse that it's in.


Screen Shot 2019-01-30 at 10.22.11 AM.png


Hmmm, so maybe there's a way to add certain roles that are assigned from the specified dataset's dataverse? I say "certain roles" because some roles that you can give people at the dataverse level, but not at the dataset level, like Dataverse/Dataset Creator, don't have the file download permission, so I think those roles would have to be excluded.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.
To post to this group, send email to dataverse-community@googlegroups.com.

MM - ADA

unread,
Jan 30, 2019, 7:33:06 PM1/30/19
to Dataverse Users Community
Thanks so much for taking the time to look into this, Julian.

For ADA's workflows, only data owners/archivists are given any kind of contributor/admin/etc-type role so those are of no interest since we know who those people are. 

So the users of interest to ADA are only those who have been granted permission to download after a 'Request Access' (99.99% of our dataset are 'restricted access'). So I believe that users are assigned a role_id of '2' per datafile in the roleassigment table when 'Grant' is issued and that's what I need to filter on.

Of course, querying roleassignment for role_id =2 only provides who has current filedownload access and not everyone who has had it in the past. (I suppose a 'hack' could be to query who has ever download any of the data files associated with the particular datasets.)


I did look into the crosstab() function and was able to install it (with the assistance of our devops) as it was not part of ADA's postgres version.

The way to install it separately is (using your version of postgres in the postgressqlxx-contrib filename - ADA's is 9.3.x so postgressql93-contrib):

sudo apt-get install postgresql93-contrib 
or 
sudo yum install postgresql93-contrib
or whatever package manager you use

Then install the module once per database to be able to access the crosstab() function. 
So within dvndb execute:

CREATE EXTENSION tablefunc;



Then, this is what I did to get what ADA needs (assuming only role_id=2 is required):

*****
select * from crosstab('select au.useridentifier||'' (''|| au.firstname ||'' ''||  au.lastname|| '' '' || au.email|| '')'' as username, dv2.id as dv2_id, count(*) from authenticateduser au 
inner join roleassignment ra on ''@''||au.useridentifier=ra.assigneeidentifier 
inner join dvobject dv1 on ra.definitionpoint_id = dv1.id 
inner join dvobject dv2 on dv1.owner_id = dv2.id 
where ra.role_id=2 and dv2.dtype=''Dataset'' and dv2.id in (44,63,459)
group by username, dv2_id order by username, dv2_id',
'select id from dvobject
where dtype=''Dataset'' and id in (44,63,459) group by id order by id')
AS (username text, DS44_fdp_count int, DS63_fdp_count int, DS459_fdp_count int);
*****

The second query returns the list of categories (not sure if there is a simpler way of specifying the categories of datasetid's) and the list order is expected to be the same as that specified in the first source query (or so I believe).

Since I know the dataset ids and the title, I am going to cheat by just naming them in the AS() clause rather than joining the datasetfieldvalue-related tables (I did look at that from the previous queries you helped me with).

Using the first source query will give all the counts so you can check what is to be expected in the crosstabs() output.

This query will count all filedownload permissions regardless of datasetversion, etc.

This doesn't take into account users in explicitgroups but ADA doesn't use them so I (thankfully) don't have to take the time to figure out yet another join.


The results of the crosstabs() call for one of our test dataverse installations is the following (with placeholder text):

Results

username                             | ds44_fdp_count | ds63_fdp_count | ds459_fdp_count 
-------------------------------------+----------------+----------------+-----------------
 u1_username (u1_name u1_email) |                |              3 |                
 u2_username (u2_name u2_email)      |                |                |               1
 u3_username (u3_name u3_email)      |                |              3 |                


The source query for the crosstab() returns:

username                             | dv2_id | count 
-------------------------------------+--------+-------
 u1_username (u1_name u1_email) |     63 |     3
 u2_username (u2_name u2_email)      |    459 |     1
 u2_username (u2_name u2_email)      |     63 |     3

so they both seem to be working properly but I will have to look at the data itself to ensure I am getting the correct answers.


I'll have to think more about the condition of role_id=2 being sufficient for the info we need.

Thanks again for assisting!

M.












To post to this group, send email to dataverse...@googlegroups.com.

Julian Gautier

unread,
Jan 31, 2019, 3:46:50 PM1/31/19
to Dataverse Users Community
Glad to help! Thanks for sharing more about what you're doing.

So it sounds like ADA's interested in knowing who's been granted access to files within certain datasets in the ADA Dataverse. You could look at the Restricted File Permissions page for each dataset, but your query lets you see who's been granted access across several datasets at once (in one view). Could you share why you need this info? There's a lot of work done for providing info about who's downloaded files, and not necessarily for who was only granted access to restricted files. But maybe we can find or make better solutions if we understand what the goals are.

About the queries, I agree that when people are granted access to restricted files, they are given role_id 2 (file downloader).

Lastly, you wrote:
The second query returns the list of categories (not sure if there is a simpler way of specifying the categories of datasetid's) and the list order is expected to be the same as that specified in the first source query (or so I believe).

I'm not sure what you mean by categories of datasetid's, but I should admit that I haven't been able to run or follow your queries that closely (I haven't had a good reason to ask (or beg/bribe) my developer colleagues to install the crosstab function, mostly because importing into spreadsheets works fine for me :) If we could help with that, we're all ears.

Thanks!
Julian

MM - ADA

unread,
Jan 31, 2019, 8:17:09 PM1/31/19
to Dataverse Users Community
Hi Julian.

In answer to your question about why we need this info, it's a reporting requirement in that some of ADA's data owners want to know who has access permissions to their datasets. They also want to know who downloaded files (and which ones) but also want to know who has permission to do so (even if they haven't downloaded anything yet). And they want this info in a report-style output rather than seeing it through the UI. I'm not sure what they do with that info, but they want it so ADA provides it (or aims to).

WRT my statement about the second query, it was poor wording by me. I was trying to say that the second query is supposed to return the values for categories to be used in the crosstab result and that there may be a simpler way to specify those categories other than with a query (since it's just going to return the Id's I've specified). Maybe that is no clearer than my previous description!

And, finally, I just thought I would outline how to install the extension if it was helpful for you and for anybody else that might read this post. crosstab() is a pretty handy function!

Thanks again.

M.

Gautier, Julian

unread,
Feb 1, 2019, 7:48:11 AM2/1/19
to dataverse...@googlegroups.com
Thanks for more of the background. Really interesting that some data owners need to know who has access. Hopefully producing these reports isn't too troublesome. Think this is worth a github issue? 

The actionlogrecord table might be helpful for seeing who was given access but doesn't have access now. The table records when someone was given a role, and I think each entry in the table remains even when the role is revoked (or the file/dataset is deleted). The table's "info" column should have something like:

[AuthenticatedUser identifier:@jg1234] has been given DataverseRole{id=2, alias=fileDownloader} on [123456 filename.txt]

So I just searched in actionlogrecord using

where info ilike '%has been given DataverseRole{id=2, alias=fileDownloader}%'

Joining other tables to the actionlogrecord table gets tricky because all the helpful info is concatenated in the info column.

Hope you have a great weekend!
Julian

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.

To post to this group, send email to dataverse...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Marina McGale

unread,
Feb 3, 2019, 6:56:36 PM2/3/19
to dataverse...@googlegroups.com

Hi Julian - ADA has government departments as data owners and many of ADA's requirements (including the request access details at request access instead of download, and reporting issues such as this one wrt permissions) are a result of their requirements although we were surprised more people wouldn't be interested in who has access permission to their data?


I am not sure if it needs to be a github issue but maybe adding a query to the record of queries in the google community is enough? I rewrote the query to simplify it, btw, as I just sort of hacked at the previous one by adding joins until I got the result I wanted. I'll post the simplified version when I have a moment.


I did think of looking at the actionlogrecord table over the weekend (I keep forgetting that table exists) and maybe that is a simpler solution although string matching on tens of (hundreds of?) thousands of rows may be an issue (or maybe not) if the authenticateduser details are required. Do you think that would be a big hit on database performance?


Thanks!


M.








From: dataverse...@googlegroups.com <dataverse...@googlegroups.com> on behalf of Gautier, Julian <julian...@g.harvard.edu>
Sent: Friday, 1 February 2019 11:48:09 PM
To: dataverse...@googlegroups.com
Subject: Re: [Dataverse-Users] SQL query to determine users that have filedownload access to specific datasets (identified by dataset id)?
 
You received this message because you are subscribed to a topic in the Google Groups "Dataverse Users Community" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/dataverse-community/tyr6INC4kk0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to dataverse-commu...@googlegroups.com.

To post to this group, send email to dataverse...@googlegroups.com.

Julian Gautier

unread,
Feb 4, 2019, 11:16:28 AM2/4/19
to Dataverse Users Community
I did think of looking at the actionlogrecord table over the weekend (I keep forgetting that table exists) and maybe that is a simpler solution although string matching on tens of (hundreds of?) thousands of rows may be an issue (or maybe not) if the authenticateduser details are required. Do you think that would be a big hit on database performance?

That using the actionlogrecord would be more resource intensive makes sense to me. But I never had to consider database performance tbh! I wouldn't know how to estimate that. Filtering on whatever subactiontype is used first, so there are fewer rows to do string matching on, might help.

Sharing your simplified query would be awesome (when you get a chance). Thanks!

Julian

MM - ADA

unread,
Feb 5, 2019, 7:25:51 PM2/5/19
to Dataverse Users Community
Hi Julian - These are simplified versions of the crosstab query using 'where exists' instead of multiple joins:


#full count of filedownload permission
select * from crosstab('select au.useridentifier||'' (''|| au.firstname ||'' ''||  au.lastname|| '' '' || au.email|| '')'' as username, dvo.owner_id as dvo_dataset_id, count(*) from authenticateduser au, dvobject dvo
where exists (select * from roleassignment ra
        where ra.assigneeidentifier = ''@''||au.useridentifier
            and ra.role_id = 2 and ra.definitionpoint_id = dvo.id and dvo.owner_id in (8,23,98))
group by username, dvo_dataset_id',
'select id from dvobject
where dtype=''Dataset'' and id in (8,23,98) group by id order by id')
AS (username text, dataset_8_filedownload_count int, dataset_23_filedownload_count int, dataset_98_filedownload_count int);


#perm = 1 for count(*) > 1
select * from crosstab('select au.useridentifier||'' (''|| au.firstname ||'' ''||  au.lastname|| '' '' || au.email|| '')'' as username, dvo.owner_id as dvo_dataset_id, (case when count(*) > 0 then 1 end) from authenticateduser au, dvobject dvo
where exists (select * from roleassignment ra
        where ra.assigneeidentifier = ''@''||au.useridentifier
            and ra.role_id = 2 and ra.definitionpoint_id = dvo.id and dvo.owner_id in (8,23,98))
group by username, dvo_dataset_id',
'select id from dvobject
where dtype=''Dataset'' and id in (8,23,98) group by id order by id')
AS (username text, dataset_8_filedownload_perm int, dataset_370_filedownload_perm int, dataset_372_filedownload_perm int);


#count = 'x' for count(*) > 1
select * from crosstab('select au.useridentifier||'' (''|| au.firstname ||'' ''||  au.lastname|| '' '' || au.email|| '')'' as username, dvo.owner_id as dvo_dataset_id, (case when count(*) > 0 then ''x'' end) from authenticateduser au, dvobject dvo
where exists (select * from roleassignment ra
        where ra.assigneeidentifier = ''@''||au.useridentifier
            and ra.role_id = 2 and ra.definitionpoint_id = dvo.id and dvo.owner_id in (8,23,98))
group by username, dvo_dataset_id',
'select id from dvobject
where dtype=''Dataset'' and id in (8,23,98) group by id order by id')
AS (username text, dataset_8_filedownload_perm text, dataset_23_filedownload_count perm, dataset_98_filedownload_count perm);



I also checked the possibility of using the actionlogrecord to get information from the 'info' field and unfortunately v4.6.1 that we have to stay with for a while does not include the describe() method in the AbstractCommand/AssignRoleCommand class so that information is not being recorded in our version. So that's a no go solution for ADA but thank you for suggesting it.

Thanks!
Marina

Julian Gautier

unread,
Feb 6, 2019, 2:24:01 PM2/6/19
to dataverse...@googlegroups.com
Thanks Marina! I added those three queries to the list we've been collecting. I'd like to make the headings a little more user friendly (#perm = 1 for count(*) > 1) but can't run these to see what they're doing (I won't be getting crosstab). Would you have a minute to prosify the headings a little? I can give you edit access if you have a minute.

MM - ADA

unread,
Feb 6, 2019, 7:04:03 PM2/6/19
to Dataverse Users Community


On Thursday, February 7, 2019 at 6:24:01 AM UTC+11, Julian Gautier wrote:
Thanks Marina! I added those three queries to the list we've been collecting. I'd like to make the headings a little more user friendly (#perm = 1 for count(*) > 1) but can't run these to see what they're doing (I won't be getting crosstab). Would you have a minute to prosify the headings a little? I can give you edit access if you have a minute.


I changed the headings a bit and added some text to (hopefully) make it clear what they do. I have attached a file that shows query and results and that hopefully keeps the formatting of the results.

 
Dataverse postgres crosstab queries filedownload permissions per dataset.txt

Julian Gautier

unread,
Feb 6, 2019, 7:43:54 PM2/6/19
to Dataverse Users Community
With examples! This is great. Thank you!

MM - ADA

unread,
Feb 6, 2019, 7:54:23 PM2/6/19
to Dataverse Users Community
And of course I just realised that I uploaded the file with the inner joins queries instead of the simpler "where exists" queries.

Please feel free to remove that previous file. I am attaching the simplified queries now. 

It pays to pay attention!!!
Simplified Dataverse postgres crosstab queries filedownload permissions per dataset.txt

Julian Gautier

unread,
Feb 6, 2019, 8:41:52 PM2/6/19
to Dataverse Users Community
Updated!
Reply all
Reply to author
Forward
0 new messages