Pulling shared file permissions for 1.5 million files

446 views
Skip to first unread message

Christopher Laird

unread,
Sep 14, 2022, 3:36:17 PM9/14/22
to GAM for Google Workspace
I need to build a table which basically has file owner, file id, file name, shared user email, shared user access.

Owner- chris@test
fileid - 123
file name - example
shared user email - paul@test
shared user access - writer

Like that. But, I need to generate this for each file for about 9,000 users, totally around 1,500,000 files

In powershell, I can generate a list of all the files, but actually getting the drivefileacls (gam user $owner print drivefileacls $fileid) shows to be about 100 hours currently. If I foreach-object -parallel run the job, it does 1000 files in about 4 minutes.

Does anyone know a better way to do this? I could run a "gam user $user print filelist" for each of the 9000 users, which seems to pull all the data, but print filelist doesn't support "oneitemperrow" so the output files aren't in a format I can use


Christopher Laird

unread,
Sep 14, 2022, 3:40:14 PM9/14/22
to GAM for Google Workspace
Sorry,  it's really a database table, so each row would contain that data. So 1 file shared with 5 people would be 5 rows in the table. At 1.500.000 files, it will be a few millions rows total.

Ross Scroggs

unread,
Sep 14, 2022, 3:51:47 PM9/14/22
to google-ap...@googlegroups.com
Christopher,

Send me a Meet/Zoom invitation and we'll discuss your options.

Ross

--
You received this message because you are subscribed to the Google Groups "GAM for Google Workspace" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-apps-man...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/google-apps-manager/dbc6f84f-7e5c-4521-95f0-4298176032ffn%40googlegroups.com.


--

Christopher Laird

unread,
Sep 14, 2022, 5:38:14 PM9/14/22
to GAM for Google Workspace
I will email you next week. Today was my last day of work before a few days off. As a high-level overview of what I'm doing (with a ton of help from the folks at Lehigh University, who already have this working), I am using GAMadv to pull a list of every disabled user in the domain, then loop through every user to find every file they have shared, then loop through every shared file to find who it is shared with (within our domain), then present an option to each of those internal users to see if they'd like to take ownership of the file before the account (and thus the file) are deleted.

I have it all working, in concept, but our initial batch of 9,000 accounts is creating a unique issue in how large it is. That got me thinking about how to 1, solve this in the short term, and 2, think about if there is a more efficient route possible since "print drivefileacls" and "print filelist fields id,title,permissions,owners.emailaddress" are so very similiar to each other, minus the formatting.

David Walton

unread,
Sep 14, 2022, 6:39:35 PM9/14/22
to google-ap...@googlegroups.com
His Chris,

I don't have a solution for you, but I'm interested in this, but in terms of how you technically accomplish this with GAM, and how you plan to pull this off from an operational perspective. Will you be automatically transferring file ownership to one of the shared users? Will you be contacting users to ask them if they want ownership of the shared file? If so, how would you do that at scale?

We've likely in the same boat as you: Finally deleting accounts for users who have been disabled for years, due to storage limitations. But we assumed that file loss would be inevitable.



--

David Walton

Information Security Analyst

Temple Rodgers

unread,
Sep 15, 2022, 3:43:05 AM9/15/22
to GAM for Google Workspace
I did this some time ago, looking for files shared outside our organisation - I used a "method" based on slightly modified versions of the GAM Python scripts and saved data in Sheets ... I'll have to spend a little time to find the documents that I created and I'll be able to share them with you. I created one summary sheet per user, so that we could make users responsible for checking their files but this could easily be a single Sheet or csv.
Temple

Temple Rodgers

unread,
Sep 15, 2022, 4:22:34 AM9/15/22
to GAM for Google Workspace
the Python scripts are here, of course.
I modified GetSharedWithAnyoneDriveACLS.py and the TeamDrive version too in order to retrieve the columns that I required.
The whole multi-step process is outlined at the beginning of the script, however I modified it slightly to save data to Sheets as I mentioned in my previous post.
You can choose which permission(s) you wish to select/ignore
I'll post a little more later.

Peter Smulders

unread,
Sep 15, 2022, 5:22:05 PM9/15/22
to GAM for Google Workspace
Maybe this helps, if not for you then for others: I do something similar when cleaning out accounts. I use shell scripts, feeding a list of user through a loop, with the inner part of the loop being:

gam user "${LEAVER}" print filelist corpora user anyowner fields id,name,owners.emailAddress,mimetype,shortcutdetails.targetmimetype,shortcutdetails.targetid,permissions.id,permissions.emailaddress,permissions.role showpermissionslast

This outputs a large list of files that particular user either owns or has access to, with all permissions in CSV format on one line per file.

Next, I use a few routines to parse those lines, so that per file, I can account for shortcuts and deal with each permission seperately like this:

[gam command as per above] | while read LINE; do
  # grab and store header line
  [[ -z ${ORIG_HEADERS} ]] && ORIG_HEADERS="$(echo $LINE | tr '.' '_')" && continue ## The tr is required because bash variable names cannot contain .s
  # parse next line according to headers
  DATA="${LINE}"
  HEADERS="${ORIG_HEADERS}"
  eval $(
    until [[ -z $HEADERS ]]; do
      echo "$(Chomp piece "$HEADERS")=\"$(Chomp piece "$DATA")\""
      HEADERS="$(Chomp remainder "$HEADERS")"
      DATA="$(Chomp remainder "$DATA")"
    done
  )
done
  
The loop inside the eval subprocess outputs a set of variable assignments, effectively turning every CSV header into a variable with (for this particular file) the value on the data line as a value assigned. From then on, you can do all the logic you want using those variables. Very usefully, if there are more than one permissions, gam outputs that number in a separate column, which is thus stored as the variable 'permissions'.

The Chomp function is this:

Chomp () {
    ACTION="${1}"; shift
    STRING="${1}"
    case "${ACTION}" in
        piece) echo "${STRING%%,*}" ;;
        remainder)
            if [[ ${STRING} =~ ',' ]]; then
                echo "${STRING#*,}"
            else
                echo ""
            fi
            ;;
        *) echo "" ;;
    esac
}


Further in the loop, this is how the permissions are processed:

if [[ -n "${permissions}" ]]; then
  CTR=$((permissions - 1)) ## need to offset because the permissions are labeled from 0 up instead of from 1 up.
  until [[ ${CTR} -eq -1 ]]; do
    P_USER="permissions_${CTR}_emailAddress" ## the user that has the permission
    P_ID="permissions_${CTR}_id"             ## the permission (specific access right)
    #######
    ####### Your logic goes here about what to do with what kind of permission/user combo
    ####### For instance skip/ignore ('continue') or save the data in some form by echo'ing it.
    #######
    CTR=$((CTR - 1))
  done

The performance profile of this method is one fairly slow gam call per user. I tend to parallelise these jobs BUT you need to build in some safety measures to avoid overloading your machine. (I have an elaborate set of such measures but that is too detailed for this answer).

Maybe it is doable to have gam itself do the parallelisation by batching the very first call for file details. You need some way to separate out the output per file, so that you get clean sets of CSV headers + data for all files for a user. gam provides this but i admit that I have personally never gotten my head around those command options. gam could run 20 or 25 sessions in parallel, saving its data to clean and neat sets of headers+data. Then it is merely a matter of having a script tearing through that output to see what needs to be done with what.

Final note: with the exception of gam itself and a single tr command, all of this logic is bash internals. That means very little process creation and piping data. In Bash scripts, that easily causes 100 fold speed increases.

hth -- ymmv - Peter

Maj Marshall Giguere

unread,
Sep 16, 2022, 2:55:07 PM9/16/22
to google-ap...@googlegroups.com
Peter;

This is a bash usage that I would nominate for the "Best of Bash Bashing" prize in a heartbeat.  I love to see what I can wring out of bash without resorting to other scripting languages or external tools.  Great piece of scripting.  A couple of observations.  First, you can do away with "tr" altogether using bash's inline string substitution:

ORIG_HEADERS="$(echo ${LINE//./_})"

Second, a minor thought.  It would be nice to show the usage in situ for completeness of the explanation.

Again, a very nice piece of scripting.
👍


Maj Marshall E Giguere

NH Wing Director of IT

Civil Air Patrol, U.S. Air Force Auxiliary

GoCivilAirPatrol.com

nhwg.cap.gov

Volunteers serving America's communities, saving lives, and shaping futures.



Maj Marshall Giguere

unread,
Sep 16, 2022, 4:24:11 PM9/16/22
to google-ap...@googlegroups.com
My bad, the sin of cut and paste strikes again, should have been without the subshell:

ORIG_HEADERS="${LINE//./_}"


Maj Marshall E Giguere

NH Wing Director of IT

Civil Air Patrol, U.S. Air Force Auxiliary

GoCivilAirPatrol.com

nhwg.cap.gov

Volunteers serving America's communities, saving lives, and shaping futures.


Christopher Laird

unread,
Sep 21, 2022, 1:05:09 PM9/21/22
to GAM for Google Workspace
Peter, as Marsh said, it looks like you've really got a fascinating and really impressive solution built for what you're doing. To echo his request, are you able to show your complete script? I'd like to tinker with what you have working and try to understand the specifics better.

Peter Smulders

unread,
Sep 29, 2022, 5:20:06 PM9/29/22
to GAM for Google Workspace
Here is the script. I am Dutch, so my references sometimes are mixed language. I cleaned up a few particular references to my company domain. Note that I reference a number of my own functions and scripts. I do not include those, but have added comments to indicate their purpose. Comments added for this post are marked with ###.

#!/bin/bash

### The purpose of this script: when we have staff leaving, I change the credentials for that user and remove them from all groups. That will revoke 98% of their access to company data, since we try to share (only) with groups as much as we can. This script looks for the remainder of files lurking around in accounts, changes their ownership to an account that functions as a repository (this setup dates from a decade before Shared Drives) and strips out the access for this particular user.
### I dereference shortcuts and when I encounter folders, they are passed off to another script that parses trees to change ownership with a large degree of parallelisation.


. $HOME/bin/UTILS ### My personal bag of tricks. :)

#set -vx ### toggle verbose debug mode on or off. 
### syntax for really detailed debugging:
### $ MyScript 2>&1 | less

HOWTOUSE="Gebruik: $0 <loser>" ### String for Usage function.

[[ -z ${1} ]] && Usage "Geen gebruiker ingevuld." && exit 0
LOSER="${1}" ### User that loses access, hence LOSER
[[ "${LOSER}" =~ '@' ]] || LOSER="${LOSER}@YOURDOMAIN.COM" ### adding domain in case I am too lazy when invoking the script.

[[ -d ${HOME}/STOP ]] && Usage "Whoa there, killjoy!" ### Poor man's KILL signal handler. mkdir $HOME/STOP in another shell --> stops the script.

gam user "${LOSER}" print filelist corpora user anyowner fields id,name,owners.emailAddress,mimetype,shortcutdetails.targetmimetype,shortcutdetails.targetid,permissions.id,permissions.emailaddress,permissions.role showpermissionslast | \
            (
                [[ -d ${HOME}/STOP ]] && Usage "Whoa there killjoy!"

                while read LINE; do
                    # grab and store header line
                    [[ -z ${ORIG_HEADERS} ]] && ORIG_HEADERS="$(echo $LINE | tr '.' '_')" && continue
                    # parse next line according to headers
                    DATA="${LINE}"
                    HEADERS="${ORIG_HEADERS}"
                    eval $(
                        until [[ -z $HEADERS ]]; do
                            echo "$(Chomp piece "$HEADERS")=\"$(Chomp piece "$DATA")\""
                            HEADERS="$(Chomp remainder "$HEADERS")"
                            DATA="$(Chomp remainder "$DATA")"
                        done
                    )
                    ### The following assignments are there because a previous version of the script had a different way of parsing CSV output. Easier to do this than to rewrite all the rest of the code to use the new variables names.
                    ### There is a case to be made to do this, if only to adhere to the convention of UPPERCASE VARIABLE NAMES, if you are thus inclined.
                    NAME="${name}"
                    OWNER="${owners_0_emailAddress}"
                    ID="${id}"
                    MIMETYPE="${mimeType}"
                    S_MIMETYPE="${shortcutDetails_targetMimeType}"
                    S_ID="${shortcutDetails_targetId}"
                    echo "Dealing with [${NAME}]..." 1>&2
                    echo "Whole line: [${LINE}]"

                    # Ignore external domains, but not empty ones
                    ### Maybe there is a regex that matches "not empty AND NOT my domain" or "empty OR my domain" but this works with only two built-in tests, so will be hard to beat in performance.
                    if [[ -n "${OWNER}" ]]; then
                        [[ "${OWNER}" =~ "@YOURDOMAIN.COM" ]] || continue
                    fi
                    (
                        # in any case
                        if [[ -n "${OWNER}" ]]; then
                            [[ "${OWNER}" == repo...@YOURDOMAIN.COM ]] || GiveToRepo "${ID}" "${LOSER}" ### GiveToRepo is a script that tries a few ways to change the ownership of a particular asset.
                        fi

                        if [[ "${MIMETYPE}" =~ "shortcut" ]]; then
                            MIMETYPE="${S_MIMETYPE}"
                            ID="${S_ID}"
                            echo "After dereffing"
                            echo "Owner: ${OWNER}"
                            echo "ID: ${ID}"
                            # Indien nodig owner target omzetten
                            OWNER="$(FindFileOwner "${ID}")" ### FindFileOwner wraps several gam methods to determine the owner of a file.
                            [[ "${OWNER}" == 
repo...@YOURDOMAIN.COM ]] || GiveToRepo "${ID}" "${LOSER}"
                        fi


                        if [[ -z "${permissions}" ]]; then
                            StripACLs "${ID}" repo_user ${LOSER} ### This is in the script, but looking at it, I think this is old code and will never actually run anymore.
                        else    
                            CTR=$((permissions - 1))

                            until [[ ${CTR} -eq -1 ]]; do
                                P_USER="permissions_${CTR}_emailAddress"
                                P_ID="permissions_${CTR}_id"
                                if [[ "${!P_USER}" =~ "${LOSER}" ]]; then ### If you look carefully, you will see that P_USER is not set to a value, but to a variable name. The syntax ${!P_USER} references the value contained in that variable.
                                    gam user kdv del drivefileacl ${ID} id:${!P_ID} ### Note the ${!P_ID}, as above.
                                fi
                                CTR=$((CTR -1))
                            done
                        fi

                        if [[ "${MIMETYPE}" =~ "folder" ]]; then
                            DynamicWait 5 # om eventuele gestripte rechten de kans te geven door te sijpelen. ### DynamicWait is a function that adds a number to its argument based on current CPU load and then sleeps that amount of seconds. This avoids my shell boxes dying because I loop through 364 items and kick off background processes for each one. DynamicWait will insert longer and longer sleep times when the load increases, allowing it to 'cool down' again.
                            ### In this case, it might be that I stripped a permission from a top folder a millisecond ago. That might affect perms down the tree, so I give Google Drive 5 seconds to let that seep through before I chase after it to manually force an ownership change on each file in the tree.
                            FolderToKdv "${ID}" "${LOSER}"

                            DynamicWait
                        fi

                        echo "Done with [${NAME}]..." 1>&2
                    ) & ### This sneaky bit is why I need DynamicWait... :)
                    DynamicWait 0.1 ### The GNU wait on Google Cloudshell does sub-1 second sleep times.
                done
                wait ### Very important: always when backgrounding stuff in a loop, wait for all of them to finish, or your script will return to a shell prompt and the output will still be coming in until the last background task is done. With this 'wait' the script will exit only when all is actually done.
            )
echo "Done."
exit

Enjoy!

--peter

Eric Dannewitz

unread,
Sep 29, 2022, 5:50:14 PM9/29/22
to google-ap...@googlegroups.com
Seems like you could write something in Python then load it into Pandas, and do a groupby to find just the instances you want. Gam is super fast if you call it as a library from Python.


Eric Dannewitz

Technology Assistant/Tech Acalanes Union High School District

District Office Technology Department

Google Cloud Certified Administrator & G Suite

Google Certified Educator Level 1 & 2

p: 925.280.3980 ext 4309

e: edann...@auhsdschools.org

 




Peter Smulders

unread,
Feb 7, 2023, 1:52:06 PM2/7/23
to GAM for Google Workspace
I reckon all of my trickery can be done 100 times faster using just Python and piggybacking on GAMx as a library, but as it happens, I speak fluent Bash and I reckon I can't get to "Hello World" in Python. My point is that there are many other tools out there (Go comes to mind, as the language most related to Google's internals) and I am merely providing a resource for those stuck with the rather more primitive Bash and the like.

--peter

Peter Smulders

unread,
Feb 7, 2023, 2:22:56 PM2/7/23
to GAM for Google Workspace
Hi Marsh,

I though you might get a kick out of an improvement to this CSV output parsing routine I found. I haven't done actual measuring, but it feels an order of magnitude faster (which logically follows from the fact that it is a LOT less statements in the executed code).

To recap, the idea is to pipe CSV delimited output from a GAM command into a while loop, reading it LINE by LINE, starting with the HEADER:

gam [whatever whatever with CSV output] | while read LINE; do
  {do stuff}
done

with the code to save the HEADER line added, including your suggestion to regexreplace the periods into underscores:

gam [whatever whatever with CSV output] | while read LINE; do
  # grab and store header line
  [[ -z ${HEADER} ]] && HEADER="${LINE//./_}" && continue
  {do more stuff}
done

Now the goal here is to take each column name in the HEADER and use that as a variable name that gets assigned a value. I had this loop set up that would parse both HEADER and data LINE to build up a giant set of assignments that could then be eval'd. Even though it is all Bash internal memory wrangling, on data sets where the output would be a few dozen columns it would take noticeable amounts of time, each line again. Not this way, though:

gam [whatever whatever with CSV output] | while read LINE; do
  # grab and store header line
  [[ -z ${HEADER} ]] && HEADER="${LINE//./_}" && HEADER="${HEADER//,/ } && continue
  # Parse and multi-assign the entire HEADER and data LINE in one fell swoop!
  IFS="," read -r ${HEADER} < <(echo $LINE) ## HEADER has already been modified to be space'd words; the temporary IFS parses the data line. Don't even need no quoting!
  {do more stuff}
done

Explanation of that magic line: in Bash, you can assign values to variables in a group all at once, by using read. The -r guards against backslash escaping and might not even be necessary in this context. read wants to get data reading from STDIN. We provide that with the redirect (<) and the coprocess that just echo's the $LINE. The data $LINE has comma separated values that get parsed to words by setting the value of IFS to exactly that: a comma.

Enjoy!

-peter

Maj Marshall Giguere

unread,
Feb 7, 2023, 6:35:17 PM2/7/23
to google-ap...@googlegroups.com
Peter;

Ah, yes, bash "read", one of my go to tools, sometimes preferred over awk for some quick and dirty command line parsing.  I still like awk if I need to knock off something that has an FSM flavor to it.  Although I could probably do that with bash as well, but I've not tried it.

Over the last few years bash has just become more and more capable, although it is bit obscure for some operations like arrays.  The ability to bind regexes has been very helpful.  It comes close to  regex binding in perl, but not quite.  One of my complaints about python is that regexes are not baked into the syntax.

Thanks for the heads up.  Keep "bashing" away. :D

V/r

Maj Marshall E Giguere

NH Wing Director of IT

Civil Air Patrol, U.S. Air Force Auxiliary

GoCivilAirPatrol.com

nhwg.cap.gov

Volunteers serving America's communities, saving lives, and shaping futures.


Reply all
Reply to author
Forward
0 new messages