When sending an AIP to S3, is a copy of the AIP meant to stay in the server?

91 views
Skip to first unread message

Jarrod Harvey

unread,
Jan 6, 2019, 8:49:06 PM1/6/19
to archivematica
Hi all,

I am trying to learn the (relatively) new S3 integration functionality. I'm running the latest version of Archivematica on CentOS 7. I have a question relating to where AIPs go.

I currently have two AIPs in my environment, "Historical_Non_S3_Test0" (sent to local filesystem) and "S3_attempt2" (sent to S3). Both were successful, and S3_attempt2 is now in S3.

However, here are the results of a local filesystem search.

[jarrodharvey@archivematica ~]$ sudo find / -iname "*Historical_Non_S3_Test0*"
/var/archivematica/sharedDirectory/watchedDirectories/SIPCreation/completedTransfers/Historical_Non_S3_Test0-12ecad40-bb79-458e-a471-4e901d86e593
/var/archivematica/sharedDirectory/www/AIPsStore/2322/02dd/bde2/4c94/8ccb/7a16/a9f8/8ee7/Historical_Non_S3_Test0-232202dd-bde2-4c94-8ccb-7a16a9f88ee7.7z
[jarrodharvey@archivematica ~]$ sudo find / -iname "*S3_attempt2*"
/var/archivematica/storage_service/tmp6eyRzc/215e/160d/3951/4433/8f14/44eb/0f03/d0e3/ASA2019_S3_attempt2-215e160d-3951-4433-8f14-44eb0f03d0e3.7z
/var/archivematica/sharedDirectory/watchedDirectories/SIPCreation/completedTransfers/ASA2019_S3_attempt2-16c8d2b4-b9ab-4dab-ad89-6bb131405456

"Historical_Non_S3_Test0" is, as expected, saved exactly where I specified in the local filesystem.

However, the "S3_attempt2" AIP is saved in the local filesystem at /var/archivematica/storage_service/tmp6eyRzc as well as S3.

I'm guessing "tmp" in the file path means that it will only be in place temporarily. Or am I wrong - will the AIP remain in the local filesystem permanently?

- Jarrod


Jarrod Harvey

unread,
Jan 7, 2019, 11:59:36 PM1/7/19
to archivematica
I just ran a fixity check. It looks like the AIP was downloaded from S3 to perform the check, which makes sense.

[jarrodharvey@archivematica ~]$ sudo find / -iname "*S3_attempt2*"
/var/archivematica/storage_service/tmp6eyRzc/215e/160d/3951/4433/8f14/44eb/0f03/d0e3/ASA2019_S3_attempt2-215e160d-3951-4433-8f14-44eb0f03d0e3.7z
/var/archivematica/storage_service/tmpdMMUvo/215e/160d/3951/4433/8f14/44eb/0f03/d0e3/ASA2019_S3_attempt2-215e160d-3951-4433-8f14-44eb0f03d0e3.7z
/var/archivematica/sharedDirectory/watchedDirectories/SIPCreation/completedTransfers/ASA2019_S3_attempt2-16c8d2b4-b9ab-4dab-ad89-6bb131405456
[jarrodharvey@archivematica ~]$ sudo find / -iname "*S3_attempt2*" -cmin -30
/var/archivematica/storage_service/tmpdMMUvo/215e/160d/3951/4433/8f14/44eb/0f03/d0e3/ASA2019_S3_attempt2-215e160d-3951-4433-8f14-44eb0f03d0e3.7z

Is it safe to delete any of these "tmp" folders? If not, when do they get cleaned up?

- Jarrod

aipodd...@gmail.com

unread,
Jan 11, 2019, 5:44:25 AM1/11/19
to archivematica
Hi Jarrod,

what is this "(relatively) new S3 integration functionality." you speak of.

Regards

Dave Bonde

Jarrod Harvey

unread,
Jan 11, 2019, 5:55:09 AM1/11/19
to archivematica
Hi Dave,

Amazon S3 has been available as an access protocol as of Storage Service 0.12.

https://www.archivematica.org/en/docs/storage-service-0.13/administrators/

Kind regards,
Jarrod

rspe...@artefactual.com

unread,
Jan 14, 2019, 9:07:20 AM1/14/19
to archivematica
Hi Jarrod,

Thank you for opening up the discussion here. 

I have been following your work in the Australasia preserves forum and see
you've come to the correct conclusion that the tmp files shouldn't be a problem
to delete.


The library used to create the temporary directories is from the Python standard

The function declaration in the storage_service describes a delete_after flag
which is set to True: 


But this won't work for S3 (and probably other storage protocols in the storage
service). It seems that S3 fits through the gaps in a  (if I read the code 
correctly):

1. Items in S3 storage need to be downloaded locally, creating a first temporary
   directory.
   
   
2. If the AIP is uncompressed, the temporary location isn't deleted because 
   this first temporary directory isn't set within the same scope as the fixity
   check, and so the delete flag is never triggered:
   
   
   There is quite a good safety mechanism here to ensure that local AIPs are not 
   deleted, but it could be more clever.
   
3. If the AIP is compressed, the first temporary directory is created as in 2.
   This is also never deleted because it is out of the scope of the calling 
   function.
   
   A second temporary directory is created because the LoC bagit tool that we 
   use won't work on compressed files (at least, that's what the code says):
   
   
   And this second temporary directory is the one that is deleted at the end of 
   the function:
   

In short, it's always the first temporary directory created when an AIP is 
downloaded from a remote location that is never deleted.
   
It will log a ticket in github.com/archivematica/issues because clearly this 
can be improved upon. 

If storage remains tight, then having two temporary AIP folders might need 
to be accounted for. If you have ideas for these improvements to the storage 
service then feel free to contribute to the conversation on the GitHub issue or
discuss a potential pull-request. 


All the best,
Ross

Jarrod Harvey

unread,
Jan 14, 2019, 6:39:26 PM1/14/19
to archivematica
Hi Ross,

Thanks for this detailed explanation.

If I'm understanding you correctly, if there are 100 AIPs in S3 and I run "fixity scanall" then only the first AIP will be downloaded and will NOT get deleted. Every other AIP WILL get deleted after it's scanned. So the other 99 AIPs won't stick around permanently.

I only have one AIP in S3 at the moment, which I think is why I made the false assumption that all of them would be downloaded and remain on the server permanently. Whoops!

That clears things up. For our purposes having just one AIP download and stick around afterwards really isn't a big deal.

- Jarrod

rspe...@artefactual.com

unread,
Jan 15, 2019, 3:57:37 AM1/15/19
to archivematica
Hi Jarrod,

I am afraid you had it correct the first time around. The illustration was just  to provide some more background into what is happening and why. 

For services like S3, two temporary directories are created each fixity check and only one temporary directory is removed. If you were to run the check again, even on the same AIP, then you'd be left with yet another temporary directory.


Hopefully the context is useful for anyone who might want to submit a pull-request for fixing it. Your script is a sensible solution the the problem in the meantime. 

Ross

Jarrod Harvey

unread,
Jan 16, 2019, 11:39:09 PM1/16/19
to archivematica
Hi Ross,

Thanks for the update. I understand now.

- Jarrod
Reply all
Reply to author
Forward
0 new messages