SilverStripe 4.0: File Deletion / Security Discussion

275 views
Skip to first unread message

Damian Mooyman

unread,
Oct 15, 2015, 7:00:59 PM10/15/15
to SilverStripe Core Development

Hi SilverStripe developers!


As our work is progressing on the implementation of the RFC-1 assets abstraction, we have now changed nearly every part of the code base that deals with files to use this new backend.


We have made the decision at this stage not to deprecate the Image dataobject, although initially it was intended to be removed. There has been a lot of great discussion about this on the other discussion thread at https://groups.google.com/forum/#!topic/silverstripe-dev/cMRZ8HVe4Os.


To give everyone a quick recap on what HAS been implemented, this is a super short ELI5 summary of the current behaviour, and how it differs from the old 3.x asset system.


Storage:


All files are now stored using a tuple (set) of values, which the backend knows how to use to find a given physical file. The default backend uses flysystem for storage, so we map a given tuple to a relative path that flysystem can understand.


For instance, a file with the properties:

  • Filename: parent/file.jpg

  • Hash: 55b443b60176235ef09801153cca4e6da7494a0c

  • Variant: NULL


Would end up being stored in the assets directory under the physical filename


/sites/mysite.com/www/assets/parent/55b443b601/file.jpg


Since all files are ordered (and can be identified by) their hash, potentially many versions of a single file can be stored concurrently, without overwriting one another with the same filename.


Alternatively, two different files with the SAME hash but different filenames could also be stored independently, should one need to be deleted without affecting references to the other.


Referencing:


In the old API, all files had a 1-to-1 mapping with a File dataobject, and there existed a synchronisation task which enforced this. If a file dataobject was lost, the synchronisation task would re-create it, and vice versa, remove the dataobject if the underlying file was lost.


However, we no longer have this relationship. File dataobjects still exist, but each can only reference a single version (hash) of any one filename at a time.


the new DBFile field type has been introduced as a new method of linking to a file. In fact, the File dataobject itself uses this field type to refer to the current version of the Filename it’s linking to. Although you can’t have multiple File dataobjects with the same filename, you can now have as many DBFiles pointing to a single filename as you like.


The new problem:


Asset synchronisation no longer exists, since the 1-to-1 mapping is gone. We have also decided that relying on the File dataobject has a limited lifetime, and some time in the future we might choose to completely remove it.


Additionally, other DataObjects which use DBFile might reference files without a File dataobject even existing for that record. This means that we can’t rely on some features which, in the past, have solved very real problems. For instance, using file link tracking between pages and Files to detect broken links, or being able to find and remove assets by deleting the file in the asset admin section.


The problem is that we have no way of ensuring that the file can be safely assigned to one of the three states:

  • Publicly available (attached to a published dataobject or file where canView is true)

  • Disabled from public (attached to a draft / archived dataobject, or canView is false)

  • Removed from filesystem (object containing all references to this file are deleted)


Unfortunately, in the absence of a good deletion / restriction policy, file deletion is currently disabled.


The current code for `File::updateFilesystem`, in fact, never removes anything from the filesystem. https://github.com/silverstripe/silverstripe-framework/blob/f8f4ed03b6f4de2e69c70cc18e270646908d1469/filesystem/File.php#L534 and the core API for assets has no `delete` method https://github.com/silverstripe/silverstripe-framework/blob/f8f4ed03b6f4de2e69c70cc18e270646908d1469/filesystem/storage/AssetStore.php.


What does the solution look like?


My feeling is that we need a new system that can determine both when a file should be deleted or hidden, and if so, how this can be achieved.


Some of the ideas we have had:

  • Some kind of reference counting system for filename / hash values. This could be per-dataobject or across all dataobjects. When a dataobject is deleted, it can remove the physical file if its own is the last reference to that file.

  • Never have more than one reference to a file: Each DataObject with a reference to a DBFile will refer to its own copy of that file (File dataobjects included). If that file is deleted, the asset is removed as well. This of course would mean that “overwrite on upload” would not be allowed anymore, since one dataobject would not be allowed to overwrite files belonging to another.

  • Having some kind of web-inaccessible “archive” folder of deleted files, where inactive files are stored once deleted from filesystem. In the case where a tuple references a file that no longer exists (e.g. when viewing a page in archive mode), it could be retro-actively restored from the archive on the fly. This could mean that deleting files is a safer operation, but there’s also the ability for administrators to ‘flush’ the archive as necessary. It also means that projects which rely on permanent archiving of all content (such as versioned File dataobjects) would be able to restore even the oldest of deleted records.

  • Potentially (in some form) reviving a new asset synchronisation feature, and use File dataobjects once again as a 1-to-1 mapping of all assets. This would require solving the “many versions with a single filename” issue.

  • To provide file security, links to assets could either be public or protected, where public files could be exposed to the web (chmod 755 and direct urls), but protected files would have to go via a gateway. Whenever a dataobject invokes ->getURL() on a file attached to it, it would generate a link to this gateway script, along with a temporary hash (with some lifetime on it) guaranteeing access to that file. For example, to embed it via an <img /> tag when viewing draft pages as a content editor. There would also need to be some other way of preventing web access to this file directly, which could mean storing it in a location outside of /assets (or a private subdirectory). There’s also the problem of knowing whether a file is protected or public, just based on the hash and filename identifier. Different versions of a single filename could have different permissions, or even the same version of a file with slightly different filenames (/assets/parent/55b443b601/secure.jpg and /assets/parent/55b443b601/public.jpg), so using a directory .htaccess could be problematic.

  • We could rely on setting chmod 700 for private files, although it would not work if the webserver was running as the same user as PHP.

  • Going either way by leaving all files public (and relying on urls not being guessed), or making all files go through a secure gateway (with performance implications) are less ideal.


For 4.0 we agree that some kind of security and asset removal solution is needed. We have addressed some of these problems in 3.x with the secure assets module (https://github.com/silverstripe-labs/silverstripe-secureassets), but even this does not automatically respect published or deleted objects, and file dataobjects must be secured independently of the pages or dataobjects which point to them.


Thanks everyone for the great ideas and feedback along the way. I’m really keen on hearing about your ideas on this issue. For the sake of tracking this issue, please refer to the open source ticket at https://github.com/silverstripe/silverstripe-framework/issues/4677.


Also if you wish to join the ongoing discussion about Images, please refer to the current thread at https://groups.google.com/forum/#!topic/silverstripe-dev/cMRZ8HVe4Os, as well as the image shortcode discussion at https://github.com/silverstripe/silverstripe-framework/issues/4337


Ingo Schommer

unread,
Oct 20, 2015, 5:58:28 PM10/20/15
to SilverStripe Core Development

On Friday, October 16, 2015 at 12:00:59 PM UTC+13, Damian Mooyman wrote:
Thanks Damian - it's an important discussion because it affects how complex we make file handling in the core product. It also ties into to a larger discussion of how we handle versioning and publication of relationships in 4.0 - the File record and DBFile field are just one specific aspect of this. We need to ensure that the concept of archiving files rather than deleting them is seamless and transparent to both developers and content authors. That's partly an UX challenge which can be discussed separately. 

From a high level, do you think file versioning should be part of the framework baseline operation, or an optional feature which you can enable? For many developers, this will be an "ease of use" vs. "complexity" tradeoff. You get a lot more power with file publication, but also need to understand how it works. This guides decisions around which deletion approach we take. 

More comments below. 


What does the solution look like?


My feeling is that we need a new system that can determine both when a file should be deleted or hidden, and if so, how this can be achieved.


Some of the ideas we have had:

  • Some kind of reference counting system for filename / hash values. This could be per-dataobject or across all dataobjects. When a dataobject is deleted, it can remove the physical file if its own is the last reference to that file.

I think that's realistic to implement, because we control ORM operations on DataObjects, and already have a system in place for tracking file shortcodes references. I can't think of a good use case where you'd want to retain the file on the filesystem if there's no DBFile or File reference to it any longer. Unless you're using low-level APIs, it wouldn't be visible to any visitors. An older version of an unpublished or deleted DataObject would still count as a "reference" though, right? You've covered that below through the "archive" folder, and we'll have this problem regardless of *how* a file is deleted anyway.
  
  • Never have more than one reference to a file: Each DataObject with a reference to a DBFile will refer to its own copy of that file (File dataobjects included). If that file is deleted, the asset is removed as well. This of course would mean that “overwrite on upload” would not be allowed anymore, since one dataobject would not be allowed to overwrite files belonging to another.

So we'd "scope" the file name or path to include an identifier to the referencing DataObject, correct? 
I assume "File" in this context refers to a single filesystem entry, as opposed to a File DataObject. Sounds like a viable solution, particularly since we'd still retain the ability to create multiple $has_one=File or $has_many=File relations on a DataObject level. The only way to create more than one reference would be using the new DBFile field directly on multiple DataObject classes. DBFile could become a popular alternative for $has_one=File because it simplifies DataObject versioning (data is self-contained in the database row).  

  • Having some kind of web-inaccessible “archive” folder of deleted files, where inactive files are stored once deleted from filesystem. In the case where a tuple references a file that no longer exists (e.g. when viewing a page in archive mode), it could be retro-actively restored from the archive on the fly. This could mean that deleting files is a safer operation, but there’s also the ability for administrators to ‘flush’ the archive as necessary. It also means that projects which rely on permanent archiving of all content (such as versioned File dataobjects) would be able to restore even the oldest of deleted records.

Assuming we want to tackle versioning and publication of File DataObjects at the same time anyway, we'll need a solution for access control regardless of the deletion issue. An archive folder necessitates access control as you've outlined, and hence adds considerable complexity. Should we require the archive folder for simple 3.x style deletion of files, even for projects which might not care about file versioning?

Ingo

Sam Minnée

unread,
Oct 21, 2015, 7:41:02 AM10/21/15
to SilverStripe Core Development
What about adding support for:

1. Multiple asset persistence layers (APLs) connected to a system
2. Rules, such as "is public" and "is published to live" that are used to determine whether an assets is stored in a given APL
3. APL backends for both direct-public-access and canView()-mediated-access

Then public, published assets could be pushed to the direct-public-access APL, and the canView()-mediated one could be used for everything else.

As well as allowing access control without compromising performance in the simple case, this would let developers build systems where draft content was on separate systems from published content. This has come up as a desired feature in a few projects.

This basically means that your "archive" folder is a separate APL, which seems cleaner than some kind of in-APL mechanism.

The most complex part of this would probably be point #2, especially since it's unclear what the scope of rules would need to be. If you were using reference counting you would probably have per-APL reference counting rules: for a public asset system, references from File_versions or File wouldn't count, only File_Live. This would mean that assets would get garbage collected from the public APL more quickly than from the private APL.

In addition to reference counting it would probably be possible to run something like "SELECT DISTINCT FileHash, FileName FROM File_versions" and compare to the content of the APL. It would be slow, but as a daily/weekly overnight process it might be okay. This would mitigate against reference corruption. You could also run queries like these only for the assets that have 0 references reported, as a double-check before actually deleting the files.

--
You received this message because you are subscribed to the Google Groups "SilverStripe Core Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to silverstripe-d...@googlegroups.com.
To post to this group, send email to silverst...@googlegroups.com.
Visit this group at http://groups.google.com/group/silverstripe-dev.
For more options, visit https://groups.google.com/d/optout.

Damian Mooyman

unread,
Oct 22, 2015, 5:12:06 PM10/22/15
to SilverStripe Core Development

I've given this a bit of thought and looked through the code, and this is the solution I have in mind.


To provide a level of control over the visibility and access of assets in the backend store, I propose segmenting the store into a set of components with independent access controls applied.


Inside a typical vanilla installation, for instance, the assets folder would have the following directory structure:


Director / File

Description

/assets/

/assets/Uploads/

Top level directory

/assets/.htaccess

Allow direct access to files and subfolders. Redirects requests which do not match existing files are routed to /framework/main.php

/assets/secure/

/assets/secure/Uploads/

Files stored in the same directory structure as assets. Contains all files which require access control, although these assets will be less performant as each request will invoke a PHP process.

/assets/secure/.htaccess

Block all direct web requests. No urls containing /assets/secure will be accepted.

/assets/archive/

/assets/archive/Uploads/

Archive of all files which have been removed (at some point in the past) from stage / unversioned records. Pages which are restored from archive will also have their assets restored from this folder.

/assets/archive/.htaccess

Block all direct web requests. No urls containing /assets/archive will be accepted.


These segments will have some optionality, meaning that (for instance) the ‘secure’ folder could be disabled, merging any files which were intended for that folder with the public ‘assets’ folder. However, it would be problematic to leave out this feature altogether in the vanilla install, as the manipulations API would need to have a ‘secure file’ concept baked into it. A second option would be to omit archive, and simply remove these files.


These folders are specific to the default asset backend, as other backends would be able to provide their own mechanism for secure file management.


The asset store abstract interface would have added the following api methods:

  • AssetStore::archive($filename, $hash) - Delete or archive the file

  • AssetStore::protected($filename, $hash) - Move the file to the secure folder (possibly restoring it from archive)

  • AssetStore::publish($filename, $hash) - Move the file to the public folder (including restoring).


The reason that ‘variant’ is omitted, is because it would be cumbersome to control the visibility of files on a variant-by-variant basis, and it would be unlikely that individual sizes will need their own protection controls. Archiving, protecting, or publishing any filename and hash will apply to all variants of the same file at the same time.



File access


In order to provide access to files, especially secure files contained within the html area of any page / dataobject, it is necessary to provide a ‘trigger’ for granting access. In the case of versioned pages, or pages with specialised canView rules applied, determining the visibility of linked assets can be complicated.


The simplest solution is to declare that, if at any time, code requests the “url” of an asset, the APL will internally whitelist that asset path (e.g. assets/Uploads/04BCE38/image.jpg), storing that whitelist in the session of the current user, before returning it to be rendered. This could be controlled by a boolean flag on the AssetStore::getAsUrl method that disables this whitelist. E.g.



/**

* Get the url for the file

*

* @param string $filename Filename (not including assets)

* @param string $hash sha1 hash of the file content.

* If a variant is requested, this is the hash of the file before it was modified.

* @param string|null $variant Optional variant string for this file

* @param bool $whitelist Flag if this url should be temporarily granted permission for the current user.

* @return string public url to this resource

*/

public function getAsURL($filename, $hash, $variant = null, $whitelist = true);




The actual urls, therefore, will be exactly the same regardless of whether or not they are protected or public, as assets/Uploads/somefile.jpg will be redirected to framework/main.php, and if it matches the whitelist of any file in the current users session, can be sent the contents of the assets/secure/Uploads/somefile.jpg file.



Director:

 rules:

   'assets/$AssetPath': 'SecureAssetController'


Example flow for viewing a “secured” image (assets/secure/image.jpg) with a relation to an unpublished page:

  1. Anonymous visitor

    1. Web request for page

    2. Page->canView() denies request and never calls getURL()

    3. Direct requests to assets/secure/image.jpg will be denied because no session whitelist entry exists

  2. CMS author

    1. Web request for page

    2. Page->canView() allows request

    3. Page template renders image tag, calling getURL() in the process

    4. getURL() adds image path to session whitelist

    5. Browser renders image tag and requests image (assets/image.jpg)

    6. Webserver doesn’t find file, routes request to asset proxy script

    7. Proxy script finds the file in assets/secured/image.jpg, checks session whitelist and serves the file content


Caching


This would, of course, only work for assets that are guaranteed to have getAsURL called on them prior to view. For static cached sites, any asset included in a page would need to be left in the public folder. Partial caches on draft pages would work because the `$CurrentUser.ID` is include in the cache global_key, and asset whitelists are assigned to that user’s session only.


Controlling visibility


In order to ensure that files are assigned the correct visibility, the most critical element of this system is the assignment of any files to the appropriate segment. Thus the below actions would cause the following changes.


The most complicated challenge will be when modifying the published record, as the published version may have some overlap with files that also exist in the stage version.


Versioned dataobjects:


Remove record from live

Protect all assets attached to draft record. Do some magic to archive any assets on the live record but not on the draft record.

Archive record

Move all assets to archive.

Restore from archive

Move assets to protected folder from archive.

Publish

Move all assets to public folder. Will have to do some magic code to check if any assets on the old published record need to be archived.

Assign asset to stage record

Add new asset to protected folder (or configured)

Replace asset on stage record with a new one.

Move versioned File dataobject to new filename

Do some more magic here: If existing file is attached to the published record, leave it in place. If it’s protected, it can be archived. Save the new (or moved) file in the new location in the protected folder (or configured).


Unversioned dataobjects:


Save record

nothing

Assign asset to record

Add new asset to public folder (or configured)

Replace asset with new one.

Move unversioned File dataobject to new filename.

Move existing asset to archive. Assign new asset to public folder (or configured).

Delete record

Move assets to archive.


In order to save files into a DBFile, it will be necessary to use the `AssetUploadField`, which acts the same way as the `UploadField`, but acts directly on the field itself rather than creating `File` dataobjects.


This formfield will have a setDefaultSecurity() method, which will allow greater control over where new assets are stored when they are uploaded. If not configured, then new assets will default to ‘secure’ for versioned records, or ‘public’ for unversioned records, as above.


Other notes:

  • In order to prevent the need to do asset reference counting, all assets can only belong to a single dataobject. Any files uploaded via the `AssetUploadField` will only ever have their “on duplicate” behaviour set to “rename”.

  • Maybe the “archive” folder should be disabled by default, but we should only do this if we can guarantee that files will only be archived when the parent dataobject is deleted. If a dataobject is deleted and immediately re-created, for instance, it could end up with broken asset links.

  • I have considered that calling AssetStore::getAsURL() on an archived file possibly should restore this file from archive, if available. This may not be necessary if we can avoid over-eager archiving of files.

  • We would need to create some kind of extension that is applied to dataobjects, in order to trigger all of the above actions. This extension would need to be able to search all DBFile references on the dataobject, as well as all shortcodes present in all HTML areas. I expect that this extension would require a high degree of customisability, as well as being able to be completely disabled.

  • public / protected / archived is not a value stored in the database (i.e. not an additional tuple). This will need to be determined by looking at the physical file location of any file stored by the asset store.

Damian Mooyman

unread,
Oct 26, 2015, 5:29:34 PM10/26/15
to SilverStripe Core Development
Hi Dev list,

I hope that this feature hasn't been too daunting for users to discuss. :D Since we haven't had much involvement in this discussion, we'll look at developing this feature as we have proposed and will let the community know when we have more information in a future update.

In the mean time, we are still welcoming feedback in case anyone still has anything to add. We likely won't begin work on the archiving features for another few weeks.

Kind regards,

Damian Mooyman

Sam Minnée

unread,
Oct 26, 2015, 6:44:12 PM10/26/15
to SilverStripe Core Development
Hey Damian,

A few things:

1. I think that assets & assets/secure should be separate Flysystem endpoints. In nice conifgurations you'd probably keep assets/secure entirely outside the webroot. You might even have them on separate serveres.

2. I also think that public access of files should be a whitelist-based action, not a blacklist based. So the published version of a public file is explicitly pushed into the public-access space, rather than files being marked as ‘protected'.

Thanks,
Sam

Damian Mooyman

unread,
Oct 26, 2015, 6:53:49 PM10/26/15
to SilverStripe Core Development
Totally agree on these points; In fact both public and secure folders should be completely independently configurable, and may not even reside on the same servers.

We'll need to make sure that assets are "secure by default" and only published when it's safe to do so.

Damian Mooyman

unread,
Dec 21, 2015, 5:03:31 PM12/21/15
to SilverStripe Core Development
This feature is now implemented, and ready to be merged into core.

Please see https://github.com/silverstripe/silverstripe-framework/pull/4863 if you would like to review this. :)

There is a lot of documentation on the feature, so those wishing to understand the mechanism in more depth can start there.
Reply all
Reply to author
Forward
0 new messages