Hi SilverStripe developers!
As our work is progressing on the implementation of the RFC-1 assets abstraction, we have now changed nearly every part of the code base that deals with files to use this new backend.
We have made the decision at this stage not to deprecate the Image dataobject, although initially it was intended to be removed. There has been a lot of great discussion about this on the other discussion thread at https://groups.google.com/forum/#!topic/silverstripe-dev/cMRZ8HVe4Os.
To give everyone a quick recap on what HAS been implemented, this is a super short ELI5 summary of the current behaviour, and how it differs from the old 3.x asset system.
Storage:
All files are now stored using a tuple (set) of values, which the backend knows how to use to find a given physical file. The default backend uses flysystem for storage, so we map a given tuple to a relative path that flysystem can understand.
For instance, a file with the properties:
Filename: parent/file.jpg
Hash: 55b443b60176235ef09801153cca4e6da7494a0c
Variant: NULL
Would end up being stored in the assets directory under the physical filename
/sites/mysite.com/www/assets/parent/55b443b601/file.jpg
Since all files are ordered (and can be identified by) their hash, potentially many versions of a single file can be stored concurrently, without overwriting one another with the same filename.
Alternatively, two different files with the SAME hash but different filenames could also be stored independently, should one need to be deleted without affecting references to the other.
Referencing:
In the old API, all files had a 1-to-1 mapping with a File dataobject, and there existed a synchronisation task which enforced this. If a file dataobject was lost, the synchronisation task would re-create it, and vice versa, remove the dataobject if the underlying file was lost.
However, we no longer have this relationship. File dataobjects still exist, but each can only reference a single version (hash) of any one filename at a time.
the new DBFile field type has been introduced as a new method of linking to a file. In fact, the File dataobject itself uses this field type to refer to the current version of the Filename it’s linking to. Although you can’t have multiple File dataobjects with the same filename, you can now have as many DBFiles pointing to a single filename as you like.
The new problem:
Asset synchronisation no longer exists, since the 1-to-1 mapping is gone. We have also decided that relying on the File dataobject has a limited lifetime, and some time in the future we might choose to completely remove it.
Additionally, other DataObjects which use DBFile might reference files without a File dataobject even existing for that record. This means that we can’t rely on some features which, in the past, have solved very real problems. For instance, using file link tracking between pages and Files to detect broken links, or being able to find and remove assets by deleting the file in the asset admin section.
The problem is that we have no way of ensuring that the file can be safely assigned to one of the three states:
Publicly available (attached to a published dataobject or file where canView is true)
Disabled from public (attached to a draft / archived dataobject, or canView is false)
Removed from filesystem (object containing all references to this file are deleted)
Unfortunately, in the absence of a good deletion / restriction policy, file deletion is currently disabled.
The current code for `File::updateFilesystem`, in fact, never removes anything from the filesystem. https://github.com/silverstripe/silverstripe-framework/blob/f8f4ed03b6f4de2e69c70cc18e270646908d1469/filesystem/File.php#L534 and the core API for assets has no `delete` method https://github.com/silverstripe/silverstripe-framework/blob/f8f4ed03b6f4de2e69c70cc18e270646908d1469/filesystem/storage/AssetStore.php.
What does the solution look like?
My feeling is that we need a new system that can determine both when a file should be deleted or hidden, and if so, how this can be achieved.
Some of the ideas we have had:
Some kind of reference counting system for filename / hash values. This could be per-dataobject or across all dataobjects. When a dataobject is deleted, it can remove the physical file if its own is the last reference to that file.
Never have more than one reference to a file: Each DataObject with a reference to a DBFile will refer to its own copy of that file (File dataobjects included). If that file is deleted, the asset is removed as well. This of course would mean that “overwrite on upload” would not be allowed anymore, since one dataobject would not be allowed to overwrite files belonging to another.
Having some kind of web-inaccessible “archive” folder of deleted files, where inactive files are stored once deleted from filesystem. In the case where a tuple references a file that no longer exists (e.g. when viewing a page in archive mode), it could be retro-actively restored from the archive on the fly. This could mean that deleting files is a safer operation, but there’s also the ability for administrators to ‘flush’ the archive as necessary. It also means that projects which rely on permanent archiving of all content (such as versioned File dataobjects) would be able to restore even the oldest of deleted records.
Potentially (in some form) reviving a new asset synchronisation feature, and use File dataobjects once again as a 1-to-1 mapping of all assets. This would require solving the “many versions with a single filename” issue.
To provide file security, links to assets could either be public or protected, where public files could be exposed to the web (chmod 755 and direct urls), but protected files would have to go via a gateway. Whenever a dataobject invokes ->getURL() on a file attached to it, it would generate a link to this gateway script, along with a temporary hash (with some lifetime on it) guaranteeing access to that file. For example, to embed it via an <img /> tag when viewing draft pages as a content editor. There would also need to be some other way of preventing web access to this file directly, which could mean storing it in a location outside of /assets (or a private subdirectory). There’s also the problem of knowing whether a file is protected or public, just based on the hash and filename identifier. Different versions of a single filename could have different permissions, or even the same version of a file with slightly different filenames (/assets/parent/55b443b601/secure.jpg and /assets/parent/55b443b601/public.jpg), so using a directory .htaccess could be problematic.
We could rely on setting chmod 700 for private files, although it would not work if the webserver was running as the same user as PHP.
Going either way by leaving all files public (and relying on urls not being guessed), or making all files go through a secure gateway (with performance implications) are less ideal.
For 4.0 we agree that some kind of security and asset removal solution is needed. We have addressed some of these problems in 3.x with the secure assets module (https://github.com/silverstripe-labs/silverstripe-secureassets), but even this does not automatically respect published or deleted objects, and file dataobjects must be secured independently of the pages or dataobjects which point to them.
Thanks everyone for the great ideas and feedback along the way. I’m really keen on hearing about your ideas on this issue. For the sake of tracking this issue, please refer to the open source ticket at https://github.com/silverstripe/silverstripe-framework/issues/4677.
Also if you wish to join the ongoing discussion about Images, please refer to the current thread at https://groups.google.com/forum/#!topic/silverstripe-dev/cMRZ8HVe4Os, as well as the image shortcode discussion at https://github.com/silverstripe/silverstripe-framework/issues/4337
What does the solution look like?
My feeling is that we need a new system that can determine both when a file should be deleted or hidden, and if so, how this can be achieved.
Some of the ideas we have had:
Some kind of reference counting system for filename / hash values. This could be per-dataobject or across all dataobjects. When a dataobject is deleted, it can remove the physical file if its own is the last reference to that file.
Never have more than one reference to a file: Each DataObject with a reference to a DBFile will refer to its own copy of that file (File dataobjects included). If that file is deleted, the asset is removed as well. This of course would mean that “overwrite on upload” would not be allowed anymore, since one dataobject would not be allowed to overwrite files belonging to another.
Having some kind of web-inaccessible “archive” folder of deleted files, where inactive files are stored once deleted from filesystem. In the case where a tuple references a file that no longer exists (e.g. when viewing a page in archive mode), it could be retro-actively restored from the archive on the fly. This could mean that deleting files is a safer operation, but there’s also the ability for administrators to ‘flush’ the archive as necessary. It also means that projects which rely on permanent archiving of all content (such as versioned File dataobjects) would be able to restore even the oldest of deleted records.
--
You received this message because you are subscribed to the Google Groups "SilverStripe Core Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to silverstripe-d...@googlegroups.com.
To post to this group, send email to silverst...@googlegroups.com.
Visit this group at http://groups.google.com/group/silverstripe-dev.
For more options, visit https://groups.google.com/d/optout.
I've given this a bit of thought and looked through the code, and this is the solution I have in mind.
To provide a level of control over the visibility and access of assets in the backend store, I propose segmenting the store into a set of components with independent access controls applied.
Inside a typical vanilla installation, for instance, the assets folder would have the following directory structure:
Director / File | Description |
/assets/ /assets/Uploads/ | Top level directory |
/assets/.htaccess | Allow direct access to files and subfolders. Redirects requests which do not match existing files are routed to /framework/main.php |
/assets/secure/ /assets/secure/Uploads/ | Files stored in the same directory structure as assets. Contains all files which require access control, although these assets will be less performant as each request will invoke a PHP process. |
/assets/secure/.htaccess | Block all direct web requests. No urls containing /assets/secure will be accepted. |
/assets/archive/ /assets/archive/Uploads/ | Archive of all files which have been removed (at some point in the past) from stage / unversioned records. Pages which are restored from archive will also have their assets restored from this folder. |
/assets/archive/.htaccess | Block all direct web requests. No urls containing /assets/archive will be accepted. |
These segments will have some optionality, meaning that (for instance) the ‘secure’ folder could be disabled, merging any files which were intended for that folder with the public ‘assets’ folder. However, it would be problematic to leave out this feature altogether in the vanilla install, as the manipulations API would need to have a ‘secure file’ concept baked into it. A second option would be to omit archive, and simply remove these files.
These folders are specific to the default asset backend, as other backends would be able to provide their own mechanism for secure file management.
The asset store abstract interface would have added the following api methods:
AssetStore::archive($filename, $hash) - Delete or archive the file
AssetStore::protected($filename, $hash) - Move the file to the secure folder (possibly restoring it from archive)
AssetStore::publish($filename, $hash) - Move the file to the public folder (including restoring).
The reason that ‘variant’ is omitted, is because it would be cumbersome to control the visibility of files on a variant-by-variant basis, and it would be unlikely that individual sizes will need their own protection controls. Archiving, protecting, or publishing any filename and hash will apply to all variants of the same file at the same time.
File access
In order to provide access to files, especially secure files contained within the html area of any page / dataobject, it is necessary to provide a ‘trigger’ for granting access. In the case of versioned pages, or pages with specialised canView rules applied, determining the visibility of linked assets can be complicated.
The simplest solution is to declare that, if at any time, code requests the “url” of an asset, the APL will internally whitelist that asset path (e.g. assets/Uploads/04BCE38/image.jpg), storing that whitelist in the session of the current user, before returning it to be rendered. This could be controlled by a boolean flag on the AssetStore::getAsUrl method that disables this whitelist. E.g.
/** * Get the url for the file * * @param string $filename Filename (not including assets) * @param string $hash sha1 hash of the file content. * If a variant is requested, this is the hash of the file before it was modified. * @param string|null $variant Optional variant string for this file * @param bool $whitelist Flag if this url should be temporarily granted permission for the current user. * @return string public url to this resource */ public function getAsURL($filename, $hash, $variant = null, $whitelist = true); |
The actual urls, therefore, will be exactly the same regardless of whether or not they are protected or public, as assets/Uploads/somefile.jpg will be redirected to framework/main.php, and if it matches the whitelist of any file in the current users session, can be sent the contents of the assets/secure/Uploads/somefile.jpg file.
Director: rules: 'assets/$AssetPath': 'SecureAssetController' |
Example flow for viewing a “secured” image (assets/secure/image.jpg) with a relation to an unpublished page:
Anonymous visitor
Web request for page
Page->canView() denies request and never calls getURL()
Direct requests to assets/secure/image.jpg will be denied because no session whitelist entry exists
CMS author
Web request for page
Page->canView() allows request
Page template renders image tag, calling getURL() in the process
getURL() adds image path to session whitelist
Browser renders image tag and requests image (assets/image.jpg)
Webserver doesn’t find file, routes request to asset proxy script
Proxy script finds the file in assets/secured/image.jpg, checks session whitelist and serves the file content
Caching
This would, of course, only work for assets that are guaranteed to have getAsURL called on them prior to view. For static cached sites, any asset included in a page would need to be left in the public folder. Partial caches on draft pages would work because the `$CurrentUser.ID` is include in the cache global_key, and asset whitelists are assigned to that user’s session only.
Controlling visibility
In order to ensure that files are assigned the correct visibility, the most critical element of this system is the assignment of any files to the appropriate segment. Thus the below actions would cause the following changes.
The most complicated challenge will be when modifying the published record, as the published version may have some overlap with files that also exist in the stage version.
Versioned dataobjects:
Remove record from live | Protect all assets attached to draft record. Do some magic to archive any assets on the live record but not on the draft record. |
Archive record | Move all assets to archive. |
Restore from archive | Move assets to protected folder from archive. |
Publish | Move all assets to public folder. Will have to do some magic code to check if any assets on the old published record need to be archived. |
Assign asset to stage record | Add new asset to protected folder (or configured) |
Replace asset on stage record with a new one. Move versioned File dataobject to new filename | Do some more magic here: If existing file is attached to the published record, leave it in place. If it’s protected, it can be archived. Save the new (or moved) file in the new location in the protected folder (or configured). |
Unversioned dataobjects:
Save record | nothing |
Assign asset to record | Add new asset to public folder (or configured) |
Replace asset with new one. Move unversioned File dataobject to new filename. | Move existing asset to archive. Assign new asset to public folder (or configured). |
Delete record | Move assets to archive. |
In order to save files into a DBFile, it will be necessary to use the `AssetUploadField`, which acts the same way as the `UploadField`, but acts directly on the field itself rather than creating `File` dataobjects.
This formfield will have a setDefaultSecurity() method, which will allow greater control over where new assets are stored when they are uploaded. If not configured, then new assets will default to ‘secure’ for versioned records, or ‘public’ for unversioned records, as above.
Other notes:
In order to prevent the need to do asset reference counting, all assets can only belong to a single dataobject. Any files uploaded via the `AssetUploadField` will only ever have their “on duplicate” behaviour set to “rename”.
Maybe the “archive” folder should be disabled by default, but we should only do this if we can guarantee that files will only be archived when the parent dataobject is deleted. If a dataobject is deleted and immediately re-created, for instance, it could end up with broken asset links.
I have considered that calling AssetStore::getAsURL() on an archived file possibly should restore this file from archive, if available. This may not be necessary if we can avoid over-eager archiving of files.
We would need to create some kind of extension that is applied to dataobjects, in order to trigger all of the above actions. This extension would need to be able to search all DBFile references on the dataobject, as well as all shortcodes present in all HTML areas. I expect that this extension would require a high degree of customisability, as well as being able to be completely disabled.
public / protected / archived is not a value stored in the database (i.e. not an additional tuple). This will need to be determined by looking at the physical file location of any file stored by the asset store.