[edx-ora2] Upload open assessment files to backends other than S3

864 views
Skip to first unread message

Régis Behmo

unread,
Dec 2, 2014, 6:26:07 AM12/2/14
to edx-...@googlegroups.com
This post is a follow-up on a pull request that was opened a couple days ago on edx-ora2 : https://github.com/edx/edx-ora2/pull/640

To summarize: we need to allow students to upload files for open assessment. However, we do not use Amazon S3 for static file storage. We would like to allow file uploads to a local file storage backend. There are many different ways to do this, and we would like to discuss with the edx-ora2 developers to agree on a way to move forward on this subject.

We propose to develop a solution specific to edx-ora2. Once it is stable, we should be able to extract the file upload code and move it to a dedicated (external) Django app. Then we should be able to use this app in parts of edx-platform.

We understand this pull request might not be a priority for the S3-based Edx team. That's why we are planning on doing the development work ourselves.

Please pitch in!

--
Régis Behmo, software developer at FUN

William Daly

unread,
Dec 2, 2014, 9:32:54 AM12/2/14
to edx-...@googlegroups.com
Hi Régis,

Thank you for raising this important issue.  I would love to see the ORA2 file upload generalized to support backends other than Amazon S3.

In the current design, clients are given an upload link; clients can then upload directly to S3.  This is really important for scalability.  It means that edx-platform application servers don't need to process file uploads.  Any solution that replaces Amazon S3 as a backend needs to preserve this basic structure.

Therefore, at a minimum, any file upload service needs to (a) generate upload URLs based on key values, (b) generate download URLs based on key values, (c) accept uploads to the upload URLs, and (d) serve files from download URLs.

I like your suggestion of creating a Django app backend.  This would give installations the option to either install the app within edx-platform itself (for simpler deployment) or stand it up as a separate service (for better scalability).

An important question here is how edx-ora2 should interact with this new Django app.  ORA2 will need to do this in order to generate upload / download URLs.  I think the approach outlined in your PR is a step in the right direction: it should be possible to configure ORA2 to interact with different services by plugging in different implementations of the code that interacts with the file service (of which the boto-based implementation would be a concrete example).

I hope this gives you some sense of the design constraints around this project.  Moving away from the dependency on Amazon S3 would be a huge step in the right direction, and I'm really excited to see where this discussion leads!

-- Will

Piotr Mitros

unread,
Dec 2, 2014, 10:24:20 AM12/2/14
to edx-...@googlegroups.com
For this problem, I've found pyfilesystem to make a nice back-end. We've added pyfs support to the platform for use in XBlocks, and it'd be easy for ORA2 to make use of the same as a back-end. I've retrofitted several additional features into pyfilesystem: download URLs and temporary files. pyfilesystem does not support direct uploads from the client, in the way Will mentions, but this would be a nice feature to retrofit. The nice things about pyfilesystem are: 
  1. Existing support for a large number of back-ends
  2. Easy to add more back-ends. I've written a couple myself, and it's pretty straightforward
  3. Well-understood API (standard Python file API)
  4. Actively developed
  5. Already in the platform. Configuration, etc. are already in place. One thing to maintain
The rest of this e-mail is a complete sidetrack from the issue at hand:

Will: Regarding scalability, I agree about the long-term importance of this, especially if this is a service not just for ORA2. However, for ORA2 specifically, did you run the numbers? When I did the math, it seemed like we'd need an awful lot of ORA2 problems on an awful lot of very slow connections before this became an issue. My ballpark was:
  • 300 courses per year * 12 ORA uploads each * 10,000 active students per course * 1 minute per upload = 70 years of thread time. That's mostly IO bound. A single server with perhaps two dozen threads should be able to handle this. If we were conservative, we could have 2-3 servers. 
  • 300 courses per year * 12 ORA uploads each * 10,000 active students per course * 1 megabyte per upload = 36TB. That averages out to 1MBps. We'd, of course, need a bit of overhead for spikes, etc.
In other words, the total cost would be in the hundreds of dollars of server time+bandwidth+etc. per year, possible reaching single-digit thousands, for a site at the scale of the major MOOC providers. That's assuming a fairly large number of ORA problems which expect file uploads. It is assuming reasonably small files (e.g. PNGs, JPEGs, PDFs, code snippets, and not videos). 

In the context of the ProfileXBlock and the RecommenderXBlock, I explicitly wanted the uploads to pass through the server. I would verify: 
  1. Maximum file size.
  2. Check magic numbers to confirm that the type of file, file extension, and MIME type matched.
  3. For ProfileXBlock, rescale the photos to a sensible size.
  4. Rename the files.
We considered this kind of validation pretty important in order to preventing things like malware, spam, etc., as well as maintain student experience (if a student on Google Fiber uploads a 2GB file, and a student on a satellite uplink needs to view it...). 

Piotr

Régis Behmo

unread,
Dec 3, 2014, 3:40:55 AM12/3/14
to edx-...@googlegroups.com
William, Piotr, thanks for your feedback: this is an important issue for us, so we are glad to see you feel concerned, too.

@William: our goal is to provide a drop-in replacement for ORA2's existing implementation, so we intend to keep the get_download_url and get_upload_url functions in fileupload.api. The change will have to be retrocompatible, too, so the URLs will have the same format for the S3 backend.

@Piotr: I wasn't aware that pyfilesystem was a dependency of edx-platform. I agree with you that pyfs could be just the solution we need. However, correct me if I'm wrong on django-pyfs: it seems to me that it is not possible to configure different filesystems for different uses in django-pyfs? For instance, we might want to store openassessment files locally, but store university files (from lms) on S3. I see what you did on django-pyfs, though, and we will probably draw from it to glue the pieces together.
I'm not sure what's the point of your scalability estimates, but I agree with them :)

I will now work on a pull request to plug the file upload to pyfs.


Régis

John Eskew

unread,
Dec 3, 2014, 9:44:35 AM12/3/14
to edx-...@googlegroups.com
Régis,

As a part of the work to allow course assets to be stored outside of edx-platform, we plan to implement a storage system (currently called the BlobStore) that will allow pluggable storage implementations to be written. The details are evolving but current thinking is documented here:


The main difference between what you're proposing and the BlobStore is that initially the BlobStore will be focused on content uploaded by course authors and downloaded by students. So direct student upload to an external service such as S3 (which is needed by ORA2) is not planned to be in the first iteration - but will be able to be added in a subsequent iteration. 

There's a couple of reasons why we're not directly using pyfs for the storage:
- We plan to support the legacy contentstore seamlessly, serving the files stored there without requiring an explicit course asset migration step. This requirement is accomplished easier by a storage implementation having direct access to GridFS where those files are currently stored.
- The proposed API is purposely simpler than the one offered by pyfs. The API will have top-level CRUD operations without exposing a file-like object to the API consumer. This simplification will leave the storage implementation details completely up to the pluggable implementation. De-duplication, directory structure, bucket configuration/sharding, filesystem snapshotting - all these details will be handed by each implementation.

Note that this design does not prevent a pluggable implementation behind the API from using pyfs if so desired.

Although this work doesn't provide you with an immediate solution to your storage issue, I hope you find this information useful!

Best,
John

Piotr Mitros

unread,
Dec 3, 2014, 10:31:16 AM12/3/14
to edx-...@googlegroups.com
A bit off-topic, but: 

On Wednesday, December 3, 2014 9:44:35 AM UTC-5, John Eskew wrote:
There's a couple of reasons why we're not directly using pyfs for the storage:
- We plan to support the legacy contentstore seamlessly, serving the files stored there without requiring an explicit course asset migration step. This requirement is accomplished easier by a storage implementation having direct access to GridFS where those files are currently stored.

FYI: Writing a gridfs backend (or any other) for pyfilesystem is minimal work. I think this is for an older pyfilesystem, but this is basically all that there is to it: 
 
- The proposed API is purposely simpler than the one offered by pyfs. The API will have top-level CRUD operations without exposing a file-like object to the API consumer. This simplification will leave the storage implementation details completely up to the pluggable implementation.

Can you explain this in more detail? Looking at the API: 
  1. store_blob means the whole blob has to be buffered in memory. You could use multiple calls to append_to_blob, but without explicit streaming/open/close, this would be very inefficient. 
  2. get_blob is similar. You either pass the whole thing as a large in-memory binary object, or you make multiple calls to get ranges of bytes. Since there is no explicit open/close, each of those is a new open/seek/close on file system, or service request.  
It seems like streaming file-like objects, with open/read/close or open/write/close would have dramatically better performance. The back-end can open, read in block sizes optimized for the storage format, and then close, keeping any TCP connection open for exactly as long as necessary. Using, specifically, a Python file-like object has the advantage that it can be passed to other places. If you look at some of the XBlocks I've written, I can pass those objects to PIL or numpy.save() or similar. This is substantially more efficient than writing out to a temporary file (whether in-memory or on-file-system), and then restreaming, as would be needed with this kind of API. 

Of course, pyfilesystem has several backends where this type of streaming operation is impossible. In that case, it just makes an intermediate temporary file (which is what one would need to do with store_blob and get_blob in most cases, regardless). 

De-duplication, directory structure, bucket configuration/sharding, filesystem snapshotting - all these details will be handed by each implementation.

With the exception of directory structure, how would any of these be difficult with pyfilesystem? 
 
With regards to directory structure: naming, name-spacing, and listing is actually somewhat important. If we just have a large set of assets stored as blobs with no metainformation, or metainformation out-of-line (e.g. in Mongo), it is difficult to do any kind of introspection or debugging. To give an example, if the datastore is shared by many systems (which appears to be the intention) and one of them fails to properly garbage-collect objects, or accidentally stores a few terabytes of stuff, the only way to find out which data is good and which is bad would be to go through every subsystem which might have BlobLocators, pull out every BlobLocator from all of those places, and then erase the ones not in the list. This is both brittle (if you miss that some back-end happens to put things in Amazon Glacier or export to XML files), and hard to do in a way which is robust without taking down the system (the set of blobs is constantly changing).

Piotr

Omar Al-Ithawi

unread,
Dec 4, 2014, 2:18:19 AM12/4/14
to edx-...@googlegroups.com
I would suggest using a opensource S3-like service like https://nimbus.io/. This should make much less code modifications and would be easier to adopt. Especially since it provides comparable reliability features to S3. On the other hand the local file storage is a bit dangerous in terms of reliability and backup routines.  

Omar Al-Ithawi

unread,
Dec 4, 2014, 2:21:36 AM12/4/14
to edx-...@googlegroups.com
This model is very popular, and it's not just nimbus.io, there's openstack-storage and Eucalyptus projects that share the same goals i.e. provide drop-in replacement for AWS S3 and other services.

Régis Behmo

unread,
Dec 4, 2014, 12:19:21 PM12/4/14
to edx-...@googlegroups.com
John, I didn't know about the ongoing work on AssetMgr. For reference, here is a link to the current proposal:
which evolved from this initial spec to replace GridFS: https://openedx.atlassian.net/wiki/display/PLAT/GridFS+Replacement

It seems to me that this proposal for blob/metadata storage could definitely serve as a backend for file storage in edx-ora2. Open assessment files are associated to a course in edx-ora2 (right?) so it looks like a right fit. John, do you have a release date for this?

In any case, the AssetMgr would not handle upload URL expiry, right? So I propose to develop the mechanism for URL generation for local file storage; once the AssetMgr is released, it shouldn't be too difficult to plug it to the same generator of upload URLs.

For that I will need some cache system with an expiry mechanism (expiry of upload URLs, just like S3). Should we store cached data in memcache? in django cache (via django-keyedcache)? To the best of my knowledge, none of these resources are currently in use in edx-ora2 nor in edx-platform.
What do you think?


Régis Behmo
Développeur @ France Université Numérique
+33(0)1.55.55.83.42


De: "Omar Al-Ithawi" <oit...@qrf.org>
À: edx-...@googlegroups.com
Envoyé: Jeudi 4 Décembre 2014 08:21:36
Objet: [edx-code] Re: [edx-ora2] Upload open assessment files to backends other than S3

John Eskew

unread,
Dec 5, 2014, 9:07:42 AM12/5/14
to edx-...@googlegroups.com
Piotr,

Thanks for your comments! You make good points about this design relative to large file sizes. Streaming bytes to file-like objects would more easily allow optimizations for large files or heavily-appended files for the reasons you state.

This system was originally designed to hold course assets, which are currently always saved in a single transaction and are on the order of 10s of MB. Also, I'd hoped to evolve the BlobStore into a web service instead of a Python module (if possible) to make it more useful across the edX platform and its various non-platform components - so I'd been avoiding returning Python constructs in the API. The API above was shaped by this plan - indeed, all these reasons have shaped the current design.

But, first, we'll implement a Python API. And I see the value in passing back a file-like Python object to enable streaming writes/reads, which gives the design flexibility in dealing with uploads/downloads of large files.

I'd still rather not use pyfilesystem directly - here's why: I'd prefer to encapsulate the usage of the file's metadata (content-type, original directory, hash value, etc.) into each storage implementation. Some storage impls will use the metadata fully, by storing the file under a directory structure based on its metadata - and possibly even save the metadata in a DB or KVS for querying, easy retrieval, and maximum introspection. Some storage impls won't use the metadata at all - or will only use one piece of metadata. The logic which decides how blobs are stored seems to belong down at a lower level, close to the actual storage. There's nothing that would prevent pyfs from being used down at the storage impl level - or boto or any other existing Python-based storage interface.

I'm definitely not proposing that directory structures be removed from all asset storage - they are important to some storage impls and I expect that most impls will use them. I'm just proposing that directory-creating/choosing and blob-location logic be pushed down out of platform and into the storage impls.

For asset cleanup, any asset deletion originating in the store itself will also need to be mirrored in the platform as well, since we'll be storing the course asset BlobLocators in the course's modulestore. I've always viewed those types of cleanups as originating from the platform itself - though I can imagine situations where storage-originated removals will happen (mistakes, storage failures).

John Eskew

unread,
Dec 5, 2014, 9:49:43 AM12/5/14
to edx-...@googlegroups.com
We've not yet begun implementing the storage part of the AssetMgr. According to the current plan, the feature won't be available for at least four months. There's no release date currently set.

The first AssetMgr release won't support direct upload of files to external storage, so it won't hand out upload URLs and therefore not manage their expiry. But do you need to handle their expiration within platform at all? The URL generators I've seen for external storage services allow you to specify a time-to-live for the URL, after which it won't resolve. If you determine you do need this management, the platform does use Django caching via the django-cache-toolbox currently - you could see if that covers your functional needs.

John

Régis Behmo

unread,
Jan 5, 2015, 4:00:56 AM1/5/15
to edx-...@googlegroups.com
We have made some progress on this feature; the (open) pull request is here: https://github.com/edx/edx-ora2/pull/640
Please feel free to pitch in!

Christos Bellos

unread,
Oct 3, 2016, 7:08:01 AM10/3/16
to General Open edX discussion
I am trying to use the ORA2 and upload files to our server. I think that we have performed all the recommended steps and used the updated code from #640, but with no success.

Is it possible to post the steps, sth like a tutorial for dummies? That would be really really appreciated, because I am following several threads and performed several recommendations from users with no success.
For example, I have followed these steps: https://github.com/edx/edx-documentation/blob/master/en_us/install_operations/source/configuration/ora2/ora2_uploads.rst#id6
and these: https://groups.google.com/forum/#!topic/edx-code/cji-sZq7Mgc

Régis Behmo

unread,
Oct 4, 2016, 4:50:21 AM10/4/16
to General Open edX discussion
Christos,

Are you observing an error, on either the backend or the frontend? Please provide us with some more details.

Régis

Christos Bellos

unread,
Oct 4, 2016, 12:12:58 PM10/4/16
to General Open edX discussion
I am observing the same error ("error on file upload") both on LMS and Studio, when trying to use the ORA2 component for uploading either a pdf or an image file.

My problem is that I cannot change the DEFAULT_FILE_STORAGE to filesystem. It seems that it is using Amazon S3, no matter what I have tried based on forum threads and recommendations.

Christos Bellos

unread,
Oct 4, 2016, 12:25:04 PM10/4/16
to General Open edX discussion
Also, I just used the 1 hour demo of the Bitnami OpenEdX for AWS and there is the same problem...

It says:
"Unable to upload file"
"Error on retrieving upload URL"

Don't you have the same error message?

Régis Behmo

unread,
Oct 6, 2016, 9:05:54 AM10/6/16
to General Open edX discussion
Do you observe errors in the application logs ?
Reply all
Reply to author
Forward
0 new messages