Uploaded versions separation

380 views
Skip to first unread message

radovan...@gmail.com

unread,
Nov 2, 2017, 1:59:45 PM11/2/17
to Shrine
Hello, 

We have migrating from Carrierwave to Shrine, because we are trying to decouple from Rails. At migration process we have realized that one thing have Carrierwave better designed and it is issue for us to stop migration yet.

And reason why we stop it is: Versions.

For example:

Carrierwave versions are non-destructive. I can call original file object or call any version. Both are separated by DSL and data-structure. We can change any version or completely remove them and everything will work.

On the other hand Shrine versions are destructive.
Version plugin change data structure instead append/extend them. If someone adds versions plugin and start using version it completely change data structure (*_data).
Without Version plugin it creates clean and easy to maintain data. Attributes "id", "storage", "metadata" etc. for example
{"id"=>"asset/766a9cbbec3f2607cc447165850ccded.jpg", "storage"=>"cache", "metadata"=>{"filename"=>"f76d3177-da13-4c08-9def-f1e9a76b7211.jpg", "size"=>126528, "mime_type"=>"image/jpeg"}}

But after enable Version plugin, it is needed to update data again. Better solution should be create version on demand. So when data are not migrated response is:
NoMethodError: undefined method `[]' for #<ImageUploader::UploadedFile:0x00007fc281662718>
As you can see, it is ImageUploader class

After recreating Version data it changes original data structure to something like this:
{:original=>
  #<ImageUploader::UploadedFile:0x00007fc2831766f8
   @data={"id"=>"asset/16511/nature-forest-industry-rails.jpg", "storage"=>"store", "metadata"=>{"filename"=>"nature-forest-industry-rails.jpg", "size"=>5081543, "mime_type"=>"image/jpeg"}}>,
 :huge=>
  #<ImageUploader::UploadedFile:0x00007fc283176590
   @data={"id"=>"asset/16511/huge_nature-forest-industry-rails.jpg", "storage"=>"store", "metadata"=>{"filename"=>"huge_nature-forest-industry-rails.jpg", "size"=>983053, "mime_type"=>"image/jpeg"}}>,
 :large=>
  #<ImageUploader::UploadedFile:0x00007fc283176428
   @data={"id"=>"asset/16511/large_nature-forest-industry-rails.jpg", "storage"=>"store", "metadata"=>{"filename"=>"large_nature-forest-industry-rails.jpg", "size"=>458088, "mime_type"=>"image/jpeg"}}>,
 :medium=>
  #<ImageUploader::UploadedFile:0x00007fc2831762c0
   @data={"id"=>"asset/16511/medium_nature-forest-industry-rails.jpg", "storage"=>"store", "metadata"=>{"filename"=>"medium_nature-forest-industry-rails.jpg", "size"=>98342, "mime_type"=>"image/jpeg"}}>,
 :small=>
  #<ImageUploader::UploadedFile:0x00007fc283176158
   @data={"id"=>"asset/16511/small_nature-forest-industry-rails.jpg", "storage"=>"store", "metadata"=>{"filename"=>"small_nature-forest-industry-rails.jpg", "size"=>32982, "mime_type"=>"image/jpeg"}}>,
 :tiny=>
  #<ImageUploader::UploadedFile:0x00007fc283175ff0
   @data={"id"=>"asset/16511/tiny_nature-forest-industry-rails.jpg", "storage"=>"store", "metadata"=>{"filename"=>"tiny_nature-forest-industry-rails.jpg", "size"=>14988, "mime_type"=>"image/jpeg"}}>}
And this is Hash, it is different than original Uploader class.

After disabling Version plugin it doesn't work anymore:
Shrine::Error: {"original"=>{"id"=>"asset/16511/nature-forest-industry-rails.jpg", "storage"=>"store", "metadata"=>{"filename"=>"nature-forest-industry-rails.jpg", "size"=>5081543, "mime_type"=>"image/jpeg"}}, "huge"=>{"id"=>"asset/16511/huge_nature-forest-industry-rails.jpg", "storage"=>"store", "metadata"=>{"filename"=>"huge_nature-forest-industry-rails.jpg", "size"=>983053, "mime_type"=>"image/jpeg"}}, "large"=>{"id"=>"asset/16511/large_nature-forest-industry-rails.jpg", "storage"=>"store", "metadata"=>{"filename"=>"large_nature-forest-industry-rails.jpg", "size"=>458088, "mime_type"=>"image/jpeg"}}, "medium"=>{"id"=>"asset/16511/medium_nature-forest-industry-rails.jpg", "storage"=>"store", "metadata"=>{"filename"=>"medium_nature-forest-industry-rails.jpg", "size"=>98342, "mime_type"=>"image/jpeg"}}, "small"=>{"id"=>"asset/16511/small_nature-forest-industry-rails.jpg", "storage"=>"store", "metadata"=>{"filename"=>"small_nature-forest-industry-rails.jpg", "size"=>32982, "mime_type"=>"image/jpeg"}}, "tiny"=>{"id"=>"asset/16511/tiny_nature-forest-industry-rails.jpg", "storage"=>"store", "metadata"=>{"filename"=>"tiny_nature-forest-industry-rails.jpg", "size"=>14988, "mime_type"=>"image/jpeg"}}} isn't valid uploaded file data
from /Users/Rado/.gem/ruby/2.4.2/gems/shrine-2.8.0/lib/shrine.rb:733:in `initialize'
So again, next data updating.

I think versions should extends original data format, not change it.

For us is most convenient way to store original file data (*_data) and version data (*_versions_data) as two separate data structures. Versions for us are not critical, they can be generated anytime if they don't exists (on URL request), but original file is just original :)
We have media/document server for all document types (images, audios, videos, other documents) and every document is associated. Currently we have 16 000 original documents and 80 000 derivates from original document (different size, quality, format).

What i can do is write own version plugin. I want to open discussion how to provide it for our needs and be compatible in the future.

Janko Marohnić

unread,
Nov 6, 2017, 4:57:15 PM11/6/17
to Radovan Šmitala, Shrine
First of all, thank you for explaining and visualizing your concerns so well.

I think versions should extends original data format, not change it.

That's something I had in mind when I was designing the versions plugin, but I ended up going with the route of having a different format. I'm definitely open to discussing this, and I would like to know more about how would you propose to make versions more of an extension. I will also try to explain my reasons.

Firstly, note that unlike CarrierWave, Shrine doesn't return an uploader object, it returns a Shrine::UploadedFile object (for the `ImageUploader < Shrine` subclass it would be ImageUploader::UploadedFile). A Shrine::UploadedFile object represents a single uploaded file, and is defined by the data hash that's written in the *_data column. In case of multiple versions, the JSON value in the *_data column contains a data hash for each version file, and in that case Shrine attachment attribute returns a hash of Shrine::UploadedFile objects.

CarrierWave, on the other hand, uses the uploader class both for uploading and representing uploaded files. Unlike in Shrine, CarrierWave doesn't store locations of the uploaded versions, instead they are defined dynamically be the user in #store_dir. That allows CarrierWave to generate URLs for versions that don't even exist, because it has no knowledge whether versions are actually uploaded. So, if you were to call

  record.attachment.url(:thumb)

CarrierWave would return the generated URL, even if the thumbnail doesn't actually exist. In my opinion you would never want to do that, so I consider the fact the Shrine loudly fails here a good thing. Note that with Shrine you can use  #<attachment>_url and pass it a version name:

  record.attachment_url(:thumb)

It will have the following behaviour:
  • if the attachment consists of versions and the version :thumb exists, generate URL for that version
  • if the attachment consists of versions and the version :thumb doesn't exist, fail with an ArgumentError
  • if the attachment consists of a single file, generate URL for that file
  • if the attachment is blank, return nil
For me this is the method that's fairly agnostic to whether *_data column contains single files or versions. By default it only accept options that are forwarded to the underlying storage as URL options, but versions plugin extends it to accept an additional version name argument.

To provide rest of the comments inline:

On the other hand Shrine versions are destructive.
Version plugin change data structure instead append/extend them. If someone adds versions plugin and start using version it completely change data structure (*_data). 

Yes, in case you're not using versions, the *_data column will be a data hash of a single uploaded file, and a single Shrine::UploadedFile is returned. In case of versions the *_data column contains a hash of data hashes (because each version is an individual file with its own location and metadata), and from the Shrine instantiates a hash of Shrine::UploadedFile objects.

So, the structure needs to be different, because without versions the *_data column holds a single unnamed file, and with versions it holds multiple named files.

Without Version plugin it creates clean and easy to maintain data. Attributes "id", "storage", "metadata" etc. for example
 
{"id"=>"asset/766a9cbbec3f2607cc447165850ccded.jpg", "storage"=>"cache", "metadata"=>{"filename"=>"f76d3177-da13-4c08-9def-f1e9a76b7211.jpg", "size"=>126528,"mime_type"=>"image/jpeg"}} 

But after enable Version plugin, it is needed to update data again. Better solution should be create version on demand. So when data are not migrated response is:
 
NoMethodError: undefined method `[]' for #<ImageUploader::UploadedFile:0x00007fc281662718>
 
As you can see, it is ImageUploader class

Since we already have the #<attachment>_url method which has fallbacks, what would be the value of an object that would be returned both when there is single file and when there are versions?

After disabling Version plugin it doesn't work anymore:
...

So again, next data updating.

Yes, the base functionality doesn't know about the versions, so if you remove versions plugin it assumes again that the *_data column will contain an uploaded file data hash. Note that if you would like to stop using versions without changing the *_data column, you can remove processing and still keep versions plugin loaded.

What i can do is write own version plugin.

Yes, that's definitely an option. Though I would like to find out more about what advantages do you see in CarrierWave's API for handling versions. I get that it might look convenient at first sight, but I don't see what does it actually make easier. For me it's a needlessly complex layer over retrieving uploaded file objects directly, with much ambiguity in the behaviour. But, as I said, I'm open to be convinced otherwise.

Kind regards,
Janko


--
You received this message because you are subscribed to the Google Groups "Shrine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ruby-shrine+unsubscribe@googlegroups.com.
To post to this group, send email to ruby-...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ruby-shrine/305e1498-32d0-441a-b6fc-b429f8913bd6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

radovan...@gmail.com

unread,
Dec 10, 2017, 5:30:38 PM12/10/17
to Shrine
Hello,

sorry for my late reply.

I understand and agree your decisions why did you design it as it is now. It is good enough, it is simple and with similar pattern as simple upload have.

I think versions are derivates of original content. For example different size of image, different format and quality for audio, first 3 lines as introduction from long text, simpler SVG icon for lower resolution etc.

And there are two ways how we can make versions:

1. Manually where designer make multiple image sizes and uploaded it, where audio composer makes MP3 derivate for older audio players, where copywriter takes 3 lines and put them into introduction, video files needs subtitles embedded into video. But they need to keep original file somewhere because one day screen resolutions are increasing so we need bigger images of same original image, new audio devices needs higher bitrate, intro text get bored and need to be fancier. And everything goes over and over.

2. Or we can do it automatically by change an algorithms and just "re-save" our original files. Then after resaving we have new images with better resolution, or maybe we just need them compress with better technology to lowering their size.

And that is issue. Original file is designed as just another version where user don't apply any filter or destructing algorithm to lowering size.
Original file don't have any weight in currently designed Versions plugin. It is just another version.

I think most scenarios are:
A) People doesn't care and they don't need versions. So they are able to modify original file if they want to.
B) People cares and they want to keep original file and just deliver derivates optimised to specific devices. And if needs are changing, they change versions again
C) People cares and uses some cloud services like Cloudinary, Imgix or build owns

We have currently about 16 000 images in our app, they are mostly unique becase we are comparing them with hash. So we are in situation B. We likes Shrine simplicity and of course multiple stores :)

I know Carrierwave has some parts what are not so good. Like creating URL on demand, Carrierwave can't tell that specific version doesn't exists, because it doesn't know. That is true. But this could be "fixed" by simple proxy routing. If for that URL doesn't exist any file, fallbacking it to certain controller action and will be created.(Only if file names are generated with some pattern. Also not applicable for every edge case).
On the other hand cause Carrierwave separating original file and versions it allows more "magic" for every specific version independent.

I think Versions plugin should make it same or very similar. Make a new space where versions will live and not mixing original with "copies".

Data schema for versions should be extensible. Maybe like this:
{
  "id": "asset/766a9cbbec3f2607cc447165850ccded.jpg",
  "storage": "file",
  "metadata": {
    "filename": "f76d3177-da13-4c08-9def-f1e9a76b7211.jpg",
    "size": 126528,
    "mime_type": "image/jpeg"
  },
  "versions": {
    "large": {
      "id": "asset/766a9cbbec3f2607cc447165850ccded_large.jpg",
      "storage": "file",
      "metadata": {
        "filename": "f76d3177-da13-4c08-9def-f1e9a76b7211_large.jpg",
        "size": 56528,
        "mime_type": "image/jpeg"
      }
    },
    "another": {}
  }
}

It is extensible cause versions data can be simple added, changed or removed without breaking any other part by changing a format.
Or they could be just extracted and saved in different column if app uses SQL. Maybe "_versions_data"?

I think It is not just about Versions plugin. This extensible format opens doors to any other Shrine plugin which want to save data.

Hope this my "2 cents" explain why we are still using Carrierwave instead Shrine, And i hope it is just about time until we will migrate :)

Kind regards,
Rado
To unsubscribe from this group and stop receiving emails from it, send an email to ruby-shrine...@googlegroups.com.

hiren....@chai-monsters.com

unread,
Dec 11, 2017, 3:24:58 PM12/11/17
to Shrine
Janko,

I'd like to add my thoughts on this topic for future consideration.

I did not like Carrierwave's handling of versions as it was autogenerated, can't add metadata to versions, and there was no correlation with actual file stored. It's what caused me to look for another library. In this case, I agree with Janko's design choice in Shrine. Rado, your suggestion of having to create a proxy routing, in my opinion, is more work which the library creates from a poor design choice and that should not happen. I believe it should only add versions and files that actually exist. This is my preference.

As far as the data structure goes, I think Rado has a point of having a unified data structure with and without versions in Shrine. It makes sense to make it easy to fallback to original if versions plugin is removed or recreated with different version names. However, I prefer having the original file data is wrapped in a key as opposed to being in root like his example (doing this has always paid off for me in future changes). So it would be:
{
 
"original": {

   
"id": "asset/766a9cbbec3f2607cc447165850ccded.jpg",
   
"storage": "file",
   
"metadata": {
     
"filename": "f76d3177-da13-4c08-9def-f1e9a76b7211.jpg",
     
"size": 126528,
     
"mime_type": "image/jpeg"
   
}
 
},
 
"versions": {
   
"large": {
     
"id": "asset/766a9cbbec3f2607cc447165850ccded_large.jpg",
     
"storage": "file",
     
"metadata": {
       
"filename": "f76d3177-da13-4c08-9def-f1e9a76b7211_large.jpg",
       
"size": 56528,
       
"mime_type": "image/jpeg"
     
}
   
},
   
"another": {}
 
}
}


or


{
 
"original": {

   
"id": "asset/766a9cbbec3f2607cc447165850ccded.jpg",
   
"storage": "file",
   
"metadata": {
     
"filename": "f76d3177-da13-4c08-9def-f1e9a76b7211.jpg",
     
"size": 126528,
     
"mime_type": "image/jpeg"
   
}
 
},

 
"large": {
   
"id": "asset/766a9cbbec3f2607cc447165850ccded_large.jpg",
   
"storage": "file",
   
"metadata": {
     
"filename": "f76d3177-da13-4c08-9def-f1e9a76b7211_large.jpg",
     
"size": 56528,
     
"mime_type": "image/jpeg"
   
}
 
},
 
"another": {}
}


That's all.

Hiren.

Janko Marohnić

unread,
Jan 7, 2018, 7:27:04 PM1/7/18
to Hiren Mistry, Shrine
I thought about this some more, and I definitely see what you're saying. The original file is something that should actually always be uploaded, and processed versions are a derivative of that original file. When I think about it more, I think the original file should always be stored, the user shouldn't have an option not to do that, because you can always uncover a bug in processing configuration later on and if you don't have the original file you painted yourself in the corner since you cannot reprocess now. This separate will also make it easier to enable upload the original file to a different storage than processed versions (this requirement popped up a few times).

I think I'm more inclined to Radovan's proposal, as in that case the addition or removal of versions doesn't affect the position of the original file data; I like that only the "versions" field is added or removed. That way you could remove the `versions` plugin if you want to stop using them, and you don't need to update the columns. With this format the base plugin will correctly retrieve the original file regardless of whether versions are used or not.

I would still like that `record.attachment` returns a Shrine::UploadedFile object that doesn't have any notion of versions. We could then have an attribute on the model that returns the versions, e.g. `record.attachment_versions` which would then return the hash of Shrine::UploadedFile objects.

Moreover, even if people are generating only a single processed file (e.g. a transcoded video), I would like that it's not possible to save it on top-level. In other words, I would like that the user always needs to come up with a name for it, e.g. "transcoded", and then fetch it with `movie.video_versions[:transcoded]`. This way they can always choose to add more processed files later on should they need it, e.g. they might realize they also want to add a video sceenshot now.

This new style would probably require a new API, as here we won't be transforming the original file, we will be generating new versions. That's good, because I still want users to be able to use the current style using the `process` block. Btw, this would probably also make the `processing` plugin obsolete, which would be good because it really doesn't do anything when you think about it :P.

Do you think either of you could make a PR for this? If not, I could probably find time eventually to make this change (lately I haven't been in the mood for any larger changes, but I expect that'll change back at some point).

To unsubscribe from this group and stop receiving emails from it, send an email to ruby-shrine+unsubscribe@googlegroups.com.

To post to this group, send email to ruby-...@googlegroups.com.

Hiren Mistry

unread,
Jan 8, 2018, 1:20:32 PM1/8/18
to Shrine
Janko,

I think we're all on the same page with slight differences in the hash data structure implementation... we can work out the details in the PR.

Yes I'd love to work on this PR... however I won't be able to work on it till February. I'm quite backed up with stuff and need to pace things. With that said, if some one wishes to work on it, feel free. Before I start, I'll start a new thread discussing the details of the new API and changes so that we can agree on the implementation.

In the mean time, will there be accommodation for method #1 in Radovan's email where the versions of an original file are uploaded to storage by a user (because it was created by designer) instead of being created in some background job?

I'll keep you posted.

Regards,
Hiren.


Janko Marohnić

unread,
Jan 9, 2018, 6:08:35 AM1/9/18
to Hiren Mistry, Shrine
I would like the brainstorm here what the new versions API could look like. I think that Shrine can be often a bit too implicit (some other users thought that too), so I would like the new API for generating versions to have to be called manually. So, instead of defining processing in the uploader, you would be able to call it like this:

  photo.image_attacher.versions = {

    # ...

  }


I would also like users to have the option of being more granular, that we also have Attacher#add_version and Attacher#remove_version:

  photo.image_attacher.add_version(:small, small_version)

  photo.image_attacher.remove_version(:small)


The Attacher#versions= and Attacher#add_version methods would both accept Shrine::UploadedFile objects, which would need to be created beforehand. I think It's ok that the user has to explicitly upload them:

  processed_versions = {

    small: resize_to_limit(io.download),

    # ...

  }


  versions_for_assignment = ImageUploader.new(:store).upload(processed_versions)


Thinking about it more, it would probably be better to have a Versions class and define all methods on it:

  photo.image_attacher.versions # returns a Shrine::Versions object

  photo.image_attacher.versions = { ... }

  photo.image_attacher.versions.add(:small, small_version)

  photo.image_attacher.versions.remove(:small)


We would probably have an #<attachment>_versions method which makes it easier to access:

  photo.image_versions

  photo.image_versions = { ... }

  photo.image_versions.add(:small, small_version)

  photo.image_versions.remove(:small)


We probably wouldn't need #<attachment>_versions= actually, we could probably name it Versions#replace instead:

  photo.image_versions.replace({...})


The Versions class would have the #[] operator just like a Hash:

  photo.image_versions[:small] #=> Shrine::UploadedFile


But it wouldn't subclass Hash, because there are some downsides of that, and because I would also like it to be able to support Arrays. The ability to support an Array of versions came up a few times, e.g. when wanting to save encoded video or split a PDF into pages.

  photo.image_versions.update([...])

  photo.image_versions[0] #=> Shrine::UploadedFile


The versions should be infinitely nestable, because the video encoding example needs this feature. For traversing through nesting we use subscript operator with multiple arguments:

  movie.video_versions[:hls, :640_360, :segments, 0]


or maybe it's enough to just have subscript operator be able to return hashes and arrays as well, and then we can traverse normally:

  movie.video_versions[:hls][:640_360][:segments][0]


The Versions class should have #to_h and #to_a conversion methods.

These are some of the ideas that came to mind. What I would like is that users have complete control of managing versions. E.g. they might want to generate some versions in the foreground, and others in the background, and maybe others later in another background job. They might also want to store the original file in a different location than versions, that's why users should have the ability to choose to which storage they want to upload versions. I might be missing some more use cases that came up in the past.

Kind regards,
Janko

To unsubscribe from this group and stop receiving emails from it, send an email to ruby-shrine+unsubscribe@googlegroups.com.

To post to this group, send email to ruby-...@googlegroups.com.

Janko Marohnić

unread,
Jan 9, 2018, 6:17:38 AM1/9/18
to Hiren Mistry, Shrine
In the mean time, will there be accommodation for method #1 in Radovan's email where the versions of an original file are uploaded to storage by a user (because it was created by designer) instead of being created in some background job?

 The API discussed above would trivially support this use case, but with the current API it is still possible, see this discussion for more details on a similar use case.

Kind regards,
Janko

radovan...@gmail.com

unread,
Feb 1, 2018, 9:23:54 AM2/1/18
to Shrine
Hello, sorry for my late reply.

On Tuesday, January 9, 2018 at 12:08:35 PM UTC+1, Janko Marohnić wrote:
I would like the brainstorm here what the new versions API could look like. I think that Shrine can be often a bit too implicit (some other users thought that too), so I would like the new API for generating versions to have to be called manually. So, instead of defining processing in the uploader, you would be able to call it like this:

  photo.image_attacher.versions = {

    # ...

  }


I would also like users to have the option of being more granular, that we also have Attacher#add_version and Attacher#remove_version:

  photo.image_attacher.add_version(:small, small_version)

  photo.image_attacher.remove_version(:small)


The Attacher#versions= and Attacher#add_version methods would both accept Shrine::UploadedFile objects, which would need to be created beforehand. I think It's ok that the user has to explicitly upload them:

  processed_versions = {

    small: resize_to_limit(io.download),

    # ...

  }


  versions_for_assignment = ImageUploader.new(:store).upload(processed_versions)


 
I like the idea about separate processing every version. Because there are use cases when small thumbnail should be returned to user synchronously because user need to see reposnse, but other larger versions should be created in background job.
 
Thinking about it more, it would probably be better to have a Versions class and define all methods on it:

  photo.image_attacher.versions # returns a Shrine::Versions object

  photo.image_attacher.versions = { ... }

  photo.image_attacher.versions.add(:small, small_version)

  photo.image_attacher.versions.remove(:small)


"Shrine::Versions" class is great idea. Easy extendable.
 

We would probably have an #<attachment>_versions method which makes it easier to access:

  photo.image_versions

  photo.image_versions = { ... }

  photo.image_versions.add(:small, small_version)

  photo.image_versions.remove(:small)


Idea about suffix ```_versions``` or calling method ```versions``` from _attacher has same value for me. It is hard predict which way is better or will be more used by developers. For me calling ```.versions``` is less magic than create new base model/object method "_versions". But you are right that "_versions" makes code simple to read.

replace/add/remove is consistent.
 

We probably wouldn't need #<attachment>_versions= actually, we could probably name it Versions#replace instead:

  photo.image_versions.replace({...})


The Versions class would have the #[] operator just like a Hash:

  photo.image_versions[:small] #=> Shrine::UploadedFile


But it wouldn't subclass Hash, because there are some downsides of that, and because I would also like it to be able to support Arrays. The ability to support an Array of versions came up a few times, e.g. when wanting to save encoded video or split a PDF into pages.

  photo.image_versions.update([...])

  photo.image_versions[0] #=> Shrine::UploadedFile


The versions should be infinitely nestable, because the video encoding example needs this feature. For traversing through nesting we use subscript operator with multiple arguments:

  movie.video_versions[:hls, :640_360, :segments, 0]


or maybe it's enough to just have subscript operator be able to return hashes and arrays as well, and then we can traverse normally:

  movie.video_versions[:hls][:640_360][:segments][0]


How do you want to nested versions in version? If there will be available nested feature how do you want to make interface when version will be requested by

```image_attacher.versions``` will returns Shrine::Versions. For example i have video version named "full_hd" and that version should have subversions defined by filter applied: "color", "grayscale", "sepia", "inverse_color", and every subversion will have sub-subversion defined by format type "h264", "h265", "webm".

There should be 3 different ways how to get version what should be needed.
"With brackets":
movie.video_versions[:full_hd][:color][:h264]
Should return exception when doesn't exists some subversion?

Simulate "dig" behavior
movie.video_versions.dig(:full_hd, :color, :h264)
When doenst exists it returns nil.

Calling methods:
movie.video_attacher.versions[:full_hd].versions[:color].versions[:h264]
because every "versions" is another Shrine::Versions object.
There could be used safe operator &

But nesting versions of some version is very rare use case. But i think someone should help.


The Versions class should have #to_h and #to_a conversion methods.

I agree,  to_a is very helpful.


These are some of the ideas that came to mind. What I would like is that users have complete control of managing versions. E.g. they might want to generate some versions in the foreground, and others in the background, and maybe others later in another background job. They might also want to store the original file in a different location than versions, that's why users should have the ability to choose to which storage they want to upload versions. I might be missing some more use cases that came up in the past.

Kind regards,
Janko

One issue is still quite open. Meta about versions will be stored inside "_data" or could be extracted outside?
Just one case. I have 5 versions. Original file will be saved and metadata about that file persisted in database. Versions will be created async in background. When version is processed, meta about version are appended. It is ok for most cases. But every-time what is version processed and meta persisted is running callback what will change "updated_at" column for Rails.
Is there any way how to persist meta about version in another table?

Anyway by reading this blogpost https://evilmartians.com/chronicles/rails-5-2-active-storage-and-beyond i have a feeling if word "versions" has right meaning. In ActiveStorage is used name "variant". Different meanings of words version and variant: https://english.stackexchange.com/questions/192032/what-is-the-make-or-break-difference-between-version-and-variant?rq=1 
What do you think?

Regards.
Radovan. 


Janko Marohnić

unread,
Apr 2, 2018, 6:07:39 AM4/2/18
to Radovan Šmitala, Shrine
Sorry for the late reply.

There should be 3 different ways how to get version what should be needed.

I agree with this.

Though it now raises the question, is it good that the object returned by `movie.video_versions` and `movie.video_attacher.versions` is both a value object and an object which you can use to add/remove/update versions. That was one of the things I really minded in CarrierWave and Paperclip design, that `movie.video` returns a CarrierWave::Uploader::Base/Paperclip::Attachment object that is used both for retrieving the uploaded file data and for doing all other operations, which makes the interface bloated and unclear.

I'm now thinking that the following design would be better:
  • `movie.video_versions` returns a hash/array of Shrine::UploadedFile objects, just like the current `movie.video` returns when versions are used. Users could then fetch the specific version by using the native Hash#dig/Array#dig.
  • `movie.video_versions` just delegates to `movie.video_attacher.versions`
  • Instead of having `Shrine::Attacher#versions` return a `Shrine::Versions` proxy object (like ActiveRecord's association proxy objects), we define all the version management methods on the Shrine::Attacher object directly, i.e. `movie.video_attacher.add_version(...)`, `movie.video_attacher.remove_version(...)`, `movie.video_attacher.update_versions(...)` etc. (like Sequel's association methods). I think this would make for a simpler and more consistent API.
What do you think about this API?

One issue is still quite open. Meta about versions will be stored inside "_data" or could be extracted outside?
Just one case. I have 5 versions. Original file will be saved and metadata about that file persisted in database. Versions will be created async in background. When version is processed, meta about version are appended. It is ok for most cases. But every-time what is version processed and meta persisted is running callback what will change "updated_at" column for Rails.
Is there any way how to persist meta about version in another table?

In #254 Jonathan and I were discussing that it would be better to store versions in a separate "<attachment>_versions_data" column than to chuck everything into the same "<attachment>_data" column. That would make versions truly an addition to the regular file upload flow, instead of potentially breaking something off when reading the "<attachment>_data" column.

Saving versions in a separate table could work too, though it would be more complex because it would require ORM-specific logic. In general I would like to enable users of Shrine to be able to store attachments in a separate table like in ActiveStorage. Note that it is already possible, it just needs to be manually configured (ActiveStorage's design can be built on top of Shrine's existing design). But for versions we would need to add special functionality to allow storing each version as a separate record, so we would first need to discuss if there are any significant advantages to that. If we would store all versions in the same DB column (either in the same table or a separate table), then there should be no need for special ORM-specific logic to enable that feature.

What do you think that for now we use the "<attachment>_versions_data" column (which can also be moved to a separate "attachment" table)?

Anyway by reading this blogpost https://evilmartians.com/chronicles/rails-5-2-active-storage-and-beyond i have a feeling if word "versions" has right meaning. In ActiveStorage is used name "variant". Different meanings of words version and variant: https://english.stackexchange.com/questions/192032/what-is-the-make-or-break-difference-between-version-and-variant?rq=1 
What do you think?

I completely agree, Jonathan raised the same point in #254. I would say that using different naming is even necessary, because I would like both the existing `versions` plugin and the new plugin to be able to coexist at the same time (and later we would deprecate and remove the old plugin), which would mean the new plugin would need a different name. "Variants" is an option, but Jonathan suggested a name which I find even better – "derivatives". I think this is more descriptive and more generic, because e.g. I wouldn't call a movie screenshot a "variant" of the original movie, because the movie is a video and a screenshot is an image. What do you think about that name?

Kind regards,
Janko

To unsubscribe from this group and stop receiving emails from it, send an email to ruby-shrine+unsubscribe@googlegroups.com.

To post to this group, send email to ruby-...@googlegroups.com.
Message has been deleted
Message has been deleted

Janko Marohnić

unread,
Apr 4, 2018, 4:45:48 AM4/4/18
to Radovan Šmitala, Shrine
i dont know why my reply message was automatically deleted.

For some reason it ended up in the moderation queue for as suspected spam, I approved it now. It happens occasionally, always with posts that are fine, and there doesn't seem to be a way to disable it :/

I like separated proxy object. It allows editable/extendable interface. It also allows to very exciting things with derivatives.

Can you elaborate on this a bit more? Command-Query Separation is a well-known design pattern that should lead to simpler code. I like that retrieving the Shrine attachment currently returns a Shrine::UploadedFile value object (instantiated only using data from the data column), I don't see a good reason to change that with derivates. I never saw the advantage of ActiveRecord's association proxies over Sequel's separated approach, and I used both gems quite a bit.

Exists have 2 options:

Thanks for a great overview, you definitely left me something to ponder on. Note that if we decide to move derivates into a separate table, we should first also move attachments into a separate table, so that we can then make derivates a natural extension of that. I will definitely keep both options in mind, though at the moment I'm leaning more towards option 2.

I am ok with derivative. It is mostly used with image processing. I prefer short variation "derivates", singular "derivate". It is same root from latin word derivatus, later derivat.

As you probably already noticed, I like "derivate". With "derivative" my concern was that it's too long, and "derivate" solves that.

Do you want to continue in Github Issue?

I would prefer to keep the discussions on the Google group, so that people subscribed to the Shrine repository don't get too many notifications. Then when the pull request is opened we can move the discussion there.

It will take me some time before I can start working on this, so anyone feel free to go for it if you want.

Kind regards,
Janko

On Mon, Apr 2, 2018 at 3:52 PM, <radovan...@gmail.com> wrote:
Hello, 

i dont know why my reply message was automatically deleted. So i repost it again but shorter :/


On Monday, April 2, 2018 at 12:07:39 PM UTC+2, Janko Marohnić wrote:
Sorry for the late reply.

There should be 3 different ways how to get version what should be needed.

I agree with this.

Though it now raises the question, is it good that the object returned by `movie.video_versions` and `movie.video_attacher.versions` is both a value object and an object which you can use to add/remove/update versions. That was one of the things I really minded in CarrierWave and Paperclip design, that `movie.video` returns a CarrierWave::Uploader::Base/Paperclip::Attachment object that is used both for retrieving the uploaded file data and for doing all other operations, which makes the interface bloated and unclear.

I'm now thinking that the following design would be better:
  • `movie.video_versions` returns a hash/array of Shrine::UploadedFile objects, just like the current `movie.video` returns when versions are used. Users could then fetch the specific version by using the native Hash#dig/Array#dig.
  • `movie.video_versions` just delegates to `movie.video_attacher.versions`
  • Instead of having `Shrine::Attacher#versions` return a `Shrine::Versions` proxy object (like ActiveRecord's association proxy objects), we define all the version management methods on the Shrine::Attacher object directly, i.e. `movie.video_attacher.add_version(...)`, `movie.video_attacher.remove_version(...)`, `movie.video_attacher.update_versions(...)` etc. (like Sequel's association methods). I think this would make for a simpler and more consistent API.
What do you think about this API?

I like separated proxy object. It allows editable/extendable interface. It also allows to very exciting things with derivatives.
 

One issue is still quite open. Meta about versions will be stored inside "_data" or could be extracted outside?
Just one case. I have 5 versions. Original file will be saved and metadata about that file persisted in database. Versions will be created async in background. When version is processed, meta about version are appended. It is ok for most cases. But every-time what is version processed and meta persisted is running callback what will change "updated_at" column for Rails.
Is there any way how to persist meta about version in another table?

In #254 Jonathan and I were discussing that it would be better to store versions in a separate "<attachment>_versions_data" column than to chuck everything into the same "<attachment>_data" column. That would make versions truly an addition to the regular file upload flow, instead of potentially breaking something off when reading the "<attachment>_data" column.

Saving versions in a separate table could work too, though it would be more complex because it would require ORM-specific logic. In general I would like to enable users of Shrine to be able to store attachments in a separate table like in ActiveStorage. Note that it is already possible, it just needs to be manually configured (ActiveStorage's design can be built on top of Shrine's existing design). But for versions we would need to add special functionality to allow storing each version as a separate record, so we would first need to discuss if there are any significant advantages to that. If we would store all versions in the same DB column (either in the same table or a separate table), then there should be no need for special ORM-specific logic to enable that feature.

What do you think that for now we use the "<attachment>_versions_data" column (which can also be moved to a separate "attachment" table)?

Exists have 2 options:

1. Separate column "<attachment>_versions_data"
  - isolated from original file
  - easy to add queue as process and create different version (linear or concurrent) together
  - complicated to handle multiple queue processes when every process is every version (need to lock column and merge metadata)
  - thin ORM integration
  - more logic on Shrine side

2. Separate table "<model>_<attachment>_versions"
  - with two minimal columns "version" and "data" where "version" is name of version
  - easy multiprocess handling, no need merging, every "row" is isolated version
  - "thick" ORM integration
  - less logic on Shrine, more on ORM (Shrine in basic needs to create version, store it and returns "metadata" which is delegated to ORM
  - better auditing and regenerating what is needed (logging and audit)
  - just another "model" associated with original file


Anyway by reading this blogpost https://evilmartians.com/chronicles/rails-5-2-active-storage-and-beyond i have a feeling if word "versions" has right meaning. In ActiveStorage is used name "variant". Different meanings of words version and variant: https://english.stackexchange.com/questions/192032/what-is-the-make-or-break-difference-between-version-and-variant?rq=1 
What do you think?

I completely agree, Jonathan raised the same point in #254. I would say that using different naming is even necessary, because I would like both the existing `versions` plugin and the new plugin to be able to coexist at the same time (and later we would deprecate and remove the old plugin), which would mean the new plugin would need a different name. "Variants" is an option, but Jonathan suggested a name which I find even better – "derivatives". I think this is more descriptive and more generic, because e.g. I wouldn't call a movie screenshot a "variant" of the original movie, because the movie is a video and a screenshot is an image. What do you think about that name?

I am ok with derivative. It is mostly used with image processing. I prefer short variation "derivates", singular "derivate". It is same root from latin word derivatus, later derivat.

Do you want to continue in Github Issue?

Regards!
 

Kind regards,
Janko

--
You received this message because you are subscribed to the Google Groups "Shrine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ruby-shrine+unsubscribe@googlegroups.com.
To post to this group, send email to ruby-...@googlegroups.com.

jroc...@sciencehistory.org

unread,
Sep 5, 2018, 3:48:24 PM9/5/18
to Shrine
Hello, I'm going to revive this thread! 

I work in library/museum/archives digital collections/preservation.  I am interested in using shrine (possibly for in a toolkit I might share with others in this sector), but have some different needs for derivatives. 

I started discussing some of them here: https://github.com/shrinerb/shrine/issues/254

Where janko directed me to this thread. I think my needs may overlap/be compatible with what's being discussed here, that there is some chance I can develop something to contribute to shrine core as a newer versions/derivatives plugin, which would be great. So I'd like to discuss that.  I'm going to use the term "derivatives" instead of "versions", to be clear I'm talking about a new thing, and because "derivatives" is what we use in my field to talk about these things, derived from the original. 

* I want derivatives to be able to have a separate shrine store than the originals. 

* I would _like_ derivatives to be in a separate column than the original, so adding derivatives on does not change the APIs for the original at all, a goal discussed in this thread which I share. 

* However, I do not think I have a need/desire for a whole separate table for derivatives, with one line each, as proposed here. While it should be noted that that is what ActiveStorage does, I think simply a separate column (still having a hash) is sufficient, possibly preferable, and should make it a lot easier to achieve goals with relatively less new code -- and has fewer performance challenges with n+1 queries etc. 

* I need derivatives to have differently configured plugins than the original. For instance, you might want checksums or other metadata on the original that you simply don't need on the derivatives. (Not sure if current 'versions' easily supports that or not). 

* _Ideally_ I'd like it to be possible for _different_ derivatives to be stored in _different_ stores, beyond just one store for all derivatives (that is different than original).  This is sort of an extra bonus feature maybe, but to the extent it might effect initial architectural choices, it might be worth mentioning. 

In https://github.com/shrinerb/shrine/issues/254, janko suggested the way to do the first requirement (different store for derivatives than original), is to use a _separate attacher class_. (Are those classes called 'attachers'? Or 'attachments'? as in `include ImageUploader::Attachment.new(:image_versions)`)  With a separate column, which it turns out is just fine or maybe even desirable for our requirements. 

Which makes sense to me, and seems do-able with little additional code. But if we were to turn that into an actual plugin avoiding boilerplate... what would it look like?  Maybe a plugin that automatically creates that extra attachment class and uploader for you?   With a default column name, but you can of course pass in an option for a different one?  Maybe it exposes something like the versions `on_process` block, but that block will actually be attached to this "extra" attacher -- which maybe just uses the existing `versions` plugin?   With all the code janko suggested as explicit somehow automatic in the plugin at correct hook points?

Maybe that's the default, but based on options you can actually have an explicit and generated in your app derivatives attacher, so you can completely customize the plugins in it (metadata plugins iff desired, etc)?

One trick I don't totally understand is making sure the derivatives (now in this separate column/attacher) get cleared out when the original changes or is removed -- where to hook into to catch that. 

Does this basic approach seem reasonable? Seem like something that might make sense in shrine core as an improved version of `versions`?  Any tips on implementation, or more specific ideas for what public facing API should like?

I believe I will have some time and would be interested into trying to develop this for shrine, if it seems like it could make sense, and would appreciate any guidance to make that more possible. 

Thanks!


Hiren Mistry

unread,
Sep 6, 2018, 1:44:05 AM9/6/18
to Shrine
Hi @jrochkind

It's good to hear from you again! I started working on this plugin but didn't get very far before I got interrupted. If you'd like, we can work on it together.

I think we're headed in the same direction that meets your needs. And if not, it's ok to branch off and create your own plugin. That's the beauty of plugins.

Plugin will be called derivates, a couple characters shorter than derivatives. :)

After gathering all the needs above and in Github issues, here's the summary of what I'm implementing:
  • Support a new data structure that is non-destructive to original (it will be a hash added to original under the key derivates)
  • Support an infinite nested version structure
  • User have full control of version generation
  • Always require an original file
    • Can the original be replaced like if I didn’t want to store RAW but same res JPG version instead? (TBD)
  • Versions can use different storages - all the same or each one separate? (TBD)
  • Can original be promoted to ICE storage after some length of inactivity (TBD)
  • Be thread safe and avoid DB race conditions when updating file_data
  • If original is changed, destroy all versions including case when change is made while in processing versions
  • Consider version_updated_at timestamp and not update record timestamp when updating versions
  • Store versions in orig_metadata or separate ver_metadata or separate table? => First or second
  • Can save additional metadata in for each version in _metadata
  • Can versions have different metadata than original (by @jrochkind)
  • Support versions created in 2 ways:
    • through code programmatically
    • user uploads versions because auto-generation doesn’t work for this use case
  • Is the migration path easy for existing Shrine users?
  • Have temp download links that expire and/or obfuscate file storage urls (TBD)
Attachers vs Attachments
The thing included in the model is an attachment and it holds the file data. The attachment has an attacher that's responsible for adding the file data to the record. See the docs and guides for more explanations.

As to your other questions, I need a refresher again as I was in the middle of digging through the code before getting interrupted. :)

Hiren.

Rochkind, Jonathan

unread,
Sep 6, 2018, 11:08:54 AM9/6/18
to Hiren Mistry, Shrine

Thanks, do you have existing code anywhere I can look at?  It might be helpful to me.  If I get anywhere, I’ll try to put my code somewhere you can see, so you can advise or collaborate in any other way? maybe just a shrine fork in my GH? I definitely could use some help!

 

  • Can the original be replaced like if I didn’t want to store RAW but same res JPG version instead?

 

I think that is a valid use case, but I’d suggest it’s NOT the role of the versions/derivatives plugin.

 

I BELIEVE that can easily be done already without versions/derivatives plugin, you can for instance use the `processing` plugin without versions, to substitute an original, yes?

 

Similarly, I think “Have temp download links that expire and/or obfuscate file storage urls”  is probably out of scope – you might want to do that with originals OR derivatives, you can already do it to some extent (say if you are using S3, and used signed urls, which you can already do in shrine). We should pay attention to architecture to make sure that it’s no harder to do this for derivatives than for originals, but I think it’s not part of a derivatives plugin.  If you want an end-point in your app to obfuscate the back-end storage location, probably by proxying bytes, I think that may be out of scope for shrine as a whole, but is definitely out of scope for derivatives plugin.

 

One thing I mentioned that might be worth including on the list as a stretch goal, is the ability to have _different_ derivatives in different stores. Only because that may or may not effect the architecture – even if it’s not there right away, we should see a path to adding it backwards compatibly.

 

Another thing I forgot to mention but which is important – it should be possible to create derivatives in background jobs, with a separate background job for each derivative. And relatedly, to create some derivatives in a delayed fashion long after the others were created (sort of a very long-term async).  This is tricky because of potential race conditions, and is confusing to me how to address because the solution(s) probably need to be ORM-specific, and it’s not clear to me how to handle that in a shrine plugin.

 

Really, gets confusing ot me when I think about, I think I need to just play with shrine code some more.

 

Jonathan

 

__________________________________

Jonathan Rochkind
Software Developer/Technical Lead
Othmer Library of Chemical History
t. +1.215.873.8239

Science History Institute 
Chemistry • Engineering • Life Sciences 
315 Chestnut Street • Philadelphia, PA 19106 • U.S.A.
sciencehistory.org

The Chemical Heritage Foundation is now the Science History Institute. For more information visit sciencehistory.org.

--
You received this message because you are subscribed to a topic in the Google Groups "Shrine" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/ruby-shrine/G-WooM709YY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
ruby-shrine...@googlegroups.com.


To post to this group, send email to

Hiren Mistry

unread,
Sep 6, 2018, 4:09:39 PM9/6/18
to Shrine
You're correct about derivatives plugin responsibility. I wasn't clear when writing it. I have them there because whatever we implement in derivatives plugin, it mustn't affect those use cases and make it hard or convoluted to do. For example, the RAW original replaced with a JPG version, what if 1 month after initial creation, a background job created an archive version of the original to store in S3 glacier, and upon replacing the original with a JPG the plugin deletes all versions causing it to recreate all versions unnecessarily. This is one implementation that indirectly affects versions plugin and it's behaviour. We will have to draw the line somewhere but in initial planning, it's good practice to think about such things.

Your stretch goal (i.e. different derivatives in different stores) is not a stretch goal but a requirement for the new versions plugin. We have seen several requests for it and there are strong use cases where people want to move some of the versions out of storage to a CDN.

I'll email you separately to discuss collaboration to keep the noise out of the relevant thread topic of the new versions plugin features/behavior.

Hiren.

.
To unsubscribe from this group and all its topics, send an email to
ruby-shrine+unsubscribe@googlegroups.com.


To post to this group, send email to


To view this discussion on the web visit

Janko Marohnić

unread,
Sep 8, 2018, 5:10:00 PM9/8/18
to Hiren Mistry, ruby-shrine
Thank you Jonathan for reviving the thread and Hiren for sharing your investigations so far. For collaborating on the plugin you can open a branch on shrinerb/shrine directly if you would like, in this case I'll add Jonathan as the collaborator.

I definitely want for the derivates plugin to be as flexible as possible, and support each of the requirements mentioned here. Considering that Shrine saves the storage for each uploaded file (unlike other file attachment libraries), uploading derivates to different storages shouldn't be too difficult to implement.

However, I do not think I have a need/desire for a whole separate table for derivatives, with one line each, as proposed here. While it should be noted that that is what ActiveStorage does, I think simply a separate column (still having a hash) is sufficient, possibly preferable, and should make it a lot easier to achieve goals with relatively less new code -- and has fewer performance challenges with n+1 queries etc. 

Yeah, for me just having a separate column is the right way to go. Not having to think about N+1 queries as you said is one thing. The other thing is that having a record per derivate (like ActiveStorage has) is a big performance hit, because then query performance diminishes with more attachments and more derivates per attachment; this is unnecessary performance penalty IMO, especially if you have a lot of derivates such as when you want to store videos in HLS format.

Hiren, how does a separate column sit with you?

I need derivatives to have differently configured plugins than the original. For instance, you might want checksums or other metadata on the original that you simply don't need on the derivatives. (Not sure if current 'versions' easily supports that or not). 

This might be a tough one, I would really not go there if we can avoid it. The current versions plugin supports it, because it sends the current version in the `context`, so you could add a conditional in the `add_metadata` block:

class MyUploader < Shrine

  plugin :add_metadata

  plugin :signature


  add_metadata do |io, context|

    calculate_signature(io, :md5) if context[:version].nil? || context[:version] == :original

  end

end


One trick I don't totally understand is making sure the derivatives (now in this separate column/attacher) get cleared out when the original changes or is removed -- where to hook into to catch that. 

The derivates plugin could override Attacher#assign, which is what gets called when someone sets the new attachment (e.g. `photo.image = file`). Whenever a new value gets assigned, the derivates column could get cleared.

Does this basic approach seem reasonable? Seem like something that might make sense in shrine core as an improved version of `versions`?  Any tips on implementation, or more specific ideas for what public facing API should like?

In #254 I suggested the basic API to look like this:

photo.image.download do |original|

  pipeline = ImageProcessing::MiniMagick.source(original)


  derivates = {

    large:  pipeline.resize_to_limit!(800, 800),

    medium: pipeline.resize_to_limit!(500, 500),

    small:  pipeline.resize_to_limit!(300, 300),

    # ...

  }


  photo.image_attacher.add_derivates(derivates)

  # or

  derivates.each do |name, file|

    photo.image_attacher.add_derivate(name, file)

  end

end


I would like that for starters we only have this explicit API, and postpone any discussion about potential implicit hooks that trigger version creation (like we currently have with the processing plugin), as existing implicit APIs already hurt Shrine's flexibility and simplicity so I would like to avoid repeating the same mistake with derivates plugin.

The explicit API above needs to exist because we want enough flexibility so that users are able to add or remove versions whenever. Note that this code wouldn't be defined in the uploader, but rather called explicitly on the record itself, so it's called from the controller (and can be put in a service object).

---------------

Hiren, now to address your comments:

Can the original be replaced like if I didn’t want to store RAW but same res JPG version instead? (TBD)

We should make this possible during promotion. This is not related to the derivates plugin directly, but since it would mean deprecating versions and processing plugins, it should ideally be easy to do this without the processing plugin.

Be thread safe and avoid DB race conditions when updating file_data

I've recently found out about the SELECT ... FOR UPDATE locking mechanism that SQL databases offer, which is supported both by ActiveRecord (see pessimistic locking) and Sequel (see Locking for Update and #lock!). I think that could be an elegant solution for DB race conditions, here is how it looks in Sequel:

record.db.transaction do

  record.lock! # SELECT * WHERE id = '123' FOR UPDATE LIMIT 1;

  record.update(attachment_data: "...")

end # lock is released


Thread safety could be achieved with a simple mutex.

Is the migration path easy for existing Shrine users?

Thanks for noting that, we should definitely think about the migration path when we release it. 

Have temp download links that expire and/or obfuscate file storage urls (TBD)

Short of the "temporary" part, the download_endpoint plugin already does this.

.
To unsubscribe from this group and all its topics, send an email to
ruby-shrine...@googlegroups.com.


To post to this group, send email to


To view this discussion on the web visit

You received this message because you are subscribed to the Google Groups "Shrine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ruby-shrine...@googlegroups.com.

To post to this group, send email to ruby-...@googlegroups.com.

jona...@dnil.net

unread,
Sep 12, 2018, 5:59:31 PM9/12/18
to Shrine
Ah, I refound this one, thanks!  I think I am getting closer to understanding how to implement this, after spending a couple days experimenting with shrine. 

In your conception, does "photo.image_attacher.add_derivates" make an immediate save to the db through the ORM?  I am thinking probably yes?  


The derivates plugin could override Attacher#assign, which is what gets called when someone sets the new attachment (e.g. `photo.image = file`). Whenever a new value gets assigned, the derivates column could get cleared.

I'm not sure about this. Not sure if the "derivatives column gets cleared" means in-memory, or already persisted to the ORM. 

The trick is dealing with the actual files persisted to storage.  Once the new thing is for real saved (even to cache), we need to delete the derivatives from storage. But if someone does an assign, but ends up not saving the model (succesfully, or even trying), then we need the derivative files in storage to still be there, since the model still points to them (and might need them!). 



I've recently found out about the SELECT ... FOR UPDATE locking mechanism that SQL databases offer, which is supported both by ActiveRecord (see pessimistic locking) and Sequel (see Locking for Update and #lock!). I think that could be an elegant solution for DB race conditions, here is how it looks in Sequel:

record.db.transaction do

  record.lock! # SELECT * WHERE id = '123' FOR UPDATE LIMIT 1;

  record.update(attachment_data: "...")

end # lock is released


Thread safety could be achieved with a simple mutex.

I'm not sure what kinds of thread-safety you are thinking about here, concurrency always twists my brain around. 

But don't forget that concurrency-safety using the database is not limited to "thread safety" -- there can be different _processes_, even running on different machines (which of course is a typical way to deploy a Rails app, and probably any web app under MRI). 

Of course the DB-level pessimistic lock is intended to handle that. If it handles it succesfully (and I am scared of db pessimistic locks, I feel like they have unpredictable and sometimes disastrous performance characteristics) -- is any additional kind of thread safety requiring a mutex required? I don't _think_ so, but am not sure. 

 

Is the migration path easy for existing Shrine users?

Thanks for noting that, we should definitely think about the migration path when we release it. 

Yeah, that's gonna be painful I think, especially on a running app with no downtime. But one step at a time, we gotta have something to migrate to before worrying about migration maybe. 

Jonathan

Hiren Mistry

unread,
Sep 12, 2018, 7:41:21 PM9/12/18
to Shrine
See my thoughts below

For Jonathan

I saw your blog, haven't read it yet but that was quick! 



In your conception, does "photo.image_attacher.add_derivates" make an immediate save to the db through the ORM?  I am thinking probably yes?  


It'll be when the record is saved i.e. "photo.save" needs to be called - be consistent with model change behavior. It'll have to go through the process of uploading the derivatives, I'm thinking directly to :store but maybe there's value in sending it through :cache???

There has been an ask for a use case where some derivatives are not auto-generated through scripts but uploaded by some user. In this case, it may make sense to have derivatives go through :cache and get some validation. We can address this later.

 

The derivates plugin could override Attacher#assign, which is what gets called when someone sets the new attachment (e.g. `photo.image = file`). Whenever a new value gets assigned, the derivates column could get cleared.

I'm not sure about this. Not sure if the "derivatives column gets cleared" means in-memory, or already persisted to the ORM. 

The trick is dealing with the actual files persisted to storage.  Once the new thing is for real saved (even to cache), we need to delete the derivatives from storage. But if someone does an assign, but ends up not saving the model (succesfully, or even trying), then we need the derivative files in storage to still be there, since the model still points to them (and might need them!). 

I think clearing derivatives should only happen when user intent is clear i.e. they replaced original AND actually persisting it to the db by saving the record. Let's see how we can make the code do that unless others have different opinion we can discuss it then.

 



I've recently found out about the SELECT ... FOR UPDATE locking mechanism that SQL databases offer, which is supported both by ActiveRecord (see pessimistic locking) and Sequel (see Locking for Update and #lock!). I think that could be an elegant solution for DB race conditions, here is how it looks in Sequel:

record.db.transaction do

  record.lock! # SELECT * WHERE id = '123' FOR UPDATE LIMIT 1;

  record.update(attachment_data: "...")

end # lock is released


Thread safety could be achieved with a simple mutex.

I'm not sure what kinds of thread-safety you are thinking about here, concurrency always twists my brain around. 

But don't forget that concurrency-safety using the database is not limited to "thread safety" -- there can be different _processes_, even running on different machines (which of course is a typical way to deploy a Rails app, and probably any web app under MRI). 

Of course the DB-level pessimistic lock is intended to handle that. If it handles it succesfully (and I am scared of db pessimistic locks, I feel like they have unpredictable and sometimes disastrous performance characteristics) -- is any additional kind of thread safety requiring a mutex required? I don't _think_ so, but am not sure. 
 

We can evaluate this later once we have some code. I try to keep an open mind and not let fear drive engineering decisions.

 
For Janko

Yeah, for me just having a separate column is the right way to go. Not having to think about N+1 queries as you said is one thing. The other thing is that having a record per derivate (like ActiveStorage has) is a big performance hit, because then query performance diminishes with more attachments and more derivates per attachment; this is unnecessary performance penalty IMO, especially if you have a lot of derivates such as when you want to store videos in HLS format.
Hiren, how does a separate column sit with you?

I'm keeping an open mind and not let my pre-biases prematurely decide between using the original column vs a separate column. I think both ways have pros and cons. The separate column avoids cluttering the original column with versions which can be nice for select queries but downside is requires a migration and you have twice as many columns as files stored in a model and keeping each pair sync'd. The single original column is opposite. 

Being a universal library, we should make the best decision for the users and their use cases - 80-20 rule. One thought is we can also look at the other file attachment libraries (not just the ruby ones) and see how whether people liked single vs separate columns.

Hiren.

Jonathan Rochkind

unread,
Sep 13, 2018, 9:37:01 AM9/13/18
to Shrine
The race conditions are the problem here. I don't think we can avoid thinking about them until 'later', I think they will have some fundamental implications on some architectural choices. 

I agree it's important that derivatives can be added "whenever".  Including:

* before promotion (so it's available immediately after an ORM save, like currently using recache plugin)
* during promotion (whether promotion is foreground or background with backgrounding plugin)
* in a separate background job, perhaps kicked off in promotion
* in a separate background job or just ruby code wherever at any time after the asset was promoted -- including re-generating/adding/deleting derivatives

That different derivatives can be added at different times creates a challenging to handle race condition, when we are storing all derivatives in a single hash. 

To be clear so we're all on the same page, I'll outline the race condition (let's see if I can paste an html table into gmail!). I say "thread" but it could be different processes as well. 

Derivatives already exists with { one: something, two: something }

THREAD ATHREAD B
Fetches existing model with derivatives and keys :one, :twoFetches existing model with derivatives and keys :one, :two
Adds key :a to hash in memory
Persists to db, db now has hash with :one, :two, :a -- and actual derivative :a is persisted in storage, ready for use
Adds key :b to hash in memory
Persists to db, db now has hash with keys :one, :two, :b. And asset for :b is persisted in storage, ready for use. 

:a is lost from the hash saved in db column. But the actual asset associated with :a is still persisted in storage, it's been orphaned


So not only did the model "lose" derivative `:a:`, overwritten when thread B did a "save" -- but the actual asset for derivative `:a` is still persisted in storage, orphaned and not referred to by any models -- which is even more disastrous. 

The current "versions" code has less opportunity for this race condition, by assuming that versions are only created as part of the "promotion" phase of setting a a new original. Although I'm not completely sure there are no such race conditions in current code -- especially if you follow advice for regenerating/adding/removing versions. 

It's painful to try to create automated tests around this race condition, making this more challenging. (Also just painful to think about it ha). The current code doesn't actually have any tests around race conditions, although it does have some code that tries to minimize them in 'backgrounding' plugin, it isn't really testing (and as recently discussed doesn't completely eliminate the race condition). 

If derivatives are being kept in the SAME column as original, it makes the race condition even more confusing to think about, as it can be effected by changes to the original, not just changes to derivatives, as they are all in the same column -- and possibly vice versa, an intended to change to derivatives might "overwrite" someone elses change to an original, which is even more disastrous.  It's super confusing to me to think about, but this is one reason trying to put derivatives in the same column as the original makes me nervous. 

There's probably no way to handle this without being ORM-specific.  It may be a LOT more feasible to handle (still in ORM-specific ways) if the `update_derivatives` method makes an _immediately persisted_ change, rather than in-memory waiting for a model-save.  Which is what the method name "update" implies to me anyway. 

Having update_ make an immediate change also eases the issue of "when do you actually persist derivative to storage".  If you have separate events in time for "set this new derivative in memory", and "actually persist it in model to db", you probably need to use ORM hooks to deal with persisting the derivative to storage, to make sure it doesn't get persisted for an unsaved model. (And/or use cache storage that gets promoted to store; possibly both)

And a separate thing, if you have a use case for user-uploaded derivatives, and want them to be validated etc -- you probably want an actual Uploader/Attacher for the derivatives column, so all that existing logic can be applied to it. Which janko I think was hoping to avoid. I think it may make a lot of sense to use it -- but I also think it probably makes sense to avoid the use case for "user-uploaded derivatives", if we can. 

On the one hand, all this stuff is both really confusing, and I think may end up tightly effected by original architectural choices. So it makes sense to think of it before writing any code. On the other hand -- it's just so confusing, and there are so many factors! We probably just have to start writing code and evaluating it, cause it's easier to talk about with code in front of us than purely abstractly -- even knowing we might end up throwing out a lot of the first draft. 

I'm going to try to get started writing a first draft soon -- if not today then early next week. I am currently spending the bulk of my day job time on this, for as long as I can get away with it. 

I am trying to keep all the requirements/desires in mind -- but it gets really complicated, I may have to temporarily "forget" about some of them. This is actually some really tricky stuff, I think. 

Jonathan


--
You received this message because you are subscribed to a topic in the Google Groups "Shrine" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/ruby-shrine/G-WooM709YY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to ruby-shrine+unsubscribe@googlegroups.com.

To post to this group, send email to ruby-...@googlegroups.com.

Jonathan Rochkind

unread,
Sep 13, 2018, 9:52:50 AM9/13/18
to Shrine
PS: It's almost making me re-consider my opposition to having a separate table for derivatives -- a seperate table is more "normalized" for rdbms, letting you lean on basic rdbms semantics harder, and avoiding the race condition of having different data elements in the same column that might be updated separate. 

Although that will bring it's own set of challenges, both in terms of shrine architecture (adding other referenced objects into it is something shrine doesn't currently do, and is difficult to do in an ORM-agnostic way), and in terms of ultimate performance characteristics. 

I might try to spend some time looking at how ActiveStorage handles all these. 

Here are some lower-level implementation questions/requirements, assuming we want to allow adding/removing derivatives flexibly at any time, meaning possibly overlapping/concurrently (which I think we definitely do, and is actually to me a big part of the motivation for writing a new thing -- along with allowing storage-location flexibility)

* How do we make sure a derivative is persisted to storage if and only if it's referenced in a model -- if the change for whatever reason isn't ultimately saved to model, have to make sure the derivative isn't left persisted in storage?  Likewise for _removing_ an existing derivative, including on model destroy where all existing derivatives need to be removed from storage. 

* If an original is changed, how do we make sure the derivatives for the old original are not only removed from references in the model, but the actual derivative files are removed from storage if and ONLY if the original change is actually persisted to the db. 

* Race condition issues with all of this. Different processes/threads may be concurrently adding/removing derivatives AND changing/removing originals from a given model instance. How do you make sure nobody overwrites anyone else (some kind of locking), AND that no derivatives end up "orphaned", persisted to storage but not referenced in the persisted model. 

It's worth pointing out that the current versions implementation only creates versions _after_ a model has been persisted to the db with a given original (which may be in cache) persisted. Because they happen in promotion/finalize phase, which is normally triggered in "after_commit", and will generally only work with a persisted model in any event.  


Janko Marohnić

unread,
Sep 17, 2018, 7:16:43 PM9/17/18
to jona...@dnil.net, ruby-shrine
Finally got around to follow up on this. Most of the points were already addressed in parallel discussions, so I'll try to address some the remaining things.

I'm not sure about this. Not sure if the "derivatives column gets cleared" means in-memory, or already persisted to the ORM. 
The trick is dealing with the actual files persisted to storage.  Once the new thing is for real saved (even to cache), we need to delete the derivatives from storage. But if someone does an assign, but ends up not saving the model (succesfully, or even trying), then we need the derivative files in storage to still be there, since the model still points to them (and might need them!). 

I'll try to explain the behaviour I had in mind in code.

  photo.image #=> #<ShrineUploadedFile: @data={"id" => "111", "storage" => "store"} > 

  photo.image_derivates #=> { ... hash of derivates ... }


  # assigning a new image clears the derivates column in-memory

  photo.image = new_image

  photo.image #=> #<ShrineUploadedFile: @data={"id" => "222", "storage" => "cache"} > 

  photo.image_derivates #=> nil


  # that change still isn't persisted

  Photo.find(photo.id).image_derivates #=> { ... hash of derivates ... }


  # saving the record deletes the old derivates from storage

  photo.save

  photo.image #=> #<ShrineUploadedFile: @data={"id" => "333", "storage" => "store"} > 


I think clearing derivatives should only happen when user intent is clear i.e. they replaced original AND actually persisting it to the db by saving the record. Let's see how we can make the code do that unless others have different opinion we can discuss it then.

That could also work. 

Of course the DB-level pessimistic lock is intended to handle that. If it handles it succesfully (and I am scared of db pessimistic locks, I feel like they have unpredictable and sometimes disastrous performance characteristics) -- is any additional kind of thread safety requiring a mutex required? I don't _think_ so, but am not sure. 

This reminded me that it would also be nice to support concurrent derivate processing/uploading:

  pipeline = ImageProcessing::Vips.source(photo.image)


  derivate_definitions = {

    large: [800, 800],

    medium: [500, 500],

    small: [300, 300],

  }


  derivate_definitions.each do |name, (width, height)|

    thread_pool.post do # this block executes asynchronously

      derivate = pipeline.resize_to_limit!(width, height)


      photo.image_attacher.add_derivate(name, derivate) # adds the derivate in-memory

    end

  end


  photo.save


In that case there would need to be some thread safety with a mutex, because hashes don't seem to be thread-safe even in MRI (let alone JRuby); see this Roda commit for a little more info.

The separate column avoids cluttering the original column with versions which can be nice for select queries but downside is requires a migration and you have twice as many columns as files stored in a model and keeping each pair sync'd. The single original column is opposite. 

Yeah, that's true. We'll see how a separate column would work for now, but I'll try to keep the alternative in mind.

One thought is we can also look at the other file attachment libraries (not just the ruby ones) and see how whether people liked single vs separate columns.

I was surprised by how difficult it is to find full-featured file attachment libraries in other languages. The only one I know about that has a similar scope is arc written in Elixir, which mentions transformation but doesn't mention where they are stored.

Looks like the initial derivates implementation is coming up soon, looking forward! :)

To unsubscribe from this group and all its topics, send an email to ruby-shrine...@googlegroups.com.

To post to this group, send email to ruby-...@googlegroups.com.


--
You received this message because you are subscribed to the Google Groups "Shrine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ruby-shrine...@googlegroups.com.
To post to this group, send email to ruby-...@googlegroups.com.

Rochkind, Jonathan

unread,
Sep 18, 2018, 2:13:20 PM9/18/18
to ruby-shrine

So, with that code, where the change to derivatives aren’t persisted until save… if they are still going to be race-condition safe… I guess you would have to somehow trigger that race condition safe update code _on_ save, rather than on set?  It all gets to be a bit overwhelming for me honestly, my code isn’t going great.

 

__________________________________

Jonathan Rochkind
Software Developer/Technical Lead
Othmer Library of Chemical History
t. +1.215.873.8239

Science History Institute 
Chemistry • Engineering • Life Sciences 
315 Chestnut Street • Philadelphia, PA 19106 • U.S.A.
sciencehistory.org

The Chemical Heritage Foundation is now the Science History Institute. For more information visit sciencehistory.org.

From: <ruby-...@googlegroups.com> on behalf of Janko Marohnić <janko.m...@gmail.com>
Date: Monday, September 17, 2018 at 7:16 PM
To: "jona...@dnil.net" <jona...@dnil.net>
Cc: ruby-shrine <ruby-...@googlegroups.com>
Subject: Re: Uploaded versions separation

 

Reply all
Reply to author
Forward
0 new messages