Strategy for good deduplication

291 views
Skip to first unread message

sain...@gmail.com

unread,
May 27, 2014, 9:47:25 AM5/27/14
to sea...@googlegroups.com
Hello,

I would like to use Seafile to store and share pictures with friends and family.
Typical use case is an event where everybody take pictures and after the event, everybody want to share pictures with everybody.

People (for instance A, B and C) usually follow the following workflow:
1) Create a Images/event directory where they put all their pictures so we end up with:
  - A/Images/event
  - B/Images/event
  - C/Images/event
2) A decides to create a share directory eventAshared to share the event pictures with B and C so we end up with:
  - A/Images/event
  - A/eventAshared
  - B/Images/event
  - B/eventAshared
  - C/Images/event
  - C/eventAshared
3) A,B and C copy a selection of their Images/event to eventAshared
4) A, B and C copy all files from eventAshared into their Images/event
5) A delete eventAshared

Given the previous workflow, what would be a good library/sub-library/directory structure to minimize upload time and optimize deduplication ?
With the current Seafile it seems that everything is automatically deduplicated between libraries and between users, so maybe we don't have to to anything and it will work automagically ?
What would happen when deduplication between libraries is removed then ?

Thanks !

P.S.: I know that having one single shared directory that we keep forever where everybody store and exchange files would be the best, but it doesn't seem to be the workflow that people use.

Michele Innocenti

unread,
May 27, 2014, 12:33:37 PM5/27/14
to sea...@googlegroups.com


Il giorno martedì 27 maggio 2014 15:47:25 UTC+2, sain...@gmail.com ha scritto:
Hello,



Given the previous workflow, what would be a good library/sub-library/directory structure to minimize upload time and optimize deduplication ?
With the current Seafile it seems that everything is automatically deduplicated between libraries and between users, so maybe we don't have to to anything and it will work automagically ?
What would happen when deduplication between libraries is removed then ?

Deduplication is at block level.
You don't have to to anything and you can't remove deduplication.

sain...@gmail.com

unread,
May 27, 2014, 12:40:25 PM5/27/14
to sea...@googlegroups.com

On Tuesday, May 27, 2014 6:33:37 PM UTC+2, Michele Innocenti wrote:

Given the previous workflow, what would be a good library/sub-library/directory structure to minimize upload time and optimize deduplication ?
With the current Seafile it seems that everything is automatically deduplicated between libraries and between users, so maybe we don't have to to anything and it will work automagically ?
What would happen when deduplication between libraries is removed then ?

Deduplication is at block level.
You don't have to to anything and you can't remove deduplication.

Hum perhaps I was not clear. Blocks deduplication between libraries is planned to be removed soon:
https://groups.google.com/d/msg/seafile/Y5xfuBGjnRc/1OiZ8hj6lRcJ

So for now blocks deduplication blocks works between libraries and users without doing anything. But what do we do after the Blocks deduplication between libraries removal ? Especially for my use case above ?

Thanks

Michele Innocenti

unread,
May 27, 2014, 1:24:37 PM5/27/14
to sea...@googlegroups.com


Il giorno martedì 27 maggio 2014 18:40:25 UTC+2, sain...@gmail.com ha scritto:

Hum perhaps I was not clear. Blocks deduplication between libraries is planned to be removed soon:
https://groups.google.com/d/msg/seafile/Y5xfuBGjnRc/1OiZ8hj6lRcJ

So for now blocks deduplication blocks works between libraries and users without doing anything. But what do we do after the Blocks deduplication between libraries removal ? Especially for my use case above ?

I had not seen and if so is a bad news for me.
If it is so, I think that will be the update program to deal with the duplication.

https://groups.google.com/forum/#!topic/seafile/vw0lDoF9VS8

Saint Germain

unread,
May 27, 2014, 2:43:01 PM5/27/14
to sea...@googlegroups.com
On Tue, 27 May 2014 10:24:37 -0700 (PDT), Michele Innocenti
<inno....@gmail.com> wrote :

> >
> > Hum perhaps I was not clear. Blocks deduplication between libraries
> > is planned to be removed soon:
> > https://groups.google.com/d/msg/seafile/Y5xfuBGjnRc/1OiZ8hj6lRcJ
> >
> > So for now blocks deduplication blocks works between libraries and
> > users without doing anything. But what do we do after the Blocks
> > deduplication between libraries removal ? Especially for my use
> > case above ?
> >
> > I had not seen and if so is a bad news for me.
> If it is so, I think that will be the update program to deal with the
> duplication.
> https://groups.google.com/forum/#!topic/seafile/vw0lDoF9VS8
>

What update program are you talking about ? I don't understand.
My point was not to worried about the update process to Seafile
server 3.1.

I wanted to know if there is a preferred structure to organize our
libraries/sub-libraries/directories in order to optimize the upload
time and deduplication.

However thanks for trying to help me.

Michele Innocenti

unread,
May 28, 2014, 4:26:33 AM5/28/14
to sea...@googlegroups.com


Il giorno martedì 27 maggio 2014 20:43:01 UTC+2, Saint Germain ha scritto:
On Tue, 27 May 2014 10:24:37 -0700 (PDT), Michele Innocenti
<inno....@gmail.com> wrote :


I wanted to know if there is a preferred structure to organize our
libraries/sub-libraries/directories in order to optimize the upload
time and deduplication.


 If there is no deduplication between different libraries, there is no system better or worse for it.
But you can:
1 - continue to use your way (without deduplication between users)
2 - use a shared library (as you said)
3 - Use a single seafile user for all of your users (a, b, c)

I see no other possibility but a developer can answer better than me.

Saint Germain

unread,
May 28, 2014, 6:15:06 AM5/28/14
to sea...@googlegroups.com
On 28 May 2014 10:26, Michele Innocenti <inno....@gmail.com> wrote:
>> I wanted to know if there is a preferred structure to organize our
>> libraries/sub-libraries/directories in order to optimize the upload
>> time and deduplication.
>>
>
> If there is no deduplication between different libraries, there is no
> system better or worse for it.
> But you can:
> 1 - continue to use your way (without deduplication between users)
> 2 - use a shared library (as you said)
> 3 - Use a single seafile user for all of your users (a, b, c)
>
> I see no other possibility but a developer can answer better than me.
>

I am new to Seafile so I don't understand the whole data model yet.
In particular, I don't really understand the concept of sub-libraries
and how they are different from normal libraries.

If Seafile remove in the future deduplication between libraries, do we
still keep deduplication between a library and its sub-libraries ? If
yes, perhaps a sub-library can be attached to differents libraries and
different users and we can still have deduplication between libraries
and users.

I would be really interested if a developer can enlighten me on this
library deduplication problem.

Thanks,

Daniel Pan

unread,
May 28, 2014, 7:41:42 AM5/28/14
to sea...@googlegroups.com
This is not deduplication between different libraries. A sub-library is a virtual library created from a directory of the original library. It shares the storage and deduplicates with the original library.

sain...@gmail.com

unread,
May 28, 2014, 7:56:04 AM5/28/14
to sea...@googlegroups.com
On Wednesday, May 28, 2014 1:41:42 PM UTC+2, Daniel Pan wrote:
This is not deduplication between different libraries. A sub-library is a virtual library created from a directory of the original library. It shares the storage and deduplicates with the original library.

Ok so I should always use a sub-library when sharing files of a library then.
With the workflow I described, the solution looks like:
  - A/Images/event
  - A/Images/eventAshared

  - B/Images/event
  - B/eventAshared
  - C/Images/event
  - C/eventAshared
And we have only a 3x duplication (A/Images/event+A/Images/eventAshared and B/Images/event and C/Images/even).

Is there any way I can reduce further the duplication factor given the workflow I have described in my first post ?

Thanks !

Michele Innocenti

unread,
May 28, 2014, 10:16:27 AM5/28/14
to sea...@googlegroups.com

I think that you have a 3x duplication if you *copy* the images from the shared library of user A (or sub-library) into the other users libraries.
There is no duplication if you access those files through the shared library.

Saint Germain

unread,
May 29, 2014, 10:49:56 AM5/29/14
to sea...@googlegroups.com
Of course but in the workflow I described, Seafile is not used only for
sharing but also for storing the files. The shared folder is only
temporary.
Currently it works perfectly in this use case (no duplication,
optimized upload) but with the lost of deduplication between libraries,
it will be quite cumbersome.
Reply all
Reply to author
Forward
0 new messages