Hi Salvatore, Quick update on our progress so far. we have been looking into the latency issue: > + /* make clone with modified cow destructors for db dict */
> Date: Mon, 30 Apr 2012 23:03:34 +0200
> Subject: Re: Redis for Windows is now stronger
> To: redis-db@googlegroups.com
> Claudio thank you a lot!
> This is an excellent explanation, a few remarks and questions follow.
> 1) I think that the idea of making the read-only efficient copy is
> great, very cool!
> 2) I also like the iterator trick, this really is a good abstraction
> to split complexity into different pieces.
> 3) If I understand correctly, the read-only copy is performed in the
> main thread blocking it, but this only matters for big objects.
> 4) Based on what I read you'll have little troubles rebasing your
> changes to 2.6, that's good.
> so "3" is perfectly reasonable for an usage of Redis like an object
> store (many small hashes) and many other users.
> I'm more concerned about this:
> + /* make clone with modified cow destructors for db dict */
> + server.cowSaveDbExt[db->id].dictArray = copyReadonly_dictobj(..
> This seems to block copying the whole dictionary. With many keys this
> can be a problem causing a big amount of latency.
> I've a few ideas for an alternative implementation of user space COW,
> no free lunch... but I'll try to think a bit more about this ideas in
> the next days and will try to submit it to you, in the hope they can
> help or at least that can be excluded as non-promising.
> Btw there are tree main ways I can see:
> 1) To relax requirements, for instance save with just key-consistency,
> no point-in-time. This means that transactions will not be guaranteed
> in the dump, but otherwise may make sense.
> 2) To use a separate thread that takes a duplicated copy of the whole
> dataset, possibly optimized in some way to take less space for the not
> actively used keys. So when you have to persist this other thread
> saves the RDB, and the first serves queries accumulating the changes.
> It's like doing it already with a master and a slave, but optimized
> and automatic.
> 3) What I think it's the gooood thing: segmented AOF with off-line
> rewriting. WE ARE INTERESTED IN THIS.
> What is "3"?
> It's easy, imagine if we segment AOF in aof.1, aof.2, aof.3, aof.N, so
> that every file is at max K Megabytes.
> That's trivial... still an append-only business, we need just to
> switch file when the limit is reached.
> Then we can analyze all the AOF files but the currently active one,
> rewriting them offline.
> For instance if we find "DEL foo" after "INCR foo", "INCR foo", we
> know in the rewritten version we can remove all those three commands.
> Of course we can make sure that DELs are issued in many interesting cases.
> For instance: SREM makes a set empty? Emit a DEL instead.
> Trace all the keys deleted by scripts and emit DELs, and so forth.
> It's still not clear if this can really work well. Imagine for
> instance a Redis instance used only to store a single never deleted
> sorted key (very real use case).
> But still... would be very good for many use cases.
> Cheers,
> Salvatore
> On Mon, Apr 30, 2012 at 9:27 PM, Claudio Caldato <ccald...@hotmail.com> wrote:
> > I'll jump on this discussion with some details that may help to understand
> > how the code works.
> > There is an overview of how the code work in the win32_cow.c file (included
> > below).
> > The code defers deletion of objects during the save - if the ref count is
> > going to be 0, the reference is added to a list. After save is done, it
> > decrements the ref count of all objects on the list.
> > Some objects are modified in place rather than deleted. Typically collection
> > objects such as sets, lists, hash tables. For these objects the code makes a
> > read-only version of them before they are modified. The read-only version is
> > a more efficient storage as it doesn't need to support modifications. The
> > save code will use the read-only version if it exists -otherwise it uses the
> > normal collection.
> > The code uses special iterators for the saving code.
> > - If the collection being iterated is not modified, it acts as a normal
> > iterator.
> > - If a read-only version copy, it iterates on the read-only copy.
> > - If a read-only copy is created while the save is in the middle of
> > iterating on the original collector, the iterator is modified to switch to
> > the read-only copy. Locks are used to ensure this is done safely.
> > As a result, if no updates are done during the save, there is very little
> > extra overhead (no copy).
> > If some updates are done, there is some copying required, but only for the
> > collection being modified.
> > If the same collection is modified n times during a save, only the first
> > modification results in a copy.
> > The special read-only encoding of the collection is used to reduce the cost
> > of allocating, copying and then freeing the copy. A collection with
> > thousands of entries would normally require thousands of allocate and frees.
> > With the read-only encoding, it requires 1. It also uses less memory.
> > This is the best we came up with to replicate a COW-like behavior without
> > changing drastically Redis code and hence make very difficult any future
> > integration.
> > Salvatore, Dusan, All: Suggestions and other ideas on how we can make it
> > better are welcome.
> > Claudio /************************************************************************
> > * This module implements copy on write to support
> > * saving on a background thread in Windows.
> > *
> > * Collection objects (dictionaries, lists, sets, zsets)
> > * are copied to a read-only form if a command to modify the
> > * collection is started. This is triggered via lookupKeyWrite().
> > *
> > * Objects which are modified in place - ziplist, zipset, etc.
> > * are copied before being modified.
> > * Strings are normally copied before being modified.
> > *
> > * In addition deletion of objects is deferred until the save is completed.
> > * This is done by modifying the dictionary delete function, and also
> > * by modifying the decrRefCount function.
> > *
> > * To allow conversion of collections while the save is iterating on them
> > * special iterators are used. These iterators can be migrated
> > * from their normal mode to iterating over a read-only collection.
> > * Locking is used so that iterator can be used from 2 threads.
> > * For migration to work properly, only one save at a time may run.
> > * (this restriction was already imposed in the Redis code)
> > *
> > ************************************************************************/
> > Sent from my Windows 8 PC
> > From: Du�an D. Majki�
> > Sent: Monday, April 30, 2012 8:02:15 AM
> > To: redis-db@googlegroups.com
> > Subject: Re: Redis for Windows is now stronger
> >> Not sure how a Redis value, like a list, can be copied using memcpy(),
> >> I'm probably missing something.
> >> Maybe the win32 port uses different data structures that can be copied
> >> in this way?
> > There is a call to cowEnsureWriteCopy() on every key change/delete.
> > That function, in critical section does the COW. Here it is:
> > https://github.com/MSOpenTech/redis/blob/bksavecow/src/win32_cow.c#L467
> > Complex types are converted to simple array like here:
> > https://github.com/MSOpenTech/redis/blob/bksavecow/src/win32_cow.c#L235
> > It looks like only keys and values pointers are copied without ref
> > count increase,
> > and the original key is deferred from deletion.
> > --
> > You received this message because you are subscribed to the Google Groups
> > "Redis DB" group.
> > To post to this group, send email to redis-db@googlegroups.com.
> > To unsubscribe from this group, send email to
> > redis-db+unsubscribe@googlegroups.com.
> > For more options, visit this group at
> > http://groups.google.com/group/redis-db?hl=en.
> > --
> > You received this message because you are subscribed to the Google Groups
> > "Redis DB" group.
> > To post to this group, send email to redis-db@googlegroups.com.
> > To unsubscribe from this group, send email to
> > redis-db+unsubscribe@googlegroups.com.
> > For more options, visit this group at
> > http://groups.google.com/group/redis-db?hl=en.
> --
> Salvatore 'antirez' Sanfilippo
> open source developer - VMware
> http://invece.org
> "We are what we repeatedly do. Excellence, therefore, is not an act,
> but a habit." -- Aristotele
> --
> You received this message because