Memcache setup and stats

572 views
Skip to first unread message

Fede Naum

unread,
Mar 16, 2017, 2:52:12 AM3/16/17
to rez-config
Hi 

Can you share your memcache server setup and stats just to see if the number we are seeing in our instance are ok?

We recently updated  from 1.4.4 to 1.4.35 (which has dynamic slab eviction and support for CAS (check and set) operation that I'm planning to use)

Our server has 2 Cores (2600  MHz) and  8GB RAM .  7GB allocated for memcahe 
The cache gets full in a week time and starts evicting some emtries.
We are caching developers resolves

These are the  stats after 14 days of use:


Get Stats
Hits1747297K[100.0%]
Miss783858[0.0%]
Rate1440.4 Request/sec
Set Stats
Total526512
Rate0.4 Request/sec
Delete Stats
Hits18452[80.8%]
Miss4398[19.2%]
Rate0.0 Request/sec


Eviction & Reclaimed Stats
Items Eviction128095
Rate0.1 Eviction/sec
Reclaimed0
Rate0.0 Reclaimed/sec
Expired unfetched0
Evicted unfetched45438













Do you have similar numbers? what I'm mainly interested is on the Get Rate and the Eviction Rate
Are you caching developers resolves? I mean other than resolves that come from the REZ_RELEASE_PATH? 
How many GB of memory do you have on your memcache server?

Thanks
Fede

Allan Johns

unread,
Mar 16, 2017, 5:17:24 PM3/16/17
to rez-c...@googlegroups.com
Hey Fede,

How'd you get those stats? I've only ever taken notice of two stats - the hit rate, and occasionally the get/set per second (get specifically). Evictions shouldn't matter - it's designed to keep entries forever and let old/invalidated entries drop out of the cache.


We have:

rez-dedicated memcached instance.
256 Mb
99% hit rate
yes we cache developer resolves
get rate swings from 0-1500 get/sec or so, but usually it'll swing 100-700 or so.
set rates are low, lucky to get over 1/sec

I ran the numbers once and I think even with a paltry 256 Mb you can store 10,000 ish good sized resolves. You do also need room for the package definitions though.

We're going to move to dual memcached instances soon but that's more for availability than performance - dropping from 2 to 1 instance isn't a big deal (since the remaining cache will pick up the slack pretty quickly), but 1 to zero is (as you'd know, when you don't have a cache all those tiny package file reads hurt a lot, not to mention you lose cached resolves).

Hth
A




--
You received this message because you are subscribed to the Google Groups "rez-config" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rez-config+unsubscribe@googlegroups.com.
To post to this group, send email to rez-c...@googlegroups.com.
Visit this group at https://groups.google.com/group/rez-config.
For more options, visit https://groups.google.com/d/optout.

Fede Naum

unread,
Mar 16, 2017, 10:55:38 PM3/16/17
to rez-config
Hi Allan, 

We have installed phpMemcachedAdmin 1.2.2  
It is pretty usefull to see all the stats at once, and from the same interface we can monitor other instances that we have in other locations (.ie Sydney / Vancouver)





It is good to know that you get rates as high as 1500 get/sec as well, I was a bit concerned, that number looked quite big (but I just asked the wranglers and we had 580.000 jobs completed in the last 24 hours, where 90% of those are using rez.. so those number might be just right)

Set rates for us is 0.4 a sec.

I think our resolves are bigger since they ended up stored in slabs of 100k to 600K (I'll double check this,a nd make sure we are not storing extra data that is not needed)
This is our current slab utilization

>memcached-tool localhost:11211 display

  #  Item_Size  Max_age   Pages   Count   Full?  Evicted Evict_Time OOM
  1      96B       961s       1       1     yes        0        0    0
  3     152B    774156s       1    6898     yes     1987   773608    0
  4     192B    550796s      30  163829     yes   131636   550791    0
  5     240B    612071s       1    4369     yes     3254   610077    0
  6     304B   1207036s       3   10347     yes      776  1206805    0
  7     384B   1288181s       1    1201     yes        0        0    0
  8     480B   1304973s       1     560     yes        0        0    0
  9     600B   1284988s       1    1042     yes        0        0    0
 10     752B    329037s       3    4182     yes      807   328999    0
 11     944B   1286152s       2    2011     yes        0        0    0
 12     1.2K    781863s       8    7079     yes      867   778023    0
 13     1.4K    339971s      12    8496     yes     3331   339897    0
 14     1.8K    675363s      17    9588     yes     2753   675424    0
 15     2.3K    602837s      27   12177     yes     3678   602737    0
 16     2.8K    339080s      15    5415     yes     2059   338979    0
 17     3.5K    795239s       6    1728     yes      297   794865    0
 18     4.4K   1288323s       4     843     yes        0        0    0
 19     5.5K    848941s       1     184     yes       13   843636    0
 20     6.9K   1194310s       1      32     yes        0        0    0
 21     8.7K   1218002s       1       6     yes        0        0    0
 22    10.8K   1288216s       1      54     yes        0        0    0
 23    13.6K    527816s       1      75     yes       74   588851    0
 24    16.9K    782172s       3     180     yes       48   778253    0
 25    21.2K    159470s       1      48     yes      109   157140    0
 26    26.5K    126427s       2      76     yes      238    86384    0
 27    33.1K    614963s       3      90     yes       77   612160    0
 28    41.4K    331150s       3      72     yes      129   326978    0
 29    51.7K    848968s       5      95     yes       70   845365    0
 30    64.7K    709405s      38     570     yes      595   708414    0
 31    80.9K    331959s      21     252     yes      408   332110    0
 32   101.1K    766833s      15     150     yes      122   764572    0
 33   126.3K    627715s      20     160     yes      168   620703    0
 34   157.9K    331867s      35     210     yes      341   331336    0
 35   197.4K    262477s      35     175     yes      282   262515    0
 36   246.8K    683712s      50     200     yes      191   679488    0
 37   308.5K    582970s     132     396     yes      358   583415    0
 38   385.6K    697045s      68     136     yes      129   693750    0
 39   482.0K    690068s     864    1727     yes     1884   690006    0
 40   602.5K    770073s    2345    2345     yes     2185   770019    0



I think we get this amount of resolves because our live "beta" environments get invalidated every time we release a new package. (we release ~50 new rez packages a day that affects ~20 beta environments -> 1000 resolves a day)
So, jsut the resolves in the last slab,  2345 resolves of 600K -> 1.5GB .

Our real problem is the eviction of "production" resolves. In our case, ideally,  those resolves should last at least a cycle of 2 weeks (or more) in the cache.
It looks like our  "beta", "prerelease" and "developers" resolves are filling the cache and the 'production' resolves get evicted.

So I was thinking on something along the lines of setting a TTL of ~4 weeks for every key, and every time we do a GET and hit the cache we update the TTL of that key. That way the keys that get used often tent to stay on the cache, and other transient keys (for beta or dev environment) get evicted first.

What do you think? would that impact somewhere else? or it should be fine?

Fede

















To unsubscribe from this group and stop receiving emails from it, send an email to rez-config+...@googlegroups.com.

Allan Johns

unread,
Mar 16, 2017, 11:07:57 PM3/16/17
to rez-c...@googlegroups.com
I don't think updating TTLs like that will change the current behavior much - memcached is an LRU cache anyway, so entries hit more often will already be staying in the cache longer.

Could you instead point rez at a separate memcached instance specifically for production, when appropriate? I think that's gonna get you much better results.

A


To unsubscribe from this group and stop receiving emails from it, send an email to rez-config+unsubscribe@googlegroups.com.

Fede Naum

unread,
Mar 19, 2017, 8:01:54 PM3/19/17
to rez-config
I'll try that... I have a spare machine with memcache installed with 2GB RAM , that should be enough for production presets.
I'll report back when I have some results to compare.
F

Fede Naum

unread,
Aug 1, 2017, 9:47:18 AM8/1/17
to rez-config
I only found out that I never reported back.
For the records and anybody interested
I did end up separating the production from the developers cache.
So we have 1 memcache with 8 GB for caches of production resolves and another instance with 2GB for caches of developers resolves.
That seems to have alleviated the problem we had with the early evictions and cache misses.

Blazej Floch

unread,
Aug 15, 2017, 6:29:07 PM8/15/17
to rez-config
I have a related question:

We just introduced memcached but noticed twice (within a week) a false uncleared hit when using a tool that solves.
It basically did not respect a newer release of a package for a particular user.

Interestingly from the console the same solve would not result in the invalid cached result.
If it happened only a hard flush would help, else these users were stuck in the old version.

Allan, can you give a brief overview of how rez caches, when it flushes and what the context of the cache is so I can get a better idea of why this happened and what's wrong with our setup.

Thanks,
Blazej 

Fede Naum

unread,
Aug 16, 2017, 7:02:13 AM8/16/17
to rez-config
Hi Blazej, 

I don't think I understand the question.
What are you saying is that a new version of a tool was released and rez resolve did not pick that new version? and that only happened for a particular user?

When you are making the distinction about the "from the console" which is the other mechanisms you are invoking rez,? from Python thru the API?

When you are referring to a hard flush, are you using the clear_caches function, or are you going to the memcache server and doing a flush_all


If it helps somehow, a brief explanation of how this work is inside the code, in the resolver module, have a look here

Fede  

--
You received this message because you are subscribed to a topic in the Google Groups "rez-config" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rez-config/ahJHJZuArwM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rez-config+unsubscribe@googlegroups.com.

Blazej Floch

unread,
Aug 16, 2017, 3:27:27 PM8/16/17
to rez-config
Yes you pretty much nailed it. I'll try to clarify:

We have a studio wide memcached server.

A new version of a tool was released.

An single user did not receive the update, although she restarted the process (gui) that triggers the rez resolve.

In the same session doing a solve from the console did resolve correctly.

I basically restarted the memcached server to fix it (that's as hard of a flush as it can get), this kind of verified that the issue was related to memcached.

I was expecting rez to flush the cache on a new release (version, package, variant).
Then again I wonder what the context is that is being stored for a resolve. Is it caching per user at all?

I would to emphasize that I personally do not have a reproducible case yet in front of me, I was just wondering if there is something obvious that might have gone wrong.

Thanks for the hint, I'll look into the code.

Cheers,
Blazej
To unsubscribe from this group and all its topics, send an email to rez-config+...@googlegroups.com.

Fede Naum

unread,
Aug 17, 2017, 9:35:11 AM8/17/17
to rez-config
Hi Blazej,  

see my answer inline

On Thu, Aug 17, 2017 at 5:27 AM, Blazej Floch <blazej...@gmail.com> wrote:
Yes you pretty much nailed it. I'll try to clarify:

We have a studio wide memcached server.

A new version of a tool was released.

An single user did not receive the update, although she restarted the process (gui) that triggers the rez resolve.

In the same session doing a solve from the console did resolve correctly.

I basically restarted the memcached server to fix it (that's as hard of a flush as it can get), this kind of verified that the issue was related to memcached.

I was expecting rez to flush the cache on a new release (version, package, variant).

Note: I think the correct term here will be to invalidate the cache, not to flush it.

Is the request thru the GUI that triggers the resolve locked with a timestamp vs the console non timestamped ?

could you issue be around this lines.. see the comment in the code.
       This behaviour exists specifically so that resolves that use a
        timestamp but set that to the current time, can be reused by other
        resolves if nothing has changed. Older resolves however, can only be
        reused if the timestamp matches exactly (but this might happen a lot -
        consider a workflow where a work area is tied down to a particular
        timestamp in order to 'lock' it from any further software releases).
Then again I wonder what the context is that is being stored for a resolve. Is it caching per user at all?

No, the cache entry is no per user. What gets stored is the resolve  (list of packages + any implicit package) + the  resolve graph + timestamp (if a timestamped request ) + the uid of the repository (filesystem where you have your packages, normally REZ_RELEASE_PATH) see the code here 

If that could help.. When I did at some point an analysis of why some resolves where missing the cache (note exaclty the problem that you have) I found that,  different machine add some different implicit packages .i.e a Centos-6.2 or a Centos-6.6 will be different resolves.

The other thing that caught us one time was the uid of the filesystem. At some point, we wanted to make a performance test and compare the results of having the rez repository mounted in a solid disk compared to an spinning disk, So there were 2 groups of machines (and users) that where mounting the same directory (REZ_RELEASE_PATH) but that directory hosted in 2 different filers. then the call to get the uid what returning 2 different ids and hence even the resolved was cached in one machine other user asking for the same set of packages was missing the cache, as the key was different.


I would to emphasize that I personally do not have a reproducible case yet in front of me, I was just wondering if there is something obvious that might have gone wrong.

next time you have the issues you can debug it a bit using the following configs
# Print debugging info related to use of memcached during a resolve
debug_resolve_memcache = False

# Debug memcache usage. As well as printing debugging info to stdout, it also
# sends human-readable strings as memcached keys (that you can read by running
# "memcached -vv" as the server)
debug_memcache = False
You can set the environment variable as REZ_DEBUG_RESOLVE_MEMCACHE=True and REZ_DEBUG_MEMCACHE=True

Hope this helps

Fede

 
To unsubscribe from this group and all its topics, send an email to rez-config+unsubscribe@googlegroups.com.

Blazej Floch

unread,
Aug 18, 2017, 10:50:54 AM8/18/17
to rez-config
Excellent points! Highly appreciated.

I'll look into it.
To unsubscribe from this group and all its topics, send an email to rez-config+...@googlegroups.com.

Allan Johns

unread,
Aug 21, 2017, 3:54:47 PM8/21/17
to rez-c...@googlegroups.com
I haven't quite gotten to the bottom of it yet but there do exist cases where cache entries are not correctly invalidated. There are a couple of things to take note of:

1. Not sure if this describes your case but Paul at Luma has discovered a flaw in the caching logic. If you have a resolve that picked up a local package, and a newer, non-local version of that package is released, then that cached resolve is not invalidated (I *think* - need to properly investigate to be confident). This happens because, to test whether a resolve is invalidated, rez checks for any new releases of packages in the resolve - BUT it does this only by checking within the repo of each resolved package. So in this case it will check for newer releases in the local packages path only, and will miss the central release.

2. In order to make cache invalidation checks cheap, rez relies on the fact that, when a new package is installed, the topmost unversioned package directory is touched. That way, only one file stat per package family is needed to test whether any newer packages have been released. So if for any reason this dir mod time doesn't get changed (eg, a package is manually altered), then this can cause stale cache entries not to be invalidated.

I'm planning on fixing (1), this should be fairly straightforward to do. Would be good to see if this fixes the existing issue (which unfortunately is hard to repro, as I think you've found out).

Cheers
A










To unsubscribe from this group and stop receiving emails from it, send an email to rez-config+unsubscribe@googlegroups.com.

Blazej Floch

unread,
Aug 28, 2017, 3:18:11 PM8/28/17
to rez-config
Thanks for the additional hints. I'm waiting for it to happen again before I can take action :)

I did not know about 2. Since we are releasing as a dedicated user to protect from changes, maybe some of the topmost folders have insufficient permissions to be updated. I'll take a look.

Blazej Floch

unread,
Oct 11, 2017, 7:22:39 PM10/11/17
to rez-config
I haven't finished investigation but we identified the root cause of the mismatching cache entries thanks to one of our TDs.

I want to just throw it in so maybe someone can spot something unexpected right away.

I've replaced the literal names for simplicity.

Our rezconfig has a packages_path = [

   "${TBE_REZ_LOCAL_INTERNAL_DIR}",           
   "${TBE_REZ_LOCAL_VIRTUAL_DIR}",            
   "${TBE_REZ_LOCAL_THIRDPARTY_DIR}",

   "${TBE_REZ_LOCATION_DIR}",                 
   "${TBE_REZ_RELEASE_INTERNAL_DIR}",         
   "${TBE_REZ_RELEASE_VIRTUAL_DIR}",          
   "${TBE_REZ_RELEASE_THIRDPARTY_DIR}",       
]

The first three folders are the local build folders, meaning they contain the username.

For most TDs including me we have the folders and a memcached resolve looks something like this (reformated):

7:39:38 DEBUG    Retrieving memcache key:
   "('resolve', ('packagename', '~platform==linux', '~arch==x86_64', '~os==CentOS-7.2'),

   (('filesystem', '/somepath/b.floch/internal', 5805584676),
    ('filesystem', '/somepath/b.floch/virtual', 5787412802),
    ('filesystem', '/somepath/b.floch/thirdparty', 5797940304),

    ('filesystem', '/otherpath/location', 5311951666),
    ('filesystem', '/otherpath/release/internal', 4523540188),
    ('filesystem', '/otherpath/release/virtual',4543421042),
    ('filesystem', '/otherpath/release/thirdparty', 4523540189)),
    'cebb675c437da423bfcd0e21bc2fa2c4ce9c92f', '', False, True)"
7:39:38 DEBUG    memcache get (resolve) took 0.195131063461


However most of the floor users don't really have any folder at the local path.

The cached resolve:

17:29:25 DEBUG    Retrieving memcache key: "('resolve', ('vp_tnj3_katana', '~platform==linux', '~arch==x86_64', '~os==C
entOS-7.2'),
    (('filesystem', '/somepath/some.floorusername/internal'),
     ('filesystem', '/somepath/some.floorusername/virtual'),
     ('filesystem', '/somepath/some.floorusername/thirdparty'),

    ('filesystem', '/otherpath/location', 5311951666),
    ('filesystem', '/otherpath/release/internal', 4523540188),
    ('filesystem', '/otherpath/release/virtual',4543421042),
    ('filesystem', '/otherpath/release/thirdparty', 4523540189)),
     'cebb675c437da423bfcd0e21bbc2fa2c4ce9c92f', '', False, True)"
17:29:25 DEBUG    memcache get (resolve) took 0.183612108231


Note how the non-existing localfolders seem to not have the number (which I assume is an uid).

So it seems to be very close to Fede's issue with the SSD benchmark: I get a different uid for floor users, but in their case there is no uid and therefore they are stuck to the first cache ever created (last time flushed I would think). Is this possible? Is this even possibly a bug?

What makes me wonder is that the updates happened really in the release folder which the floor users have access to. Is this related to John's (1) issue?

Appreciate your opinions.

Blazej Floch

unread,
Oct 11, 2017, 7:34:14 PM10/11/17
to rez-config
Sorry I meant "Allan's (1)" issue.
<div style="margin-top:10px;padding:3px 7px;border-width:1px;border-style:solid;border-color:rgb(81,72,69);border-radius:2px;background:url("") 50% 50% repeat-x scroll rgb(99,88,85);font-weight:bold;color:rgb(255,255,255);clear:both;font-family:Verdana,Tahoma,"Segoe UI",Arial;font-size:12.

Blazej Floch

unread,
Oct 13, 2017, 3:56:47 PM10/13/17
to rez-config
I tried stepping through on a case where the cache would not resolve properly.

Here is my findings:

In this particular case the problem was here:

https://github.com/nerdvegas/rez/blob/master/src/rez/resolver.py#L236

The request is something like:

rez-env A

The cache for this particular user is was for A-1.0 while meanwhile A-2.0 was released.

But here is my question:
The data which is retrieved by the cache already defines a version (1.0).

When I step into get_variant_state_handle, as variant_states.get(variant) ends up being None (I assume that's just a local cache for optimization purposes):

https://github.com/nerdvegas/rez/blob/master/src/rez/resolver.py#L236

... I will eventually end up in parent() of FileSystemVariant resource that according to:

https://github.com/nerdvegas/rez/blob/master/src/rezplugins/package_repository/filesystem.py#L247

... reuses that version (assuming based on the resource from the cache). Am I right to assume that this enforces the cached version and there is no way that A-2.0 will be solved.

Hence the new_state ends up being equal to the old_state and there is no way that the cache will be invalided? Or is the state_handle global to the package family?
Regardless it does not pick up the timestamp of the latest package, just the one retried from the cache.

Let me know if I missed something. In case this is the issue what would be the right approach to fix it?

I am not sure how this relates to my previous message, in which case the missing uid seemed to be the trigger. I would assume that regardless of uid set or not it would suffer from the problem.
The real difference in my mind is that the request itself has changed, but I have a hard time to imagine that this is a bug that was unnoticed.

I still have the feeling we are experiencing (1). I will look into this direction.

I could live with not having caches if local search paths exist. This would slow down TDs but the floor would be good.
So I am actually considering having two configs, one for the floor without local paths and with memcache set, and one for TDs without cache but with local paths.

I hope there is a better way.

Allan Johns

unread,
Oct 16, 2017, 8:33:40 PM10/16/17
to rez-c...@googlegroups.com
Hey Blazej,

I haven't been able to look at this yet, but it's definitely on the list.

The one thing I would say is that you definitely should have a REZ_PACKAGES_PATH setup in production that does NOT include users' local packages path, because that's going to get you highest reuse of cache entries. Including local paths means that every user's paths differ, so there will be no reuse of caches across users at all! This would reduce your cache hit ratio quite substantially I think.

At Method, our system for managing use of rez in production never includes local package paths unless that's explicitly asked for via a flag (++local).

Hth
A




--
You received this message because you are subscribed to the Google Groups "rez-config" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rez-config+unsubscribe@googlegroups.com.

Blazej Floch

unread,
Oct 17, 2017, 6:23:27 PM10/17/17
to rez-config
Makes sense. I'll try to incorporate this in our system.
To unsubscribe from this group and stop receiving emails from it, send an email to rez-config+...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages