Memcache setup and stats

Fede Naum

unread,

Mar 16, 2017, 2:52:12 AM3/16/17

to rez-config

Hi

Can you share your memcache server setup and stats just to see if the number we are seeing in our instance are ok?

We recently updated from 1.4.4 to 1.4.35 (which has dynamic slab eviction and support for CAS (check and set) operation that I'm planning to use)

Our server has 2 Cores (2600 MHz) and 8GB RAM . 7GB allocated for memcahe

The cache gets full in a week time and starts evicting some emtries.

We are caching developers resolves

These are the stats after 14 days of use:

Get Stats

Hits1747297K[100.0%]

Miss783858[0.0%]

Rate1440.4 Request/sec

Set Stats

Total526512

Rate0.4 Request/sec

Delete Stats

Hits18452[80.8%]

Miss4398[19.2%]

Rate0.0 Request/sec

Eviction & Reclaimed Stats

Items Eviction128095

Rate0.1 Eviction/sec

Reclaimed0

Rate0.0 Reclaimed/sec

Expired unfetched0

Evicted unfetched45438

Do you have similar numbers? what I'm mainly interested is on the Get Rate and the Eviction Rate

Are you caching developers resolves? I mean other than resolves that come from the REZ_RELEASE_PATH?

How many GB of memory do you have on your memcache server?

Thanks

Fede

Allan Johns

unread,

Mar 16, 2017, 5:17:24 PM3/16/17

to rez-c...@googlegroups.com

Hey Fede,

How'd you get those stats? I've only ever taken notice of two stats - the hit rate, and occasionally the get/set per second (get specifically). Evictions shouldn't matter - it's designed to keep entries forever and let old/invalidated entries drop out of the cache.

We have:

rez-dedicated memcached instance.

256 Mb

99% hit rate

yes we cache developer resolves

get rate swings from 0-1500 get/sec or so, but usually it'll swing 100-700 or so.

set rates are low, lucky to get over 1/sec

I ran the numbers once and I think even with a paltry 256 Mb you can store 10,000 ish good sized resolves. You do also need room for the package definitions though.

We're going to move to dual memcached instances soon but that's more for availability than performance - dropping from 2 to 1 instance isn't a big deal (since the remaining cache will pick up the slack pretty quickly), but 1 to zero is (as you'd know, when you don't have a cache all those tiny package file reads hurt a lot, not to mention you lose cached resolves).

Hth

A

--
You received this message because you are subscribed to the Google Groups "rez-config" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rez-config+unsubscribe@googlegroups.com.
To post to this group, send email to rez-c...@googlegroups.com.
Visit this group at https://groups.google.com/group/rez-config.
For more options, visit https://groups.google.com/d/optout.

Fede Naum

unread,

Mar 16, 2017, 10:55:38 PM3/16/17

to rez-config

Hi Allan,

We have installed phpMemcachedAdmin ^1.2.2

It is pretty usefull to see all the stats at once, and from the same interface we can monitor other instances that we have in other locations (.ie Sydney / Vancouver)

It is good to know that you get rates as high as 1500 get/sec as well, I was a bit concerned, that number looked quite big (but I just asked the wranglers and we had 580.000 jobs completed in the last 24 hours, where 90% of those are using rez.. so those number might be just right)

Set rates for us is 0.4 a sec.

I think our resolves are bigger since they ended up stored in slabs of 100k to 600K (I'll double check this,a nd make sure we are not storing extra data that is not needed)

This is our current slab utilization

>memcached-tool localhost:11211 display

# Item_Size Max_age Pages Count Full? Evicted Evict_Time OOM

1 96B 961s 1 1 yes 0 0 0

3 152B 774156s 1 6898 yes 1987 773608 0

4 192B 550796s 30 163829 yes 131636 550791 0

5 240B 612071s 1 4369 yes 3254 610077 0

6 304B 1207036s 3 10347 yes 776 1206805 0

7 384B 1288181s 1 1201 yes 0 0 0

8 480B 1304973s 1 560 yes 0 0 0

9 600B 1284988s 1 1042 yes 0 0 0

10 752B 329037s 3 4182 yes 807 328999 0

11 944B 1286152s 2 2011 yes 0 0 0

12 1.2K 781863s 8 7079 yes 867 778023 0

13 1.4K 339971s 12 8496 yes 3331 339897 0

14 1.8K 675363s 17 9588 yes 2753 675424 0

15 2.3K 602837s 27 12177 yes 3678 602737 0

16 2.8K 339080s 15 5415 yes 2059 338979 0

17 3.5K 795239s 6 1728 yes 297 794865 0

18 4.4K 1288323s 4 843 yes 0 0 0

19 5.5K 848941s 1 184 yes 13 843636 0

20 6.9K 1194310s 1 32 yes 0 0 0

21 8.7K 1218002s 1 6 yes 0 0 0

22 10.8K 1288216s 1 54 yes 0 0 0

23 13.6K 527816s 1 75 yes 74 588851 0

24 16.9K 782172s 3 180 yes 48 778253 0

25 21.2K 159470s 1 48 yes 109 157140 0

26 26.5K 126427s 2 76 yes 238 86384 0

27 33.1K 614963s 3 90 yes 77 612160 0

28 41.4K 331150s 3 72 yes 129 326978 0

29 51.7K 848968s 5 95 yes 70 845365 0

30 64.7K 709405s 38 570 yes 595 708414 0

31 80.9K 331959s 21 252 yes 408 332110 0

32 101.1K 766833s 15 150 yes 122 764572 0

33 126.3K 627715s 20 160 yes 168 620703 0

34 157.9K 331867s 35 210 yes 341 331336 0

35 197.4K 262477s 35 175 yes 282 262515 0

36 246.8K 683712s 50 200 yes 191 679488 0

37 308.5K 582970s 132 396 yes 358 583415 0

38 385.6K 697045s 68 136 yes 129 693750 0

39 482.0K 690068s 864 1727 yes 1884 690006 0

40 602.5K 770073s 2345 2345 yes 2185 770019 0

I think we get this amount of resolves because our live "beta" environments get invalidated every time we release a new package. (we release ~50 new rez packages a day that affects ~20 beta environments -> 1000 resolves a day)

So, jsut the resolves in the last slab, 2345 resolves of 600K -> 1.5GB .

Our real problem is the eviction of "production" resolves. In our case, ideally, those resolves should last at least a cycle of 2 weeks (or more) in the cache.

It looks like our "beta", "prerelease" and "developers" resolves are filling the cache and the 'production' resolves get evicted.

So I was thinking on something along the lines of setting a TTL of ~4 weeks for every key, and every time we do a GET and hit the cache we update the TTL of that key. That way the keys that get used often tent to stay on the cache, and other transient keys (for beta or dev environment) get evicted first.

What do you think? would that impact somewhere else? or it should be fine?

Fede

To unsubscribe from this group and stop receiving emails from it, send an email to rez-config+...@googlegroups.com.

Allan Johns

unread,

Mar 16, 2017, 11:07:57 PM3/16/17

to rez-c...@googlegroups.com

I don't think updating TTLs like that will change the current behavior much - memcached is an LRU cache anyway, so entries hit more often will already be staying in the cache longer.

Could you instead point rez at a separate memcached instance specifically for production, when appropriate? I think that's gonna get you much better results.

A

To unsubscribe from this group and stop receiving emails from it, send an email to rez-config+unsubscribe@googlegroups.com.

Fede Naum

unread,

Mar 19, 2017, 8:01:54 PM3/19/17

to rez-config

I'll try that... I have a spare machine with memcache installed with 2GB RAM , that should be enough for production presets.

I'll report back when I have some results to compare.

F

Fede Naum

unread,

Aug 1, 2017, 9:47:18 AM8/1/17

to rez-config

I only found out that I never reported back.

For the records and anybody interested

I did end up separating the production from the developers cache.

So we have 1 memcache with 8 GB for caches of production resolves and another instance with 2GB for caches of developers resolves.

That seems to have alleviated the problem we had with the early evictions and cache misses.

Blazej Floch

unread,

Aug 15, 2017, 6:29:07 PM8/15/17

to rez-config

I have a related question:

We just introduced memcached but noticed twice (within a week) a false uncleared hit when using a tool that solves.

It basically did not respect a newer release of a package for a particular user.

Interestingly from the console the same solve would not result in the invalid cached result.

If it happened only a hard flush would help, else these users were stuck in the old version.

Allan, can you give a brief overview of how rez caches, when it flushes and what the context of the cache is so I can get a better idea of why this happened and what's wrong with our setup.

Thanks,

Blazej

Fede Naum

unread,

Aug 16, 2017, 7:02:13 AM8/16/17

to rez-config

Hi Blazej,

I don't think I understand the question.

What are you saying is that a new version of a tool was released and rez resolve did not pick that new version? and that only happened for a particular user?

When you are making the distinction about the "from the console" which is the other mechanisms you are invoking rez,? from Python thru the API?

When you are referring to a hard flush, are you using the clear_caches function, or are you going to the memcache server and doing a flush_all

If it helps somehow, a brief explanation of how this work is inside the code, in the resolver module, have a look here

Fede

--
You received this message because you are subscribed to a topic in the Google Groups "rez-config" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rez-config/ahJHJZuArwM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rez-config+unsubscribe@googlegroups.com.

Blazej Floch

unread,

Aug 16, 2017, 3:27:27 PM8/16/17

to rez-config

Yes you pretty much nailed it. I'll try to clarify:

We have a studio wide memcached server.

A new version of a tool was released.

An single user did not receive the update, although she restarted the process (gui) that triggers the rez resolve.

In the same session doing a solve from the console did resolve correctly.

I basically restarted the memcached server to fix it (that's as hard of a flush as it can get), this kind of verified that the issue was related to memcached.

I was expecting rez to flush the cache on a new release (version, package, variant).

Then again I wonder what the context is that is being stored for a resolve. Is it caching per user at all?

I would to emphasize that I personally do not have a reproducible case yet in front of me, I was just wondering if there is something obvious that might have gone wrong.

Thanks for the hint, I'll look into the code.

Cheers,

Blazej

To unsubscribe from this group and all its topics, send an email to rez-config+...@googlegroups.com.

Fede Naum

unread,

Aug 17, 2017, 9:35:11 AM8/17/17

to rez-config

Hi Blazej,

see my answer inline

On Thu, Aug 17, 2017 at 5:27 AM, Blazej Floch <blazej...@gmail.com> wrote:

Yes you pretty much nailed it. I'll try to clarify:

We have a studio wide memcached server.

A new version of a tool was released.

An single user did not receive the update, although she restarted the process (gui) that triggers the rez resolve.

In the same session doing a solve from the console did resolve correctly.

I basically restarted the memcached server to fix it (that's as hard of a flush as it can get), this kind of verified that the issue was related to memcached.

I was expecting rez to flush the cache on a new release (version, package, variant).

Note: I think the correct term here will be to invalidate the cache, not to flush it.

Is the request thru the GUI that triggers the resolve locked with a timestamp vs the console non timestamped ?

could you issue be around this lines.. see the comment in the code.

       This behaviour exists specifically so that resolves that use a
        timestamp but set that to the current time, can be reused by other
        resolves if nothing has changed. Older resolves however, can only be
        reused if the timestamp matches exactly (but this might happen a lot -
        consider a workflow where a work area is tied down to a particular
        timestamp in order to 'lock' it from any further software releases).

Then again I wonder what the context is that is being stored for a resolve. Is it caching per user at all?

No, the cache entry is no per user. What gets stored is the resolve (list of packages + any implicit package) + the resolve graph + timestamp (if a timestamped request ) + the uid of the repository (filesystem where you have your packages, normally REZ_RELEASE_PATH) see the code here

If that could help.. When I did at some point an analysis of why some resolves where missing the cache (note exaclty the problem that you have) I found that, different machine add some different implicit packages .i.e a Centos-6.2 or a Centos-6.6 will be different resolves.

The other thing that caught us one time was the uid of the filesystem. At some point, we wanted to make a performance test and compare the results of having the rez repository mounted in a solid disk compared to an spinning disk, So there were 2 groups of machines (and users) that where mounting the same directory (REZ_RELEASE_PATH) but that directory hosted in 2 different filers. then the call to get the uid what returning 2 different ids and hence even the resolved was cached in one machine other user asking for the same set of packages was missing the cache, as the key was different.

I would to emphasize that I personally do not have a reproducible case yet in front of me, I was just wondering if there is something obvious that might have gone wrong.

next time you have the issues you can debug it a bit using the following configs

# Print debugging info related to use of memcached during a resolve
debug_resolve_memcache = False

# Debug memcache usage. As well as printing debugging info to stdout, it also
# sends human-readable strings as memcached keys (that you can read by running
# "memcached -vv" as the server)
debug_memcache = False

You can set the environment variable as REZ_DEBUG_RESOLVE_MEMCACHE=True and REZ_DEBUG_MEMCACHE=True

Hope this helps

Fede

To unsubscribe from this group and all its topics, send an email to rez-config+unsubscribe@googlegroups.com.

Blazej Floch

unread,

Aug 18, 2017, 10:50:54 AM8/18/17

to rez-config

Excellent points! Highly appreciated.

I'll look into it.

To unsubscribe from this group and all its topics, send an email to rez-config+...@googlegroups.com.

Allan Johns

unread,

Aug 21, 2017, 3:54:47 PM8/21/17

to rez-c...@googlegroups.com

I haven't quite gotten to the bottom of it yet but there do exist cases where cache entries are not correctly invalidated. There are a couple of things to take note of:

1. Not sure if this describes your case but Paul at Luma has discovered a flaw in the caching logic. If you have a resolve that picked up a local package, and a newer, non-local version of that package is released, then that cached resolve is not invalidated (I *think* - need to properly investigate to be confident). This happens because, to test whether a resolve is invalidated, rez checks for any new releases of packages in the resolve - BUT it does this only by checking within the repo of each resolved package. So in this case it will check for newer releases in the local packages path only, and will miss the central release.

2. In order to make cache invalidation checks cheap, rez relies on the fact that, when a new package is installed, the topmost unversioned package directory is touched. That way, only one file stat per package family is needed to test whether any newer packages have been released. So if for any reason this dir mod time doesn't get changed (eg, a package is manually altered), then this can cause stale cache entries not to be invalidated.

I'm planning on fixing (1), this should be fairly straightforward to do. Would be good to see if this fixes the existing issue (which unfortunately is hard to repro, as I think you've found out).

Cheers

A

To unsubscribe from this group and stop receiving emails from it, send an email to rez-config+unsubscribe@googlegroups.com.

Blazej Floch

unread,

Aug 28, 2017, 3:18:11 PM8/28/17

to rez-config

Thanks for the additional hints. I'm waiting for it to happen again before I can take action :)

I did not know about 2. Since we are releasing as a dedicated user to protect from changes, maybe some of the topmost folders have insufficient permissions to be updated. I'll take a look.

Blazej Floch

unread,

Oct 11, 2017, 7:22:39 PM10/11/17

to rez-config

I haven't finished investigation but we identified the root cause of the mismatching cache entries thanks to one of our TDs.

I want to just throw it in so maybe someone can spot something unexpected right away.

I've replaced the literal names for simplicity.

Our rezconfig has a packages_path = [

   "${TBE_REZ_LOCAL_INTERNAL_DIR}",
   "${TBE_REZ_LOCAL_VIRTUAL_DIR}",
   "${TBE_REZ_LOCAL_THIRDPARTY_DIR}",

   "${TBE_REZ_LOCATION_DIR}",
   "${TBE_REZ_RELEASE_INTERNAL_DIR}",
   "${TBE_REZ_RELEASE_VIRTUAL_DIR}",
   "${TBE_REZ_RELEASE_THIRDPARTY_DIR}",
]

The first three folders are the local build folders, meaning they contain the username.

For most TDs including me we have the folders and a memcached resolve looks something like this (reformated):

7:39:38 DEBUG    Retrieving memcache key:
   "('resolve', ('packagename', '~platform==linux', '~arch==x86_64', '~os==CentOS-7.2'),

   (('filesystem', '/somepath/b.floch/internal', 5805584676),
    ('filesystem', '/somepath/b.floch/virtual', 5787412802),
    ('filesystem', '/somepath/b.floch/thirdparty', 5797940304),

    ('filesystem', '/otherpath/location', 5311951666),
    ('filesystem', '/otherpath/release/internal', 4523540188),
    ('filesystem', '/otherpath/release/virtual',4543421042),
    ('filesystem', '/otherpath/release/thirdparty', 4523540189)),
    'cebb675c437da423bfcd0e21bc2fa2c4ce9c92f', '', False, True)"
7:39:38 DEBUG    memcache get (resolve) took 0.195131063461

However most of the floor users don't really have any folder at the local path.

The cached resolve:

17:29:25 DEBUG    Retrieving memcache key: "('resolve', ('vp_tnj3_katana', '~platform==linux', '~arch==x86_64', '~os==C
entOS-7.2'),
    (('filesystem', '/somepath/some.floorusername/internal'),
     ('filesystem', '/somepath/some.floorusername/virtual'),
     ('filesystem', '/somepath/some.floorusername/thirdparty'),

    ('filesystem', '/otherpath/location', 5311951666),
    ('filesystem', '/otherpath/release/internal', 4523540188),
    ('filesystem', '/otherpath/release/virtual',4543421042),
    ('filesystem', '/otherpath/release/thirdparty', 4523540189)),
     'cebb675c437da423bfcd0e21bbc2fa2c4ce9c92f', '', False, True)"
17:29:25 DEBUG    memcache get (resolve) took 0.183612108231

Note how the non-existing localfolders seem to not have the number (which I assume is an uid).

So it seems to be very close to Fede's issue with the SSD benchmark: I get a different uid for floor users, but in their case there is no uid and therefore they are stuck to the first cache ever created (last time flushed I would think). Is this possible? Is this even possibly a bug?

What makes me wonder is that the updates happened really in the release folder which the floor users have access to. Is this related to John's (1) issue?

Appreciate your opinions.

Blazej Floch

unread,

Oct 11, 2017, 7:34:14 PM10/11/17

to rez-config

Sorry I meant "Allan's (1)" issue.

<div style="margin-top:10px;padding:3px 7px;border-width:1px;border-style:solid;border-color:rgb(81,72,69);border-radius:2px;background:url("") 50% 50% repeat-x scroll rgb(99,88,85);font-weight:bold;color:rgb(255,255,255);clear:both;font-family:Verdana,Tahoma,"Segoe UI",Arial;font-size:12.

Blazej Floch

unread,

Oct 13, 2017, 3:56:47 PM10/13/17

to rez-config

I tried stepping through on a case where the cache would not resolve properly.

Here is my findings:

In this particular case the problem was here:

https://github.com/nerdvegas/rez/blob/master/src/rez/resolver.py#L236

The request is something like:

rez-env A

The cache for this particular user is was for A-1.0 while meanwhile A-2.0 was released.

But here is my question:
The data which is retrieved by the cache already defines a version (1.0).

When I step into get_variant_state_handle, as variant_states.get(variant) ends up being None (I assume that's just a local cache for optimization purposes):

https://github.com/nerdvegas/rez/blob/master/src/rez/resolver.py#L236

... I will eventually end up in parent() of FileSystemVariant resource that according to:

https://github.com/nerdvegas/rez/blob/master/src/rezplugins/package_repository/filesystem.py#L247

... reuses that version (assuming based on the resource from the cache). Am I right to assume that this enforces the cached version and there is no way that A-2.0 will be solved.

Hence the new_state ends up being equal to the old_state and there is no way that the cache will be invalided? Or is the state_handle global to the package family?
Regardless it does not pick up the timestamp of the latest package, just the one retried from the cache.

Let me know if I missed something. In case this is the issue what would be the right approach to fix it?

I am not sure how this relates to my previous message, in which case the missing uid seemed to be the trigger. I would assume that regardless of uid set or not it would suffer from the problem.
The real difference in my mind is that the request itself has changed, but I have a hard time to imagine that this is a bug that was unnoticed.

I still have the feeling we are experiencing (1). I will look into this direction.

I could live with not having caches if local search paths exist. This would slow down TDs but the floor would be good.
So I am actually considering having two configs, one for the floor without local paths and with memcache set, and one for TDs without cache but with local paths.

I hope there is a better way.

Allan Johns

unread,

Oct 16, 2017, 8:33:40 PM10/16/17

to rez-c...@googlegroups.com

Hey Blazej,

I haven't been able to look at this yet, but it's definitely on the list.

The one thing I would say is that you definitely should have a REZ_PACKAGES_PATH setup in production that does NOT include users' local packages path, because that's going to get you highest reuse of cache entries. Including local paths means that every user's paths differ, so there will be no reuse of caches across users at all! This would reduce your cache hit ratio quite substantially I think.

At Method, our system for managing use of rez in production never includes local package paths unless that's explicitly asked for via a flag (++local).

Hth

A

--

You received this message because you are subscribed to the Google Groups "rez-config" group.

To unsubscribe from this group and stop receiving emails from it, send an email to rez-config+unsubscribe@googlegroups.com.

Blazej Floch

unread,

Oct 17, 2017, 6:23:27 PM10/17/17

to rez-config

Makes sense. I'll try to incorporate this in our system.

To unsubscribe from this group and stop receiving emails from it, send an email to rez-config+...@googlegroups.com.

Reply all

Reply to author

Forward