--
You received this message because you are subscribed to the Google Groups "Redis DB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to redis-db+u...@googlegroups.com.
To post to this group, send email to redi...@googlegroups.com.
Visit this group at http://groups.google.com/group/redis-db.
For more options, visit https://groups.google.com/d/optout.
Processes: 156 total, 3 running, 2 stuck, 151 sleeping, 587 threads 22:45:01
Load Avg: 1.40, 1.59, 1.51 CPU usage: 55.90% user, 10.0% sys, 34.9% idle
SharedLibs: 1332K resident, 0B data, 0B linkedit.
MemRegions: 20623 total, 2230M resident, 87M private, 1000M shared.
PhysMem: 4017M used (722M wired), 60M unused.
VM: 373G vsize, 1026M framework vsize, 0(0) swapins, 0(0) swapouts.
Networks: packets: 46431/104M in, 43605/85M out.
Disks: 109601/3416M read, 96533/2053M written.
PID COMMAND %CPU TIME #TH #WQ #PORT #MREG MEM RPRVT PURG
1005 spindump_age 0.0 00:00.01 2 1 46 46 1000K 440K 0B
990 python2.7 0.0 00:00.10 1 0 16 102 8260K 8028K 0B
964 bash 0.0 00:00.02 1 0 19 29 608K 468K 0B
963 login 0.0 00:00.03 2 0 30 42 836K 508K 0B
959 com.apple.We 0.0 00:01.82 8 1 178 293 21M 19M 32K
950 rdm 0.0 00:15.20 7 0 171 1612 313M 302M 0B
942 redis-server 97.3 28:26.38 3/1 0 18 1120 1010M 1081M 0B
Python 2.7.7 (default, Jun 2 2014, 18:55:26)
[GCC 4.2.1 Compatible Apple LLVM 5.1 (clang-503.0.40)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import redis
>>> r = redis.StrictRedis(db=3)
>>> keys = r.keys('user_following:*')
>>> len(keys)
12306
>>> from datetime import datetime
>>> print datetime.now()
2014-08-06 22:16:28.820763
>>> r.zunionstore('most_followed', keys)
3668747L
>>> print datetime.now()
2014-08-06 23:22:36.795859
Based on "mem_fragmentation_ratio:0.35", only 35% of the memory that Redis knows its using is actually resident in memory...
--
Thank you again Josiah for your detailed answer, I'll have to crunch more on that.I'm currently running the same command again, for 30 min already, and below you can see the 'top'. I cleaned up Redis data before and restarted the machine.Sorry I don't know how to read all this, but it seems to me it's not swapping, just that redid-server process stays around 99%.
I understand the "standard-ish" approaches you're explaining below, apart from the 3rd one on Lua. Are you saying you can have better performance by running some Lua script then just the plain ZUNIONSTORE command?
But actually I'm very interested in what you've done improving performance by 1000x, because in fact I'm also working with twitter data.
I will have to do some more reading on that as I said, and will try doing this in python just to compare performance.
Hi Josiah, just got back from holidays, that's why I was silent the past week.I'll resume investigating this zunionstore issue and I could still use your help, so I'll explain what I'm trying to accomplish. I'm actually planning to write a blog post about it, I'll let you know before publishing it.Here's the problem.Given a well known group of twitter users, let's call it the "target-group", I want to find out which are the most popular accounts for this specific target-group; i.e. top N most-followed-accounts-by-the-target-group-members. This means that I don't know who these popular accounts are, so I cannot pull the follower count. And even if I would know who they are, I'd have to select from their followers only the ones which are in my target-group so I can order them by popularity (# of followers from my target-group).
E.g. 1: If my target-group would be the entire twitter user base, the result would be exactly this http://twittercounter.com/pages/100 (for N=100).
E.g. 2: In my specific example, the chosen target-group is all the HNers (the 12306 public followers of @newsyc20 - out of the total 13343, minus the 1037 protected ones). And attached you can see the resulting top 300 (for N=300) that I found out by performing the 1 hour running zunionstore.So as you can see, I need to get the complete sets of "friend ids" for each of my target-group members, and perform a zunionstore on all these sets. Hope I managed to explain it better now.
A friend of mine tried to perform the same zunionstore-like operation in python as you suggested, and that took 30 sec, just as you estimated. But in redis, the zunionstore takes from 30 min to 1 hour depending on the processor. And we still have no idea why...
We could give you a redis backup with all this data so you can see for yourself how the zunionstore is performing. Maybe it's related to redis config, I don't know how to fine tune that yet, I just went for the default configuration.
--
You received this message because you are subscribed to a topic in the Google Groups "Redis DB" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/redis-db/LUT2P446JVc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to redis-db+u...@googlegroups.com.
To post to this group, send email to redi...@googlegroups.com.
Visit this group at http://groups.google.com/group/redis-db.
For more options, visit https://groups.google.com/d/optout.
Yep, well this is a huge performance improvement in redis (or performance flaw for previous versions :)
I guess generally you don't do this kind of big zunionstores, I don't plan to do it on a regular basis either, it's just for the initial data load phase of my project. Glad it's sorted out. I understand that you can do it differently w/o zunionstore, but I plan to use this command heavily in my system, for much smaller data sets indeed, so I had to understand what was wrong.
As I said, I'm writing a blog post about this exercise and I'm gonna let you know when done.
Besides this zunionstore issue, there was a challenge to retrieve all those friend ids for the 13K HNers, considering Twitter API rate limits. Briefly, there were 15K calls to the friends/ids method, and at 1 call/min/token you need about 10 days to retrieve all the data. What I did was to extend tweepy so I could provide several tokens to the same API object, and these tokens are used in a round robin fashion transparently. Using about four dozen tokens I reduced the overall retrieval time to 5 hours (2 hours work time, 3 hours sleep). Anyways, I'll wrap this up and publish a post about it.Thank you again Josiah for your support.Cheers, Alex