ZUNIONSTORE taking hours to execute?

Alexandru Stanciu

unread,

Aug 5, 2014, 3:27:54 PM8/5/14

to redi...@googlegroups.com

Hi guys,

I have about 12K simple SETS of integers, containing different number of members from 1 to 50K, about 700 on average, a total of more or less 9M members. I would like to aggregate and get the count of each distinct member, so I'm using ZUNIONSTORE (with sum and the default weight of 1). It didn't seem that heavy, but the ZUNIONSTORE on all these 12K sets keeps running for about 1h already and I have no idea if it's gonna stop soon.

It's a 4GB RAM machine, and here's some Redis info:

redis_version:2.8.9
used_cpu_sys:2.36
used_cpu_user:14.54
used_memory_human:886.58M

Please help, I feel like I'm missing something here..

Is this normal? Any way I could estimate the time it takes to execute this? Or should I have a totally different approach? Maybe work with sorted instead of simple sets?

Thanks a lot

Alexandru Stanciu

unread,

Aug 5, 2014, 3:43:07 PM8/5/14

to redi...@googlegroups.com

Well yeah, it stopped, took more than 1h..

Have no idea why, so I would really appreciate someone explaining this

Here's the updated Redis info

# Server
redis_version:2.8.9
redis_git_sha1:00000000
redis_git_dirty:0
redis_build_id:9ccc8119ea98f6e1
redis_mode:standalone
os:Darwin 13.3.0 x86_64
arch_bits:64
multiplexing_api:kqueue
gcc_version:4.2.1
process_id:15506
run_id:5f41fb312b69ab07d2bfa0f2a84e9b60885c4e73
tcp_port:6379
uptime_in_seconds:7372
uptime_in_days:0
hz:10
lru_clock:14758329
config_file:

# Clients
connected_clients:3
client_longest_output_list:0
client_biggest_input_buf:0
blocked_clients:0

# Memory
used_memory:1285861536
used_memory_human:1.20G
used_memory_rss:448233472
used_memory_peak:1391152896
used_memory_peak_human:1.30G
used_memory_lua:33792
mem_fragmentation_ratio:0.35
mem_allocator:libc

# Persistence
loading:0
rdb_changes_since_last_save:0
rdb_bgsave_in_progress:0
rdb_last_save_time:1407266837
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:17
rdb_current_bgsave_time_sec:-1
aof_enabled:0
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok

# Stats
total_connections_received:7
total_commands_processed:12427
instantaneous_ops_per_sec:0
rejected_connections:0
sync_full:0
sync_partial_ok:0
sync_partial_err:0
expired_keys:0
evicted_keys:0
keyspace_hits:12353
keyspace_misses:0
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:29076

# Replication
role:master
connected_slaves:0
master_repl_offset:0
repl_backlog_active:0
repl_backlog_size:1048576
repl_backlog_first_byte_offset:0
repl_backlog_histlen:0

# CPU
used_cpu_sys:18.59
used_cpu_user:3835.04
used_cpu_sys_children:2.73
used_cpu_user_children:9.16

# Keyspace
db0:keys=12339,expires=0,avg_ttl=0

Josiah Carlson

unread,

Aug 5, 2014, 11:56:18 PM8/5/14

to redi...@googlegroups.com

It's a 4 gig memory machine, but what do the first few lines of 'top' say? Specifically the lines that look like:

top - 20:13:55 up 29 days, 10:41, 20 users, load average: 0.20, 0.26, 0.24

Tasks: 282 total, 2 running, 280 sleeping, 0 stopped, 0 zombie

Cpu(s): 12.8%us, 0.4%sy, 0.0%ni, 85.8%id, 0.1%wa, 0.0%hi, 0.9%si, 0.0%st

Mem: 8118860k total, 7629100k used, 489760k free, 301432k buffers

Swap: 8328188k total, 1655812k used, 6672376k free, 1978564k cached

If you provide that data, it will give us a better idea of whether or not Redis is swapping. Based on "mem_fragmentation_ratio:0.35", only 35% of the memory that Redis knows its using is actually resident in memory, which says that either 1) Redis' accounting is not very good, or 2) 65%+ of the memory used by Redis is swapped out to disk by the OS. If Redis is swapped out to disk (remember, Redis requires its full dataset to be in-memory), I would say that would offer a very good reason as to why your command took over an hour to execute.

Why would that explain? Well, that has to do with the algorithm Redis uses to perform the union. Redis does the rough equivalent of the following Python when using SUM with input SETs defaulting to a weight of 1...

from collections import defaultdict

temp_result = defaultdict(float)

for key in keys:

for member in conn.smembers(key):

temp_result[member] += 1

conn.delete(destination_key)

conn.zadd(destination_key, **temp_result)

... which is to say that Redis will iterate over each input SET/ZSET one at a time, then iterate over each item in the input SET/ZSET to add it to a temporary counting hash. When it is done, it then converts that hash into a ZSET. This isn't slow by itself, but the thing to remember is that for small SETs/ZSETs (512 entries at max, by default), data is represented in a compact format, so is quick to read. But larger SETs/ZSETs represent the data internally as a hash table or hash table + skiplist (for the ZSET). With the non-compact representation, each element read is a random memory read. With data mostly swapped out to disk, that random memory read now leads to a random disk read of the page that the memory was located in. If you're lucky, that page miss will load other items into memory that are used before the page is swapped out again.

Estimating your total number of items examined at 700 * 12000 = 8,400,000, that's roughly 2200 items examined each second for 3800 seconds. That's high for the expected performance if you are using a spinning disk and every random memory read results in a random disk read. It's more or less in-line with what you would expect if you are using an SSD and every random memory read results in a random disk read (Redis won't do parallel disk reads, which is where SSDs really shine). It's also more or less in line with what you would expect if some of your data was packed concisely (1-2 reads to get up to 512 elements) and some of it was not.

As for how you should be doing it... don't use ZSETs as input in this case. It just adds to the amount of memory that will be necessary to read to compute the result. Really though your first step should be to make sure Redis has enough memory to do what it needs to do. If Redis is swapping, remove other stuff, or move Redis somewhere that it can use enough memory. From there, you should be able to execute the exact same command you were executing and it should complete in under a minute. My best guess assuming a modern-ish processor is actually closer to 15-30 seconds. Back in 2011 I was intersecting sets of 5 million and 7.5 million elements in 7 seconds with Redis[1]; it's not unreadonableto think that unioning a total of roughly 8.4 million elements would take under 30 seconds.

Want to not spend ?? seconds waiting for Redis to complete? You have 3 standard-ish* options:

1. Pull your data key by key and compute on the client side (like the example I wrote above in Python)

2. Perform a series of partial UNIONSTORE operations either sequentially:

UNIONSTORE result 100 <the first 100 keys>

UNIONSTORE result 101 result <the next 100 keys>

... or using a sort of tree:

UNIONSTORE result1 100 <the first 100 keys>

UNIONSTORE result2 100 <the next 100 keys>

...

UNIONSTORE result120 100 <the last 100 keys>

UNIONSTORE result 120 result1 result2 result3 ...

3. Perform a variant of #2 sequential using Lua, which prevents the delete + recreation of the result ZSET (I can explain this further)

Each has their drawbacks and benefits, none in this example offer a point-in-time union. If that is a requirement, you have at least 2 additional options there, but one requires another Redis server (to actually perform the query), and the other requires that you write a client that pretends to be a Redis server asking for a snapshot of the data, which you then parse and aggregate on the client side.

I hope this helps. Let me know if you have any other questions or require clarification on anything.

* By "standard-ish" I mean that it doesn't require a complicated development effort

[1] - http://www.dr-josiah.com/2011/09/improving-performance-by-1000x.html

- Josiah

--
You received this message because you are subscribed to the Google Groups "Redis DB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to redis-db+u...@googlegroups.com.
To post to this group, send email to redi...@googlegroups.com.
Visit this group at http://groups.google.com/group/redis-db.
For more options, visit https://groups.google.com/d/optout.

Alexandru Stanciu

unread,

Aug 6, 2014, 4:49:52 PM8/6/14

to redi...@googlegroups.com

Thank you again Josiah for your detailed answer, I'll have to crunch more on that.

I'm currently running the same command again, for 30 min already, and below you can see the 'top'. I cleaned up Redis data before and restarted the machine.

Sorry I don't know how to read all this, but it seems to me it's not swapping, just that redid-server process stays around 99%.

Processes: 156 total, 3 running, 2 stuck, 151 sleeping, 587 threads 22:45:01

Load Avg: 1.40, 1.59, 1.51 CPU usage: 55.90% user, 10.0% sys, 34.9% idle

SharedLibs: 1332K resident, 0B data, 0B linkedit.

MemRegions: 20623 total, 2230M resident, 87M private, 1000M shared.

PhysMem: 4017M used (722M wired), 60M unused.

VM: 373G vsize, 1026M framework vsize, 0(0) swapins, 0(0) swapouts.

Networks: packets: 46431/104M in, 43605/85M out.

Disks: 109601/3416M read, 96533/2053M written.

PID COMMAND %CPU TIME #TH #WQ #PORT #MREG MEM RPRVT PURG

1005 spindump_age 0.0 00:00.01 2 1 46 46 1000K 440K 0B

990 python2.7 0.0 00:00.10 1 0 16 102 8260K 8028K 0B

964 bash 0.0 00:00.02 1 0 19 29 608K 468K 0B

963 login 0.0 00:00.03 2 0 30 42 836K 508K 0B

959 com.apple.We 0.0 00:01.82 8 1 178 293 21M 19M 32K

950 rdm 0.0 00:15.20 7 0 171 1612 313M 302M 0B

942 redis-server 97.3 28:26.38 3/1 0 18 1120 1010M 1081M 0B

I understand the "standard-ish" approaches you're explaining below, apart from the 3rd one on Lua. Are you saying you can have better performance by running some Lua script then just the plain ZUNIONSTORE command?

But actually I'm very interested in what you've done improving performance by 1000x, because in fact I'm also working with twitter data.

http://www.dr-josiah.com/2011/09/improving-performance-by-1000x.html

I will have to do some more reading on that as I said, and will try doing this in python just to compare performance.

Cheers, Alex

Alexandru Stanciu

unread,

Aug 6, 2014, 5:25:00 PM8/6/14

to redi...@googlegroups.com

So yeah, a little more then 1 hour run time..

Python 2.7.7 (default, Jun 2 2014, 18:55:26)

[GCC 4.2.1 Compatible Apple LLVM 5.1 (clang-503.0.40)] on darwin

Type "help", "copyright", "credits" or "license" for more information.

>>> import redis

>>> r = redis.StrictRedis(db=3)

>>> keys = r.keys('user_following:*')

>>> len(keys)

12306

>>> from datetime import datetime

>>> print datetime.now()

2014-08-06 22:16:28.820763

>>> r.zunionstore('most_followed', keys)

3668747L

>>> print datetime.now()

2014-08-06 23:22:36.795859

Kiril Kartunov

unread,

Aug 7, 2014, 3:27:03 AM8/7/14

to redi...@googlegroups.com

Based on "mem_fragmentation_ratio:0.35", only 35% of the memory that Redis knows its using is actually resident in memory...

Just curious here and wanted to consult me... Reading this detailed answer I just decided to see the value of "mem_fragmentation_ratio" in my Redis instance and got 2.40. Wondering if this is in % and represents split of portion hdd/mem why is it more than 1?

Thanks

Josiah Carlson

unread,

Aug 7, 2014, 3:07:40 PM8/7/14

to redi...@googlegroups.com

If you look at your Redis INFO output, there will be 3 lines, which I've extracted from the info output on one of my dev servers:

used_memory:156293968

used_memory_rss:160935936

mem_fragmentation_ratio:1.03

"used_memory" is the amount of memory that Redis knows to have dynamically allocated

"used_memory_rss" is the amount of memory that Redis knows to be resident in ram (this is calculated in *nix by examining /proc)

"mem_fragmentation_ratio" is: used_memory_rss / used_memory

When mem_fragmentation_ratio is lower than 1, there is only one way that can happen: the OS has swapped portions of Redis' dynamically allocated memory to disk. This usually means that you need more memory, or less other things putting pressure on the OS to swap Redis to disk.

When mem_fragmentation_ratio is significantly higher than 1, usually that's because of one of several different scenarios involving memory allocations/deallocations (sometimes sudden, sometimes over time) that has caused the memory that Redis is using to become fragmented. This can be addressed trivially by restarting Redis. It can also be sort-of addressed by forcing Redis to delete and re-allocate all of your keys (this is theoretical, I've never done it before, but knowing enough about Redis and jemalloc, this should work). A script for reallocating a group of keys (including TTLs) is as follows...

local total = 0

for _, key in ipairs(KEYS) do

local ttl = tonumber(redis.call('pttl', key))

if ttl >= -1 then

local data = redis.call('dump', key)

total = total + #data

redis.call('del', key)

redis.call('restore', key, 0, data)

if ttl > 0 then

redis.call('pexpire', key, ttl)

end

return total

If you run this over all of your keys (maybe a few hundred at a time), you should reduce/eliminate your fragmentation considerably. There are some exceptions to this, but theoretically it should work :)

- Josiah

--

Josiah Carlson

unread,

Aug 7, 2014, 4:31:51 PM8/7/14

to redi...@googlegroups.com

On Wed, Aug 6, 2014 at 1:49 PM, Alexandru Stanciu <alexandr...@gmail.com> wrote:

Thank you again Josiah for your detailed answer, I'll have to crunch more on that.

I'm currently running the same command again, for 30 min already, and below you can see the 'top'. I cleaned up Redis data before and restarted the machine.
Sorry I don't know how to read all this, but it seems to me it's not swapping, just that redid-server process stays around 99%.

[replaced version with better formatting, where I also discovered you were using OS X and not Linux]

Processes: 156 total, 3 running, 2 stuck, 151 sleeping, 587 threads 22:45:01

Load Avg: 1.40, 1.59, 1.51 CPU usage: 55.90% user, 10.0% sys, 34.9% idle

SharedLibs: 1332K resident, 0B data, 0B linkedit.

MemRegions: 20623 total, 2230M resident, 87M private, 1000M shared.

PhysMem: 4017M used (722M wired), 60M unused.

VM: 373G vsize, 1026M framework vsize, 0(0) swapins, 0(0) swapouts.

Networks: packets: 46431/104M in, 43605/85M out.

Disks: 109601/3416M read, 96533/2053M written.

PID COMMAND %CPU TIME #TH #WQ #PORT #MREG MEM RPRVT PURG

1005 spindump_age 0.0 00:00.01 2 1 46 46 1000K 440K 0B

990 python2.7 0.0 00:00.10 1 0 16 102 8260K 8028K 0B

964 bash 0.0 00:00.02 1 0 19 29 608K 468K 0B

963 login 0.0 00:00.03 2 0 30 42 836K 508K 0B

959 com.apple.We 0.0 00:01.82 8 1 178 293 21M 19M 32K

950 rdm 0.0 00:15.20 7 0 171 1612 313M 302M 0B

942 redis-server 97.3 28:26.38 3/1 0 18 1120 1010M 1081M 0B

Looking at your top output, I would agree that it doesn't seem as though it is swapping; swapin/swapout are both 0.

I would love to see the output of the following command run in your Python console:

sum(map(r.scard, keys))

That will tell us the total number of items examined by Redis during the unionstore operation.

I understand the "standard-ish" approaches you're explaining below, apart from the 3rd one on Lua. Are you saying you can have better performance by running some Lua script then just the plain ZUNIONSTORE command?

No. I'm saying that you can get similar (but slower) performance using Lua instead, but you can control how much work you perform in any given step, meaning that while it may take longer to execute, you won't be blocking all other commands from executing.

That said, assuming your 700 average elements/set is correct, this is taking *way* too long to execute. With 12k sets of 700 items each, I'd estimate less than 30 seconds of overall execution time should be required. This *could* be a corner case with the number of sets you are dealing with, or your estimate of average size could be off (this is where that sum(map(...)) call can be useful, gives us a better estimate).

But actually I'm very interested in what you've done improving performance by 1000x, because in fact I'm also working with twitter data.

http://www.dr-josiah.com/2011/09/improving-performance-by-1000x.html

Remember that in that example, we were looking at the size of intersections between 2 sets, rather than the result of a counting union of many sets.

I will have to do some more reading on that as I said, and will try doing this in python just to compare performance.

Core set operations in Python are faster than in Redis, and this kind of counting should be comparable between the two (if Redis was running in 30 seconds for this and not over an hour, and Python already had the data in memory).

- Josiah

Alexandru Stanciu

unread,

Aug 8, 2014, 3:48:39 AM8/8/14

to redi...@googlegroups.com

Yes, sorry about bad formatting, here's the total count of elements, 9.2M.

>>> len(keys)
12306

>>> sum(map(r.scard, keys))
9267750

Have to mention that before the union I had left out a significant portion of my data set, top 30 keys with most members (>50K) - a total of 6.7M elements.

Otherwise I guess it would have taken much more than 1 hour.. Today I will try performing the union with python and see how much it takes.

Anyways I'm setting up a github repo for this exercise I'm doing (https://github.com/ducu/twitter-most-followed), it's work in progress, there's no description yet, but basically it's about finding out the most followed twitter accounts by a specified group. In this exercise it's the Hacker News (@newsyc20) followers group, so that's what the 12.3K "user_following:*" keys are.

Well hopefully I'll get to understand what's happening with my Redis instance.. Thank you Josiah for your help

Josiah Carlson

unread,

Aug 18, 2014, 1:41:40 AM8/18/14

to redi...@googlegroups.com

Looking at your code, I can't help but think that you are approaching this from the wrong direction, unless you have other reasons for having follower lists.

In particular, if you want to find users with the most followers, you can pull the follower count directly from the user information about that user. That way you don't need to crawl the full follower list just to be able to answer the question, "who has the most followers?" Of the information contained in the user info object about the person. Specifically: https://dev.twitter.com/docs/api/1.1/get/users/lookup lets you pull the follower count for up to 100 users at the same time*. It's what let us prioritize crawling the follower lists on Twitter (we never bothered with the 'following' list, though that could have lead to some interesting pagerank-like computation), which lead to even more big users on Twitter.

So... what else are you planning on doing with the data?

- Josiah

* In the past, we used to get errors when making API calls for certain users. I'm pretty sure that they were private users, but it would error out an entire 100 user request, and I didn't think it made any sense to do individual calls on every user in that case. When an error occurred on a 100 user list, I'd chop the list in half and re-run on the halves individually. After 14 calls, I'd have all of the information for the 100 users, and I'd know which user was private. If there were 2 private users, that would be more API calls, but I don't think I ever saw more than 2-3 private users per 100 users, and about 90-95% of API requests returned without requiring the recursive splitting.

Alexandru Stanciu

unread,

Aug 18, 2014, 7:41:15 AM8/18/14

to redi...@googlegroups.com

Hi Josiah, just got back from holidays, that's why I was silent the past week.

I'll resume investigating this zunionstore issue and I could still use your help, so I'll explain what I'm trying to accomplish. I'm actually planning to write a blog post about it, I'll let you know before publishing it.

Here's the problem.

Given a well known group of twitter users, let's call it the "target-group", I want to find out which are the most popular accounts for this specific target-group; i.e. top N most-followed-accounts-by-the-target-group-members. This means that I don't know who these popular accounts are, so I cannot pull the follower count. And even if I would know who they are, I'd have to select from their followers only the ones which are in my target-group so I can order them by popularity (# of followers from my target-group).

E.g. 1: If my target-group would be the entire twitter user base, the result would be exactly this http://twittercounter.com/pages/100 (for N=100).

E.g. 2: In my specific example, the chosen target-group is all the HNers (the 12306 public followers of @newsyc20 - out of the total 13343, minus the 1037 protected ones). And attached you can see the resulting top 300 (for N=300) that I found out by performing the 1 hour running zunionstore.

So as you can see, I need to get the complete sets of "friend ids" for each of my target-group members, and perform a zunionstore on all these sets. Hope I managed to explain it better now.

A friend of mine tried to perform the same zunionstore-like operation in python as you suggested, and that took 30 sec, just as you estimated. But in redis, the zunionstore takes from 30 min to 1 hour depending on the processor. And we still have no idea why...

We could give you a redis backup with all this data so you can see for yourself how the zunionstore is performing. Maybe it's related to redis config, I don't know how to fine tune that yet, I just went for the default configuration.

Please let me know if the problem is clear now. Cheers, Alex

top-300-most-followed-accounts-by-hners.csv

Alexandru Stanciu

unread,

Aug 20, 2014, 10:48:12 AM8/20/14

to redi...@googlegroups.com

Maybe it has something to do with this - https://github.com/antirez/redis/pull/1786

I'll install the latest Redis release (was using 2.8.9) and run the zunionstore again, dunno what else to try..

If you have any tips please let me know, thanks!

Josiah Carlson

unread,

Aug 20, 2014, 1:14:13 PM8/20/14

to redi...@googlegroups.com

On Mon, Aug 18, 2014 at 4:41 AM, Alexandru Stanciu <alexandr...@gmail.com> wrote:

Hi Josiah, just got back from holidays, that's why I was silent the past week.
I'll resume investigating this zunionstore issue and I could still use your help, so I'll explain what I'm trying to accomplish. I'm actually planning to write a blog post about it, I'll let you know before publishing it.

Here's the problem.
Given a well known group of twitter users, let's call it the "target-group", I want to find out which are the most popular accounts for this specific target-group; i.e. top N most-followed-accounts-by-the-target-group-members. This means that I don't know who these popular accounts are, so I cannot pull the follower count. And even if I would know who they are, I'd have to select from their followers only the ones which are in my target-group so I can order them by popularity (# of followers from my target-group).

I can think of several ways of doing this that don't involve the use of ZUNIONSTORE. For one of them, you can get the exact answers you want, you can get the exact answer at any time, and incremental updates are fast. The straightforward way of implementing the the solution uses 2x as much memory as your current method, but you can probably cut that down to 10% more memory than what you are currently doing, and still get a real-time toplist.

Another version can do similar real-time incremental additions, but can fail if you ever re-spider a user's following list, and is only approximate. But this one can actually reduce memory use overall if some optimizations are performed.

E.g. 1: If my target-group would be the entire twitter user base, the result would be exactly this http://twittercounter.com/pages/100 (for N=100).

E.g. 2: In my specific example, the chosen target-group is all the HNers (the 12306 public followers of @newsyc20 - out of the total 13343, minus the 1037 protected ones). And attached you can see the resulting top 300 (for N=300) that I found out by performing the 1 hour running zunionstore.

So as you can see, I need to get the complete sets of "friend ids" for each of my target-group members, and perform a zunionstore on all these sets. Hope I managed to explain it better now.

Now I see the purpose of what you are looking to do: finding influencers in subgroups.

A friend of mine tried to perform the same zunionstore-like operation in python as you suggested, and that took 30 sec, just as you estimated. But in redis, the zunionstore takes from 30 min to 1 hour depending on the processor. And we still have no idea why...

Your followup with the reference to the github pull request might be the source of the issue, and though it has been merged, it isn't yet in a release. If you pull down the github repository and check out the 2.8 branch, it will have the change, but the next official release with the change will be 2.8.14, which isn't out yet.

We could give you a redis backup with all this data so you can see for yourself how the zunionstore is performing. Maybe it's related to redis config, I don't know how to fine tune that yet, I just went for the default configuration.

It is unlikely to be configuration related. I could run it on my machine, but I doubt I would experience all that much difference in execution time.

- Josiah

Alexandru Stanciu

unread,

Aug 21, 2014, 3:08:24 AM8/21/14

to redi...@googlegroups.com

It worked!!! Down to 2 min ;)

# Server

redis_version:2.9.57 (3.0.0 beta 8)

>>> import redis

>>> r = redis.StrictRedis()

>>> keys = r.keys('user_friends:*')

>>> len(keys)

12340

>>> from datetime import datetime

>>> print datetime.now()

2014-08-21 08:59:52.154846

>>> r.zunionstore('most_followed', keys)

7790619L

>>> print datetime.now()

2014-08-21 09:01:57.874463

Josiah Carlson

unread,

Aug 21, 2014, 3:12:23 AM8/21/14

to redi...@googlegroups.com

Awesome :D

Congratulations on partly discovering a solution to your mystery!

- Josiah

Alexandru Stanciu

unread,

Aug 21, 2014, 4:43:37 AM8/21/14

to redi...@googlegroups.com

Yep, well this is a huge performance improvement in redis (or performance flaw for previous versions :)

I guess generally you don't do this kind of big zunionstores, I don't plan to do it on a regular basis either, it's just for the initial data load phase of my project. Glad it's sorted out. I understand that you can do it differently w/o zunionstore, but I plan to use this command heavily in my system, for much smaller data sets indeed, so I had to understand what was wrong.

As I said, I'm writing a blog post about this exercise and I'm gonna let you know when done.

Besides this zunionstore issue, there was a challenge to retrieve all those friend ids for the 13K HNers, considering Twitter API rate limits. Briefly, there were 15K calls to the friends/ids method, and at 1 call/min/token you need about 10 days to retrieve all the data. What I did was to extend tweepy so I could provide several tokens to the same API object, and these tokens are used in a round robin fashion transparently. Using about four dozen tokens I reduced the overall retrieval time to 5 hours (2 hours work time, 3 hours sleep). Anyways, I'll wrap this up and publish a post about it.

Thank you again Josiah for your support.

Cheers, Alex

--
You received this message because you are subscribed to a topic in the Google Groups "Redis DB" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/redis-db/LUT2P446JVc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to redis-db+u...@googlegroups.com.

To post to this group, send email to redi...@googlegroups.com.
Visit this group at http://groups.google.com/group/redis-db.
For more options, visit https://groups.google.com/d/optout.

--
Alexandru Stanciu

+32 (473) 433 004

Josiah Carlson

unread,

Aug 22, 2014, 12:56:00 AM8/22/14

to redi...@googlegroups.com

On Thu, Aug 21, 2014 at 1:43 AM, Alexandru Stanciu <alexandr...@gmail.com> wrote:

Yep, well this is a huge performance improvement in redis (or performance flaw for previous versions :)

:D

I guess generally you don't do this kind of big zunionstores, I don't plan to do it on a regular basis either, it's just for the initial data load phase of my project. Glad it's sorted out. I understand that you can do it differently w/o zunionstore, but I plan to use this command heavily in my system, for much smaller data sets indeed, so I had to understand what was wrong.

I get it :) Generally when I'm approaching a problem, I ask if there is anything that I can do in advance that could help me answer the query I want to perform later. If so, I perform as much precomputation as possible. If you have a fixed target, this gets easier.

As I said, I'm writing a blog post about this exercise and I'm gonna let you know when done.

I'm looking forward to reading it :)

Besides this zunionstore issue, there was a challenge to retrieve all those friend ids for the 13K HNers, considering Twitter API rate limits. Briefly, there were 15K calls to the friends/ids method, and at 1 call/min/token you need about 10 days to retrieve all the data. What I did was to extend tweepy so I could provide several tokens to the same API object, and these tokens are used in a round robin fashion transparently. Using about four dozen tokens I reduced the overall retrieval time to 5 hours (2 hours work time, 3 hours sleep). Anyways, I'll wrap this up and publish a post about it.

Thank you again Josiah for your support.
Cheers, Alex

You are welcome. I'm a big fan of fun problems :)

Alexandru Stanciu

unread,

Sep 1, 2014, 9:00:45 AM9/1/14

to redi...@googlegroups.com

Hi Josiah, I finally published the story about this experiment

https://medium.com/@ducu/how-to-find-out-whos-popular-on-twitter-d659884fd942 (also on https://news.ycombinator.com/item?id=8252252)

Here's the detailed technical description on the github repo, with credits to you as well

https://github.com/ducu/twitter-most-followed

Cheers, Alex

Alexandru Stanciu

unread,

Sep 2, 2014, 4:01:37 AM9/2/14

to redi...@googlegroups.com

Btw I'd appreciate some upvotes on these two posts. Thanks a lot!
news.ycombinator.com/item?id=8252323 and https://news.ycombinator.com/item?id=8252252

Josiah Carlson

unread,

Sep 3, 2014, 12:55:16 PM9/3/14

to redi...@googlegroups.com

I enjoyed the post, thank you for keeping us updated :)

- Josiah

Reply all

Reply to author

Forward