How are zipf alpha values calculated?

112 views
Skip to first unread message

Andre Costa

unread,
Sep 17, 2023, 5:37:32 AM9/17/23
to cache-trace
Hello, I was utilizing these real traces to compare them to my synthetic workloads.
The objective was to understand how well my synthetic workload could imitate reality, so that I can evaluate how much we can trust results obtained by our synthetic workloads.

My question is: how did you calculate the zipf alpha value in the traces? I know that it is possible that the access distribution changes over time, but from my analysis some of the zipf alpha values are wrong.
For example, from the first 465 million requests from cluster1, the zipfian alpha is 0.22; but the value in the statistics is 2.6 (which makes a huge difference). Even though I know it is possible that the skewness of the workload changed later on, it seems unlikely that it changed that drastically.

I might be missing the point entirely, please excuse me if that's the case.

Thank you for your help!

peter.waynechina

unread,
Oct 4, 2023, 5:56:18 AM10/4/23
to cache-trace
Sorry for the late reply. 
The calculation was performed using linear regression on the log(freq) vs log(rank), which we have noted is not the best/correct solution from a statistics view, however, this was also used in other works, so we chose this so that we can compare the results. 
 

I have run the analysis again, on the 400 million requests from cluster1, I am observing alpha value of more than 2. Can you double check or illustrate your method? 

peter.waynechina

unread,
Oct 4, 2023, 6:00:54 AM10/4/23
to cache-trace
BTW cluster1 does not follow Zipf very well, see attached 


cluster1.png

peter.waynechina

unread,
Oct 4, 2023, 6:03:29 AM10/4/23
to cache-trace
As mentioned in the paper, for a small number of clusters, there is a small cache in the application to absorb the requests to hot keys. 
Cluster1 might have such a cache that caused the very popular objects to have fewer requests. 

Andre Costa

unread,
Mar 14, 2024, 2:35:51 PM3/14/24
to peter.waynechina, cache-trace
I'm sorry for the (very) late reply, the reply got lost on my inbox.

Thank you very much for the response :)!

The method I am using for calculating the alpha value is the one illustrated in the first response to this question: https://stats.stackexchange.com/questions/6780/how-to-calculate-zipfs-law-coefficient-from-a-set-of-top-frequencies
What I have observed is that when I analyze a truncated version of the original trace, the alpha value diminishes. My measurements have shown that the smaller the trace, the lower the alpha value. This means that my original statement that the zipf alpha values presented in the github page are wrong is not valid as I was changing the distribution (by truncating the samples).

I will continue to analyze this and update this thread if I find something new.

Again, thank you for your help (and sorry for the delay).


--
You received this message because you are subscribed to the Google Groups "cache-trace" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cache-trace...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cache-trace/9c919e9c-7c1c-4dbc-8151-55880c0fad6bn%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

peter.waynechina

unread,
Mar 14, 2024, 3:33:33 PM3/14/24
to cache-trace
Interesting. I have never thought that the trace length would affect skewness (a property of the workload independent of trace collection). I think one reason might be that when trace length is short, there are too many one-hit wonders, which caused the curve to deviate from Zipf distribution. 

Andre Costa

unread,
Mar 14, 2024, 3:57:40 PM3/14/24
to peter.waynechina, cache-trace
Yes! That is my intuition as well. When the workload is truncated, the number of distinct keys is significantly lower, which is coherent with what you said. 

You received this message because you are subscribed to a topic in the Google Groups "cache-trace" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/cache-trace/iyEEnDp2sqo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to cache-trace...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cache-trace/3221c8e7-79cb-430e-917d-0d70df075ec1n%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages