Re: StreamSummary accuracy

148 views
Skip to first unread message

Matt Abrams

unread,
Apr 13, 2013, 9:06:20 PM4/13/13
to stream-...@googlegroups.com
Travis -

There are certain guaranteed properties of the space-saving algorithm implemented in the StreamSummary class but the error rate is dependent ont he distribution of your data.  The paper, http://www.cs.ucsb.edu/research/tech_reports/reports/2005-23.pdf, goes into some detail about the various properties of the algorithm.  

Error enters the StreamSummary as elements thrash in and out of the data structure.  The most frequent items tend to rise to the top of the list and suffer less thrashing (depending ont he distribution of data).  The summary object does track the estimated error on each counter node so you may try processing your data with various capacity settings to get a feel for how the distribution of your data represents itself in the counter.  

As a simple rule of thumb I generally set the capacity to 1.5-2x the number of elements I'd like to track with the understanding that that bottom of the list will contain errors.

Matt



On Fri, Apr 12, 2013 at 11:18 AM, <tra...@massrelevance.com> wrote:
How does the capacity parameter affect the resulting accuracy of the StreamSummary output?

Is there a recommended capacity given a certain size of desired output via the topk method?

thank you

--
You received this message because you are subscribed to the Google Groups "stream-lib-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stream-lib-us...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

naz.m...@gmail.com

unread,
Apr 15, 2015, 9:33:56 AM4/15/15
to stream-...@googlegroups.com, tra...@massrelevance.com
Hi All,

Q: What does mean "error" field in Counter object ?
Q: How it possible to calculate some counter accuracy based on it value & error value ?
For example:
* StreamSummary capacity is 10
* first item (counter object) returned by topK method looks like in the next way:
item: gDIM, count: 10, error: 9

What is accuracy for item  "gDIM" ?

Thanks in advance & sorry for the simple question.

On Tuesday, April 16, 2013 at 3:10:56 AM UTC+3, tra...@massrelevance.com wrote:
Excellent, Matt. Thank you for the detail.
I recently changed my code to set capacity equal to the actual top k I was looking to retrieve and did indeed see errors.

thanks again
Travis

Ian Barfield

unread,
Apr 15, 2015, 12:35:53 PM4/15/15
to stream-...@googlegroups.com, tra...@massrelevance.com
the error field tracks the maximum possible over-estimation that may be occurring for that key. Translating that into probabilistic error bounds might be possible in those specific cases, but I would defer to the paper. It isn't something I've looked into recently myself.

For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages