Sum(), Count(), Random() ?

7 views
Skip to first unread message

yosun

unread,
Apr 14, 2011, 12:23:21 AM4/14/11
to FluidDB Discuss
Is there a way to pull a random entry from a particular userbase/tag
in FluidDB?

Is there a way to perform functions like MySQL Sum(), Count(), and
Random() in FluidDB?

Nick Radcliffe

unread,
Apr 14, 2011, 2:29:18 AM4/14/11
to fluiddb...@googlegroups.com
On 14 Apr 2011, at 05:23, yosun wrote:

> Is there a way to pull a random entry from a particular userbase/tag
> in FluidDB?

If you are asking what I think you are asking, the answer is currently
"no".
I'm assuming your question is, can you ask the system to choose a
random object tagged with some tag and return its value --- something
conceptually like

SELECT * FROM FLUIDINFO WHERE yosun/rating > 6 LIMIT 1

At the moment, you can request all the things that have a tag with a
query like

has yosun/rating

or all the things that have a rating over 6 with a query like

yosun/rating > 6

but you can't just ask for one. (This kind of functionality has been
requested before and, I imagine, is on a request list somewhere.)


> Is there a way to perform functions like MySQL Sum(), Count(), and
> Random() in FluidDB?

No, there are no such functions in Fluidinfo at the moment.

However, my library fdb.py (https://github.com/njr0/fdb.py) does offer
a count function.

For example,

$ fdb count -q 'fluiddb/about matches "yosun"'
0 objects matched
Total: 0 objects

$ fdb count -q 'fluiddb/about matches "njr0"'
7 objects matched
Total: 7 objects

It does this by asking for all the IDs and then counting them, but
it's a start.

It would also be pretty easy for me to add sum and even a random
selection/limit function that would not be perfect, but would simulate
what you want quite well. I might try these over the next week or so.

I guess what I would think about would be something like this:

fdb -q 'has terrycojones/rating' sum /yosun/rating

(to sum the yosun/ratings of the things terrycojones has rated).
There are some subtleties around type --- specifically, the same tag
can have different types on different objects. Probably most people
would agree that integer and real values should be summed in the
natural way (and combined). Strings and booleans would probably
create more disagreement. Personally, I would probably include
booleans (with true = 1 and false = 0) and exclude strings (even if
the string represents a number, like "11"). Obviously, there could
be options to control this. So I think my first implementation would
do

7 + 3.4 + true + false + "Melody Gardot" + "11" + "true"

= 7 + 3.4 + 1 + 0 + + +

= 11.4.

(But I'm pretty sure Jamu (@jkakar), a non-believer in the numeric
equivalents of booleans, would make it 10.0, and could believe that
Xavier (@fxn), a perl golfer, would make it 22.4 on the basis that
there's a perfectly good way to interpret "11" as a number.)

If I were feeling really enthusiastic, I might add a flag, alternative
agg functions or a configuration parameter to control that behaviour.

The random selection could be done a couple of different ways. The
special case of just wanting one could be an 'any' aggregation:

fdb -q 'has terrycojones/rating' any /yosun/rating

but a more general mechanism would be to add an ability to limit the
number of results. So I might use -L or something to give a syntax
such as

fdb -L 10 -q 'has terrycojones/rating' /about

to limit the number of results to 10.

Regards

Nick

yosun

unread,
Apr 14, 2011, 2:34:59 AM4/14/11
to FluidDB Discuss
Does your count function work well for a data set of millions? -
Potential bit of network latency just for a count?

Nick Radcliffe

unread,
Apr 14, 2011, 2:42:57 AM4/14/11
to fluiddb...@googlegroups.com

On 14 Apr 2011, at 07:34, yosun wrote:

> Does your count function work well for a data set of millions? -

I guess it depends what you mean by "well". Obviously, as you say,
it's slow, and could run into the million-records-returned limit.

I would say it's painfully slow but usable; your milage may vary.
Obviously, I am hoping a count function, at least, will get added to
the API at some point.

Small results can be fairly quick (in human terms; not computing
terms), e.g. half a second:

$ time fdb count -q 'has terrycojones/rating'


7 objects matched
Total: 7 objects

real 0m0.479s
user 0m0.096s
sys 0m0.063s

$ time fdb count -q 'fluiddb/about matches "book"'
6313 objects matched
Total: 6313 objects

real 0m3.333s
user 0m0.198s
sys 0m0.078s


As the result set gets bigger, so does the run time.

$ time fdb count -q 'fluiddb/about matches "http"'
208434 objects matched
Total: 208434 objects

real 0m31.760s
user 0m3.563s
sys 0m0.507s

It looks to me as if it's mostly down to the amount of data being
sent. 800k takes about 2 minutes.

> Potential bit of network latency just for a count?

Absolutely. If I knew a better way to do it, I'd use it; but right
now, I think this is the best that can be done.

Regards

Nick

Jamu Kakar

unread,
Apr 14, 2011, 2:59:03 AM4/14/11
to fluiddb...@googlegroups.com, Nick Radcliffe
Hi Nick,

On Thu, Apr 14, 2011 at 8:29 AM, Nick Radcliffe <n...@radcliffe0.com> wrote:
> It would also be pretty easy for me to add sum and even a random
> selection/limit function that would not be perfect, but would simulate what
> you want quite well.   I might try these over the next week or so.

Adding the equivalent of LIMIT on the client side would be nice, but
you need to be careful with OFFSET, since it isn't guaranteed to work
reliably, for at least these reasons:

- Data sent to clients is not ordered in a stable way. It may appear
to be consistently ordered, but this is just by chance. There is no
explicit server-side sorting.

- Data could change, for instance new values could be added, as you're
requesting page after page of results. Ideally, a builtin paging
would maintain consistent results for a short period of time.

Anyway, builtin paging is definitely on the feature list and is an
(obviously) important feature we'd like to have available in
Fluidinfo.

> I guess what I would think about would be something like this:
>
>    fdb -q 'has terrycojones/rating' sum /yosun/rating
>
> (to sum the yosun/ratings of the things terrycojones has rated).   There are
> some subtleties around type --- specifically, the same tag can have
> different types on different objects.   Probably most people would agree
> that integer and real values should be summed in the natural way (and
> combined).   Strings and booleans would probably create more disagreement.
> Personally, I would probably include booleans (with true = 1 and false = 0)
> and exclude strings (even if the string represents a number, like "11").
> Obviously, there could be options to control this.   So I think my first
> implementation would do
>
>   7 + 3.4 + true + false + "Melody Gardot" + "11" + "true"
>
> =  7 + 3.4 + 1    + 0     +                 +      +
>
> =  11.4.
>
> (But I'm pretty sure Jamu (@jkakar), a non-believer in the numeric
> equivalents of booleans, would make it 10.0, and could believe that Xavier
> (@fxn), a perl golfer, would make it 22.4 on the basis that there's a
> perfectly good way to interpret "11" as a number.)

Actually, in this case, I would expect True to be 1 and False to be 0,
like in Python:

>>> sum([True,False,False])
1
>>> sum([True,False,False,True])
2

As Esteve likes to point out, I'm "eventually consistent". :)

Thanks,
J.

Nick Radcliffe

unread,
Apr 14, 2011, 3:06:37 AM4/14/11
to Jamu Kakar, fluiddb...@googlegroups.com

On 14 Apr 2011, at 07:59, Jamu Kakar wrote:

> Hi Nick,
>
> On Thu, Apr 14, 2011 at 8:29 AM, Nick Radcliffe <n...@radcliffe0.com>
> wrote:
>> It would also be pretty easy for me to add sum and even a random
>> selection/limit function that would not be perfect, but would
>> simulate what
>> you want quite well. I might try these over the next week or so.
>
> Adding the equivalent of LIMIT on the client side would be nice, but
> you need to be careful with OFFSET, since it isn't guaranteed to work
> reliably, for at least these reasons:

Yes, excellent point.

> [...]

Cool :-) I perhaps over-interpreted your tweets / our email
exchange. Apologies if you feel I was trying to put (incorrect)
words in your mouth.

Maybe I have Xavi wrong too and All Right-Thinking People(TM) would
expect the answer to be 11.4 :-)

> As Esteve likes to point out, I'm "eventually consistent". :)

There are worse things.

Nick

yosun

unread,
Apr 14, 2011, 3:13:47 AM4/14/11
to FluidDB Discuss
Being able to retrieve table analytics - such as count would be
useful, too.
Reply all
Reply to author
Forward
0 new messages