Querying events by the occurrence of another event

96 views
Skip to first unread message

Marcos Passos

unread,
May 8, 2021, 8:51:45 AM5/8/21
to Druid User
Hi everyone!

We're trying to write a query that correlates multiple events using the data sketches but without success so far.

Generally speaking, the question we're trying to answer is: how many events X occurs daily/monthly where event Y also occurred (X times in the last 7 days) for the same user?

One of the several use cases is to answer questions like: how many orders (orderPlaced event) come from users who view a given blog post (contentViewed event) in the last 7 days? Or how many users upgraded their accounts (accountUpgraded event) up to 15 days after signing up (signUp event)?

We've read everything available out there, but no success yet. Does anyone know if Druid supports queries like that?

Bests,
Marcos

vijay narayanan

unread,
May 8, 2021, 9:21:34 AM5/8/21
to druid...@googlegroups.com
Assuming you have the following structure

__time,userid,event_type

The following query should do it

select  THETA_SKETCH_ESTIMATE(THETA_SKETCH_INTERSECT(theta_blog,theta_order)) from (select ds_theta(userid) filter (where event_type='blogviewed') theta_blog,select ds_theta(userid) filter (where event_type='order_placed') theta_order from source)

you can make the filter more complex by adding a blog name etc. The above will give intersect for the full data source. You could slice it by time extract for week etc. 

vijay

--
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/1687cab2-b4b6-452e-bf5e-1ce6f0f9bfb9n%40googlegroups.com.

Marcos Passos

unread,
May 8, 2021, 9:45:26 AM5/8/21
to Druid User
Hi Vijay,

Thanks for the reply.

I tried the suggested query, but it counts distinct users matching both filters, meaning how many users caused both events.

We are looking to answer how many of the "blogViewed" events are associated with those users. Or, in other words, we're trying to count events and not users.

 Is there a way to do it?

vijay narayanan

unread,
May 8, 2021, 2:17:27 PM5/8/21
to druid...@googlegroups.com
the tuple sketch should do it. The below query is on the wikipedia dataset and does two things

1) distinctcount of distinct users by country who contributed to en.wikipedia and es.wikipedia. 
2) mean of the "count" metric by each country (mean = total count/distinct user count)

so if you multiply intersect_mean*intersect_est in the below query you will get count of events of users contributing to both wikipedia channels
This is available only in the native query

{
    "dataSource": "wikipedia1",
    "queryType": "groupBy",
    "intervals": [
        "2016-06-27T00:00:00.000Z/2016-06-27T23:00:00.000Z"
    ],
    "granularity": "all",

    "aggregations": [
  {  "type": "filtered",
     "aggregator":
        {
            "type": "arrayOfDoublesSketch",
            "name": "sketch_en",
            "fieldName": "user",
            "metricColumns" : ["count"],
            "nominalEntries": 65536
        },
        "filter": {
       "type": "selector",
       "dimension": "channel",
       "value": "#en.wikipedia",
       "extractionFn": null
     }
     },
     {  "type": "filtered",
        "aggregator":
           {
               "type": "arrayOfDoublesSketch",
               "name": "sketch_es",
               "fieldName": "user",
               "metricColumns" : ["count"],
               "nominalEntries": 65536
           },
           "filter": {
          "type": "selector",
          "dimension": "channel",
          "value": "#es.wikipedia",
          "extractionFn": null
        }
        }
]
        ,
        "postAggregations": [
          {"type"  : "arrayOfDoublesSketchToMeans",
            "name": "intersect_mean",
            "field":
            {

  "type"  : "arrayOfDoublesSketchSetOp",
  "name": "intersect",
  "operation": "INTERSECT",
  "nominalEntries":65536,
  "fields"  : [{"type": "fieldAccess",
            "fieldName": "sketch_en"},
            {"type": "fieldAccess",
                    "fieldName": "sketch_es"}
            ]
            }
          },
          {"type"  : "arrayOfDoublesSketchToEstimate",
            "name": "intersect_est",
            "field":
            {

  "type"  : "arrayOfDoublesSketchSetOp",
  "name": "intersect",
  "operation": "INTERSECT",
  "nominalEntries":65536,
  "fields"  : [{"type": "fieldAccess",
            "fieldName": "sketch_en"},
            {"type": "fieldAccess",
                    "fieldName": "sketch_es"}
            ]
            }
          }

],
    "dimensions": [
        {
            "type": "default",
            "dimension": "countryName",
            "outputName": "countryName"
        }
    ]
}

Marcos Passos

unread,
May 8, 2021, 2:43:51 PM5/8/21
to Druid User
Hi Vijay,

That looks okay for roughly estimating things like contributors but seems too rough to be useful for counting events like orders. Isn't there any way to count with better accuracy (at least the same as offered by HLL or Theta)?

Bests,
Marcos

vijay narayanan

unread,
May 8, 2021, 3:01:28 PM5/8/21
to druid...@googlegroups.com
not as far as I know. The basic issue with your requirement is that you need to intersect the users and count the events the users participated in.  Essentially you need to estimate metrics along with count. The tuple sketch is the only one that can do this (even here you get count of all events the users participated in, not events of one type alone.

vijay

Luiz Augusto Ferraz

unread,
May 8, 2021, 3:24:27 PM5/8/21
to Druid User
Hi everybody!

I work with Marcos (OP). Thanks for the replies.

This is a good approach for estimating how many events came from a subset of the users based on the average amount of events per user. The problem here is that we are trying to find what affects that average.

From what I understood of your example using Wikipedia, you are doing this:
Calculate how many edits a user does in the Spanish channel. As a mental experience, let's say its 10
Calculate how many users edit both entries in the English and the Spanish channel. For our experience, let's say there are 5000 such users
Then multiply them together to estimate how many edits were done by those users. In our experience, it would be 50k

But that kind of defeats the goal. Translating our goal to the Wikipedia example would be to determine whether users who edit entries in both the Spanish and English channels are more or less active than those who only contribute to the Spanish channel.

By the way, it seems that you can compute the mean with a sum of the "count" column and a count of the entries. What advantage does the "arrayOfDoublesSketchToMeans" from the Tuple Sketch have in this case? Does it aim for something different than an arithmetic average?

If JOINs did not require loading the entire right side of the operation in memory, the SQL query would be something like this:

SELECT SUM(es.count) FROM "#es.wikipedia" es
INNER JOIN (SELECT userId FROM "#en.wikipedia") en ON es.userId = en.userId;

Or in our case:

SELECT COUNT(*) FROM "events" order
INNER JOIN (SELECT userId FROM "events" WHERE type = 'contentViewed') content
ON order.userId = content.userId
WHERE order.type = 'orderPlaced';

Is there any other data model or pre-processing task that would make such queries feasible?

Thanks,
Luiz

vijay narayanan

unread,
May 8, 2021, 10:29:51 PM5/8/21
to druid...@googlegroups.com
the tuple sketch on the wikipedia data set gives the result of below. 

select count(*) from (select "user" from wikipedia1 where user in (select distinct "user" from wikipedia1 where channel='#en.wikipedia')
and channel='#es.wikipedia')
+
select count(*) from (select "user" from wikipedia1 where user in (select distinct "user" from wikipedia1 where channel='#es.wikipedia')
and channel='#en.wikipedia')

vijay narayanan

unread,
May 9, 2021, 12:33:54 AM5/9/21
to druid...@googlegroups.com
one approach you could take is create a an event lookup.....this will keep only the latest user, time of blog_viewed_event.

Then a query like
select count(user) from events where lkp(user,'event_lookup')!='' and type='order_placed'

you will need a jdbc or a kafka lookup. I think this should work but not sure how large the lookup will get. However since the lookup keeps only one event for a user it may be small enough. 

I think it is either the lookup or join. Can't think of any other way.
Reply all
Reply to author
Forward
0 new messages