Perform conj and distinct on a dataset

13 views
Skip to first unread message

Punit Naik

unread,
May 3, 2017, 2:26:18 AM5/3/17
to Onyx
Hi Guys

I was wondering if I can perform a "distinct" operation on my dataset like in Spark: rdd.distinct

I have done a group-by on a particular field and my key values look like this

key1 -> {:id 1 :string "punit naik"}
key1 -> {:id 2 :string "punit naij"}
key2 -> {:id 1 :string "punit naik"}
key2 -> {:id 2 :string "punit naij"}

I want to group by on the keys, collect (conj) all the records and then finally apply a distinct on my collected list of strings so that I end up having only one list: ["punit naik" "punit naij"] in my final dataset.

I looked at the aggregation docs but did not find anything on performing a "distinct".

Is this achievable in Onyx currently? If yes, how can I do this? 

Mike Drogalis

unread,
May 3, 2017, 12:28:51 PM5/3/17
to Punit Naik, Onyx
Implement a new aggregate (similar to conj) that does distinct and conj in one step: https://github.com/onyx-platform/onyx/blob/0.10.x/src/onyx/windowing/aggregation.cljc#L22-L26

--
You received this message because you are subscribed to the Google Groups "Onyx" group.
To unsubscribe from this group and stop receiving emails from it, send an email to onyx-user+unsubscribe@googlegroups.com.
To post to this group, send email to onyx...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/onyx-user/accd4880-e57b-4742-9397-57c181466e02%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Lucas Bradstreet

unread,
May 3, 2017, 12:42:01 PM5/3/17
to Mike Drogalis, Punit Naik, Onyx
As the aggregation will be performed segment by segment, I would recommend holding the values in a set to avoid an expensive distinct call for each new segment.
To unsubscribe from this group and stop receiving emails from it, send an email to onyx-user+...@googlegroups.com.

To post to this group, send email to onyx...@googlegroups.com.

Punit Naik

unread,
May 3, 2017, 1:15:47 PM5/3/17
to Lucas Bradstreet, Mike Drogalis, Onyx
Hi Michael, Lucas

I really appreciate your suggestions and I will try to implement them. I didn't find anything on "how to contribute" on your website so I will send a PR as is if that is fine.

On Wed, May 3, 2017 at 10:11 PM, Lucas Bradstreet <lucasbr...@gmail.com> wrote:
As the aggregation will be performed segment by segment, I would recommend holding the values in a set to avoid an expensive distinct call for each new segment.

On 3 May 2017, at 09:28, Mike Drogalis <madru...@gmail.com> wrote:

Implement a new aggregate (similar to conj) that does distinct and conj in one step: https://github.com/onyx-platform/onyx/blob/0.10.x/src/onyx/windowing/aggregation.cljc#L22-L26

--
You received this message because you are subscribed to the Google Groups "Onyx" group.
To unsubscribe from this group and stop receiving emails from it, send an email to onyx-user+unsubscribe@googlegroups.com.
To post to this group, send email to onyx...@googlegroups.com.

On Tue, May 2, 2017 at 11:26 PM, Punit Naik <naik.p...@gmail.com> wrote:
Hi Guys

I was wondering if I can perform a "distinct" operation on my dataset like in Spark: rdd.distinct

I have done a group-by on a particular field and my key values look like this

key1 -> {:id 1 :string "punit naik"}
key1 -> {:id 2 :string "punit naij"}
key2 -> {:id 1 :string "punit naik"}
key2 -> {:id 2 :string "punit naij"}

I want to group by on the keys, collect (conj) all the records and then finally apply a distinct on my collected list of strings so that I end up having only one list: ["punit naik" "punit naij"] in my final dataset.

I looked at the aggregation docs but did not find anything on performing a "distinct".

Is this achievable in Onyx currently? If yes, how can I do this? 

--
You received this message because you are subscribed to the Google Groups "Onyx" group.
To unsubscribe from this group and stop receiving emails from it, send an email to onyx-user+unsubscribe@googlegroups.com.
To post to this group, send email to onyx...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/onyx-user/accd4880-e57b-4742-9397-57c181466e02%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.




--
Thank You

Regards

Punit Naik



  Sent with Mailtrack

Mike Drogalis

unread,
May 3, 2017, 1:17:49 PM5/3/17
to Punit Naik, Lucas Bradstreet, Onyx
These don't need to be added to Onyx core. You can simply define and reference them in your own program.

Punit Naik

unread,
May 3, 2017, 1:25:17 PM5/3/17
to Mike Drogalis, Onyx, Lucas Bradstreet

Okay sure.

Reply all
Reply to author
Forward
0 new messages