dplyr.spark 0.3.0 available

36 views
Skip to first unread message

Antonio Piccolboni

unread,
Aug 18, 2015, 5:30:24 PM8/18/15
to rha...@googlegroups.com
This release delivers a fairly complete and usable dplyr backend, with anti-joins and some set operations the only omissions. The latest version can start and stop a thrift server automatically and can cache tables with the cache function, which takes advantage of one of the most distinctive features in Spark. But more than those features, what is important is the full coverage of dplyr, including window functions, exotic joins and what not, the growing set of tests supporting the quality of this package and the large number of bugs squashed. Particularly eye-popping in my opinion is the support for window functions, which allows to write programs like


my_db = src_SparkSQL()

flights = tbl(my_db, "flights")
daily = group_by(flights, year, month, day)
bestworst =
  daily %>%
  select(flight, arr_delay) %>%
  filter(arr_delay == min(arr_delay) || arr_delay == max(arr_delay))
bestworst

Source: Spark at:localhost:10000
From: <derived table> [?? x 5]
Filter: arr_delay == min(arr_delay) || arr_delay == max(arr_delay) 
Grouped by: year, month, day 

   year month day flight arr_delay
1  2013     1   5     15       -45
2  2013     1   5   1109       308
3  2013     6  20   1183       130
4  2013     6  20   1543       -51
5  2013     1   6    665       -61
6  2013     1   6   4304       175
7  2013     6  21   3310       -50
8  2013     6  21     40       367
9  2013     1   7    675       -54
10 2013     1   7    377       368
..  ...   ... ...    ...       ...


which identifies the best and worst flight for each day by arrival delay from a well know aviation dataset. Behind the disarmingly simple code are both the dplyr-provided R to SQL translation, which will throw in nested queries, window functions, OVER clauses and and all sort of advanced SQL features to make things happen and the Spark SQL execution engine that offers state-of-the art parallel distributed execution. dplyr.spark only provides some glue between the two and resolves a number of inconsistencies, but the result is beyond the sum of the components.

Installation instructions in the README file. Enjoy and please provide feedback. 


PS: In case you wonder what the SQL for bestworst is, this is the monster

SELECT `year`, `month`, `day`, `flight`, `arr_delay`
FROM (SELECT `year`, `month`, `day`, `flight`, `arr_delay`, min(`arr_delay`) OVER (PARTITION BY `year`, `month`, `day` ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS `_w2`, max(`arr_delay`) OVER (PARTITION BY `year`, `month`, `day` ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS `_w3`
FROM (SELECT `year` AS `year`, `month` AS `month`, `day` AS `day`, `flight` AS `flight`, `arr_delay` AS `arr_delay`
FROM `flights`) AS `_w1`) AS `_w4`
WHERE `arr_delay` = `_w2` OR `arr_delay` = `_w3`


Reply all
Reply to author
Forward
0 new messages