Data frame from HBase scan

468 views
Skip to first unread message

Los

unread,
Nov 21, 2013, 5:01:06 PM11/21/13
to rha...@googlegroups.com
Is there an example of how to create a data frame from an HBase scan using rhbase?  Specifically, how would I create a data frame for which 'row.names' corresponds to all the row keys and names refers to all the columns?

Thanks,

Carlos

David Champagne

unread,
Nov 21, 2013, 8:21:20 PM11/21/13
to rha...@googlegroups.com
Take a look at the function hb.get.data.frame  https://github.com/RevolutionAnalytics/rhbase/blob/master/pkg/R/hbase.r

Los

unread,
Dec 20, 2013, 4:13:48 PM12/20/13
to rha...@googlegroups.com
Hi David,

Does this look like an appropriate way to convert an HBase scan into a data frame?


Sorry if it looks weird.  I'm pretty new to R.

Thanks,

Carlos

David Champagne

unread,
Dec 28, 2013, 2:51:48 PM12/28/13
to rha...@googlegroups.com
It should work.  

David Champagne

unread,
Jan 22, 2014, 4:19:11 PM1/22/14
to rha...@googlegroups.com
I took a closer look at your code, and would recommend a change so you can support both scans on a specific set of columns and on an entire column family (see below)

#create a dataframe from a scan
hb.scan.data.frame <- function( tablename, startrow, end=NULL, colspec,sz=hb.defaults("sz"), usz=hb.defaults("usz"),
                                hbc=hb.defaults("hbc") )
{
  scn <- hb.scan( tablename, startrow, end, colspec, sz, usz, hbc )
  f <- scn$get()
  get_column_index_values <- function( column_index )
  {
    get_value <- function( row, column_name )
    {
      indices <- which( row[[ 2 ]] == column_name )
      index <- ifelse( length( indices ) == 1, indices[[ 1 ]], 0 )
      ifelse( index == 0, NA, row[[ 3 ]][[ index ]] )
    }
    column_name <- cols[[ column_index ]]
    unlist( lapply( f, get_value, column_name ) )
  }
  #get the vector of columns from the first row
  cols<-f[[1]][[2]]
  df <- as.data.frame( lapply( 1:length(cols), get_column_index_values ) )
  rownames( df ) <- unlist( lapply( f, "[[", 1 ) )
  colnames( df ) <- cols
  df
}

#examples
mydf1<- hb.scan.data.frame("mytable2" , startrow = 1, colspec = c("f:COL1","f:COL2"))

mydf2<- hb.scan.data.frame("mytable2" , startrow = 1, colspec = 'f:')


Los

unread,
Jan 22, 2014, 6:07:48 PM1/22/14
to rha...@googlegroups.com
That sounds like a good idea.  However, it assumes the first row contains all the possible columns.  This might be acceptable when the caller requests all columns within a family.  However, in the case the caller requests specific columns, if the first row does not contain all of them, then not all the columns will exist in the data frame.  What do you think of the following definition of "cols"?

cols <- ifelse( length( colspec ) == 1, f[[ 1 ]][[ 2 ]], colspec )

This way, we can retrieve data frames that look like this:

> hb.scan.data.frame( "mytable2", startrow=1, colspec=c( 'a', 'b', 'c' ) )
   a b  c
1 NA 2  4
2  1 3 NA

What do you think?

Thanks,

Carlos

David Champagne

unread,
Jan 23, 2014, 2:01:27 PM1/23/14
to rha...@googlegroups.com
ok by me

Los

unread,
Feb 8, 2014, 7:00:15 PM2/8/14
to rha...@googlegroups.com
I implemented the change you recommended and also created a version of the function that scans using a filter string:


-Carlos

Antonio Piccolboni

unread,
May 21, 2014, 1:00:40 PM5/21/14
to rha...@googlegroups.com
Hi Carlos,
sorry for the delay. A filter scan is a good addition, but for clarity we need to do one feature at a time, particularly because you had a positive code review on the first but also just to be able to test them separately and to revert them individually if necessary.  That's why we use version control. I would recommend to do a pull request for the data frame conversion with the code that David reviewed in January and a separate one with the filter scan. Also, we only accept bug fixes in master, you need to target dev with your pull request. I can change the target with a little trickery, but then it looks like I made the contribution, not you, and that's not fair. Thanks


Antonio

Los

unread,
May 22, 2014, 1:32:34 AM5/22/14
to rha...@googlegroups.com
Hi Antonio,

Thanks for the feedback. I closed my original pull request and created two more, one for creating a data frame from a simple scan ( https://github.com/RevolutionAnalytics/rhbase/pull/7 ) and the other for creating a data frame from a scan with a filter expression ( https://github.com/RevolutionAnalytics/rhbase/pull/8 ). Both target the dev branch. Of course, I am open to any other feedback.

Thanks!

-Carlos

Antonio Piccolboni

unread,
May 22, 2014, 1:33:33 AM5/22/14
to RHadoop Google Group
I am reviewing #7 right now. So far so good. Thanks


Antonio


--
post: rha...@googlegroups.com ||
unsubscribe: rhadoop+u...@googlegroups.com ||
web: https://groups.google.com/d/forum/rhadoop?hl=en-US
---
You received this message because you are subscribed to the Google Groups "RHadoop" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rhadoop+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages