Data frame from HBase scan

Los

unread,

Nov 21, 2013, 5:01:06 PM11/21/13

to rha...@googlegroups.com

Is there an example of how to create a data frame from an HBase scan using rhbase? Specifically, how would I create a data frame for which 'row.names' corresponds to all the row keys and names refers to all the columns?

Thanks,

Carlos

David Champagne

unread,

Nov 21, 2013, 8:21:20 PM11/21/13

to rha...@googlegroups.com

Take a look at the function hb.get.data.frame https://github.com/RevolutionAnalytics/rhbase/blob/master/pkg/R/hbase.r

Los

unread,

Dec 20, 2013, 4:13:48 PM12/20/13

to rha...@googlegroups.com

Hi David,

Does this look like an appropriate way to convert an HBase scan into a data frame?

https://github.com/l0s/rhbase/commit/f813dbf872d3e8ce5651158672cf5bcf707632dd#diff-c57cb30464b9ab079b68eb0736bd94aa

Sorry if it looks weird. I'm pretty new to R.

Thanks,

Carlos

David Champagne

unread,

Dec 28, 2013, 2:51:48 PM12/28/13

to rha...@googlegroups.com

It should work.

David Champagne

unread,

Jan 22, 2014, 4:19:11 PM1/22/14

to rha...@googlegroups.com

I took a closer look at your code, and would recommend a change so you can support both scans on a specific set of columns and on an entire column family (see below)

#create a dataframe from a scan

hb.scan.data.frame <- function( tablename, startrow, end=NULL, colspec,sz=hb.defaults("sz"), usz=hb.defaults("usz"),

hbc=hb.defaults("hbc") )

{

scn <- hb.scan( tablename, startrow, end, colspec, sz, usz, hbc )

f <- scn$get()

get_column_index_values <- function( column_index )

{

get_value <- function( row, column_name )

{

indices <- which( row[[ 2 ]] == column_name )

index <- ifelse( length( indices ) == 1, indices[[ 1 ]], 0 )

ifelse( index == 0, NA, row[[ 3 ]][[ index ]] )

}

column_name <- cols[[ column_index ]]

unlist( lapply( f, get_value, column_name ) )

}

#get the vector of columns from the first row

cols<-f[[1]][[2]]

df <- as.data.frame( lapply( 1:length(cols), get_column_index_values ) )

rownames( df ) <- unlist( lapply( f, "[[", 1 ) )

colnames( df ) <- cols

df

}

#examples

mydf1<- hb.scan.data.frame("mytable2" , startrow = 1, colspec = c("f:COL1","f:COL2"))

mydf2<- hb.scan.data.frame("mytable2" , startrow = 1, colspec = 'f:')

Los

unread,

Jan 22, 2014, 6:07:48 PM1/22/14

to rha...@googlegroups.com

That sounds like a good idea. However, it assumes the first row contains all the possible columns. This might be acceptable when the caller requests all columns within a family. However, in the case the caller requests specific columns, if the first row does not contain all of them, then not all the columns will exist in the data frame. What do you think of the following definition of "cols"?

cols <- ifelse( length( colspec ) == 1, f[[ 1 ]][[ 2 ]], colspec )

This way, we can retrieve data frames that look like this:

> hb.scan.data.frame( "mytable2", startrow=1, colspec=c( 'a', 'b', 'c' ) )
   a b  c
1 NA 2  4
2  1 3 NA

What do you think?

Thanks,

Carlos

David Champagne

unread,

Jan 23, 2014, 2:01:27 PM1/23/14

to rha...@googlegroups.com

ok by me

Los

unread,

Feb 8, 2014, 7:00:15 PM2/8/14

to rha...@googlegroups.com

I implemented the change you recommended and also created a version of the function that scans using a filter string:

https://github.com/RevolutionAnalytics/rhbase/pull/4/files

-Carlos

Antonio Piccolboni

unread,

May 21, 2014, 1:00:40 PM5/21/14

to rha...@googlegroups.com

Hi Carlos,

sorry for the delay. A filter scan is a good addition, but for clarity we need to do one feature at a time, particularly because you had a positive code review on the first but also just to be able to test them separately and to revert them individually if necessary. That's why we use version control. I would recommend to do a pull request for the data frame conversion with the code that David reviewed in January and a separate one with the filter scan. Also, we only accept bug fixes in master, you need to target dev with your pull request. I can change the target with a little trickery, but then it looks like I made the contribution, not you, and that's not fair. Thanks

Antonio

Los

unread,

May 22, 2014, 1:32:34 AM5/22/14

to rha...@googlegroups.com

Hi Antonio,

Thanks for the feedback. I closed my original pull request and created two more, one for creating a data frame from a simple scan ( https://github.com/RevolutionAnalytics/rhbase/pull/7 ) and the other for creating a data frame from a scan with a filter expression ( https://github.com/RevolutionAnalytics/rhbase/pull/8 ). Both target the dev branch. Of course, I am open to any other feedback.

Thanks!

-Carlos

Antonio Piccolboni

unread,

May 22, 2014, 1:33:33 AM5/22/14

to RHadoop Google Group

I am reviewing #7 right now. So far so good. Thanks

Antonio

--
post: rha...@googlegroups.com ||
unsubscribe: rhadoop+u...@googlegroups.com ||
web: https://groups.google.com/d/forum/rhadoop?hl=en-US
---
You received this message because you are subscribed to the Google Groups "RHadoop" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rhadoop+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward