Rattle with Hive (Hadoop) ?

145 views
Skip to first unread message

dokondr

unread,
Apr 3, 2014, 6:23:09 AM4/3/14
to rattle...@googlegroups.com
Hello,
I need to analyze lots of data in Hive, which is Hadoop DB that supports basic SQL (https://cwiki.apache.org/confluence/display/Hive)
Questions:
1) Is there any way for Rattle to use jdbc connection to load data?
2) Is there any way for Rattle to get input data from any R data structure pre-loaded into R before starting Rattle?
3) Any other ways to load data from Hive to Rattle?

Thanks!

malcolm stanley

unread,
Apr 3, 2014, 10:46:44 AM4/3/14
to rattle...@googlegroups.com
could you not use the Hive ODBC connector to establish a DSN to the database?

Scott....@added-value.com

unread,
Apr 3, 2014, 6:32:40 PM4/3/14
to rattle...@googlegroups.com

I guess I have a question about all these questions about getting data in from external sources.

 

I tend to use Rattle for really basic stuff.  I load an SPSS file into a data frame outside of Rattle, then start rattle and select the data frame as my data source.

 

For database sources that Rattle supports, it is doing something special with those data sources? For example, pulling special subsets of the data for training and testing, etc.   If so… that would be really helpful for situations where the data won’t fit into memory, and I would love to know what features are available for which data sources.

 

If not, should we just have a blanket response that says… get your data (or a sample of it) into a data frame and select that as the data source?  We seem to have an inquiry like this often… but I never know enough about the particular data source they are asking about, or Rattle’s data source specific features, to give any sort of definitive suggestions.

 

Onward and upward,

 

Scott

--
You received this message because you are subscribed to the Google Groups "rattle-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rattle-users...@googlegroups.com.
To post to this group, send email to rattle...@googlegroups.com.
Visit this group at http://groups.google.com/group/rattle-users.
For more options, visit https://groups.google.com/d/optout.

Graham Williams

unread,
Apr 3, 2014, 8:44:34 PM4/3/14
to rattle-users
As you note Scott, Rattle can load any data frame defined in the current session of R by choosing the R Dataset option on the Data tab. This allows us to use the full suite of data reading functions in R to load data from any source, and to then have it available in Rattle. Whether the data is loaded from CSV or ODBC within Rattle or from the R console, Rattle treats it exactly the same once it has been loaded as a data.frame. The Log tab in Rattle shows the actual commands that Rattle runs - and we will see there it is simply doing a read.csv().

Graham Williams

Graham Williams

Scott....@added-value.com

unread,
Apr 3, 2014, 8:48:50 PM4/3/14
to rattle...@googlegroups.com, rattle...@googlegroups.com
Thanks for the clarification.

Sent from my iPhone

On Apr 3, 2014, at 5:45 PM, "Graham Williams" <Graham....@togaware.com<mailto:Graham....@togaware.com>> wrote:

As you note Scott, Rattle can load any data frame defined in the current session of R by choosing the R Dataset option on the Data tab. This allows us to use the full suite of data reading functions in R to load data from any source, and to then have it available in Rattle. Whether the data is loaded from CSV or ODBC within Rattle or from the R console, Rattle treats it exactly the same once it has been loaded as a data.frame. The Log tab in Rattle shows the actual commands that Rattle runs - and we will see there it is simply doing a read.csv().

Graham Williams
http://togaware.com

Graham Williams
http://togaware.com


On 4 April 2014 09:32, <Scott....@added-value.com<mailto:Scott....@added-value.com>> wrote:
I guess I have a question about all these questions about getting data in from external sources.

I tend to use Rattle for really basic stuff. I load an SPSS file into a data frame outside of Rattle, then start rattle and select the data frame as my data source.

For database sources that Rattle supports, it is doing something special with those data sources? For example, pulling special subsets of the data for training and testing, etc. If so… that would be really helpful for situations where the data won’t fit into memory, and I would love to know what features are available for which data sources.

If not, should we just have a blanket response that says… get your data (or a sample of it) into a data frame and select that as the data source? We seem to have an inquiry like this often… but I never know enough about the particular data source they are asking about, or Rattle’s data source specific features, to give any sort of definitive suggestions.

Onward and upward,

Scott


From: rattle...@googlegroups.com<mailto:rattle...@googlegroups.com> [mailto:rattle...@googlegroups.com<mailto:rattle...@googlegroups.com>] On Behalf Of malcolm stanley
Sent: Thursday, April 03, 2014 7:47 AM
To: rattle...@googlegroups.com<mailto:rattle...@googlegroups.com>
Subject: Re: Rattle with Hive (Hadoop) ?

could you not use the Hive ODBC connector to establish a DSN to the database?
http://doc.mapr.com/display/MapR/Hive+ODBC+Connector


On Thursday, 3 April 2014 06:23:09 UTC-4, dokondr wrote:
Hello,
I need to analyze lots of data in Hive, which is Hadoop DB that supports basic SQL (https://cwiki.apache.org/confluence/display/Hive)
Questions:
1) Is there any way for Rattle to use jdbc connection to load data?
2) Is there any way for Rattle to get input data from any R data structure pre-loaded into R before starting Rattle?
3) Any other ways to load data from Hive to Rattle?

Thanks!
--
You received this message because you are subscribed to the Google Groups "rattle-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rattle-users...@googlegroups.com<mailto:rattle-users...@googlegroups.com>.
To post to this group, send email to rattle...@googlegroups.com<mailto:rattle...@googlegroups.com>.
Visit this group at http://groups.google.com/group/rattle-users.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "rattle-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rattle-users...@googlegroups.com<mailto:rattle-users...@googlegroups.com>.
To post to this group, send email to rattle...@googlegroups.com<mailto:rattle...@googlegroups.com>.
Visit this group at http://groups.google.com/group/rattle-users.
For more options, visit https://groups.google.com/d/optout.


--
You received this message because you are subscribed to the Google Groups "rattle-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rattle-users...@googlegroups.com<mailto:rattle-users...@googlegroups.com>.
To post to this group, send email to rattle...@googlegroups.com<mailto:rattle...@googlegroups.com>.

dokondr

unread,
Apr 4, 2014, 2:57:34 AM4/4/14
to rattle...@googlegroups.com
Good news that Rattle can load data from R frame. What about data that does not fit all at once in a data frame? Can Rattle  process data bigger then available physical memory from external source like file or ODBC connection? In other words, can Rattle do incremental calculations?

Scott....@added-value.com

unread,
Apr 4, 2014, 10:24:04 AM4/4/14
to rattle...@googlegroups.com, rattle...@googlegroups.com
I'm pretty sure what Graham said below is no.

I would pull a sample of your data into memory and run rattle. If you seem to have luck with a modeling method, check the rattle log and see which R package it's using. Then decide if it's possible to refactor that analysis (maybe packages already exist that do so). Or at the very least set up code to run on multiple samples.

Very manual...

Let us know what you end up doing. I'm very interested. I have analyses I'm debating refactoring for distributed computation but haven't had the stomach to dig in and try it yet.

Sent from my iPhone

On Apr 3, 2014, at 11:57 PM, "dokondr" <dok...@gmail.com<mailto:dok...@gmail.com>> wrote:

Good news that Rattle can load data from R frame. What about data that does not fit all at once in a data frame? Can Rattle process data bigger then available physical memory from external source like file or ODBC connection? In other words, can Rattle do incremental calculations?

On Friday, April 4, 2014 4:44:34 AM UTC+4, Graham Williams wrote:
As you note Scott, Rattle can load any data frame defined in the current session of R by choosing the R Dataset option on the Data tab. This allows us to use the full suite of data reading functions in R to load data from any source, and to then have it available in Rattle. Whether the data is loaded from CSV or ODBC within Rattle or from the R console, Rattle treats it exactly the same once it has been loaded as a data.frame. The Log tab in Rattle shows the actual commands that Rattle runs - and we will see there it is simply doing a read.csv().

Graham Williams
http://togaware.com

Graham Williams
http://togaware.com


On 4 April 2014 09:32, <Scott....@added-value.com<javascript:>> wrote:
I guess I have a question about all these questions about getting data in from external sources.

I tend to use Rattle for really basic stuff. I load an SPSS file into a data frame outside of Rattle, then start rattle and select the data frame as my data source.

For database sources that Rattle supports, it is doing something special with those data sources? For example, pulling special subsets of the data for training and testing, etc. If so… that would be really helpful for situations where the data won’t fit into memory, and I would love to know what features are available for which data sources.

If not, should we just have a blanket response that says… get your data (or a sample of it) into a data frame and select that as the data source? We seem to have an inquiry like this often… but I never know enough about the particular data source they are asking about, or Rattle’s data source specific features, to give any sort of definitive suggestions.

Onward and upward,

Scott


From: rattle...@googlegroups.com<javascript:> [mailto:rattle...@googlegroups.com<javascript:>] On Behalf Of malcolm stanley
Sent: Thursday, April 03, 2014 7:47 AM
To: rattle...@googlegroups.com<javascript:>
Subject: Re: Rattle with Hive (Hadoop) ?

could you not use the Hive ODBC connector to establish a DSN to the database?
http://doc.mapr.com/display/MapR/Hive+ODBC+Connector


On Thursday, 3 April 2014 06:23:09 UTC-4, dokondr wrote:
Hello,
I need to analyze lots of data in Hive, which is Hadoop DB that supports basic SQL (https://cwiki.apache.org/confluence/display/Hive)
Questions:
1) Is there any way for Rattle to use jdbc connection to load data?
2) Is there any way for Rattle to get input data from any R data structure pre-loaded into R before starting Rattle?
3) Any other ways to load data from Hive to Rattle?

Thanks!
--
You received this message because you are subscribed to the Google Groups "rattle-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rattle-users...@googlegroups.com<javascript:>.
To post to this group, send email to rattle...@googlegroups.com<javascript:>.
Visit this group at http://groups.google.com/group/rattle-users.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "rattle-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rattle-users...@googlegroups.com<javascript:>.
To post to this group, send email to rattle...@googlegroups.com<javascript:>.
Visit this group at http://groups.google.com/group/rattle-users.
For more options, visit https://groups.google.com/d/optout.


--
You received this message because you are subscribed to the Google Groups "rattle-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rattle-users...@googlegroups.com<mailto:rattle-users...@googlegroups.com>.
To post to this group, send email to rattle...@googlegroups.com<mailto:rattle...@googlegroups.com>.
Reply all
Reply to author
Forward
0 new messages