Is Cassandra right for me?

Showing 1-13 of 13 messages
Is Cassandra right for me? Marcelo Elias Del Valle 9/17/12 3:28 PM
Hello,

     I am new to Cassandra and I am in doubt if Cassandra is the right technology to use in the architecture I am defining. Also, I saw a presentation which said that if I don't have rows with more than a hundred rows in Cassandra, whether I am doing something wrong or I shouldn't be using Cassandra. Therefore, it might be the case I am doing something wrong. If you could help me to find out the answer for these questions by giving any feedback, it would be highly appreciated. 
     Here is my need and what I am thinking in using Cassandra for:
  • I need to support a high volume of writes per second. I might have a billion writes per hour
  • I need to write non-structured data that will be processed later by hadoop processes to generate structured data from it. Later, I index the structured data using SOLR or SOLANDRA, so the data can be consulted by my end user application. Is Cassandra recommended for that, or should I be thinking in writting directly to HDFS files, for instance? What's the main advantage I get from storing data in a nosql service like Cassandra, when compared to storing files into HDFS?
  • Usually I will write json data associated to an ID and my hadoop processes will process this data to write data to a database. I have two doubts here:
    • If I don't need to perform complicated queries in Cassandra, should I store the json-like data just as a column value? I am afraid of doing something wrong here, as I would need just to store the json file and some more 5 or 6 fields to query the files later.
    • Does it make sense to you to use hadoop to process data from Cassandra and store the results in a database, like HBase? Once I have structured data, is there any reason I should use Cassandra instead of HBase?
     I am sorry if the questions are too dummy, I have been watching a lot of videos and reading a lot of documentation about Cassandra, but honestly, more I read more I have questions. 

Thanks in advance.

Best regards,
--
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr
Re: Is Cassandra right for me? aaron morton 9/18/12 4:08 AM
Also, I saw a presentation which said that if I don't have rows with more than a hundred rows in Cassandra, whether I am doing something wrong or I shouldn't be using Cassandra. 
I do not agree with that statement. (I read that as rows with ore than a hundred _columns_)

  • I need to support a high volume of writes per second. I might have a billion writes per hour
Thats about 280K /sec. Netflix did a benchmark that shows 1.1M/sec http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html

  • I need to write non-structured data that will be processed later by hadoop processes to generate structured data from it. Later, I index the structured data using SOLR or SOLANDRA, so the data can be consulted by my end user application. Is Cassandra recommended for that, or should I be thinking in writting directly to HDFS files, for instance? What's the main advantage I get from storing data in a nosql service like Cassandra, when compared to storing files into HDFS?
You can query your data using Hadoop easily enough. You may want take a look at DSE from  http://datastax.com/ it makes using Hadoop and Solr with cassandra easier. 

  • If I don't need to perform complicated queries in Cassandra, should I store the json-like data just as a column value? I am afraid of doing something wrong here, as I would need just to store the json file and some more 5 or 6 fields to query the files later.
Store the data in the way that best supports the read queries you want to make. If you always read all the fields, or it's a canonical record of events storing as JSON may be best. If you often get a few fields, and maybe they are updated, storing each field as a column value may be best. 

  • Does it make sense to you to use hadoop to process data from Cassandra and store the results in a database, like HBase? Once I have structured data, is there any reason I should use Cassandra instead of HBase?
It depends on how many moving parts you are comfortable with. Same for the questions about HDFS etc. Start with the smallest about of infrastructure. 

Hope that helps. 

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
Re: Is Cassandra right for me? Hiller, Dean 9/18/12 6:04 AM
I wanted to clarify the where that statement comes from on wide rows ….

Realize some people make the claim that if you don’t' have 1000's of columns in "some" rows in cassandra you are doing something wrong.  This is not true, BUT it comes from the fact that people are setting up indexes.  This is what leads to the very wide row affect.  playOrm is one such library using wide rows like this BUT it is NOT necessary for all applications.

You can easily use map/reduce on a cassandra cluster.  You can map/reduce your dataset into a new model if you make a mistake as well and don't get it right the first time.  This wide row affect is 80% of the time used for indexing.  I draw off playOrm examples a lot but one table may be partitioned by time so each month of data is in a partition, you can then have indexes on each partition allowing you to do quick queries into partitions.

Later,
Dean

From: Marcelo Elias Del Valle <mval...@gmail.com<mailto:mvallebr@gmail.com>>
Reply-To: "us...@cassandra.apache.org<mailto:us...@cassandra.apache.org>" <us...@cassandra.apache.org<mailto:us...@cassandra.apache.org>>
Date: Monday, September 17, 2012 4:28 PM
To: "us...@cassandra.apache.org<mailto:us...@cassandra.apache.org>" <us...@cassandra.apache.org<mailto:us...@cassandra.apache.org>>
Subject: Is Cassandra right for me?

Hello,

     I am new to Cassandra and I am in doubt if Cassandra is the right technology to use in the architecture I am defining. Also, I saw a presentation which said that if I don't have rows with more than a hundred rows in Cassandra, whether I am doing something wrong or I shouldn't be using Cassandra. Therefore, it might be the case I am doing something wrong. If you could help me to find out the answer for these questions by giving any feedback, it would be highly appreciated.
     Here is my need and what I am thinking in using Cassandra for:

 *   I need to support a high volume of writes per second. I might have a billion writes per hour
 *   I need to write non-structured data that will be processed later by hadoop processes to generate structured data from it. Later, I index the structured data using SOLR or SOLANDRA, so the data can be consulted by my end user application. Is Cassandra recommended for that, or should I be thinking in writting directly to HDFS files, for instance? What's the main advantage I get from storing data in a nosql service like Cassandra, when compared to storing files into HDFS?
 *   Usually I will write json data associated to an ID and my hadoop processes will process this data to write data to a database. I have two doubts here:
    *   If I don't need to perform complicated queries in Cassandra, should I store the json-like data just as a column value? I am afraid of doing something wrong here, as I would need just to store the json file and some more 5 or 6 fields to query the files later.
    *   Does it make sense to you to use hadoop to process data from Cassandra and store the results in a database, like HBase? Once I have structured data, is there any reason I should use Cassandra instead of HBase?
Re: Is Cassandra right for me? Marcelo Elias Del Valle 9/18/12 6:51 AM
Aaron,

    Thank you very much for the answers! Helped me a lot!
    I would like just a bit more clarification about the points bellow, if you allow me:

  • You can query your data using Hadoop easily enough. You may want take a look at DSE from  http://datastax.com/ it makes using Hadoop and Solr with cassandra easier.
Actually, if I use community edition for now, I wouldn't be able to use hadoop against data stored in CFS? We are considering the enterprise edition here, but the best scenario would be using it just when really needed. Would writes on HDFS be so quick as in Cassandra?

  • It depends on how many moving parts you are comfortable with. Same for the questions about HDFS etc. Start with the smallest about of infrastructure.
Sorry, I didn't really understand this part. I am not sure what you wanted to say, but the question was about using nosql instead a relational database in this case. If learning nosql is not a problem, would I have advantages in using Cassandra instead of HBase? If everything in my model fits into a relational database, if my data is structured, would it still be a good idea to use Cassandra? Why?


Thanks,
Marcelo.

2012/9/18 aaron morton <aa...@thelastpickle.com>
Re: Is Cassandra right for me? Marcelo Elias Del Valle 9/18/12 6:53 AM

I will have just 6 columns in my CF, but I will have about a billion writes per hour. In this case, I think Cassandra applies then, by what you are saying.
This answer helped a lot too, thanks! 

2012/9/18 Hiller, Dean <Dean....@nrel.gov>
Re: Is Cassandra right for me? Hiller, Dean 9/18/12 7:02 AM
Until Aaron replies, here are my thoughts on the relational piece…

           If everything in my model fits into a relational database, if my data is structured, would it still be a good idea to use Cassandra? Why?

The playOrm project explores exactly this issue……A query on 1,000,000 rows in a single partition only took 60ms AND you can do joins with it's S-SQL language.  The answer is a resounding YES, you can put relational data in cassandra.  The writes are way faster than a DBMS and joins and SQL can be just as fast and in many cases FASTER on noSQL IF you partition your data properly.  A S-SQL statement looks like so on playOrm

PARTITIONS t(:partitionId) SELECT t FROM Trades as t where t.numShares > 10

You can have as many partitions as you want and a single partition can have millions of rows though I would not exceed 10 million probably.

Later,
Dean

2012/9/18 aaron morton <aa...@thelastpickle.com<mailto:aa...@thelastpickle.com>>
Also, I saw a presentation which said that if I don't have rows with more than a hundred rows in Cassandra, whether I am doing something wrong or I shouldn't be using Cassandra.
I do not agree with that statement. (I read that as rows with ore than a hundred _columns_)


 *   I need to support a high volume of writes per second. I might have a billion writes per hour

Thats about 280K /sec. Netflix did a benchmark that shows 1.1M/sec http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html


 *   I need to write non-structured data that will be processed later by hadoop processes to generate structured data from it. Later, I index the structured data using SOLR or SOLANDRA, so the data can be consulted by my end user application. Is Cassandra recommended for that, or should I be thinking in writting directly to HDFS files, for instance? What's the main advantage I get from storing data in a nosql service like Cassandra, when compared to storing files into HDFS?
 *

You can query your data using Hadoop easily enough. You may want take a look at DSE from  http://datastax.com/ it makes using Hadoop and Solr with cassandra easier.


 *   If I don't need to perform complicated queries in Cassandra, should I store the json-like data just as a column value? I am afraid of doing something wrong here, as I would need just to store the json file and some more 5 or 6 fields to query the files later.
 *

Store the data in the way that best supports the read queries you want to make. If you always read all the fields, or it's a canonical record of events storing as JSON may be best. If you often get a few fields, and maybe they are updated, storing each field as a column value may be best.


 *   Does it make sense to you to use hadoop to process data from Cassandra and store the results in a database, like HBase? Once I have structured data, is there any reason I should use Cassandra instead of HBase?
 *

It depends on how many moving parts you are comfortable with. Same for the questions about HDFS etc. Start with the smallest about of infrastructure.

Hope that helps.

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 18/09/2012, at 10:28 AM, Marcelo Elias Del Valle <mval...@gmail.com<mailto:mvallebr@gmail.com>> wrote:

Hello,

     I am new to Cassandra and I am in doubt if Cassandra is the right technology to use in the architecture I am defining. Also, I saw a presentation which said that if I don't have rows with more than a hundred rows in Cassandra, whether I am doing something wrong or I shouldn't be using Cassandra. Therefore, it might be the case I am doing something wrong. If you could help me to find out the answer for these questions by giving any feedback, it would be highly appreciated.
     Here is my need and what I am thinking in using Cassandra for:

 *   I need to support a high volume of writes per second. I might have a billion writes per hour
 *   I need to write non-structured data that will be processed later by hadoop processes to generate structured data from it. Later, I index the structured data using SOLR or SOLANDRA, so the data can be consulted by my end user application. Is Cassandra recommended for that, or should I be thinking in writting directly to HDFS files, for instance? What's the main advantage I get from storing data in a nosql service like Cassandra, when compared to storing files into HDFS?
 *   Usually I will write json data associated to an ID and my hadoop processes will process this data to write data to a database. I have two doubts here:
    *   If I don't need to perform complicated queries in Cassandra, should I store the json-like data just as a column value? I am afraid of doing something wrong here, as I would need just to store the json file and some more 5 or 6 fields to query the files later.
    *   Does it make sense to you to use hadoop to process data from Cassandra and store the results in a database, like HBase? Once I have structured data, is there any reason I should use Cassandra instead of HBase?
Re: Is Cassandra right for me? Marcelo Elias Del Valle 9/18/12 9:51 AM
You're talking about this project, right?  https://github.com/deanhiller/playorm 
I will take a look. However, I don't think using Cassandra's model itself (with CFs / key-values) would be a problem, I just need to know where the advantage relies on. By your answer, my guess is it relies on better performance and more control.

I also saw that if I plan to use Data Stax enterprise to get real time analytics, my data would need to be stored in Cassandra's usual format. It would harder for me use PlayOrm if I am planning to use advanced data stax features, like Solr indexing data on Cassandra without copying columns, realtime, wouldn't it? I don't know much of this Solr feature yet, but my understanding today is it wouldn't be aware of the tables I create with playOrm, just of the column families this framework uses to store the data, right?




2012/9/18 Hiller, Dean <Dean....@nrel.gov>
Re: Is Cassandra right for me? Hiller, Dean 9/18/12 1:31 PM
Cassandra is fully aware of all tables created with playOrm and you can still use DataStax enterprise features to get real time analytics.  Playroom is a layer on top of cassandra and with any layer it makes a developer more productive at a slight cost of performance just like hibernate on top of JDBC.  In some cases though, we find because someone uses the S-SQL instead of reading in full rows themselves, it has actually sped up their application in certain use cases…this is kind of unusual when putting a layer on top of the interface to cassandra.

Also, playOrm is working on a ad-hoc query tool to view all indexes created by playOrm as well as query into all rows in partitions so you can ad-hoc inspect your data much easier.  CQL can also be used as a complement to S-SQL(playOrm's SQL with partitions) in that you could analyze a full table but CQL doesn't do joins and you have to use equality operator and other limitations.  S-SQL is limited by only viewing into partitions which is okay for many OLTP applications.  For analytics, usually one needs to break out of the partitions and look at the more global data set…ie. Map/reduce and CQL help there.

Later,
Dean

From: Marcelo Elias Del Valle <mval...@gmail.com<mailto:mvallebr@gmail.com>>
Reply-To: "us...@cassandra.apache.org<mailto:us...@cassandra.apache.org>" <us...@cassandra.apache.org<mailto:us...@cassandra.apache.org>>
Date: Tuesday, September 18, 2012 10:50 AM
To: "us...@cassandra.apache.org<mailto:us...@cassandra.apache.org>" <us...@cassandra.apache.org<mailto:us...@cassandra.apache.org>>
Subject: Re: Is Cassandra right for me?
Re: Is Cassandra right for me? Hiller, Dean 9/18/12 1:32 PM
Oh, and yes, that is the correct link.

Dean

From: Marcelo Elias Del Valle <mval...@gmail.com<mailto:mvallebr@gmail.com>>
Reply-To: "us...@cassandra.apache.org<mailto:us...@cassandra.apache.org>" <us...@cassandra.apache.org<mailto:us...@cassandra.apache.org>>
Date: Tuesday, September 18, 2012 10:50 AM
To: "us...@cassandra.apache.org<mailto:us...@cassandra.apache.org>" <us...@cassandra.apache.org<mailto:us...@cassandra.apache.org>>
Subject: Re: Is Cassandra right for me?

You're talking about this project, right?  https://github.com/deanhiller/playorm
I will take a look. However, I don't think using Cassandra's model itself (with CFs / key-values) would be a problem, I just need to know where the advantage relies on. By your answer, my guess is it relies on better performance and more control.

I also saw that if I plan to use Data Stax enterprise to get real time analytics, my data would need to be stored in Cassandra's usual format. It would harder for me use PlayOrm if I am planning to use advanced data stax features, like Solr indexing data on Cassandra without copying columns, realtime, wouldn't it? I don't know much of this Solr feature yet, but my understanding today is it wouldn't be aware of the tables I create with playOrm, just of the column families this framework uses to store the data, right?




2012/9/18 Hiller, Dean <Dean....@nrel.gov<mailto:Dean.Hiller@nrel.gov>>
Until Aaron replies, here are my thoughts on the relational piece…

           If everything in my model fits into a relational database, if my data is structured, would it still be a good idea to use Cassandra? Why?

The playOrm project explores exactly this issue……A query on 1,000,000 rows in a single partition only took 60ms AND you can do joins with it's S-SQL language.  The answer is a resounding YES, you can put relational data in cassandra.  The writes are way faster than a DBMS and joins and SQL can be just as fast and in many cases FASTER on noSQL IF you partition your data properly.  A S-SQL statement looks like so on playOrm

PARTITIONS t(:partitionId) SELECT t FROM Trades as t where t.numShares > 10

You can have as many partitions as you want and a single partition can have millions of rows though I would not exceed 10 million probably.

Later,
Dean

2012/9/18 aaron morton <aa...@thelastpickle.com<mailto:aa...@thelastpickle.com><mailto:aa...@thelastpickle.com<mailto:aa...@thelastpickle.com>>>
Also, I saw a presentation which said that if I don't have rows with more than a hundred rows in Cassandra, whether I am doing something wrong or I shouldn't be using Cassandra.
I do not agree with that statement. (I read that as rows with ore than a hundred _columns_)


 *   I need to support a high volume of writes per second. I might have a billion writes per hour

Thats about 280K /sec. Netflix did a benchmark that shows 1.1M/sec http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html


 *   I need to write non-structured data that will be processed later by hadoop processes to generate structured data from it. Later, I index the structured data using SOLR or SOLANDRA, so the data can be consulted by my end user application. Is Cassandra recommended for that, or should I be thinking in writting directly to HDFS files, for instance? What's the main advantage I get from storing data in a nosql service like Cassandra, when compared to storing files into HDFS?
 *

You can query your data using Hadoop easily enough. You may want take a look at DSE from  http://datastax.com/ it makes using Hadoop and Solr with cassandra easier.


 *   If I don't need to perform complicated queries in Cassandra, should I store the json-like data just as a column value? I am afraid of doing something wrong here, as I would need just to store the json file and some more 5 or 6 fields to query the files later.
 *

Store the data in the way that best supports the read queries you want to make. If you always read all the fields, or it's a canonical record of events storing as JSON may be best. If you often get a few fields, and maybe they are updated, storing each field as a column value may be best.


 *   Does it make sense to you to use hadoop to process data from Cassandra and store the results in a database, like HBase? Once I have structured data, is there any reason I should use Cassandra instead of HBase?
 *

It depends on how many moving parts you are comfortable with. Same for the questions about HDFS etc. Start with the smallest about of infrastructure.

Hope that helps.

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

Re: Is Cassandra right for me? aaron morton 9/20/12 5:59 PM
Actually, if I use community edition for now, I wouldn't be able to use hadoop against data stored in CFS? 
AFAIK DSC is a packaged deployment of Apache Cassandra. You should be ale to use Hadoop against it, in the same way you can use hadoop against Apache Cassandra. 

You "can do" anything with computers if you have enough time and patience. DSE reduces the amount of time and patience needed to run Hadoop over Cassandra. Specifically it helps by providing a HDFS and Hive Meta Store that run on Cassandra. This reduces the number of moving parts you need to provision. 

Would writes on HDFS be so quick as in Cassandra?
Yes and no. 
HDFS uses a big bock size, so while it may absorb writes quickly you may not be able to read them immediately. 
Remember you may need a HDFS layer for intermediate results. 
 
would I have advantages in using Cassandra instead of HBase?
Cassandra provides no single point of failure, great scalability, tuneable consistency, a flexible data model and very easy single package deployment. My HBase knowledge is limited, but I would check those points and go with whatever you feel comfortable with. 

If everything in my model fits into a relational database, if my data is structured, would it still be a good idea to use Cassandra? Why?
It's reasonable to use cassandra for structured data. After a few iterations of development you may find that the current structure is not the best for a non-RDBMS. e.g. It's often easier to work with larger entities that violate Normal Form requirements.

There are lots of advantages to use Cassandra, just as there are benefits to using a RDBMS rather than custom flat files. If you feel your project will benefit from those advantages, and/or you are technically curious, I would recommend  trying Cassandra. 

Chose a small part of your product and create a Proof of Concept, it should only take a week or so. Make as many mistakes as you can as fast as you can and have fun.   

Hope that helps. 

-----------------
Aaron Morton
Freelance Developer
@aaronmorton

Re: Is Cassandra right for me? Marcelo Elias Del Valle 9/21/12 10:27 AM


2012/9/20 aaron morton <aa...@thelastpickle.com>

Actually, if I use community edition for now, I wouldn't be able to use hadoop against data stored in CFS? 
AFAIK DSC is a packaged deployment of Apache Cassandra. You should be ale to use Hadoop against it, in the same way you can use hadoop against Apache Cassandra. 

You "can do" anything with computers if you have enough time and patience. DSE reduces the amount of time and patience needed to run Hadoop over Cassandra. Specifically it helps by providing a HDFS and Hive Meta Store that run on Cassandra. This reduces the number of moving parts you need to provision. 

Can I use BRISK with Apache Cassandra, without changing Brisk or Cassandra's code? To the best of my knowledge, DSE uses Brisk, so I am afraid of writting hadoop process now and have to change them when I hire DSE support.

I am not an expert in the Apache 2.0 license, but in my understanding Data Stax modified Apache Cassandra and included modifications to it in the version they sell. At the same time I am interested in hiring their support, I wanna keep compatibility with the open source version distributed in the mainstream, just in case I want to stop hiring their support at any time.

 
Re: Is Cassandra right for me? Michael Kjellman 9/21/12 10:38 AM
Brisk is no longer actively developed by the original author or Datastax. It was left up for the community.

https://github.com/steeve/brisk

Has a fork that is supposedly compatible with 1.0 API

Your more than welcome to fork that and make it work with 1.1 :)

DSE != (Cassandra + Brisk)

From: Marcelo Elias Del Valle <mval...@gmail.com<mailto:mvallebr@gmail.com>>
Reply-To: "us...@cassandra.apache.org<mailto:us...@cassandra.apache.org>" <us...@cassandra.apache.org<mailto:us...@cassandra.apache.org>>
Date: Friday, September 21, 2012 10:27 AM
To: "us...@cassandra.apache.org<mailto:us...@cassandra.apache.org>" <us...@cassandra.apache.org<mailto:us...@cassandra.apache.org>>
Subject: Re: Is Cassandra right for me?



2012/9/20 aaron morton <aa...@thelastpickle.com<mailto:aa...@thelastpickle.com>>
'Like' us on Facebook for exclusive content and other resources on all Barracuda Networks solutions.
Visit http://barracudanetworks.com/facebook


Re: Is Cassandra right for me? Marcelo Elias Del Valle 9/21/12 11:24 AM
Thanks a lot! Things are much more clear now.

2012/9/21 Michael Kjellman <mkje...@barracuda.com>