What do you think about it ? Which problems could we have with perfomance, scalability and availability?
We will contribute with Redis creating a lot of tutorials and wiki when we launch in our platform for free, if you are interested, please join our community: http://www.guchex.com
> What do you think about it ? Which problems could we have with perfomance,
> scalability and availability?
If your product / service is free as in zero cost to use, you will
have all sorts of problems. I'd rethink the business model before
worrying about the technical aspects. How will you earn revenue to
support the efforts?
> We have used the following databases together to deal with our problems:
So it's already implemented, and you're asking our advice after it's done?
> - Users - DynamoDB
It doesn't matter where you store your user database, as long as it's
in a database. You'd also be fine with PostgreSQL, MySQL, or any other
database that can store data on a disk somewhere for subsequent
reading, atomic writes/updates, etc.
> - Content and Search - ElasticSearch (lucene)
Unless Amazon messed this up severely, this will probably work fine.
> What do you think about it ? Which problems could we have with perfomance,
> scalability and availability?
Redis won't offer you much for machine learning. If you're looking to
gather statistics, calculate co-visitation, ..., Redis would work
fine.
But if you're looking to perform "complicated machine learning
processing", then Redis is not the tool for you. Most machine learning
techniques rely on large matrix multiplication and/or linear
optimization, neither of which can be done efficiently with Redis.
With Redis, you are reading/writing data with a round-trip to a remote
server, which means reading/writing 100k-1M items/second (or 25k-250k
from a single client) against a single server. You won't get any
optimized algorithms for free, which means that you will not be doing
anything "complicated" with any volume of real data.
You are better off using one of the available libraries in your
language of choice, or implementing them yourself, which will let you
read/write 1B+ items/second (main memory is so much faster than a
network roundtrip), use optimized algorithms (improving the big-O
runtime), and could let you use pre-existing known-good implementation
of these algorithms.
<josiah.carl...@gmail.com> wrote:
> But if you're looking to perform "complicated machine learning
> processing", then Redis is not the tool for you. Most machine learning
> techniques rely on large matrix multiplication and/or linear
> optimization, neither of which can be done efficiently with Redis.
> With Redis, you are reading/writing data with a round-trip to a remote
> server, which means reading/writing 100k-1M items/second (or 25k-250k
> from a single client) against a single server. You won't get any
> optimized algorithms for free, which means that you will not be doing
> anything "complicated" with any volume of real data.
> You are better off using one of the available libraries in your
> language of choice, or implementing them yourself, which will let you
> read/write 1B+ items/second (main memory is so much faster than a
> network roundtrip), use optimized algorithms (improving the big-O
> runtime), and could let you use pre-existing known-good implementation
> of these algorithms.
"Out-of-core" linear algebra is expensive, which is why I brought up
the revenue issue. Pretty much the only game in town for packaged
large-scale number-crunching that doesn't involve writing a lot of
code and doesn't cost an arm and a leg is Mahout. Just about
everything else is either proprietary or falls on its hiney once it
goes beyond the capacity of a single machine's RAM and CPU.
We decide to start with reducing the complexity of the process for
now, but it seems that our recommendation system and n-clustering
algoritms using just commands from Redis sets are enough, but it seems
we will have a lot of problems with scalability on this, but we
already have some engineers working with Mahout.
Our service is exactly a mix of all features from the social networks
(+tumblr) that you have talked, but we will just accept content that
meets our guidelines that will be anything related to knowledge , of
course, programming will be the most popular =)
Thanks,
Vinicius Melo
On Sun, Jul 29, 2012 at 11:28 PM, M. Edward (Ed) Borasky
<zn...@znmeb.net> wrote:
> On Sun, Jul 29, 2012 at 3:57 PM, Josiah Carlson
> <josiah.carl...@gmail.com> wrote:
>> But if you're looking to perform "complicated machine learning
>> processing", then Redis is not the tool for you. Most machine learning
>> techniques rely on large matrix multiplication and/or linear
>> optimization, neither of which can be done efficiently with Redis.
>> With Redis, you are reading/writing data with a round-trip to a remote
>> server, which means reading/writing 100k-1M items/second (or 25k-250k
>> from a single client) against a single server. You won't get any
>> optimized algorithms for free, which means that you will not be doing
>> anything "complicated" with any volume of real data.
>> You are better off using one of the available libraries in your
>> language of choice, or implementing them yourself, which will let you
>> read/write 1B+ items/second (main memory is so much faster than a
>> network roundtrip), use optimized algorithms (improving the big-O
>> runtime), and could let you use pre-existing known-good implementation
>> of these algorithms.
> "Out-of-core" linear algebra is expensive, which is why I brought up
> the revenue issue. Pretty much the only game in town for packaged
> large-scale number-crunching that doesn't involve writing a lot of
> code and doesn't cost an arm and a leg is Mahout. Just about
> everything else is either proprietary or falls on its hiney once it
> goes beyond the capacity of a single machine's RAM and CPU.
> Data is the new coal - abundant, dirty and difficult to mine.
> --
> You received this message because you are subscribed to the Google Groups "Redis DB" group.
> To post to this group, send email to redis-db@googlegroups.com.
> To unsubscribe from this group, send email to redis-db+unsubscribe@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/redis-db?hl=en.
If you're looking for realtime data processing, take a look at
Storm<http://engineering.twitter.com/2011/08/storm-is-coming-more-details-a...>,
(open-sourced by twitter). It allows you to do online map-reduce and other
data processing algorithms. It also works well with Redis as your
computations can use Redis to store the maps on each of the storm nodes.
For batch-processing style of apps, Mahout on top of Hadoop is the way to
go at the moment.
On Sun, Jul 29, 2012 at 7:40 PM, Vinicius Melo <vinicius...@gmail.com>wrote:
> We decide to start with reducing the complexity of the process for
> now, but it seems that our recommendation system and n-clustering
> algoritms using just commands from Redis sets are enough, but it seems
> we will have a lot of problems with scalability on this, but we
> already have some engineers working with Mahout.
> Our service is exactly a mix of all features from the social networks
> (+tumblr) that you have talked, but we will just accept content that
> meets our guidelines that will be anything related to knowledge , of
> course, programming will be the most popular =)
> Thanks,
> Vinicius Melo
> On Sun, Jul 29, 2012 at 11:28 PM, M. Edward (Ed) Borasky
> <zn...@znmeb.net> wrote:
> > On Sun, Jul 29, 2012 at 3:57 PM, Josiah Carlson
> > <josiah.carl...@gmail.com> wrote:
> >> But if you're looking to perform "complicated machine learning
> >> processing", then Redis is not the tool for you. Most machine learning
> >> techniques rely on large matrix multiplication and/or linear
> >> optimization, neither of which can be done efficiently with Redis.
> >> With Redis, you are reading/writing data with a round-trip to a remote
> >> server, which means reading/writing 100k-1M items/second (or 25k-250k
> >> from a single client) against a single server. You won't get any
> >> optimized algorithms for free, which means that you will not be doing
> >> anything "complicated" with any volume of real data.
> >> You are better off using one of the available libraries in your
> >> language of choice, or implementing them yourself, which will let you
> >> read/write 1B+ items/second (main memory is so much faster than a
> >> network roundtrip), use optimized algorithms (improving the big-O
> >> runtime), and could let you use pre-existing known-good implementation
> >> of these algorithms.
> > "Out-of-core" linear algebra is expensive, which is why I brought up
> > the revenue issue. Pretty much the only game in town for packaged
> > large-scale number-crunching that doesn't involve writing a lot of
> > code and doesn't cost an arm and a leg is Mahout. Just about
> > everything else is either proprietary or falls on its hiney once it
> > goes beyond the capacity of a single machine's RAM and CPU.
> > Data is the new coal - abundant, dirty and difficult to mine.
> > --
> > You received this message because you are subscribed to the Google
> Groups "Redis DB" group.
> > To post to this group, send email to redis-db@googlegroups.com.
> > To unsubscribe from this group, send email to
> redis-db+unsubscribe@googlegroups.com.
> > For more options, visit this group at
> http://groups.google.com/group/redis-db?hl=en.
> --
> You received this message because you are subscribed to the Google Groups
> "Redis DB" group.
> To post to this group, send email to redis-db@googlegroups.com.
> To unsubscribe from this group, send email to
> redis-db+unsubscribe@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/redis-db?hl=en.
> Redis won't offer you much for machine learning. If you're looking to
> gather statistics, calculate co-visitation, ..., Redis would work
> fine.
> But if you're looking to perform "complicated machine learning
> processing", then Redis is not the tool for you. Most machine learning
For probabilistic models redis actually works rather well. sorted sets and
hashes can model frequency tables and sparse feature vectors pretty
efficiently.
Even for more complex stuff, storing the end result in redis for quick
querying should also work fine.
Also, real time counting of events is also rather fast in redis, provided
you have the right strategy to scale redis beyond your RAM limitations.
a few examples of stuff I've done with redis in the past 2 years:
1. wikipedia based bayesian classification of texts. Redis was used both to
reduce wikipedia to (word -> {count(class1), count(class2), ...})
vectors, and to query them in real time.
2. query log based "adult filtering" for queries - both training and
querying.
3. trending topics detection from rss news feeds and other sources.
4. adaptive A/B testing.
and more...
I agree that it has its limitations, but it's pretty damn powerful for a
lot of common uses.
have you seen the new redis based map reduce framework I linked here a
couple of days ago?
On Mon, Jul 30, 2012 at 12:50 AM, Dvir Volk <dvir...@gmail.com> wrote:
>> Redis won't offer you much for machine learning. If you're looking to
>> gather statistics, calculate co-visitation, ..., Redis would work
>> fine.
>> But if you're looking to perform "complicated machine learning
>> processing", then Redis is not the tool for you. Most machine learning
> For probabilistic models redis actually works rather well. sorted sets and
> hashes can model frequency tables and sparse feature vectors pretty
> efficiently.
> Even for more complex stuff, storing the end result in redis for quick
> querying should also work fine.
> Also, real time counting of events is also rather fast in redis, provided
> you have the right strategy to scale redis beyond your RAM limitations.
> a few examples of stuff I've done with redis in the past 2 years:
> 1. wikipedia based bayesian classification of texts. Redis was used both to
> reduce wikipedia to (word -> {count(class1), count(class2), ...})
> vectors, and to query them in real time.
> 2. query log based "adult filtering" for queries - both training and
> querying.
> 3. trending topics detection from rss news feeds and other sources.
> 4. adaptive A/B testing.
> and more...
I've done all of those except your listed #4 in the past. Though I
don't consider any of them to be "complicated machine learning". I
suppose it's all a matter of opinion.
> I agree that it has its limitations, but it's pretty damn powerful for a lot
> of common uses.
I agree completely. Though from what the op was saying, they want it
for more than just the basics. Redis will work great for the basics,
and if you are okay with it running slow (in comparison to an in-core
library), you may even be happy about using it for more complicated
scenarios (beyond caching a result matrix from some
clustering/decomposition). But that they are already looking into
Mahout means that they're already looking towards solving bigger
problems.
> have you seen the new redis based map reduce framework I linked here a
> couple of days ago?
> clustering/decomposition). But that they are already looking into
> Mahout means that they're already looking towards solving bigger
> problems.
They are probably looking for unsupervised clustering algorithms. Again,
redis can perfectly store the result of map reduce jobs for realy time.
about it being slow compared to in core stuff - scaling out is always a
trade off. the advantage of being able to distribute my crunching jobs
while still using a simple single redis (even if sharded) in the middle,
wins back a lot of speed.
the way I usually work is I do small batches that reside in memory (for
example, break N wikipedia documents), and dump the aggregate to redis in a
single pipeline query. this works rather well while keeping things simple
and distributed.
BTW I'm wondering what can be done to exploit the new bitmap features for
learning algorithms. for example IDF counts can be easily modeled on them,
if the document set is either small or not very sparse.
> > have you seen the new redis based map reduce framework I linked here a
> > couple of days ago?
> I did, though I've not had a reason to use it.
me neither, but it's on my radar as I'm exploring ways to provide a more
rigid framework for log crunching and learning, and I'm really not a Java
fan so I have a natural bias against Hadoop and friends. Have you tried
Disco BTW?