Re: Machine learning processing on Redis - Perfomance

584 views
Skip to first unread message

M. Edward (Ed) Borasky

unread,
Jul 29, 2012, 4:58:19 PM7/29/12
to redi...@googlegroups.com
On Sat, Jul 28, 2012 at 4:39 PM, Vinicius Melo <vinic...@gmail.com> wrote:
> Hello Developers and Team from Redis,
>
> We are building a product that will be a free social platform intended for
> knowledge exchange.
>
> We have used the following databases together to deal with our problems:
>
> - Users - DynamoDB
> - Content and Search - ElasticSearch (lucene)
> - Complicated machine learning processing, and custom algorithms - Redis
>
> What do you think about it ? Which problems could we have with perfomance,
> scalability and availability?

If your product / service is free as in zero cost to use, you will
have all sorts of problems. I'd rethink the business model before
worrying about the technical aspects. How will you earn revenue to
support the efforts?

--
Twitter: http://twitter.com/znmeb Computational Journalism Studio
http://j.mp/CompJournStudio

Data is the new coal - abundant, dirty and difficult to mine.

Josiah Carlson

unread,
Jul 29, 2012, 6:57:09 PM7/29/12
to redi...@googlegroups.com
On Sat, Jul 28, 2012 at 4:39 PM, Vinicius Melo <vinic...@gmail.com> wrote:
> Hello Developers and Team from Redis,
>
> We are building a product that will be a free social platform intended for
> knowledge exchange.

Like Quora? Stack Exchange? Facebook questions? Yahoo answers? Reddit AMA? ...?

> We have used the following databases together to deal with our problems:

So it's already implemented, and you're asking our advice after it's done?

> - Users - DynamoDB

It doesn't matter where you store your user database, as long as it's
in a database. You'd also be fine with PostgreSQL, MySQL, or any other
database that can store data on a disk somewhere for subsequent
reading, atomic writes/updates, etc.

> - Content and Search - ElasticSearch (lucene)

Unless Amazon messed this up severely, this will probably work fine.

> - Complicated machine learning processing, and custom algorithms - Redis
>
> What do you think about it ? Which problems could we have with perfomance,
> scalability and availability?

Redis won't offer you much for machine learning. If you're looking to
gather statistics, calculate co-visitation, ..., Redis would work
fine.

But if you're looking to perform "complicated machine learning
processing", then Redis is not the tool for you. Most machine learning
techniques rely on large matrix multiplication and/or linear
optimization, neither of which can be done efficiently with Redis.
With Redis, you are reading/writing data with a round-trip to a remote
server, which means reading/writing 100k-1M items/second (or 25k-250k
from a single client) against a single server. You won't get any
optimized algorithms for free, which means that you will not be doing
anything "complicated" with any volume of real data.

You are better off using one of the available libraries in your
language of choice, or implementing them yourself, which will let you
read/write 1B+ items/second (main memory is so much faster than a
network roundtrip), use optimized algorithms (improving the big-O
runtime), and could let you use pre-existing known-good implementation
of these algorithms.

Regards,
- Josiah

M. Edward (Ed) Borasky

unread,
Jul 29, 2012, 10:28:28 PM7/29/12
to redi...@googlegroups.com
On Sun, Jul 29, 2012 at 3:57 PM, Josiah Carlson
<josiah....@gmail.com> wrote:


> But if you're looking to perform "complicated machine learning
> processing", then Redis is not the tool for you. Most machine learning
> techniques rely on large matrix multiplication and/or linear
> optimization, neither of which can be done efficiently with Redis.
> With Redis, you are reading/writing data with a round-trip to a remote
> server, which means reading/writing 100k-1M items/second (or 25k-250k
> from a single client) against a single server. You won't get any
> optimized algorithms for free, which means that you will not be doing
> anything "complicated" with any volume of real data.
>
> You are better off using one of the available libraries in your
> language of choice, or implementing them yourself, which will let you
> read/write 1B+ items/second (main memory is so much faster than a
> network roundtrip), use optimized algorithms (improving the big-O
> runtime), and could let you use pre-existing known-good implementation
> of these algorithms.

"Out-of-core" linear algebra is expensive, which is why I brought up
the revenue issue. Pretty much the only game in town for packaged
large-scale number-crunching that doesn't involve writing a lot of
code and doesn't cost an arm and a leg is Mahout. Just about
everything else is either proprietary or falls on its hiney once it
goes beyond the capacity of a single machine's RAM and CPU.

Vinicius Melo

unread,
Jul 29, 2012, 10:40:21 PM7/29/12
to redi...@googlegroups.com
We decide to start with reducing the complexity of the process for
now, but it seems that our recommendation system and n-clustering
algoritms using just commands from Redis sets are enough, but it seems
we will have a lot of problems with scalability on this, but we
already have some engineers working with Mahout.

Our service is exactly a mix of all features from the social networks
(+tumblr) that you have talked, but we will just accept content that
meets our guidelines that will be anything related to knowledge , of
course, programming will be the most popular =)

Thanks,
Vinicius Melo
> --
> You received this message because you are subscribed to the Google Groups "Redis DB" group.
> To post to this group, send email to redi...@googlegroups.com.
> To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/redis-db?hl=en.
>

Krishna Gade

unread,
Jul 30, 2012, 12:07:24 AM7/30/12
to redi...@googlegroups.com
If you're looking for realtime data processing, take a look at Storm, (open-sourced by twitter). It allows you to do online map-reduce and other data processing algorithms. It also works well with Redis as your computations can use Redis to store the maps on each of the storm nodes. 

For batch-processing style of apps, Mahout on top of Hadoop is the way to go at the moment.

--
krishna

Dvir Volk

unread,
Jul 30, 2012, 3:50:01 AM7/30/12
to redi...@googlegroups.com
 
Redis won't offer you much for machine learning. If you're looking to
gather statistics, calculate co-visitation, ..., Redis would work
fine.

But if you're looking to perform "complicated machine learning
processing", then Redis is not the tool for you. Most machine learning

For probabilistic models redis actually works rather well. sorted sets and hashes can model frequency tables and sparse feature vectors pretty efficiently.
Even for more complex stuff, storing the end result in redis for quick querying should also work fine.
Also, real time counting of events is also rather fast in redis, provided you have the right strategy to scale redis beyond your RAM limitations. 
a few examples of stuff I've done with redis in the past 2 years:

1. wikipedia based bayesian classification of texts. Redis was used both to reduce wikipedia to  (word ->  {count(class1), count(class2), ...})  vectors, and to query them in real time.

2. query log based "adult filtering" for queries - both training and querying.

3.  trending topics detection from rss news feeds and other sources. 

4. adaptive A/B testing.

and more...

I agree that it has its limitations, but it's pretty damn powerful for a lot of common uses.

have you seen the new redis based map reduce framework I linked here a couple of days ago?



Josiah Carlson

unread,
Jul 30, 2012, 4:27:11 AM7/30/12
to redi...@googlegroups.com
On Mon, Jul 30, 2012 at 12:50 AM, Dvir Volk <dvi...@gmail.com> wrote:
>> Redis won't offer you much for machine learning. If you're looking to
>> gather statistics, calculate co-visitation, ..., Redis would work
>> fine.
>>
>> But if you're looking to perform "complicated machine learning
>> processing", then Redis is not the tool for you. Most machine learning
>
>
> For probabilistic models redis actually works rather well. sorted sets and
> hashes can model frequency tables and sparse feature vectors pretty
> efficiently.
> Even for more complex stuff, storing the end result in redis for quick
> querying should also work fine.
> Also, real time counting of events is also rather fast in redis, provided
> you have the right strategy to scale redis beyond your RAM limitations.
> a few examples of stuff I've done with redis in the past 2 years:
>
> 1. wikipedia based bayesian classification of texts. Redis was used both to
> reduce wikipedia to (word -> {count(class1), count(class2), ...})
> vectors, and to query them in real time.
>
> 2. query log based "adult filtering" for queries - both training and
> querying.
>
> 3. trending topics detection from rss news feeds and other sources.
>
> 4. adaptive A/B testing.
>
> and more...

I've done all of those except your listed #4 in the past. Though I
don't consider any of them to be "complicated machine learning". I
suppose it's all a matter of opinion.

> I agree that it has its limitations, but it's pretty damn powerful for a lot
> of common uses.

I agree completely. Though from what the op was saying, they want it
for more than just the basics. Redis will work great for the basics,
and if you are okay with it running slow (in comparison to an in-core
library), you may even be happy about using it for more complicated
scenarios (beyond caching a result matrix from some
clustering/decomposition). But that they are already looking into
Mahout means that they're already looking towards solving bigger
problems.

> have you seen the new redis based map reduce framework I linked here a
> couple of days ago?

I did, though I've not had a reason to use it.

Regards,
- Josiah

Dvir Volk

unread,
Jul 30, 2012, 4:50:33 AM7/30/12
to redi...@googlegroups.com
clustering/decomposition). But that they are already looking into
Mahout means that they're already looking towards solving bigger
problems.


They are probably looking for unsupervised clustering algorithms. Again, redis can perfectly store the result of map reduce jobs for realy time.

about it being slow compared to in core stuff - scaling out is always a trade off. the advantage of being able to distribute my crunching jobs while still using a simple single redis (even if sharded) in the middle, wins back a lot of speed.

the way I usually work is I do small batches that reside in memory (for example, break N wikipedia documents), and dump the aggregate to redis in a single pipeline query. this works rather well while keeping things simple and distributed.

BTW I'm wondering what can be done to exploit the new bitmap features for learning algorithms. for example IDF counts can be easily modeled on them, if the document set is either small or not very sparse.
 
> have you seen the new redis based map reduce framework I linked here a
> couple of days ago?

I did, though I've not had a reason to use it.

me neither, but it's on my radar as I'm exploring ways to provide a more rigid framework for log crunching and learning, and I'm really not a Java fan so I have a natural bias against Hadoop and friends. Have you tried Disco BTW?

André Lage

unread,
Jan 21, 2016, 4:08:52 PM1/21/16
to Redis DB, GSD-UFAL
Hello,

Do you guys hold the same opinions nowadays? 

For example, Krishna, as Redis Cluster currently supports Partitioning, would you still recommend Storm?

The problem we address is to processes huge dense matrices at real time. The amount of data is greater than the local RAM memory (out-of-core or external memory algorithm). Moreover, operations over a matrix cell (i,j) require its 8 direct neighbors (i-1,j; i-1,j-1; i,j-1; i-1,j-1, ...). Would you advice any specific Redis data primitive types (or structure) or any specific way of storing the data?

Thank you in advance,


André Lage.
Reply all
Reply to author
Forward
0 new messages