I have started using cascading few weeks back. I have created data analytic
engine on top of it. Mos of the code I wrote till date were running in
pseudo distributed environment.
Now that I needed to push the application to real data. I have to run the
jobs in cluster. With the same data, jobs almost run in the same time both
in pseudo distributed mode and 6-node cluster.
For a very simple example, I have a simple validation job to validate
regular expression pattern matching on individual fields of tuple. When I
run the job in distributed environment its not scaling well. Even if I have
large number of mappers, individual mappers are running too slow there than
in pseudo distributed mode.
Is there anything missing in configuration.When I run the similar jobs in
pure MapReduce they are scaling well.