DistributedCache

16 views
Skip to first unread message

Curt Holden

unread,
Dec 12, 2013, 4:10:41 PM12/12/13
to pangoo...@googlegroups.com
Is it possible to use the Hadoop DistributedCache with Pangool?  It looks like I need access to the JobConf for this.

Alexei Perelighin

unread,
Dec 13, 2013, 3:35:33 AM12/13/13
to pangool-user
This is the way I did it.

use -files option to specify files to be placed in the distributed cache.

TupleMRContext context
Configuration conf = context.getHadoopContext().getConfiguration();
Path[] paths = DistributedCache.getLocalCacheFiles(conf);
        for (Path path : paths) {
/*
PATH SELECTION LOGIC
*/
        }

But, Pangool can work around it. As it uses distributed cache for Serialized objects. All you need to do is to read your file into a Serialized (like String or ArrayList<String> or HashSet<String , String>) and pass it as an argument to the constructor of the Reducer or Mapper class. Thus avoiding any hassle with the direct access to the DistributedCache.

Thanks,
Alexei


2013/12/12 Curt Holden <curt....@gmail.com>
Is it possible to use the Hadoop DistributedCache with Pangool?  It looks like I need access to the JobConf for this.

--
Has recibido este mensaje porque estás suscrito al grupo "pangool-user" de Grupos de Google.
Para anular la suscripción a este grupo y dejar de recibir sus correos electrónicos, envía un correo electrónico a pangool-user...@googlegroups.com.
Para obtener más opciones, visita https://groups.google.com/groups/opt_out.

Alexei Perelighin

unread,
Dec 13, 2013, 3:36:41 AM12/13/13
to pangoo...@googlegroups.com
Hi Curt,


This is the way I did it.

use -files option to specify files to be placed in the distributed cache.

Pere Ferrera

unread,
Dec 13, 2013, 3:54:00 AM12/13/13
to pangoo...@googlegroups.com, alex...@googlemail.com
Hello,

As Alexei said you can use the DistributedCache; if you want to configure it programatically you can use the Configuration object before launching the Job, and then do as Alexei suggests inside Mappers / Reducers (getHadoopContext() is the key to accessing all Hadoop-based functionality such as Counters).

For small files using state objects in Mappers / Reducers is more convenient, as Alexei pointed out.

Curt Holden

unread,
Dec 13, 2013, 4:28:58 PM12/13/13
to pangoo...@googlegroups.com, alex...@googlemail.com

As Alexi suggested, I switched my code to using a serialized data structure that I pass to the reducer. This works well since I am only working with a few 10s of MB of data.

 

Thanks for the good advice,

 

Curt-

Reply all
Reply to author
Forward
0 new messages