titan hadoop cassandra data locality

168 views
Skip to first unread message

Apoorva Gaurav

unread,
Apr 24, 2015, 9:35:02 AM4/24/15
to aureliu...@googlegroups.com
Hello All,

I'm trying some basic hadoop jobs over titan running on cassandra. So far I've written gremlin queries rather than MR jobs and intend to do so to make it more accessible to the team. The output format I've used are GraphSONOutputFormat, ScriptOutputFormat and NoOpOutputFormat; with TitanCassandraOutputFormat I was getting issues similar to ones described here (https://groups.google.com/d/topic/gremlin-users/SjMjlINdTIc/discussion) and here (https://groups.google.com/d/topic/aureliusgraphs/PzE6IXP0z5s/discussion). Properties file look like 

titan.hadoop.input.format=com.thinkaurelius.titan.hadoop.formats.cassandra.TitanCassandraInputFormat
titan.hadoop.input.conf.storage.backend=cassandra
titan.hadoop.input.conf.storage.hostname=cassandra-host-1,cassandra-host-3,cassandra-host-3
titan.hadoop.input.conf.storage.port=9160
titan.hadoop.input.conf.storage.cassandra.keyspace=graphkeyspace
cassandra.input.partitioner.class=org.apache.cassandra.dht.Murmur3Partitioner


titan.hadoop.sideeffect.format=org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
titan.hadoop.output.format=com.thinkaurelius.titan.hadoop.formats.noop.NoOpOutputFormat
#titan.hadoop.output.format=com.thinkaurelius.titan.hadoop.formats.graphson.GraphSONOutputFormat
#titan.hadoop.output.format=com.thinkaurelius.titan.hadoop.formats.script.ScriptOutputFormat
#titan.hadoop.output.conf.script-file=examples/ScriptOutput.groovy

hadoop.tmp.dir=/data/tmp

What I've observed is that it pulls data from cassandra nodes to the server running gremlin rather than taking code to the data (one of the core philosophy behind hadoop), is it true? If yes, is it possible to send code to the data? And in case sending code option is not possible how can the jobs run over multiple nodes?

--
Thanks & Regards,
Apoorva

Apoorva Gaurav

unread,
Apr 26, 2015, 12:02:58 PM4/26/15
to aureliu...@googlegroups.com
Thanks & Regards,
Apoorva

Daniel Kuppitz

unread,
Apr 26, 2015, 4:46:40 PM4/26/15
to aureliu...@googlegroups.com
You can have Cassandra and Hadoop co-located (each node in the cluster serves as a Cassandra data node and a Hadoop data node + task tracker node) to reduce network latencies. However, on the other hand you'll increase the load on your nodes (CPU + memory) which may have a significant impact on the performance of your OLTP queries while the Hadoop jobs are running.

Cheers,
Daniel


--
You received this message because you are subscribed to the Google Groups "Aurelius" group.
To unsubscribe from this group and stop receiving emails from it, send an email to aureliusgraph...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/aureliusgraphs/9de98fed-4020-4f80-b717-1d4802cc5cda%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Apoorva Gaurav

unread,
Apr 27, 2015, 9:09:59 AM4/27/15
to aureliu...@googlegroups.com
Thanks Daniel,

Impacting OLTP performance is not something I cant afford. 

I couldn't find any sample .properties for using hdfs. Changing output location like titan.hadoop.output.location=hdfs://127.0.0.1:9000/tmp/foutput  seems to work on local. Is this sufficient or some other changes are required? Even with this I suspect that processing is done on the single node on which gremlin is running, isn't it? How to run it distributed manner.

Daniel Kuppitz

unread,
Apr 27, 2015, 9:51:35 AM4/27/15
to aureliu...@googlegroups.com
If you have a n-node Hadoop cluster, then your OLAP jobs will be processed on n nodes. It's not necessary to specify the full HDFS url for your output location (if Hadoop is configured properly). Here's a sample shell session using the default configuration files: https://gist.github.com/dkuppitz/8f5580038dc011ed26f0

Note that the default output location is just "jobs".

Cheers,
Daniel



Apoorva Gaurav

unread,
Apr 27, 2015, 10:36:53 AM4/27/15
to aureliu...@googlegroups.com
A noob question -- without giving full path how does it find out hadoop cluster's location, there has to be some configuration somewhere, I must be missing something as I couldn't find that in default configuration files.


For more options, visit https://groups.google.com/d/optout.



--
Thanks & Regards,
Apoorva

Daniel Kuppitz

unread,
Apr 27, 2015, 10:40:54 AM4/27/15
to aureliu...@googlegroups.com
The default location depends on the current user (it's the user's home directory):

daniel@cube /usr/local/titan-0.5.4-hadoop1 $ hadoop fs -ls
Found 2 items
drwxr-xr-x   - daniel supergroup          0 2015-04-27 15:43 /user/daniel/jobs
drwxr-xr-x   - daniel supergroup          0 2015-04-27 15:43 /user/daniel/titanlib

Cheers,
Daniel


Apoorva Gaurav

unread,
Apr 28, 2015, 9:38:16 AM4/28/15
to aureliu...@googlegroups.com
Exporting HADOOP_CONF like https://github.com/thinkaurelius/faunus/issues/129 and LD_LIBRARY_PATH like http://balanceandbreath.blogspot.ca/2013/01/utilnativecodeloader-unable-to-load.html seems to have done the trick. Trying to run some meaningful jobs now. Will bug again in case I get stuck again :-) Thanks a lot for your support.

On Tue, Apr 28, 2015 at 6:54 AM, Apoorva Gaurav <apoorva...@myntra.com> wrote:
Just checked gremlin thinks hdfs is local file system

[sanjay.yadav@nmc-lp5 titan-0.5.4-hadoop2]$ hadoop fs -ls /
Found 1 items
drwxrwxrwt   - hdfs hadoop          0 2015-04-28 06:18 /tmp
[sanjay.yadav@nmc-lp5 titan-0.5.4-hadoop2]$ bin/gremlin.sh 

         \,,,/
         (o o)
-----oOOo-(_)-oOOo-----
gremlin> hdfs
==>org.apache.hadoop.fs.LocalFileSystem@7c25984b

On Tue, Apr 28, 2015 at 6:36 AM, Apoorva Gaurav <apoorva...@myntra.com> wrote:
Thanks Daniel.

I set up a single node hdfs cluster where namenode and datanode are running on different machines and installed hadoop-client on the box where gremlin is running. The local user has access to the hdfs and can read and write but titan hadoop job runs locally. 

=================================================================
[sanjay.yadav@nmc-lp5 titan-0.5.4-hadoop2]$ hadoop fs -ls /
Found 1 items
drwxrwxrwt   - hdfs hadoop          0 2015-04-27 19:40 /tmp
[sanjay.yadav@nmc-lp5 titan-0.5.4-hadoop2]$ echo "hello world" > /tmp/hw.txt
[sanjay.yadav@nmc-lp5 titan-0.5.4-hadoop2]$ hadoop fs -copyFromLocal /tmp/hw.txt /tmp/hw_hdfs.txt
[sanjay.yadav@nmc-lp5 titan-0.5.4-hadoop2]$ hadoop fs -ls /
Found 1 items
drwxrwxrwt   - hdfs hadoop          0 2015-04-28 06:18 /tmp
[sanjay.yadav@nmc-lp5 titan-0.5.4-hadoop2]$ hadoop fs -ls /tmp/
Found 1 items
-rw-r--r--   3 sanjay.yadav hadoop         12 2015-04-28 06:18 /tmp/hw_hdfs.txt
[sanjay.yadav@nmc-lp5 titan-0.5.4-hadoop2]$ hadoop fs -copyToLocal /tmp/hw_hdfs.txt /tmp/hw_local.txt
15/04/28 06:18:56 WARN hdfs.DFSClient: DFSInputStream has been closed already
[sanjay.yadav@nmc-lp5 titan-0.5.4-hadoop2]$ cat /tmp/hw
hw_local.txt  hw.txt        
[sanjay.yadav@nmc-lp5 titan-0.5.4-hadoop2]$ cat /tmp/hw_local.txt 
hello world 



=================================================================
[sanjay.yadav@nmc-lp5 titan-0.5.4-hadoop2]$ bin/gremlin.sh 

         \,,,/
         (o o)
-----oOOo-(_)-oOOo-----
06:25:06 WARN  org.apache.hadoop.util.NativeCodeLoader  - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
gremlin> g = HadoopFactory.open("conf/hadoop/lgpgels.hadoop.properties")
==>titangraph[hadoop:titancassandrainputformat->graphsonoutputformat]
gremlin> g.V.has("entity_type","USER").groupCount('{it.entity_id.substring(it.entity_id.indexOf("@"))}')
06:25:12 INFO  com.thinkaurelius.titan.hadoop.compat.h2.Hadoop2Compiler  - Added mapper VerticesMap.Map via ChainMapper with output (class org.apache.hadoop.io.NullWritable,class com.thinkaurelius.titan.hadoop.FaunusVertex); current state is MAPPER
06:25:12 INFO  com.thinkaurelius.titan.hadoop.compat.h2.Hadoop2Compiler  - Added mapper VerticesMap.Map > PropertyFilterMap.Map via ChainMapper with output (class org.apache.hadoop.io.NullWritable,class com.thinkaurelius.titan.hadoop.FaunusVertex); current state is MAPPER
06:25:12 INFO  com.thinkaurelius.titan.hadoop.compat.h2.Hadoop2Compiler  - Configuring 1 MapReduce job(s)...
06:25:12 INFO  com.thinkaurelius.titan.hadoop.compat.h2.Hadoop2Compiler  - Configuring [Job 1/1: VerticesMap.Map > PropertyFilterMap.Map > GroupCountMapReduce.Map > GroupCountMapReduce.Reduce]
06:25:13 INFO  com.thinkaurelius.titan.hadoop.compat.h2.Hadoop2Compiler  - Configured 1 MapReduce job(s)
06:25:13 INFO  com.thinkaurelius.titan.hadoop.compat.h2.Hadoop2Compiler  - Preparing to execute 1 MapReduce job(s)...
06:25:13 INFO  com.thinkaurelius.titan.hadoop.compat.h2.Hadoop2Compiler  - Executing [Job 1/1: VerticesMap.Map > PropertyFilterMap.Map > GroupCountMapReduce.Map > GroupCountMapReduce.Reduce]
06:25:13 WARN  org.apache.hadoop.mapreduce.JobSubmitter  - Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
06:25:16 INFO  org.apache.hadoop.mapreduce.JobSubmitter  - number of splits:1025
06:25:16 INFO  org.apache.hadoop.mapreduce.JobSubmitter  - Submitting tokens for job: job_local938677550_0001
06:25:16 WARN  org.apache.hadoop.conf.Configuration  - file:/tmp/hadoop-sanjay.yadav/mapred/staging/sanjay.yadav938677550/.staging/job_local938677550_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.
06:25:16 WARN  org.apache.hadoop.conf.Configuration  - file:/tmp/hadoop-sanjay.yadav/mapred/staging/sanjay.yadav938677550/.staging/job_local938677550_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.
06:25:30 WARN  org.apache.hadoop.conf.Configuration  - file:/tmp/hadoop-sanjay.yadav/mapred/local/localRunner/sanjay.yadav/job_local938677550_0001/job_local938677550_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.
06:25:30 WARN  org.apache.hadoop.conf.Configuration  - file:/tmp/hadoop-sanjay.yadav/mapred/local/localRunner/sanjay.yadav/job_local938677550_0001/job_local938677550_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.
06:25:30 INFO  org.apache.hadoop.mapreduce.Job  - The url to track the job: http://localhost:8080/
06:25:30 INFO  org.apache.hadoop.mapreduce.Job  - Running job: job_local938677550_0001
06:25:30 INFO  org.apache.hadoop.mapred.LocalJobRunner  - OutputCommitter set in config null
06:25:30 INFO  org.apache.hadoop.mapred.LocalJobRunner  - OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
06:25:31 INFO  org.apache.hadoop.mapred.LocalJobRunner  - Starting task: attempt_local938677550_0001_m_000000_0
06:25:31 INFO  org.apache.hadoop.mapred.LocalJobRunner  - Waiting for map tasks
06:25:31 INFO  org.apache.hadoop.mapred.Task  -  Using ResourceCalculatorProcessTree : [ ]
06:25:31 INFO  org.apache.hadoop.mapred.MapTask  - Processing split: ColumnFamilySplit((-3539230589804882585, '-3398543054510187228] @[192.168.69.64, 192.168.69.66, 192.168.69.65])
06:25:31 INFO  org.apache.hadoop.mapred.MapTask  - Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
06:25:31 INFO  org.apache.hadoop.mapred.MapTask  - (EQUATOR) 0 kvi 26214396(104857584)
06:25:31 INFO  org.apache.hadoop.mapred.MapTask  - mapreduce.task.io.sort.mb: 100
06:25:31 INFO  org.apache.hadoop.mapred.MapTask  - soft limit at 83886080
06:25:31 INFO  org.apache.hadoop.mapred.MapTask  - bufstart = 0; bufvoid = 104857600
06:25:31 INFO  org.apache.hadoop.mapred.MapTask  - kvstart = 26214396; length = 6553600
06:25:31 INFO  org.apache.hadoop.mapreduce.Job  - Job job_local938677550_0001 running in uber mode : false
06:25:31 INFO  org.apache.hadoop.mapreduce.Job  -  map 0% reduce 0%
06:25:37 INFO  org.apache.hadoop.mapred.LocalJobRunner  - map > map



=================================================================
Property file is similar to default
=================================================================
titan.hadoop.input.format=com.thinkaurelius.titan.hadoop.formats.cassandra.TitanCassandraInputFormat
titan.hadoop.input.conf.storage.backend=cassandrathrift
titan.hadoop.input.conf.storage.hostname=lp1,lp3
titan.hadoop.input.conf.storage.port=9160
titan.hadoop.input.conf.storage.cassandra.keyspace=lgpgelsgraph
cassandra.input.partitioner.class=org.apache.cassandra.dht.Murmur3Partitioner

mapred.max.split.size=5242880
mapred.job.reuse.jvm.num.tasks=-1

titan.hadoop.sideeffect.format=org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
titan.hadoop.output.format=com.thinkaurelius.titan.hadoop.formats.graphson.GraphSONOutputFormat


For more options, visit https://groups.google.com/d/optout.



--
Thanks & Regards,
Apoorva



--
Thanks & Regards,
Apoorva



--
Thanks & Regards,
Apoorva

Apoorva Gaurav

unread,
Apr 28, 2015, 9:38:16 AM4/28/15
to aureliu...@googlegroups.com

Apoorva Gaurav

unread,
Apr 28, 2015, 9:38:16 AM4/28/15
to aureliu...@googlegroups.com
Just checked gremlin thinks hdfs is local file system

[sanjay.yadav@nmc-lp5 titan-0.5.4-hadoop2]$ hadoop fs -ls /
Found 1 items
drwxrwxrwt   - hdfs hadoop          0 2015-04-28 06:18 /tmp
[sanjay.yadav@nmc-lp5 titan-0.5.4-hadoop2]$ bin/gremlin.sh 

         \,,,/
         (o o)
-----oOOo-(_)-oOOo-----
gremlin> hdfs
==>org.apache.hadoop.fs.LocalFileSystem@7c25984b

On Tue, Apr 28, 2015 at 6:36 AM, Apoorva Gaurav <apoorva...@myntra.com> wrote:

Matthias Broecheler

unread,
May 1, 2015, 3:54:43 PM5/1/15
to aureliu...@googlegroups.com
It will pull the data to the nodes that are part of your Hadoop cluster which is coordinated by the central gremlin instance. It may be the case that that connection isn't configured correctly and that one gremlin instance is using local hadoop to execute the job in which case it would end up pulling the data.

If you co-locate the C* instances with the Hadoop instances it should read the data locally (or mostly).

--
You received this message because you are subscribed to the Google Groups "Aurelius" group.
To unsubscribe from this group and stop receiving emails from it, send an email to aureliusgraph...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages