Reading Hadoop sequence file

46 views
Skip to first unread message

Kamil Kantar

unread,
Mar 16, 2015, 5:24:24 PM3/16/15
to cdk...@cloudera.org
Hi all,

I want to read a Hadoop sequence file but in most cases I am only able to read *just* the first key/value pair out the file. I am reading it like so:

    { openHdfsFile {} }
    { readSequenceFile {} }

    { logInfo { format : "Record: {}", args : ["@{}"] } }

Some of my sequence files are read entirely, sometimes I get 2 key/value pairs and sometimes (most often) I get just one pair.
My keys look like this:

abcd1234_xyz_sometext_abc_log_20150224110007_iba_MyDomain_MyDomain.log.1
abcd1234_xyz_sometext_abc_log_20150224110007_iba_MyDomain_MyDomain.log.2
abcd1234_xyz_sometext_abc_log_20150224110007_iba_MyDomain_MyDomain.log.3

and values are usually large entries including java exceptions / stacktraces. It is possible that the key/values pair is not emitted if the values is too big? 

Thank you!


Wolfgang Hoschek

unread,
Mar 16, 2015, 6:20:28 PM3/16/15
to Kamil Kantar, cdk...@cloudera.org
To automatically print diagnostic information such as what command failed when, and the content of records as they pass through the morphline commands, consider enabling TRACE log level, for example by adding the following line to your log4j.properties file:

log4j.logger.org.kitesdk.morphline=TRACE

--
You received this message because you are subscribed to the Google Groups "CDK Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cdk-dev+u...@cloudera.org.
For more options, visit https://groups.google.com/a/cloudera.org/d/optout.

Kamil Kantar

unread,
Mar 16, 2015, 6:26:23 PM3/16/15
to cdk...@cloudera.org
Hi Wolfgang,

yes, I have trace enabled (I am using Flume to pass events to MorphlinesSolrSink with TRACE log level). Nothing notable, however, in the log. The only issue there is

2015-03-06 11:21:22,669 DEBUG org.apache.hadoop.util.Shell: Failed to detect a valid hadoop home directory
java.io.IOException: HADOOP_HOME or hadoop.home.dir are not set.

But I guess this does not have anything to do with proper reading the sequence files, right?
No command is failing, rather it seems to me that reader.next (from the readSequenceFile source) has nothing to read even though there are more key/value pairs in the file (I can see them by issuing hadoop fs -text /path/to/sequence/file)

Thank you
-kamil

Kamil Kantar

unread,
Mar 17, 2015, 6:36:22 AM3/17/15
to cdk...@cloudera.org, kamil....@gmail.com
Hi Wolfgang,

it seems that Morphlines has problems with reading sequence files with Record level compression and also without using compression. The only type I am able to read is files with Block-level compression. That way I can read all key/values pairs (and I am using default zlib compression codec)...

-kamil
Reply all
Reply to author
Forward
0 new messages