Morphline failed to process record

951 views
Skip to first unread message

angel

unread,
Mar 11, 2015, 11:44:36 AM3/11/15
to cdk...@cloudera.org
I am trying to index some syslog data with the morphline sink. I have come to this problem

11 Mar 2015 15:58:39,086 WARN  [SinkRunner-PollingRunner-DefaultSinkProcessor] (org.apache.flume.sink.solr.morphline.MorphlineHandlerImpl.process:130)  - Morphline /etc/flume-ng/solragent/conf/morphline.conf@morphline1 failed to process record: {Facility=[10], Severity=[6], _attachment_body=[[B@344042d9], host=[topbdii03], priority=[86], producer=[syslog], timestamp=[1426085909000]}

My morphline.conf is the below, I have used 3 messages but still getting the same error.

morphlines : [
  {
    # identification name for morphline.conf
    id : morphline1

    # Import all morphline commands in these java packages and their
    # subpackages. Other commands that may be present on the classpath are
    # not visible to this morphline.
    #importCommands : ["com.cloudera.**", "org.apache.solr.**"]
    importCommands : ["org.kitesdk.**", "org.apache.solr.**"]

    # Commands that modify the stream file so we can index directly to solr
    commands : [
      {
        # Parse input attachment and emit a record for each input line
        readLine {
          charset : UTF-8
        }
      }

      {
        grok {
          # Consume the output record of the previous command and pipe another
          # record downstream.
          #
          # A grok-dictionary is a config file that contains prefabricated
          # regular expressions that can be referred to by name. grok patterns
          # specify such a regex name, plus an optional output field name.
          # The syntax is %{REGEX_NAME:OUTPUT_FIELD_NAME}
          # The input line is expected in the "message" input field.
          # dictionaryFiles : [src/test/resources/grok-dictionaries]
          dictionaryFiles : [ "/usr/share/doc/search-1.0.0+cdh5.2.0+0/examples/solr-nrt/grok-dictionaries" ]
          expressions : {
          # message : """%{SYSLOGTIMESTAMP:timestamp} %{SYSLOGHOST:hostname} %{DATA:program}(?:\[%{POSINT:pid}\])?: %{GREEDYDATA:msg}"""
           message : """ %{POSINT:facility} %{POSINT:severity} %{DATA:body_attachment} %{SYSLOGHOST:host} %{POSINT:priority} %{DATA:producer} %{SYSLOGTIMESTAMP:timestamp}  %{GREEDYDATA:msg}"""
          # message : """<%{POSINT:syslog_pri}>%{SYSLOGTIMESTAMP:timestamp} %{SYSLOGHOST:host} %{DATA:syslog_program}(?:\[%{POSINT:syslog_pid}\])?: %{GREEDYDATA:syslog_message}"""
          }
        }
      }

      # Consume the output record of the previous command, convert
      # the timestamp, and pipe another record downstream.
      #
      # convert timestamp field to native Solr timestamp format
      # e.g. 2012-09-06T07:14:34Z to 2012-09-06T07:14:34.000Z
      {
        convertTimestamp {
          field : timestamp
          inputFormats : ["yyyy-MM-dd'T'HH:mm:ss.SSS'Z'", "MMM d HH:mm:ss"]
          inputTimezone : Europe/Zurich
          outputFormat : "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"
          outputTimezone : CET
        }
      }

      # Consume the output record of the previous command, transform it
      # and pipe the record downstream.
      #
      # This command deletes record fields that are unknown to Solr
      # schema.xml. Recall that Solr throws an exception on any attempt to
      # load a document that contains a field that isn't specified in
      # schema.xml.
      {
        sanitizeUnknownSolrFields {
          # Location from which to fetch Solr schema
          solrLocator : {
            collection : SystemLogs                                     # Name of solr collection
            zkHost : "myserver:2181/solr"                # ZooKeeper ensemble
          }
        }
      }

      # log the record at INFO level to SLF4J
      { logInfo { format : "output record: {}", args : ["@{}"] } }

      # load the record into a Solr server or MapReduce Reducer
      {
        loadSolr {
          solrLocator : {
            collection : SystemLogs                                     # Name of solr collection
            zkHost : "myserver:2181/solr"                # ZooKeeper ensemble
          }
        }
      }
    ]
  }
]                                                                

The problem is I cannot see why I am getting this error. The message command has the proper sequence. I used the
log4j.logger.com.cloudera.cdk.morphline=TRACE in log4j.properties but i did not get any different log. Any ideas?

Wolfgang Hoschek

unread,
Mar 11, 2015, 12:09:42 PM3/11/15
to angel, cdk...@cloudera.org
Try 

log4j.logger.org.kitesdk.morphline=TRACE

--
You received this message because you are subscribed to the Google Groups "CDK Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cdk-dev+u...@cloudera.org.
For more options, visit https://groups.google.com/a/cloudera.org/d/optout.

angel

unread,
Mar 11, 2015, 12:51:44 PM3/11/15
to cdk...@cloudera.org, angell...@gmail.com
Thank you I got the new log. This is the more detailed error

11 Mar 2015 17:19:42,764 DEBUG [SinkRunner-PollingRunner-DefaultSinkProcessor] (org.kitesdk.morphline.stdlib.GrokBuilder$Grok.doMatch:202)  - grok failed because it found too few matches for values: [kernel: netlog: /usr/sbin/slapd[8360] TCP 127.0.0.1:2170 <-> 127.0.0.1:43490 (uid=55)] for grok command: {
    # /etc/flume-ng/solragent/conf/morphline.conf: 32

    #  Consume the output record of the previous command and pipe another
    #  record downstream.
    #
    #  A grok-dictionary is a config file that contains prefabricated
    #  regular expressions that can be referred to by name. grok patterns
    #  specify such a regex name, plus an optional output field name.
    #  The syntax is %{REGEX_NAME:OUTPUT_FIELD_NAME}
    #  The input line is expected in the "message" input field.
    #  dictionaryFiles : [src/test/resources/grok-dictionaries]
    "dictionaryFiles" : [
        # /etc/flume-ng/solragent/conf/morphline.conf: 32
        "/usr/share/doc/search-1.0.0+cdh5.2.0+0/examples/solr-nrt/grok-dictionaries"
    ],
    # /etc/flume-ng/solragent/conf/morphline.conf: 33
    "expressions" : {
        # /etc/flume-ng/solragent/conf/morphline.conf: 35

Wolfgang Hoschek

unread,
Mar 11, 2015, 1:08:26 PM3/11/15
to angel, cdk...@cloudera.org
Well, there you see the reason.

angel

unread,
Mar 12, 2015, 3:52:20 AM3/12/15
to cdk...@cloudera.org, angell...@gmail.com
What I do not understand is the problem is in the grok path in line 32 or in the sequence of the message?

Wolfgang Hoschek

unread,
Mar 12, 2015, 4:31:17 AM3/12/15
to angel, cdk...@cloudera.org
The message string 

kernel: netlog: /usr/sbin/slapd[8360] TCP 127.0.0.1:2170 <-> 127.0.0.1:43490 (uid=55)

can't possibly match the grok expression

%{POSINT:facility} %{POSINT:severity} %{DATA:body_attachment} %{SYSLOGHOST:host} %{POSINT:priority} %{DATA:producer} %{SYSLOGTIMESTAMP:timestamp}  %{GREEDYDATA:msg}

For example the message string doesn't start with a positive integer, etc.

You need to make sure that your input data matches the grok expression.
Message has been deleted
Message has been deleted

angel

unread,
Apr 24, 2015, 7:02:15 AM4/24/15
to cdk...@cloudera.org, angell...@gmail.com
I have the following problem:

grok failed because it found too few matches for values: [kernel: netlog: p:5367 s:29323 pp:1 u:0 g:0 eu:0 eg:0 t:NULL /usr/bin/python UDP ip1:port1 <!> ip2:port2] for grok command: {
grok failed because it found too few matches for values: [kernel: netlog: p:3486 s:3480 pp:1 u:28 g:28 eu:28 eg:28 t:NULL /usr/sbin/nscd UDP [ip1]:port1 -> [ip2]:port2] for grok command:

I have constructed the grok message like that

 "message" : "kernel: netlog: p:%(POSINT:pid) s:%(POSINT:sid) pp:%(POSINT:ppid) u:%(POSINT:uid) g:%(POSINT:gid) eu:%(POSINT:euid) eg:%(POSINT:egid) t:%(DATA:tty) %(DATA:path) %(DATA:protocol) %{IP:srcip}:%{INT:srcport} %{DATA:direction} %{IP:dstip}:%{INT:dstport}"

Although it seems i miss something. Where am I wrong?

Wolfgang Hoschek

unread,
Apr 24, 2015, 9:04:34 AM4/24/15
to angel, cdk...@cloudera.org
Use curly braces, like p:%{POSINT:pid} instead of p:%(POSINT:pid)

angel

unread,
Apr 24, 2015, 9:24:30 AM4/24/15
to cdk...@cloudera.org, angell...@gmail.com
And how do i support both ip:port something ip:port and [ip]:port something [ip]:port
Because I think that is the problems.I corrected the curly brackets

Wolfgang Hoschek

unread,
Apr 24, 2015, 9:37:45 AM4/24/15
to angel, cdk...@cloudera.org
You can use the tryRules command with a separate grok command for each alternative, or use regex quantifiers inside the same regex - http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html

angel

unread,
Apr 24, 2015, 10:43:14 AM4/24/15
to cdk...@cloudera.org, angell...@gmail.com
        "message" : "kernel: netlog: p:%{POSINT:pid} s:%{POSINT:sid} pp:%{POSINT:ppid} u:%{POSINT:uid} g:%{POSINT:gid} eu:%{POSINT:euid} eg:%{POSINT:egid} t:%{DATA:tty} %{DATA:path} %{DATA:protocol} (?:\\[%{IP:srcip}\\]/%{IP:srcip}):%{INT:srcport} %{DATA:direction} (?:\\[%{IP:dstip}\\]/%{IP:dstip}):%{INT:dstport}"

I wrote something like that but it does not work. Any ideas?

angel

unread,
Apr 27, 2015, 3:15:59 AM4/27/15
to cdk...@cloudera.org, angell...@gmail.com
Do I have any escaping chars that I should take care of?

grok failed because it found too few matches for values: [kernel: netlog: p:11505 s:11503 pp:1 u:28 g:28 eu:28 eg:28 t:NULL /usr/sbin/nscd UDP [ip1]:52436 <!> [ip2::5]:53] for grok command: {

kernel: netlog: p:%{POSINT:pid} s:%{POSINT:sid} pp:%{POSINT:ppid} u:%{POSINT:uid} g:%{POSINT:gid} eu:%{POSINT:euid} eg:%{POSINT:egid} t:%{DATA:tty} %{DATA:path} %{DATA:protocol} (?:\\[%{IP:srcip}\\]/%{IP:srcip}):%
{INT:srcport} %{DATA:direction} (?:\\[%{IP:dstip}\\]/%{IP:dstip}):%{INT:dstport}

Maybe I should use INT instead of POSINT?

angel

unread,
Apr 27, 2015, 3:58:18 AM4/27/15
to cdk...@cloudera.org, angell...@gmail.com
 "message" : "kernel: netlog: p:%{POSINT:pid} s:%{POSINT:sid} pp:%{POSINT:ppid} u:%{POSINT:uid} g:%{POSINT:gid} eu:%{POSINT:euid} eg:%{POSINT:egid} t:%{DATA:tty} %{DATA:path} %{DATA:protocol} (\\[%{IP:srcip}\\]|%{IP:srcip}):%{INT:srcport} %{DATA:direction} (\\[%{IP:dstip}\\]|%{IP:dstip}):%{INT:dstport}"

Even this kind of message does not work

angel

unread,
Apr 27, 2015, 10:04:19 AM4/27/15
to cdk...@cloudera.org, angell...@gmail.com
Replying again to myself,
 "message" : "kernel: netlog: p:%{POSINT:pid} s:%{POSINT:sid} pp:%{POSINT:ppid} u:%{POSINT:uid} g:%{POSINT:gid} eu:%{POSINT:euid} eg:%{POSINT:egid} t:%{DATA:tty} %{DATA:path} %{DATA:protocol} %{GREEDYDATA:msg}

And it worked. So the rest of the fields do have some problem the only thing is that when the direction field comes as <!> then it gets imported as space in cloudera search!


 "message" : "kernel: netlog: p:%{POSINT:pid} s:%{POSINT:sid} pp:%{POSINT:ppid} u:%{POSINT:uid} g:%{POSINT:gid} eu:%{POSINT:euid} eg:%{POSINT:egid} t:%{DATA:tty} %{DATA:path} %{DATA:protocol} (\\[%{IP:srcip}\\]|%{IP:srcip}
):%{INT:srcport} %{DATA:direction} (\\[%{IP:dstip}\\]|%{IP:dstip}):%{INT:dstport}"

I have tried regexes and they do not work. I have ipv4 and ipv6. Maybe the grok IP does not support ipv6? That would be the only issue with the command

som

unread,
Apr 25, 2016, 4:22:59 AM4/25/16
to CDK Development, angell...@gmail.com
Hi,

Could you please help to get out from this error...

grok failed because it found too few matches for values: [How Startups Can Utilize Big Data in 2016 @TechCoHQ @NateMVickery #BigData #startup https://t.co/8yTDFmoH3k https://t.co/gxLQqD6xxr Wed Apr 20 16:14:14 IST 2016]

"message" : "%{DATA:msg}%{SYSLOGTIMESTAMP:timestamp}"

thq in advance
som

Wolfgang Hoschek

unread,
Apr 25, 2016, 12:31:44 PM4/25/16
to som, CDK Development, angell...@gmail.com
The trailing string does not conform to SYSLOGTIMESTAMP format: Wed Apr 20 16:14:14 IST 2016 

Wolfgang

som

unread,
Apr 26, 2016, 2:47:48 AM4/26/16
to CDK Development, sriram...@gmail.com, angell...@gmail.com
Thq for replay...

Could you please tell which format to use, i have used DATESTAMP_OTHER, but still the problem exists.

thq in advance
Reply all
Reply to author
Forward
0 new messages