I'm looking for an approach to the problem of analyzing application
log files.
I need to analyse Java log files from applications (i.e., not logs of
web servers). These logs contain Java exceptions, thread dumps, and
free-form log4j messages issued by log statements inserted by
programmers during development. Right now, these man-made log entries
do not have any specific format.
What I'm looking for is a tool and/or strategy that supports in lexing/
parsing, tagging, and analysing the log entries. Because there is only
little defined syntax and grammar - and because you might not know
what you are looking for - the task requires the quick issuing of
queries against the log data base. Some sort of visualization would be
nice, too.
Pointers to existing tools and approaches as well as appropriate tools/
algorithms to develop the required system would be welcome.
Ulrich
I once did a project for our Ruby Best Practices blog. The code is
over there at github:
https://github.com/rklemme/muppet-laboratories
Explanations can be found in the blog. This is the first posting of
the series:
http://blog.rubybestpractices.com/posts/rklemme/005_Enter_the_Muppet_Laboratories.html
This works different from what you want: log files are read and
written out to small log files according to particular criteria. But
you could reuse the parsing part (including detection of multi line
log statements) and write what you found into a relational database.
If you have it in the DB you can query for at least timestamp, log
level, message content and probably also thread id and class. If you
want to do custom tagging you could do that once the data is in the
database.
Since we do not know what goal your analysis has and how many
different questions to want to ask the data it's not entirely clear
whether that would be the optimal approach for your problem. One
variant to the above would be to provide the parsing process a number
of regular expressions with a label attached and label all log entries
during insertion into the database. But since modern relational
databases usually also support full text indexing and regular
expression matches that might also be solved with a view. If your
data volume is large you need to additionally make sure this remains
efficient.
Kind regards
robert
The colleague has developped the Viewer of CSV file with the library
JFreeChart. The csv files are time series ( date are for example in
format YYYY/MM/DD:HH:mm:ss )
I have developped my own parser that translates native logs => csv files.
In java i have used the java regexp patterns.
In a file, we have to find the beginning and the end of an
enregistrement ( it can be a multi-lines enregistrement). I can
exclude/include enregistrements with java regexp patterns.
We have to match the pattern of the date ( regexp and java dateFormat
pattern).
For every enregistrement, we can extract usefull values by pattern
matching ( I use a two passes matching to simplify the patterns) the
values can be bound to a filter ( http URL for example)
All this is embedded in swing components.
I can parse acces logs ( Apache, tomcat, weblogic), log4J logs, Verbse
GC of JVM ( IBM JVM, Open JDK 7 ..), java Threads dumps, hibernate sql
logs, Tuxedo logs and more generally all implicit or explicit dated
enregistrements.
That are the main ways ...
I take me a long time, an still in developpement ... but we have not
found any other tool.
It helps if you have a logging strategy that mandates a consistent logging
format, specific information in particular positions or marked by particular
markup, logging levels and other such so that your analysis tool isn't faced
with a completely open-ended input. What you describe requires a general
text-analysis approach, as you indicate that you can make no guarantees about
the format. Based on that, your best tool is "less" or equivalent text-file
reader.
What is a tool supposed to do, read your mind?
It's really hard to extract information from a garbage can where people just
randomly dumped whatever they individually felt like dumping without regard
for operational needs. You can't build a skyscraper on a bad foundation, and
you can't build a good log analysis off a crappy log.
Fix the logging system, then the analysis problem will be tractable.
--
Lew
Honi soit qui mal y pense.
http://upload.wikimedia.org/wikipedia/commons/c/cf/Friz.jpg
I would use Perl, and begin by recognizing some of the more important
formats, such as thread dumps. I agree with the desirability of
introducing some organized formatting into the log messages, but an
ad-hoc Perl program can often get useful data out of a disorganized log.
Patricia
You could then search by Type, ( WARNING, SEVERE, etc... )
You could search a range of times
It could handle multiple log files into one run
could Sync on an event and stop analyzing on another trigger
Graphs to count trends, events, exceptions
Used Reg-Ex a heck of a lot of work.. Sorted all the transactions in
the logs, so you could also display by package name, really helped
me solve a lot of problems when I was working .. took me nearly two
years to complete everything to where it is today..
I never found a package that even came close to it.. which is why I
wrote it
--
--------------------------------- --- -- -
Posted with NewsLeecher v5.0 Beta 1
Web @ http://www.newsleecher.com/?usenet
------------------- ----- ---- -- -
"open sourcing"
> interest. I parsed the logs into a db, and assigned id's to the
"DB, "IDs" (no greengrocer's apostrophe, and "id" is a different word from
"ID", although the meaning you imputed by the substitution is poetic and
interesting)
> various fields.
>
> You could then search by Type, ( WARNING, SEVERE, etc... )
"type"
> You could search a range of times
> It could handle multiple log files into one run
> could Sync on an event and stop analyzing on another trigger
"synch"
> Graphs to count trends, events, exceptions
"graphs"
> Used Reg-Ex a heck of a lot of work.. Sorted all the transactions in
"used" "regex"
> the logs, so you could also display by package name, really helped
> me solve a lot of problems when I was working .. took me nearly two
> years to complete everything to where it is today..
Double-dot, or two consecutive periods, is not legitimate punctuation in lieu
of a comma or full stop.
> I never found a package that even came close to it.. which is why I
> wrote it
You have made an important and useful point. Covering for a bad log format is
a freeform-text parsing problem, inherently difficult and heuristic and
probably never perfect. I wonder if your effort would have been better spent
converting to a log format that is parser-friendly, as the OP should do.
Others are more simple :
- acces logs that are Common Log Format ( CLF) or CLF extended compliant
( Apache, Tomcat, IIS, WebLogic, Websphere ...)
- Log4J
Who's Lee?
--
Lew
The first-person singular pronoun in English is spelled "I", not "i". It's
only one letter long, so it should be possible to spell it correctly. This is
one of the first lessons in an EFL course, so it should come as no surprise.
You're Lee now, Lee.
I would argue around the same lines.
I've been faced a while ago with a situation where some orthogonal
organisational unit wanted to exploit my logs. I told them to GTFO.
My logs are my logs. I put in it what I consider necessary. I often
improve them as I step through the code. I might change the message, fix
the level, &c. I don't want to have them set in stone. Neither do I
generally have enough confidence in them to allow them to be used for
analysis.
"The solution, then, is simple", I told them, "spec out the exact
messages and arguments you want, and the exact situations you want them
logged in, and I'll add them for you. But leave me my precious debugging
logs."
Let me emphasize: IMHO debugging logs and logs for analysis are two
different things and should be kept strictly separated -- possibly
logged to a different target respectively.
--
DF.
An escaped convict once said to me:
"Alcatraz is the place to be"
Did you mean to say "Bruce"?
;-)
robert
--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/
Only that Perl is so awful - YMMV of course. But for these kinds of
tasks (more correctly: for *any* task) I very much prefer to use Ruby
because of its cleaner OO and cleaner syntax. In these cases where the
basic format is fixed I place general parsing code in a library (a
single file really) and then I can write ad hoc scripts which do
arbitrary processing of the data. That's very productive.
Kind regards
That last is rather a brilliant idea, to use different targets. Heretofore
I've espoused that logs are primarily an operations tool, not a debugging
tool, although in service of the former they inevitably and inherently must
support the former. The problem I've always seen is that logging statements
are left up to the programmer, and not specified for the project.
Wait. I thought Lee was a Yank, not an Aussie? :)
Just call me Lew Lee. Then you can sing, "Lew Lee. Lew, lay, thou little
tiny child. Bye-bye, Lew Lee. Lew, lay!"
I do the same, but use gawk rather than Perl: I have the same objections
to Perl as you, while gawk is pretty straight forward if you understand
regexes and can write C.
So far, using gawk to extract the information I've needed from Linux
system logs has been rather straight forward. Besides, I generally find
gawk to be more concise and readable than Perl, for this type of job,
anyway.
--
martin@ | Martin Gregorie
gregorie. | Essex, UK
org |
I'd call it (what I described): audit logging. I don't know if the
meaning of that term normally extends beyond databases, but I don't see
why it shouldn't.
Aye!
Which reminds me of a guy who called himself "LL Cool J". ;-)
Associative memory...
Wow, I feel lucky. I was so afraid I'd be put in Coventry for that pun.
The solution was pretty simple, and it's dynamic. I don't propose to get
into a logging framework war, but in this case we saw that JUL wouldn't
cut it, but log4j would do the trick. We had to do some arcane app
server-related stuff for JMX and log4j.xml, also integrate exception
handling with various "global" handlers that could also log, and wrap
log4j calls with a plethora of methods that would result in messages
formatted to our liking, but after that the heavy lifting was and is
done: it's now up to the clients - *not* to the developers - to request
what gets logged and in what manner.
Developers of course are clients themselves.
Again, not to get into a logging framework war, but for these purposes
log4j brings a lot to the table. It's common to need logging on specific
Java packages to be at a certain level, for output of that specific
logging to go to a specific target (like its own file) and have its own
storage policy, and for that logging to not be (or be, as the case
demands) to be additive to parent logging. Being able to do this is a
minimum for supporting different clients.
We also added, as part of our log4j method wrappers, an extra field for
all log messages that characterizes a "functional category". This allows
decorating all messages with information as to the identity of a
functional subsystem, and is helpful to post-processing tools like Splunk.
This system has been in production now for about 4 months, and
operational support staff and other clients are very pleased with it.
It's not perfect, because not all the log statements exist in the code
to support every informational requirement (known or unknown), but the
framework is not a problem.
One sidenote: despite doing everything I describe above, you can still
end up with logs that are difficult to interpret, and more log
statements aren't necessarily the answer. This typically happens when
your code itself is a spaghetti tangle. Sometimes to fix a logging
problem you really do need to refactor your logged code.
AHS
I tend to use at least two logging streams: debugging and operational. I
leave debugging statements in production code: its normally off (of
course) but can be turned on if needed. Operational debugging includes
informational and error messages to be used by sysadmins which are always
enabled and should be fairly infrequent as well as performance
measurement messages. The latter can be configured on or off. As others
have said, the messages need to be designed with both log stream
selection and ease of parsing for later analysis in mind.
In a C application for a *NIX OS its easiest to send all these messages
to the system logger and let it deal with creating separate logs for the
various message streams: its then trivial to use 'tail' to present the
operational stream to sysadmins.
If the application is written in a language that doesn't provide easy
access to the system logger or is run on an OS that doesn't have one, I'd
include a custom logging process as part of the application.
There is one caveat to leaving debug logging in production code; it may
affect performance. Even with output disabled the string arguments are
still constructed, unless they are constant. If logging is located in a
tight loop, or critical section, it might become significant.
In one particular application I profiled I found this to be eating up
80% of the CPU time. Admittedly, this was a rather special case, but
it's still something which ought to be borne in mind if you leave debug
logging code in production software.
--
Nigel Wade
> On 23/05/11 23:25, Martin Gregorie wrote:
>> On Mon, 23 May 2011 15:02:23 -0400, Lew wrote:
>>
>>> On 05/23/2011 01:16 PM, Daniele Futtorovic wrote:
>>>> Let me emphasize: IMHO debugging logs and logs for analysis are two
>>>> different things and should be kept strictly separated -- possibly
>>>> logged to a different target respectively.
>>>
>>> That last is rather a brilliant idea, to use different targets.
>>> Heretofore I've espoused that logs are primarily an operations tool,
>>> not a debugging tool, although in service of the former they
>>> inevitably and inherently must support the former. The problem I've
>>> always seen is that logging statements are left up to the programmer,
>>> and not specified for the project.
>>>
>> I tend to use at least two logging streams: debugging and operational.
>> I leave debugging statements in production code: its normally off (of
>> course) but can be turned on if needed.
>
> There is one caveat to leaving debug logging in production code; it may
> affect performance. Even with output disabled the string arguments are
> still constructed, unless they are constant. If logging is located in a
> tight loop, or critical section, it might become significant.
>
Fair point. However, my usual debugging statement takes the form:
if (debug > 0)
debugger.trace(result + " = method(" + arg + ")");
Ugly, I know, but quite efficient, since when debugging is off even the
cost of the method call cost is. I use an integer to control debugging
rather than a boolean so I can control its volume: "java Application -dd"
would be expected to provide more detailed debugging output than "java
Application -d"
> In one particular application I profiled I found this to be eating up
> 80% of the CPU time. Admittedly, this was a rather special case, but
> it's still something which ought to be borne in mind if you leave debug
> logging code in production software.
>
Agreed.
On occasion I've used a circular buffer to accumulate tracing information
which gets only dumped if an exception occurs. This is great if you're
chasing rarely occurring problems in a high volume, long running process
since it obviates searching through megabytes of tracing info to find
20-30 lines of relevant tracing. You'd expect this too to be quite
expensive to run, so I'd use the same mechanism outlined earlier to
ensure that tracing set off means that circular buffer is never filled
rather than merely suppressing the buffer dump operation.
That said, I used to administer an OS (ICL's George 3) that continuously
traced its internal operations to a fine-grained and a coarse-grained
circular buffer which could be dumped after a crash and still managed to
run fast enough to be a very usable OS.
--
martin@ | Martin Gregorie
gregorie. | Essex, UK-
org |
What's wrong with
if ( logger.isDebugEnabled() ) ...
or
if ( logger.getLevel().isGreaterOrEqual( Logger.DEBUG )) ...
?
grep.
For particularly thorny bits of analysis, sed.
There are those who advocate awk for this sort of thing, but frankly that
seems like overkill.
tom
--
shouting drunkenly about 6502 assembler at parties
> Martin Gregorie wrote:
>> Fair point. However, my usual debugging statement takes the form:
>>
>> if (debug> 0)
>> debugger.trace(result + " = method(" + arg + ")");
>>
>> Ugly, I know, but quite efficient, since when debugging is off even the
>> cost of the method call cost is. I use an integer to control debugging
>> rather than a boolean so I can control its volume: "java Application
>> -dd" would be expected to provide more detailed debugging output than
>> "java Application -d"
>>
>>
> What's wrong with
>
> if ( logger.isDebugEnabled() ) ...
>
> or
>
> if ( logger.getLevel().isGreaterOrEqual( Logger.DEBUG )) ...
> ?
>
Not a lot, though that does involve method call overheads that you may
not want in a tight loop.
I initially created my simple ReportError logging class for Java 1.3,
before the Logger class was introduced, and, as it does what I need, have
seen no need to replace it. It was written to provide similar Java
functionality to a set of C functions that I'd been using for many years.
Its trace() methods are roughly equivalent to Logger.log(Level, String)
and Logger.log(Level, String, Object[]) with all output sent to stderr.
--
martin@ | Martin Gregorie
gregorie. | Essex, UK
org |
> There are those who advocate awk for this sort of thing, but frankly
> that seems like overkill.
>
Depends on your background, I suppose.
I was using FIND 2, NCC Filetab and (shudder) RPG3 long before I met awk,
so I was familiar with the sort of scriptable file processing utilities
that have a built-in record reading loop and a means of triggering
actions that match particular record types. As a result I felt right at
home with awk, which is about the easiest to use of all the ones I've
tried.
Besides, I very often find I need to calculate maxima, minima, averages,
etc as part of the log analysis and find that very easy to do with awk.
>> What's wrong with
>>
>> if ( logger.isDebugEnabled() ) ...
>>
>> or
>>
>> if ( logger.getLevel().isGreaterOrEqual( Logger.DEBUG )) ...
>> ?
> Not a lot, though that does involve method call overheads that you may
> not want in a tight loop.
Seems like it's a simple equality test that would be HotSpotted away.
> I initially created my simple ReportError logging class for Java 1.3,
> before the Logger class was introduced, and, as it does what I need, have
> seen no need to replace it. It was written to provide similar Java
> functionality to a set of C functions that I'd been using for many years.
>
> Its trace() methods are roughly equivalent to Logger.log(Level, String)
> and Logger.log(Level, String, Object[]) with all output sent to stderr.
--
Use a print-eff'ing logger and you'll have next to no overhead. If
evaluating the arguments is too heavy, use the ugly
if( logger.isXXXEnabled() ){}
jump.
> Martin Gregorie wrote:
> > Lew wrote:
> >> Martin Gregorie wrote:
> >>> Fair point. However, my usual debugging statement takes the form:
> >>>
> >>> if (debug> 0)
> >>> debugger.trace(result + " = method(" + arg + ")");
> >>>
> >>> Ugly, I know, but quite efficient, since when debugging is off even the
> >>> cost of the method call cost is. I use an integer to control debugging
> >>> rather than a boolean so I can control its volume: "java Application
> >>> -dd" would be expected to provide more detailed debugging output than
> >>> "java Application -d"
>
> >> What's wrong with
> >>
> >> if ( logger.isDebugEnabled() ) ...
> >>
> >> or
> >>
> >> if ( logger.getLevel().isGreaterOrEqual( Logger.DEBUG )) ...
> >> ?
>
> > Not a lot, though that does involve method call overheads that you may
> > not want in a tight loop.
>
> Seems like it's a simple equality test that would be HotSpotted away.
My understanding of the log4j and java.util.Logger classes is that each
logger exists within a hierarchy of classes, and that to determine
whether or not a specific logger call will produce output, the library
must test each level, starting at the specific class and traversing
upwards in the hierarchy.
So using a static variable as below to enable and disable logging
output can save some execution time:
static final boolean debug = false;
if( debug ) {
logger.debug(...);
}
Whether or not that is significant, depends upon the situation, of
course.
--
Jim Gibson
All this theory indicates what to measure in any real situation, but to echo
your "of course", you won't know what to conclude prior to that measurement.
Now why you'd *ever* want to put a log statement inside a tight,
performance-critical loop in the first place is a whole 'nother question.
In my specific example above I resolved the "problem" by using my own
class which did pretty much exactly as Lew suggested in his follow-up
post. When I subsequently profiled the code the execution time, and CPU
time, required by the remaining method invocations was negligible.
>
> Now why you'd *ever* want to put a log statement inside a tight,
> performance-critical loop in the first place is a whole 'nother question.
>
In my case the loop was reading messages from a network and generating
responses. The logging code was to log the message received, and the
response generated, during development. At this stage it wasn't deemed
to be performance critical, but was essential to verify correct receipt
of the network message.
It wasn't until much later that I was told that the response had to be
returned in less than 40uS (a fact that wasn't in the original
requirements). So I profiled the code to find out where optimisation
ought to be targeted. I still wanted to keep the logging in the code for
debugging purposes, since it was still under active development, but be
able to disable it (and its overhead) for testing.
--
Nigel Wade
Thanks for your answer. Your approach sounds reasonable.
But what exactly do you mean with "wrap log4j calls"?
Couln´t you just use different log4j appenders, one for each client?
The
appenders would then (i) decide whether to log a given message for a
particular
client and (ii) format it accordingly.
Ulrich
Using log4j as an example, if you look at the API for PatternLayout,
you've got a bunch of conversion specifiers, including %m for the
application-supplied message. The wrapping that we did, including those
functional categories I mentioned, is all about formatting the
application-supplied message. This included appropriate formatting of
exception stack traces and so forth.
We certainly also had different log4j conversion specifier combinations
in use for different appenders; the "wrapping" was purely for formatting
the application message, because that is opaque to log4j.
AHS
Yes, sometimes it's useful to maintain state while analyzing logs, and
that's more convenient with awk (or any of a thousand other scripting
languages) than it is with sed.
A couple of months back I tracked down some tricky locking issues
(multiple multi-threaded processes, with multiple components written
by multiple developers in multiple languages) by analyzing lock files
with an awk script; it made it easy to keep track of contention,
reentrant locking, number of waiters, etc on a per-lock basis.
Back in '91 I demonstrated a bug in AIX 3.1 libc's implementation of
free by using awk to turn a malloc/free log into a series of cut-down
C programs that duplicated the allocation access pattern, until I had
a minimal test program that demonstrated the problem.
Come to think of it, just a few weeks ago I was using a little awk
script to churn through the output from my various
sentiment-determination UIMA annotators[1] and compute precision,
recall, and F1 values.
Obviously, I could do any of those tasks in Perl, or Python, or Java,
or COBOL, or what have you. I only use awk because I've been using it
for more than 20 years now and so it has the advantage of familiarity.
[1] ObCLJP: The annotators are written in Java.
--
Michael Wojcik
Micro Focus
Rhetoric & Writing, Michigan State University
It's also used widely in security contexts.
Most of the products I work on have multiple logging targets, for
problem determination (which includes debugging), system history,
auditing, and other operational uses. Sometimes these use different
mechanisms, because they have different requirements - for example,
security audit logs need to be protected, insofar as that's possible,
from tampering by unauthorized users.
> Besides, I very often find I need to calculate maxima, minima, averages,
> etc as part of the log analysis and find that very easy to do with awk.
perl can do everything awk can do, and more, and do it just as concisely.
Assembler can do anything perl or awk can do, and do it much more efficiently,
Lawrence.
Just out of curiosity, Lawrence, what's the percentage for you to continually
try to trigger these Language Wars? What do you get out of it, Lawrence?
Especially since you know as well as everyone here that perl and awk are toy
languages compared to the One True Language, right, Lawrence?
Come on, Lawrence, the gauntlet has been thrown, Lawrence. Let's see what
you've got, Lawrence. The One True Language vs. perl and awk combined in the
tag-team Match of the Century(r) (tm) (k). Are you man enough to take it,
Lawrence?
I advise you to just give up now, Lawrence, and face the fact that the One
True Language is so far and away superior to awk, perl, and yes, I'm sorry,
Lawrence, but even Python - despite your passeltongue fluency, The One True
Language rules, Lawrence, and your favorite languages drool, Lawrence..
So, Lawrence, man up and respond, Lawrence, as the world awaits your wisdom.
Lawrence.