Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Re: Can we safely run sp_sysmon for a 24 hour period without causingperformanceimpact to PROD?

375 views
Skip to first unread message

David Wein

unread,
Apr 10, 2008, 12:30:15 PM4/10/08
to

Derek Asirvadem wrote:
>> On 2008-04-10 17:27:05 +1000, David Wein <david...@sybase.com> said:
>>
>> My understanding, based on a conversation with a hardcore performance
>> engineer is that the performance hit comes from a spike in L2 misses
>> when the monitor counters are enabled via sp_sysmon.
>
> Very important information, thankyou. That would mean the oh is caused
> by dis/enablement of the counters very frequently within a short
> duration, repeatedly; not by sp_sysmon itself; and not by sp-sysmon
> running once for a long duration.

I may have chosen my words poorly. I meant to state that the L2 misses
occur while the counters are enabled - but not necessarily through the
act of enabling or disabling them. Frequent short sysmon runs will
obviously have greater overhead because of the cost of running the
sproc, populating sysmonitors, etc. But the 5th minute of a 10 minute
sysmon shouldn't present greater overhead than the 5th hour of a 10 hour
sysmon. Keep in mind that this is hearsay and I haven't looked at the
monitoring personally, but I can grasp the concept. The monitor
counters tend to have poor locality of reference with respect to the
code that is incrementing them. For the generalized case, it takes a
couple of global variable references to get the pointer to the correct
monitor counter, again out of context from data structures being used in
the regular code, and this causes cachelines to go flying around. This
starts getting pretty low level pretty fast, but another important point
is that when we talk about L2 misses, your overhead is going to vary
significantly from machine to machine. Hypothesizing here...you may see
less of an impact on machines with larger L2 and L3 caches.

Interestingly enough, lots of the MDA counters have much greater
locality of reference than the tradition monitor counters (although
Peppler is correct that in some cases they are one and the same). This
is because many of the MDA counters are embedded in the data structures
that the code flow is already dealing with, so accessing the counter is
just an offset from some address already in a register.

I could go on about the possibilities of cache invalidation due to
engines sharing MDA counters vs. monitor counters but I have rambled
long enough without provided anything concrete...I'll post my comments
though just to highlight some of the variables that need to be
considered if you go down this low.

>
> It would be interesting to examine (for dishonsety and exceeded scope)
> the other representations the MDA camp engage in.
>
> Is it too much to ask, to get obsessive responders (not you Dave, but
> the MDA camp who prickle at the slightest mention of an alternative to
> MDA and treat the newsgroup like their personal blogsite) to keep within
> the context of the thread ?

Actually I am a big fan of MDA tables as well. I don't dispute your
overall thesis that it is better to solve issues at a high level, but
MDA tables can answer a lot of questions that were previously
unanswerable. To me the key to all of this is something that anybody
who has playing this systems game for any period of time knows: there is
no one size fits all approach to monitoring and troubleshooting. You
have to match the tool to the problem. Sometimes it is sysmon,
sometimes it is MDA, sometimes it is something else like oprofile,
sybmon, monitor server, vtune, iostat, etc. In all fairness, I think
everybody on these forums knows this.

Like it or not, you are going to be seeing more of the MDA tables. As
you know they are now installed by default with installmaster. Some of
the newer tables are for general information reporting and not truly
monitoring / diagnostic, i.e. they don't represent performance counters
nor do they need to be configured. When I built the workload manager
for Cluster Edition I chose to expose the state of that subsystem via a
set of MDA tables. It was the cleanest and most flexible way for me to
expose details from an internal data structure. These tables are then
used by the sp_cluster stored procedure and the Sybase Central plug-in,
and because they are SQL tables it is easy to write your own front end,
such as the monitor app in the cluster demo. The runtime overhead of
such tables is nil.

-Dave

Mark A. Parsons

unread,
Apr 14, 2008, 9:16:35 PM4/14/08
to
Besides the overhead that David Wein has mentioned, there is a (relatively) small performance hit at the end of a
sp_sysmon session when counters are downloaded from dataserver memory into the sysmonitors table, and then to process
the sysmonitors table to generate the sp_sysmon output report.

NOTE: Try enabling 'set statistics io on' just prior to running a brief sp_sysmon invocation (eg, 'sp_sysmon
"00:05:00"') and see what kind of logical IO (aka cache hit) numbers we're talking about.

Whether or not you find sp_sysmon causing a performance hit on your server will really depend on your dataserver ...
number of engines, typical load by all other processes, amount of free resources on the hardware, etc.

So, other than the above-mentioned items there's not that much of a performance hit on the dataserver.

NOTE: I don't consider a 1-5% load on a single dataserver engine to be 'too much' when compared to the information it
provides me for a) P&T work and b) historical research.

NOTE: I especially don't consider sp_sysmon a big load when I'm running with 4/8/16/or-more dataserver engines; overall
light processing load means plenty of resources for sp_sysmon; overall heavy processing load means I probably need the
sp_sysmon output to assist with P&T troubleshooting and analysis.

-----------------------

Most DBA's I know will set up a process to run sp_sysmon 24x7, usually on a 10-minute, 15-minute, or 30-minute basis.

While a longer timeframe provides for fewer output files to look at, the longer timeframes tend to smooth out the peaks
and valleys that can point at intermittent application problems and/or dataserver tuning opportunities.

With the shorter timeframes (eg, 10/15 minutes) you still have the information necessary to make dataserver
configuration changes (OK, so you have to look at more files, so write a shell script to parse the files for you), but
where the shorter timeframe runs come in real handy is for point-in-time P&T troubleshooting and analysis.

I routinely use the results from the shorter timeframes to help piece together a more complete picture of dataserver
activity during periods of heavy/problematic activity. (In most cases this type of activity is usually centered around
smallish spikes in time, hence the need for short-lived sp_sysmon sessions which can help correlate dataserver
peaks/valleys with the desired activity period.)

I say 'help piece together' because sp_sysmon is just one tool I use for P&T work. Other tools may include
monserver/histserver, some dbcc commands, home-grown queries against system tables (eg, sysprocesses, syslogshold,
systransactions), MDA tables, Sybase auditing, and OS monitoring tools.

Someone sent an email about a batch job that ran long last night from 03:15 to 04:20? Take a gander at the sp_symon
output files for the timeframe in question.

Someone just called about a recent performance-related issue? Grab a copy of the most recent sp_sysmon output file.

Having a performance-related issue with a dataserver *right now*? The next set of sp_sysmon output is just a few
minutes away (as opposed to, possibly, several hours away for a long-running sp_sysmon session).

Someone ran 'sp_sysmon/clear' during the day and screwed up your multi-hour session? Darn, there goes several hours of
monitoring data; but with the shorter timeframes you stand to lose a smaller amount of monitoring data. [Obviously (?)
some of this can be addressed by making sure everyone always runs sp_sysmon/noclear.]

-----------------------

If you don't use sp_sysmon to assist in query- and/or process-related P&T work, no problem, go ahead and run the
multi-hour sessions.

If you plan to use sp_sysmon (in conjunction with other monitoring tools) to assist with query- and/or process-related
P&T work, consider running sp_sysmon at shorter intervals.


Susan Mills wrote:
> Can we safely run sp_sysmon for a 24 hour period without
> causing a performance impact on PROD?
>
> What is the best strategy to capture a single day of
> processing?
>
> Thank you.

Derek Asirvadem

unread,
Apr 21, 2008, 12:20:16 AM4/21/08
to
Susan

In answering your question

> What is the best strategy to capture a single day of processing?

It may well appear that the gurus have quite different responses. For
purposes of undersanding, it needs to be appreciated that they have
different approaches, and a set of tools which are relevant/consistent
with their approaches.

Mark has written a fine article re a strategy of using sp_sysmon during the day

> On 2008-04-15 11:16:35 +1000, "Mark A. Parsons"
> <iron_horse@no_spamola.compuserve.com>

which provides a good idea of his approach/set of tools, and why that
method of running sp_sysmon is relevant. In effect he is using
sp_sysmon for a certain form of diagnosis/what-happened.

I have the same needs as Mark or any other DBA, I meet them in a
different way. It also needs to be identified that I set up servers
and the relevant infrastructure for production DBAs, but that plus
assistance when I am on-site and our "walk-in" service, is second to my
primary function of either providing HP databases and P&T assignments.
That latter is fixed price/fixed result agreed before the fact, so I
walk in; install scripts that collect what I need; and start work. I
have a scripted structured approach, and it does not serve a FP
assignment to spend time gathering info or deciding what is worth
(capturing and) analysing. The different approaches are not "right" or
"wrong", they need to be understood. If the DBA's want more than what
I have left for them, they are free to do anything they want.

To explain my approach (ie. the relevance of my suggestion to run
sp_sysmon once for the business day).
• sp_sysmon is my fundamental monitoring vis-a-vis server config,
changes, impact; next iteration
• Bradmark Surveillance (used to be DBGeneral) is my visual
snapshot/what-is-going-on-now/what-just-happened monitoring tool, there
are others from good-looking rubbish to ordinary-looking good stuff.
But again this is monitoring, not diagnosis.
• I do not use sp_sysmon for diagnosis
• sysaudits is very thorough and accurate (requires a bit of initial
set up, thresholds, cycling; banks and finance need it anyway so it is
usually already there and sometimes done properly) (same as MDA, you do
not want to capture all SQL Text unless what is running on the server
is completely unknown; but that really is a different problem). Much
basic spike and process-related questions can be answered here.
• for after-the fact and analytical diagnosis on production servers, to
go hand-in-hand with sysaudits, and often the first resort, I use a
cron script that runs every 30 sec to capture sysprocesses,
systransactions, syslocks (grouped!). Thiese scripts can easily be
changed to filter out the good guys and get more info on the bad guys,
or to produce different files when (eg) there are more than 10 blocking
locks on a certain table.

Therefore I do not end up with many files to co-relate, and no parsing.

For query-related analytics I use a top-down approach, not a bottom-up
one, therefore I do not have to use MDA or sp_sysmon to figure out what
a query/process is doing. I am in the HP end of things, so I do not
advocate MDA tables or the bottom-up approach that goes with them.
However if you are using them for diagnosis, you really do not need
sp_sysmon as well. I often have to use MDA at cust sites to assist
them, but not of my own accord, or as part of a fixed price P&T job. I
parse SQL text and predict its resource use and impact, as opposed to
reacting to resource use problems and trying to find out what SQL is
killing it.

Testing (as opposed to production) is a different animal. Runniing
sp_sysmon for a 5 min duration while a (known, test) query is being
executed is normal.

Likewise o/s monitoring, which is another entire chapter, and this is a
Sybase ASE newsgroup.
--
Cheers
Derek
Senior Sybase DBA / Information Architect
Copyright © 2008 Software Gems Pty Ltd
Quality Standards = Zero Maintenance + Zero Surprises
Performance Standards = Predictability + Scaleability

Derek Asirvadem

unread,
Apr 21, 2008, 2:01:30 AM4/21/08
to
Thanks for your reponse.

We have left Susan's questions behind, and we are having a discussion
about MDA, maybe we should open a new thread, or post this on the MDA
newsgroup.

> On 2008-04-11 02:30:15 +1000, David Wein <david...@sybase.com> said:

> I may have chosen my words poorly. I meant to state that the L2 misses
> occur while the counters are enabled - but not necessarily through the
> act of enabling or disabling them. Frequent short sysmon runs will
> obviously have greater overhead because of the cost of running the
> sproc, populating sysmonitors, etc. But the 5th minute of a 10 minute
> sysmon shouldn't present greater overhead than the 5th hour of a 10
> hour sysmon. Keep in mind that this is hearsay and I haven't looked at
> the monitoring personally, but I can grasp the concept. The monitor
> counters tend to have poor locality of reference with respect to the
> code that is incrementing them. For the generalized case, it takes a
> couple of global variable references to get the pointer to the correct
> monitor counter, again out of context from data structures being used
> in the regular code, and this causes cachelines to go flying around.
> This starts getting pretty low level pretty fast,

We are getting pretty low level here. Certainly I am finding that some
of the trees in one of the gullies has very big leaves, but in the
main, I do not think it changes the landscape. In any case, the
overhead IS for multiple executions of sp_sysmon and for short
durations (and that is not reasonably called monitoring)

> but another important point is that when we talk about L2 misses, your
> overhead is going to vary significantly from machine to machine.
> Hypothesizing here...you may see less of an impact on machines with
> larger L2 and L3 caches.

I agree, I do not think that is hypothesis; architected features like
that are exactly why we pay more for a true mid-range server, than for
a Piece of Cr@p pretending to be a server. The list of overheads
realised on one of those, that is absent on a true server, is
practically endless. Think truly integrated I/O subsystem; Fireplane;
1mb on-chip L2 cache.

Explains why it doesn't show up on my radar.

A more accurate statement would be ALL code (not just sp_sysmon) has
more overhead on a shoebox (whether Linux or Windoze), than on a real
server. If we are going to draw comparisons ... we cannot use the
worst case of sp_sysmon run frequently against the best case of MDA.

> Interestingly enough, lots of the MDA counters have much greater
> locality of reference than the tradition monitor counters (although
> Peppler is correct that in some cases they are one and the same). This
> is because many of the MDA counters are embedded in the data structures
> that the code flow is already dealing with, so accessing the counter is
> just an offset from some address already in a register.

Of course, newer code.

> Actually I am a big fan of MDA tables as well. I don't dispute your
> overall thesis that it is better to solve issues at a high level,

Yeah, the top-down approach is about the two principles.

1 Law of diminishing returns. With accurate understanding of the
considerations, you can identify (and thus fix) 80% of the problems
with 20% of the effort and avoid death and destruction. The bottom-up
approach is squarely at the wrong end of that spectrum. And distinctly
after the fact, post-mortem style, examination of the dead body.

2 Use the right tool for the job (as you identify yourself), but the
top-down tools are temporal, they are not large codelines or masses of
data on the server which is the fact of life for the bottom-up.

> but MDA tables can answer a lot of questions that were previously
> unanswerable. To me the key to all of this is something that anybody
> who has playing this systems game for any period of time knows: there
> is no one size fits all approach to monitoring and troubleshooting.
> You have to match the tool to the problem. Sometimes it is sysmon,
> sometimes it is MDA, sometimes it is something else like oprofile,
> sybmon, monitor server, vtune, iostat, etc. In all fairness, I think
> everybody on these forums knows this.

That is the most sane thing anyone from the MDA crowd has said. I
agree 100%. Ignorance of this principle commonly causes many problems
in both Db design and app code.

It is the context that is different. When you do not know what is
going on in your server, no argument, MDA is a substantial help, in
fact ANYTHING is a substantial help. You need a trailer to tow your
toolkit to the job, a month to set it up and debug it, sometimes
another machine for capturing and storing the carcase in order to
perform the post-mortem. In any case it works on the symptom and not
the cause (people writing bad code; bad code being allowed on the
server) which is a never-ending job.

When you DO know what is going on in your server, all that is
unnecessary, your toolkit fits into the pocket of your brief case, you
do the analysis, reconfiguration, and report in one week. Fixing the
cause (which takes education) means the symptom stops.

The two problems I have (both in this thread and others), are
• that some people further the notion that MDA is a cure for
everything. It is not, it is a diagnositic tool only (in the sense of
your post, where it is valid). Often a microscope is the wrong tool
for the job; you are better off with the naked eye at some distance.
This has placed the MDA issue in the realm of religion (proponents are
easily offended by perceived attacks and take it very personally; they
defend themselves as if their deepest beliefs are being attacked; other
people are quickly categorised as "for" or "against" them, bush-style).
When we advise MDA as a cure all we are not serving them.
• Ok, it provides monitoring as well, but there are less onerous
monitoring options, with less overhead.
• there is an awful amount of activity related to MDA, but no
reasonable result. Have a look at my post of 08 Feb 08 responding to
Jeff Tallman, on the "Getting 15.x up to speed" topic on the p+t
newsgroup.
• the notion that I am "for" or "against" MDA. I am for using the
right tool for the job; for addressing the cause not the symptom; for
results; for providing accurate expectations, and I am NOT for spending
heaps of resources (system and labour) doing in an activity that does
not produce tangible results.

I have nothing against people who like MDA; use them for specific
purposes; get what they need; and who do not recommend them as the cure
for every disease. MDA is completely irrelevant to Susan at this stage.

> Like it or not, you are going to be seeing more of the MDA tables.

No problem there, I am getting used to it. I have to assist at cust
sites and debug joins and persistence issues often enough to keep
abreast.

> As you know they are now installed by default with installmaster. Some
> of the newer tables are for general information reporting and not truly
> monitoring / diagnostic, i.e. they don't represent performance counters
> nor do they need to be configured. When I built the workload manager
> for Cluster Edition I chose to expose the state of that subsystem via a
> set of MDA tables. It was the cleanest and most flexible way for me to
> expose details from an internal data structure. These tables are then
> used by the sp_cluster stored procedure and the Sybase Central plug-in,
> and because they are SQL tables it is easy to write your own front end,
> such as the monitor app in the cluster demo. The runtime overhead of
> such tables is nil.

Look, that is brilliant. Why did you choose that, rather than a
virtual system table (ala sysprocesses) which would provide all the
above benefits ?

----------

Future/MDA

Since MDA is no longer a question, then Sybase must implement & support
it properly, formalise it:
• GET IT OFF master
• put it in its own db (which is demanded anyway), perhaps systemdb
• identify the persistence requirements, and have all possible required
tables (empty), provided in scripts, with the warning to implement them
in a separate persistent user Db
• give them their own tempdb and named cache
• implement all MDA table I/O on a nominal pseudo device (eg
MDA_device). This has two purposes: eliminate the event of these stats
changing/colouring other stats; and identify all mda stats in one
place. I realise this is really pseudo I/O not real I/O since most of
the tables are virtual.
• provide sample SQL in the vein of pubs, in order to eliminate the
common join and persistence errors that I spend half my life fixing

Future/sp_sysmon

• As far as any Monitoring/counter information provided in MDA, that is
not in sp_sysmon, provide it in sp_sysmon (I am NOT suggesting that the
two be made the same)
• remove the "hints", they confuse the beginners, and the experienced
ones do not need them.

David Wein

unread,
Apr 21, 2008, 10:44:57 AM4/21/08
to
Derek Asirvadem wrote:

>> As you know they are now installed by default with installmaster.
>> Some of the newer tables are for general information reporting and not
>> truly monitoring / diagnostic, i.e. they don't represent performance
>> counters nor do they need to be configured. When I built the workload
>> manager for Cluster Edition I chose to expose the state of that
>> subsystem via a set of MDA tables. It was the cleanest and most
>> flexible way for me to expose details from an internal data
>> structure. These tables are then used by the sp_cluster stored
>> procedure and the Sybase Central plug-in, and because they are SQL
>> tables it is easy to write your own front end, such as the monitor app
>> in the cluster demo. The runtime overhead of such tables is nil.
>
> Look, that is brilliant. Why did you choose that, rather than a virtual
> system table (ala sysprocesses) which would provide all the above
> benefits ?
>

You're right - we are now way off topic :-)

Fake tables have some disadvantages compared to MDA tables, both from
the Sybase side and from the user side.

From the Sybase side: The schema for a system table (including a fake
table) is baked into the C code. There is a C structure that needs to
look just like the row for the table. Changes to system tables require
a database upgrade and support for downgrade is difficult.

For MDA tables, it is also true that there is a C struct that needs to
match the row of the MDA table, but the DDL for the table itself isn't
baked into the ASE version, and changes to the schema don't have upgrade
/ downgrade implications.

From the user side - and this is important - there is a fundamental
difference in the way these tables handle search arguments. Consider:

select * from sysprocesses where spid = @@spid

In this case ASE will materialize a worktable containing the
sysprocesses rows for all processes in the server - potentially
thousands. Once all the rows are generated the filter will be applied
and all but the matching rows will be discarded.

Now consider

select * from monProcess where SPID = @@spid

In this case we will materialize only the single row that matches the
sarg. There handy feature is a due to the RPC mechanism used to
implement MDA tables and can result in significantly greater efficiency
vs. traditional fake tables. This only applies equisargs on columns
listed in monTableParameters.

-Dave

Derek Asirvadem

unread,
Apr 22, 2008, 12:43:07 AM4/22/08
to
> On 2008-04-22 00:44:57 +1000, David Wein <david...@sybase.com> said:

> Fake tables have some disadvantages compared to MDA tables, both from

> the Sybase side and from the user side ...

I do not have a problem with the upgrade demands of sys tables.

Downgrades are not worthy of reasonable consideration or of the same
consideration as upgrades. (The demanded [prior] full o/s and Db
backup is the position to revert to, unless one has the luxury of
wasting time and fiddling. Downgrades in production are only the realm
of newbies who have not tested their upgrades.)

In any case, the engineering consideration here should NOT be ease of
implementation/maintenance/access of sys tables vs MDA tables; it
should be the consideration of whether the data is "system/catalogue"
that ASE uses/relies upon or "monitoring/diagnostic". Additionally,
although the state/content is transient, the existence of your cluster
support tables would be real rather than fake. The long term
implications must be pondered carefully.

> In this case ASE will materialize a worktable containing the
> sysprocesses rows for all processes in the server - potentially

> thousands ...
>
> vs
>
> ... In this case we will materialize only the single row that matches
> the sarg ...

That may be a fair example of an instance where MDA tables are more
efficient. However, that is not a reasonable indication of the overall
"efficiency" of MDA tables over sys tables; for each example where MDA
is "faster" there are ten examples where it is "slower". The overall
consideration is that use of MDA creates a massive or gigantic
(depending on what you are collecting) amount of data, which needs to
be collected AND made persistent (unless you want to suffer failed
joins which deems them useless). Which means a large separate Db and a
collection mechanism. The fact that the hair on the elephant is more
efficient that the hair on the greyhound, is not relevant in the
consideration of the whole animal, its capability, its intended use,
and its requirements for care and feeding.

Plus the fact that for DBAs starting out with MDA, they create more
rather than less data; and due to innocence or an attempt to keep the
joins simple, they end up with a fair amount of duplication. I find
the key to managing the elephant is to implement a two-stage
collection: the first is demanded for persistence purposes (and here
you should collect some non-pipe tables that apparently do not require
persistence, but actually do, in order to improve join efficiency
later); the second has to filter, de-dupe and then store the data in a
permanent collection database.

Then there is the understanding that altough manifesting sysprocesses
may sound onerous, thousands of rows or not. actually it is done
directly from memory structs which are in memory (unless one has a
shoebox for a server).

> This only applies equisargs on columns listed in monTableParameters.

And what about the rest ?

----------

The fundamental problem I have is duplication, which I abhor, and here
we are talking about duplication of every execution, two orders of
magnitude of duplication. Forget about MDA, or MDA vs sp_sysmon, for a
moment and consider this. Why do we:
1 have an app [I use the term loosely] that throws (.) a bunch of
[known or unknown] SQL at the server
2 have the server execute it, with issues to varying degrees
3 capture and store [without regard to persistence and maturity
problems] the (a) detailed SQL and related data, plus (b) the various
low level detail conditions of the server operation during (2)
4 correlate (a) and (b)
5 analyse the result ???

Following which, once proof and evidence has been tabulated:
6 educate the app developer
7 correct (1)

Consider:
• (a) is greater than (.) in its inital state; plus you only need the
single instance in order to evaluate it; (a) captures every repetition
(one errant code segment vs the thousands of execution of that code
segment)
• (6) and (7) are unavoidable, and really it is the final stated or
unstated goal of the process

Why not just capture and analyse (.) and thus entirely eliminate (3)
and (4), and have the smallest content to analyse in (5) ? Only a
small amount of education is required to predict what that code segment
will do to the server, and one can correct it there. It is not just
pro-active, it is predictive, and issues can be avoided/eliminated
before they are allowed to occur, rather than allowing them to occur,
causing issues, and subsequently having to prove the issue, their cause
and inevitably educate the developer as to why the code segment is
errant.

After death and taxes, there are two unavoidable things in life:
education and responsibility. App developers can delay them at huge
expense to the organisation [via (3) and (4)] and denial etc, but they
cannot avoid them.

0 new messages