We are doing the "typical" TS/MP Requester-Server programming
involving PATHSEND clients and TS/MP servers. Recently we are facing a
typical problem during SERVERCLASS_SEND_ and reading messages via
$RECEIVE.
Some quick facts:
1. Using SERVERCLASS_SEND_ in Nowait mode.
2. Using 2 CPU S76000 hardware.
3. Using G06.31.01 OS.
4. Server's MAXLINKS=2, LINKDEPTH=1.
5. Server's binary is in OSS.
6. TS/MP client is also in OSS.
7. Server has been compiled with OPTIMIZE 0.
7. Server opens $RECEIVE in Waited mode with Receive-depth 1.
Problem statement: While we run several instances of our TS/MP client
and several instances of the server, we encounter message drops at
times.
Observations:
1. No show of any error in the SERVERCLASS_SEND_INFO_ call. Anyways,
its Nowaited I/O.
2. Even AWAITIOX does not throw any error.
3. READUPDATEX call returns with "finite and pre-expected message
legth" BUT message buffer is all NULL.
4. MEASURED the server process and "process" entity reveals that
message-system has delivered <some-length> bytes to the server process
but where those bytes gone?
5. LINKMON statistics revealed that message has been "SENT" from the
LINKMON. So, where it might have got vanished?
Did anyone encounter this kind of problem ever from TS/MP or Guardian
processes? Problem is, we can't catch it in debug. This appears only
during free run. The problem can never be reproduced in Integrity.
As I was running out of ideas, I contacted GMCSC but the support
specialist I came across is more busy on closing the case blaming my
program's inability to handle non-blocking I/O. I failed to make him
understand that, the same code is running perfect on Integrity, then
why the same is not working here. Well, he is not in a position to
understand that the situation is not occuring everytime. But it occurs
almost every now and then.
I know there are many GMCSC folks who are listening here and I would
certainly like to bring one point here for all of them - Please, for
God's sake, please don't create a small "toy" program as a model of
user's application and prove that the problem does not occur with that
"mini-joker" and so, its something wrong with our apps code! This is
unacceptable! I have noticed it couple of times now. Whenever I report
a case, the GMCSC support person will create a small program or a
small table or a small query or a small replica of our client/server
application and come back to me "loud" saying, if this small thing is
working perfect then surely your big guy has some issue. So, please
let we go ahead and close the case :-). I don't know how to term this
"support". Are you guys really helping? Ask yourself.
While I am sincerely thinking about whether the problem is with TS/MP
or Message System or File System, this gentleman is after me asking
when can he close the case.
Anupam
Hi Anupam,
you did not support any information about transaction rates, number of
messages and so on. I know there are some limitations in linkmon and
those limits have been increased on Integrity.
The difference between S and Integrity might be caused by the speed-
difference between the 2 systems so maybe you have the same problem on
Integrity but because of the higher speed it does not occur.
You should check if all the timers are set properly. Does the
requester do multithreaded processing? In this case, check transaction
handling of the requester.
Did you take a look at the Pathmon-logging? If there are any problems,
the Pathmon should write a message.
Another thing you should look at are the statistics and the detailed
status of the serverclass.
The MAXLINKS 2 should be ok if there are no external linkmons.
LINKDEPTH is ok, too, as the server works waited.
If you have defined any dynamic processes, check for the createdelay.
In which case the GMCSC person is probably correct in that you are not
handling non-blocking i/o correctly. My experience has been that if
you are unable to reproduce a problem in debug and the problem does
not appear on a different class of machine, you have a timing
problem. I seriously doubt that messages are being "lost" by the file
system, more than likely the application is missing the message or
failing to check for an error.
Since your server may only handle a single message at a time is it
possible that your requester is modifying the buffer prior to the
completion of the serverclass_send_ call? You might want to review
the rules for nowaited pathsend usage on page 5-23 of the TS/MP
Pathsend Programing manual to ensure conformance..
This is also the first thought that occurred to me when reading the original post. I don't know how skilled Anupam is in dealing with nowait I/O, so I hesitate to say definitely that the application is coded incorrectly, but it is the first thing I would look into.
Anupam: Forgive me if you do understand this, but let me mention it to be sure. When you use nowait, you must make sure that the program does not modify the memory holding the message you sent until an AWAITIOX call tells you that the reply for that request has been received.
I don't know the inner workings of Pathsend, but for ordinary filesystem WRITE or WRITEREAD calls, when certain options are set, the filesystem is free to copy the outgoing message from the user-space buffer *at any time*, not just while control is within the WRITE call. I believe that for ordinary filesystem WRITE and WRITEREAD, the default is to copy the data from the user buffer into a system buffer during the WRITE call, and you must actively specify the option to avoid that copy (you would do so in order to improve performance by eliminating that copy). So in the default case with WRITE and WRITEREAD, the program can modify the buffer immediately without harm.
I believe the default for Pathsend is that the copy to a system buffer is not done, and so the program must be careful not to modify the message buffer until it knows the operation is complete. There is a specific warning to that effect in the Pathsend manual. I don't know whether there is an option to make Pathsend copy to a system buffer the way it does by default for WRITE and WRITEREAD.
If I understand your description, the message the server receives sometimes is all zeros, but is the correct length. Does the requester sometimes clear the message buffer before using it to build a new message? If it does not do that, then perhaps the simple explanation is not what is causing your problem. But if the requester does sometimes clear the message buffer, that at least makes it plausible that it is doing so before Pathsend has sent the message. Look at your code carefully to be sure a message buffer is not touched between the SERVERCLASS_SEND_ and the time that AWAITIOX says that operation has completed. If your program ever has more than one message outstanding, make sure the code correctly identifies which message buffer is the one that AWAITIOX reports has completed.
As I said, I don't know whether there is an option to make SERVERCLASS_SEND_ copy the message to a system buffer before returning, but if there is such an option, enabling it might be a useful test. If incidents of the bad messages being delivered to the server stop occurring when such an option is enabled, that would be some evidence that the requester is clearing the message before Pathsend was able to send it. It would not be definite proof that the requester is making that error, since changing the Pathsend processing via such an option might make the Pathsend code avoid some bug it has itself. But if the bad messages still get to the server, then miscoding of the nowait logic could be ruled out. That said, I am not aware of such an option to make Pathsend copy the message to a system buffer during the SERVERCLASS_SEND_ call, so I doubt you can make this test.
On the matter of the support specialists creating small programs: It is a useful thing to do, since sometimes a problem is such that it can be duplicated in a small test program. If that can be done, it makes reporting the problem to development much easier, and it makes it much easier for development to study and understand the problem. But you are right that the support specialist should not take failure to duplicate the problem in a small program as proof that the problem is the fault of the coding in the application, at least not without studying the problem further.
> the filesystem is free to copy the outgoing message from the user-space buffer *at any time*
To expand on Wolfgang's and Keith's excellent (as usual)
contributions, download the file "PathSend Data Flow.doc" from my
online file folders at box.net (URL http://www.box.net/shared/hbo72enpj0).
This one-pager adds some detail to how the PathSend message transport
works (especially how the "at any time" can be happening).
Messages do not "get lost." From the problem description it really
appears as if the application may be "nulling out" the buffer after
the data has been pulled in, but before actual send completion... Hard
to tell for sure, of course, but years of Tandem support experience
tells me that the application does something wrong (this is not meant
to be arrogant, just stating a fact).
Good luck finding the problem cause! Please let us know what it was,
it may provide valuable info to all of us!
Cheers,
Henry Norman
MicroTech Consulting
http://groups.google.com/group/microtech_software
Henry,
Thanks for your input.
I have one doubt which I want to clarify based on your document. You
showed two paths viz. Control-Data Path and Actual-Data-Buffer Path.
According to your document, once the link being established between a
TS/MP client and TS/MP Server-Process, the data-buffer moves directly
from client to server and server to client. Isn't it?
But while in debug of the TS/MP client, after the SERVERCLASS_SEND_, I
see only the LINKMON e.g $ZL00 or $ZL01 being opened by the client.
From the standpoint of standard NonStop message system and guardian
req-server architecture, this situation is little different. Isn't it?
I am still wondering, how the client could send a "data-buffer" to a
server-"process" without even opening it. Is it something, that
LINKMON opens a shared memory segment with the client which is being
used by the client to transfer data from its own buffer to the shared
location and then trigger the LINKMON to transfer that data to the
final destination? I don't know the internals of LINKMON but just
because it pools so many connections and so many no-waited sends/
replies, it forces me to believe it must be doing a shared memory
arrangement inside.
Anupam
That is a reasonable guess, but as far as I understand it, not quite right.
As I understand it, the LINKMON takes the information about where the user buffer is located, number of bytes, and probably additional information, and sends that to the server process. On the server side, some component of the OS takes that source address and length, and the address and length of the buffer in the server process given in the READUPDATE of $RECEIVE, and constructs the appropriate Servernet commands to cause Servernet to move the data directly from the client's buffer to the server's buffer. No extra copies need be done. I assume the reply works similarly in the other direction.
If the client and server happen to be in the same CPU, I don't know how that changes the picture. I imagine Servernet is not involved and the OS moves the data directly from the client's buffer to the server's buffer, but I don't remember ever seeing a description of that case.
I might have some details wrong, but I'm pretty sure it works approximately as I just described. If I'm wrong and someone can give a more accurate description, I would be happy to hear it.
Have you been able to figure out what is causing the server to see some messages to be full of zeros?
I am sorry for being late in getting back to you on this. Yes, the
reason was buffer-alteration before the actual I/O happens. It was
precisely a coding mistake for which I would not certainly blame my
developer as it was a highly complicated C++ program. When C++ objects
gets created and destroyed plus pointers are involed, this does happen
at times. We kept many buffer slots and we were confident No-waitness
won't be an issue but we didn't notice, at one place, we were playing
with address insted of a mere simple "memcpy". As fix, we used memcpy
and everything vanished. Thanks to everyone who helped me to get out
of this situation.
By the way, I have something to share here. You could recollect I have
shared one my frustrating experience in this context with HP GMCSC. I
didn't like the attitude of GMCSC because the specialist was in a kind
of "blame-game" with me. The idea is to get rid of some problem, idea
was not blame HP and show-off my code. Anyways, stil so-far-so-good.
After two days from my initial posting, I got a mail from one senior
GMCSC person saying "Additionally I would very much appreciate if you
could remove the negative comments towards HP and the GMCSC from the
UseNet Tandem group. The is neither beneficial nor helping the
situation."
I am surprised! Do I need to take HP's permission before writing
something(which is true) in a blog? Who would understand my
frustration that, I only posted in comp.sys.tandem when I was not
getting meaningfull support from the so called "official support
centre" of HP?
>By the way, I have something to share here. You could recollect I have
>shared one my frustrating experience in this context with HP GMCSC. I
>didn't like the attitude of GMCSC because the specialist was in a kind
>of "blame-game" with me. The idea is to get rid of some problem, idea
>was not blame HP and show-off my code. Anyways, stil so-far-so-good.
>After two days from my initial posting, I got a mail from one senior
>GMCSC person saying "Additionally I would very much appreciate if you
>could remove the negative comments towards HP and the GMCSC from the
>UseNet Tandem group. The is neither beneficial nor helping the
>situation."
I think my response to that would have been:
a) to post that entire email here, including the sender's name; and
b) to send a tartly-worded email to the sender *and* the sender's superior,
informing them that I had done so, and that what I choose to post or not post
is not affected by whether or not they like it -- and also reminding them that
if their customers are not satisfied with the level of support we receive, the
solution to that problem is to provide better support, not attempt to muzzle
the customers. (In other words, a polite but firm way of saying "f--- off".)
>
>I am surprised! Do I need to take HP's permission before writing
>something(which is true) in a blog?
Of course not! Not in the U.S., anyway, where "I speak the truth" is a pretty
solid defense against a charge of libel or slander.
>Who would understand my
>frustration that, I only posted in comp.sys.tandem when I was not
>getting meaningfull support from the so called "official support
>centre" of HP?
Quite understandable IMHO. And useful to the rest of the Tandem community as
well, to learn of support problems. If HP doesn't like reading public
complaints about the quality of their technical support, the solution is
simple: provide better support. Don't hold your breath waiting for that to
happen, though. I made the decision about five years ago never to buy another
HP consumer product again, after having had my fill of bad experiences with
their wretched support.
Please take a moment and re-read your own posting:
"Please, for God's sake, please don't create a small "toy" program as
a model of
user's application and prove that the problem does not occur with that
"mini-joker" and so, its something wrong with our apps code! "
It took me exactly 1 minute of considering your problem and another
minute to locate the page in a manual that supported the resolution so
that I could point you to it. I was not party to the conversation you
had on the phone, but from the tone of your own posting, and the fact
that it was truly a simple coding error that a "toy" program showed
was not an OS problem, I hope you apologized to the GMCSC for taking
up their time blaming the File System or TS/MP. My guess is that over
the years they have had a lot of silly application bugs blamed on
operating system code that has essentially remained untouched for 15
years and, as with the GMCSC support person in your case, they
correctly identified the problem but were unable to convince you.
Would you please comment on my following lines?
1. A customer reports a case to GMCSC. - True, he did it.
2. A GMCSC support person was assigned to investigate. - Yes, it was
all done immediately.
3. Did the support person read and understood the problem statement -
As far as I know, yes he/she did.
4. Did the support person tried Replicating the problem in his system
and succedded - No. Fine....read on...
5. Was the problem continuously reproducible in customer's system? -
Yes, every day and every minute.
6. Was the GMCSC support person been offered Logon-ID and password and
IP address to access the customer's machine directly? - Yes. Right
from day one, all the access were open for the support.
6. Did the GMCSC support person been offered to debug/analyze the
problem right on the customer's system? - Yes, right from the day 1.
7. Did the GMCSC support person took any interest to really touch and
feel the problem in customer's system? - No, and never.
8. Did the customer offered that he/she would walk-thru the code of
the application for the support person so that he could analyze quick?
- Yes, everytime but no interest from the support person.
So?! What does that mean?! Everything has a process right? Without
even re-producing the problem, how dare the GMCSC would conclude that
they have doctored it well? I don't have to agree on any speculations
right? See the confession right from the GMCSC support person's own
pen -
"Thanks for your e-mail. At this time I believe that the GMCSC has
performed our due diligence to try to assist you and try to recreate
this problem. At this time we are unable to replicate the issue. If
further testing by yourself uncovers further information/errors then
we will be happy to assist."
What was that? They themselves are agreeing that they could not assist
much due to their own limitations.
Bottom line was, the support person was more busy on some "Blame Game"
with me blaming my application and not blaming the systems software.
Whereas, I was least interested into that. My aim was to solve a
problem. Hope you now understand my situation.
Difference betwen COMP.SYS.TANDEM and GMCSC:
People in this group goes out of their way to understand someone's
problem whereas the GMCSC knows process more than products. Beyond
their product and product's internals, they don't want even to see any
futher.
Looking at my problem, one man Mr. Norman, who was at an air-port and
waiting for his next flight, took time to compose a long mail for me,
espacially, just to make me understand the internals of the nowait-
ness and what must have gone wrong in this case. He spent time on me
to bring me inline with the actual problem/bug and helped me to
diagnose that. Didn't he had any better job to do at the Taipei Air-
port than composing a mail for me? Am I his family member or do I pay
for his service? Which one?
You better learn first what is called a "Helping Attitude" and then
pen down something here.
>
> You better learn first what is called a "Helping Attitude" and then
> pen down something here.
Excuse me, but I have helped many people both on and off this board
with a great deal less experience than yourself. You sir are the
person who one would be hard-pressed to ever find assisting someone
else.
You have a complete misunderstanding as to what support the GMCSC
provides. They are not there to log on to your system and debug your
code no matter how freely you offer them access. They are not there
to "walk through your code" with you!. They are there to consult and
correct problems with HP hardware and software.
Your expectation that HP will help you debug your incorrectly written
code is way off base. If you need assistance writing code you should
retain the services of an experienced consultant so help you along the
learning curve or you can rely on the good services of Mr. Norman, but
expecting free software programming support because you lack training
is not the fault of the GMCSC.
You read too much in between lines! After posting the last message, I
was very sure you will pick up this point to say, I need programming
assistance! Very smart. Again write your points against my below
mentioned points :-
1. Had GMCSC was not suppose to touch my system, then did the support
specialist access our server and coded his own program? As far as I
know and understand, GMCSC does not have any clause which says they
are not supposed to log on to customer's server. Still, leave it....
2. The support person has spent time to write his own code in our
system(which is far from a real application code), to prove that there
is no issue with system's software. But then, why he didn't write a
code that would simulate my problem? I guess you would agree, due to
me or due to HP, there was some issue. Issue is not fake right? While
I was part of NonStop development, I used to try to think inline with
what customer might have done. I used to write stupid codes at times
to go near to them and their application. In order to prove that I did
a wrong coding, he should have coded something wrongly and prove that
I must have done exactly this mistake to reach to this situation.
Isn't it realistic?
Why I should believe on what GMCSC smells, feels like, thinks like and
all?
Expectation from GMCSC and expectation from comp.sys.tandem are not
same right? No doubt, you were the first person who suspected it
correct, but did I ask you ever, how are you so sure? I can't!!
Whatever you guys are doing are all free help and whatever we get, we
should be thankful. But the same is not true with GMCSC. The
customer's are spending money on them. Did I pay money to you for your
kind help?
" But then, why he didn't write a code that would simulate my problem?
"
Why do you expect the support person, who correctly identified your
problem, to then write a program to simulate your poor programming
technique? Why should he spend his valuable time trying to create a
timing problem in a test program to prove you modified the buffer?
Why didn't you try this? Your problem was obvious to a number of us
on this board but it appears that you did not bother to test our
theory until many days later.
> No doubt, you were the first person who suspected it
> correct, but did I ask you ever, how are you so sure? I can't!!
Sure you can ask but you did not. The answer is because I have been
writing code for these machines for close to 30 years, Am I wrong on
occasion? Absolutely, but your problem was simply obvious and backed
up by documentation.
> Whatever you guys are doing are all free help and whatever we get, we
> should be thankful. But the same is not true with GMCSC. The
> customer's are spending money on them. Did I pay money to you for your
> kind help?
You do not pay us for our help and no payment was asked for. However,
unlike those of us who post on comp.sys.tandem, the folks at the GMCSC
are not free to spend time working on problems for which they are not
paid - and correcting your lack of proper programming technique is not
something they are paid to do. I know your problem was the most
important one on your plate, but you were using up their finite cycles
asking the GMCSC to help train you, while forcing customers with real
HP-caused issues to go waiting.
The net result is that your code had a flaw - just as the GMCSC said
it did. A number of us saw the probability of the same flaw but it
apparently took Henry writing you a more detailed explanation before
you would accept the truth. What bothers me is that you continue to
fault the people who gave you the correct answer right at the start,
but whom you chose to ignore.
I don't have to agree with your myths. I fixed my problem within two
hours of your first posting in this thread. Where did I ignore anyone?
Next, Mr. Norman helped me to understand in more details on why I did
"that" to fix my issue. I have works too other than browsing
comp.sys.tandem and that's why it slipped my mind to update this group
about my fix which I did close to a week back. Once Keith asked me
about the issue again and that reminded me to update the group here.
Now you got the full answer?
Do you really wanna know what the heck guy from the GMCSC did with me
in this case? At first, he didn't help me in the expected way, add on
to that, while I am analyzing/fixing my program and testing for
correctness, he is buzzing me thousand times asking, can he close the
case? Now, was that way GMCSC is supposed to behave? With just two
lines, "As there is no problem with the TS/MP, can I close the case",
he kept on sending me mails and how I was supposed to react? If you
have nothing to do, please keep quite. I myself will fix my problem
and will let you know. Simple! Why I should give the green signal to
them before confirming something.
Now listen another story -
BEINGTRANSACTION and EXEC SQL BEING WORK;
We are using this things since years and for you, since last 30 years.
Fine!....BEGINTRANSACTION and BEGIN WORK should work ditto the same?
Right? It works also...so far so good. Now see my situation here -
One Audited SQL/MP table and One Enscribe Queue File which is also
audited. Great combination!
I coded something like this -
EXEC SQL BEGIN WORK;
EXEC SQL INSERT INTO <Audited SQL/MP Table>; /* sqlcode was zero */
READUPDATELOCKX("Queue File");
/* Record got successfully de-queued. Now do something something... */
EXEC SQL COMMIT WORK;
/* sqlcode is ZERO, means success */
Now I check the inserted record in SQLCI, found the record. I then
gone to check the queue file, the record got En-queued back!!!!! How?
Checked it thousand times before filing a case with GMCSC but as
usual, Non-reproducible. I was to close the case forcibly.
Next day morning, the same error starts occuring again! Imagine my
frustration. This issue is still existing in my system and I can show
it to anyone.
One mistake I did in this case, I took the responsibility of the
replication in a small program insted of asking GMCSC to do that. As I
myself could not re-produce that, I was forced to say Yes to close the
case. But I can show this problem to anyone anytime.
At first I thought, may be this is occuring due to No-waited open of
the Queue file and pending I/O remained but while simulation, I used
Waited I/O and still it works. GMCSC says, I am surprised the code of
BEGIN WORK simply calls BEGINTRANSACTION and then how this is
happening? What should I answer? Only God knows!
There are differences between begin/commit work and begin/
endtransaction. Of course begin work will internally use
begintransaction and commit work will use endtransaction but as least
commit work will do some additional things like closing cursors.
Anyway, that example isn't helpful at all to show something. What
sense makes a READUPDATELOCKX before ending the transaction? It locks
the record and commit work unlocks the record if the file is audited!
If the file is not audited the lock remains in effect until you
explicitely unlock it.
Concerning the buffer thing you should read the Hotstuff Messages for
NonStop systems, those can be requested using the service portal. In
the last few months several of those messages were related to
modifying buffers before I/O completion.
I don't understand you blaming of GMCSC, if you want professional
support for non-HP-products, ask and pay for Professional Service. The
purpose of GMCSC is to help you (and me) with problems with HP hard-
and software that does not behave like described.
If I have a problem I usually verify twice that it is really a HP-
problem and than I call GMCSC. And if those people tell me that I
might been mistaken and tell me what might be wrong, I tell them to
put the case in HOLD-CUSTOMER and check my software if they are right.
And in most cases they are right!
I know some of the GMCSC people personally and you can be assured that
they know what they are talking about! But they are not god so they
can fail.