I have apache httpd running with a shared lib using NSS/NSPR.
The lib talks to an SSL enabled server using PR_WRITE. Occasionally
PR_WRITE returns error '-8023'.
OS is CentOS 5.5 64bit.
NSS: @(#)NSS 3.12.5.0 Aug 3 2010 17:15:02
NSPR: @(#)NSPR 4.8.2 2010-08-03 17:13:30
I've tried 'ltrace' to trace the NSS/NSPR library calls but without
success ... 'ltrace' does not show anything.
AFAIH understood the source '-8023' could be
'SEC_ERROR_PKCS11_DEVICE_ERROR' or 'CKR_DEVICE_ERROR'.
There's no other PKCS11 module involved but the internal one.
How can I track down this issue?
TIA,
Bernhard
Install the debugging symbols and source code for nss and nspr
yum install nss-debuginfo nspr-debuginfo
Then run the process under gdb or attach gdb to a running process and
set a break point on PR_SetError, you will probably need to issue this
gdb command (or put it in a .gdbinit file):
set breakpoint pending on
You should break in the debugger with full source code when an error is
set, you can then look around to see why.
--
John Dennis <jde...@redhat.com>
Looking to carve out IT costs?
www.redhat.com/carveoutcosts/
Check the console and/or error output if you find something that could
lead us.
-hb-
unfortunately the server is running remotely ... inaccessible for debugging.
Do you know why 'ltrace' does not work when using '-p'?
Thanks,
Bernhard
On 01/11/2011 06:08 PM, John Dennis wrote:
> On 01/11/2011 11:36 AM, Bernhard Thalmayr wrote:
>> Hi experts,
>>
>> I have apache httpd running with a shared lib using NSS/NSPR.
>>
>> The lib talks to an SSL enabled server using PR_WRITE. Occasionally
>> PR_WRITE returns error '-8023'.
>>
>> OS is CentOS 5.5 64bit.
>> NSS: @(#)NSS 3.12.5.0 Aug 3 2010 17:15:02
>> NSPR: @(#)NSPR 4.8.2 2010-08-03 17:13:30
>>
>> I've tried 'ltrace' to trace the NSS/NSPR library calls but without
>> success ... 'ltrace' does not show anything.
>>
>> AFAIH understood the source '-8023' could be
>> 'SEC_ERROR_PKCS11_DEVICE_ERROR' or 'CKR_DEVICE_ERROR'.
>>
>> There's no other PKCS11 module involved but the internal one.
>>
>> How can I track down this issue?
>
Thanks for the pointer Honza ....
>
>
> Check the console and/or error output if you find something that could
a private lib within Apache httpd is using NSS/NSRP ... nothing in any log.
The error is caught and printed in a custom log (using PR_GetError).
-Bernhard
> lead us.
>
> -hb-
>
The best way to debug this is to find out where NSS's internal PKCS
#11 module sets the error code CKR_DEVICE_ERROR. Unfortunately there
are a lot of possible places (all the files in the
mozilla/security/nss/lib/softoken directory):
http://mxr.mozilla.org/security/ident?i=CKR_DEVICE_ERROR
The best way to track this down is to run your Apache httpd in a
debugger with debug symbols for NSS, or run your Apache httpd with NSS
shared libraries that you build from source code and insert printf
statements.
Wan-Teh
thanks for your reply.
Will it be helpfull to use the 'PKCS #11 Module Logger' before starting
with 'printfs'?
I tried that and get at least some output in the specified log.
-Bernhard
bob
331569088[1bd1610]: C_DigestUpdate
331569088[1bd1610]: hSession = 0x88
331569088[1bd1610]: pPart = 0x6e580a4
331569088[1bd1610]: ulPartLen = 70
331569088[1bd1610]: rv = CKR_OK
331569088[1bd1610]: C_GetMechanismInfo
331569088[1bd1610]: slotID = 0x1
331569088[1bd1610]: type = 0x132
331569088[1bd1610]: pInfo = 0x7fffcd592e70
331569088[1bd1610]: rv = CKR_OK
331569088[1bd1610]: C_UnwrapKey
331569088[1bd1610]: hSession = 0x6
331569088[1bd1610]: pMechanism = 0x7fffcd592ea0
331569088[1bd1610]: hUnwrappingKey = 0x8
331569088[1bd1610]: pWrappedKey = 0x36ac9bc
331569088[1bd1610]: ulWrappedKeyLen = 48
331569088[1bd1610]: pTemplate = 0x7fffcd592cf0
331569088[1bd1610]: ulAttributeCount = 6
331569088[1bd1610]: phKey = 0x36bfd58
331569088[1bd1610]: CKA_SIGN = CK_TRUE [1]
331569088[1bd1610]: CKA_VERIFY = CK_TRUE [1]
331569088[1bd1610]: CKA_CLASS = CKO_SECRET_KEY [8]
331569088[1bd1610]: CKA_KEY_TYPE = 0x10 [8]
331569088[1bd1610]: CKA_DERIVE = CK_TRUE [1]
331569088[1bd1610]: CKA_VALUE_LEN = 3000000000000000 [8]
331569088[1bd1610]: mechanism = CKM_DES3_ECB
331569088[1bd1610]: *phKey = 0x54
331569088[1bd1610]: rv = CKR_OK
331569088[1bd1610]: C_DeriveKey
331569088[1bd1610]: slotID = 0x1
331569088[1bd1610]: flags = 0x4
331569088[1bd1610]: pApplication = 0331569088331569088[1bd1610]:
Notify = 0x13231f333331569088[1bd1610]: phSession =
0x7fffc331569088[1bd1610]: phKey = 0x36c1618
331569088[1bd1610]: CKA_CLASS = CKO_SECRET_KEY [8]
331569088[1bd1610]: CKA_KEY_TYPE = 0x12 [8]
331569088[1bd1610]: *phSession = 0x132317f8
331569088[1bd1610]: rv = CKR_DEVICE_ERROR
331569088[1bd1610]: mechanism = CKM_TLS_KEY_AND_MAC_DERIVE
331569088[1bd1610]: C_DigestInit
331569088[1bd1610]: hSession = 0x1
331569088[1bd1610]: pMechanism = 0x7fffcd593290
331569088[1bd1610]: mechanism = CKM_MD5
331569088[1bd1610]: rv = CKR_DEVICE_ERROR
331569088[1bd1610]: C_OpenSession
331569088[1bd1610]: slotID = 0x1
331569088[1bd1610]: flags = 0x4
331569088[1bd1610]: pApplication = 0x4bdff70
331569088[1bd1610]: Notify = 0x13231f30
331569088[1bd1610]: phSession = 0x7fffcd593228
331569088[1bd1610]: *phSession = 0x132317f8
331569088[1bd1610]: rv = CKR_DEVICE_ERROR
Could someone please help me understand the trace?
I've also gathered an SSLTRACE log if needed.
TIA,
Bernhard
> 331569088[1bd1610]: flags = 0x4
> 331569088[1bd1610]: pApplication = 0331569088331569088[1bd1610]:
> Notify = 0x13231f333331569088[1bd1610]: phSession =
> 0x7fffc331569088[1bd1610]: phKey = 0x36c1618
> 331569088[1bd1610]: CKA_CLASS = CKO_SECRET_KEY [8]
Was that a copy and paste error?
Or does the file really look like that?
Should look like this...
> 331569088[1bd1610]: pApplication = 0331569088
> 331569088[1bd1610]: Notify = 0x13231f333
> 331569088[1bd1610]: phSession = 0x7fffc
> 331569088[1bd1610]: phKey = 0x36c1618
No (at least I don't think so)
> Or does the file really look like that?
Yes
$grep 0331569088331569088 nspr.log2
331569088[1bd1610]: pApplication = 0331569088331569088[1bd1610]:
Notify = 0x13231f333331569088[1bd1610]: phSession =
0x7fffc331569088[1bd1610]: phKey = 0x36c1618
Am'I wright that 'C_DeriveKey' is actually 'NSC_DeriveKey' in
http://mxr.mozilla.org/security/source/security/nss/lib/softoken/pkcs11c.c ?
Yes. C_DeriveKey is a function pointer. It points to the
NSC_DeriveKey function.
Wan-Teh
This says you called C_DigestUpdate with a session handle of 0x88 and to
digest 70 bytes of data at memory location 0x63580a4. The function
returned CKR_OK.
The rest are pretty similar, I'll add additional notes as appropriate....
> 331569088[1bd1610]: C_GetMechanismInfo
> 331569088[1bd1610]: slotID = 0x1
> 331569088[1bd1610]: type = 0x132
> 331569088[1bd1610]: pInfo = 0x7fffcd592e70
> 331569088[1bd1610]: rv = CKR_OK
From this, it appears you are using the internal crypto slot rather than
the internal key/certificate store slot. This would be expected if you
are doing SSL, which is what appears to be happening here.
> 331569088[1bd1610]: C_UnwrapKey
> 331569088[1bd1610]: hSession = 0x6
> 331569088[1bd1610]: pMechanism = 0x7fffcd592ea0
> 331569088[1bd1610]: hUnwrappingKey = 0x8
> 331569088[1bd1610]: pWrappedKey = 0x36ac9bc
> 331569088[1bd1610]: ulWrappedKeyLen = 48
> 331569088[1bd1610]: pTemplate = 0x7fffcd592cf0
> 331569088[1bd1610]: ulAttributeCount = 6
> 331569088[1bd1610]: phKey = 0x36bfd58
> 331569088[1bd1610]: CKA_SIGN = CK_TRUE [1]
> 331569088[1bd1610]: CKA_VERIFY = CK_TRUE [1]
> 331569088[1bd1610]: CKA_CLASS = CKO_SECRET_KEY [8]
> 331569088[1bd1610]: CKA_KEY_TYPE = 0x10 [8]
> 331569088[1bd1610]: CKA_DERIVE = CK_TRUE [1]
> 331569088[1bd1610]: CKA_VALUE_LEN = 3000000000000000 [8]
> 331569088[1bd1610]: mechanism = CKM_DES3_ECB
> 331569088[1bd1610]: *phKey = 0x54
The extra data here is dumping out interesting values inside the
parameters passed to C_UnwrapKey. The list of CKA_XXXX = XXXXX are
dumping out the template sent in pTemplate. CKA_VALUE_LEN looks wrong to
me, I don't know if tehre is some byteswap issue here.
mechansim CKM_DES3_ECB means you are using DES to do unwrap, so this
appears to be a restart operation, rather than a full handshake.
> 331569088[1bd1610]: rv = CKR_OK
> 331569088[1bd1610]: C_DeriveKey
> 331569088[1bd1610]: slotID = 0x1
> 331569088[1bd1610]: flags = 0x4
> 331569088[1bd1610]: pApplication = 0331569088
> 331569088[1bd1610]: Notify = 0x13231f333
> 331569088[1bd1610]: phSession = 0x7fffc
> 331569088[1bd1610]: phKey = 0x36c1618
> 331569088[1bd1610]: CKA_CLASS = CKO_SECRET_KEY [8]
> 331569088[1bd1610]: CKA_KEY_TYPE = 0x12 [8]
> 331569088[1bd1610]: *phSession = 0x132317f8
> 331569088[1bd1610]: rv = CKR_DEVICE_ERROR
> 331569088[1bd1610]: mechanism = CKM_TLS_KEY_AND_MAC_DERIVE
Here you are deriving the SSL session keys. A session handle of
0x132317f8 is pretty large, but not unexpected in a server that has been
running a while.
CKR_DEVICE_ERROR can happen in a CKM_TLS_KEY_AND_MAC_DERIVE in a couple
of pathological cases:
1) if the underlying object structure is hosed.
2) if the key object looks bogus (a token Object that isn't known or
stored, for instance).
This object should be a token object. Unfortunately it looks like the
logging program did not out put hKey (or it did, but we ran into some
overwrite do to threading... the missing new lines after the
pApplication, Nofiy, and phSession) seems to indicate that. (In fact
Notify should be pNotify, and that looks like part of a C_OpenSession()
call, not a C_DeriveKey()... which makes some of the conclusions we are
trying to draw questionable.
> 331569088[1bd1610]: C_DigestInit
> 331569088[1bd1610]: hSession = 0x1
> 331569088[1bd1610]: pMechanism = 0x7fffcd593290
> 331569088[1bd1610]: mechanism = CKM_MD5
> 331569088[1bd1610]: rv = CKR_DEVICE_ERROR
I can see no place where NSC_Digest can even return CKR_DEVICE_ERROR
> 331569088[1bd1610]: C_OpenSession
> 331569088[1bd1610]: slotID = 0x1
> 331569088[1bd1610]: flags = 0x4
> 331569088[1bd1610]: pApplication = 0x4bdff70
> 331569088[1bd1610]: Notify = 0x13231f30
> 331569088[1bd1610]: phSession = 0x7fffcd593228
> 331569088[1bd1610]: *phSession = 0x132317f8
> 331569088[1bd1610]: rv = CKR_DEVICE_ERROR
This looks like a normal C_OpenSession call...).
> OS is CentOS 5.5 64bit.
Arg, I just remembered. CentOS 5.5 will likely be following RHEL 5.5 in
having an NSS 3.11 softoken. I've just checked and NSC_DigestInit still
cannot return CKR_DEVICE_ERROR in NSS 3.11 -- unless you are running in
FIPS mode. This is possible, but seem extremely unlikely since you are
using slotid=1 (FIPS slot ID is 3). In FIPS mode, any call can fail with
CKR_DEVICE_ERROR if the token is in 'fatalError' mode. That happens if
any of the following fail:
1) the Random number generator.
2) the Seeding the random number generator.
3) generating a key pair fails with CKR_GENERAL_ERROR
4) power on self check fails at init time.
It can be reset by reinitializing the token.
If this is what is happening, you should be getting NSS messages in your
syslog. I don't think FIPS is your issue here looking at the logs, however.
bob
Please see my comments in line, some stuff deleted for brevity.
On 01/12/2011 11:38 PM, Robert Relyea wrote:
--- snip ---
>> 331569088[1bd1610]: C_UnwrapKey 331569088[1bd1610]: hSession =
>> 0x6 331569088[1bd1610]: pMechanism = 0x7fffcd592ea0
>> 331569088[1bd1610]: hUnwrappingKey = 0x8 331569088[1bd1610]:
>> pWrappedKey = 0x36ac9bc 331569088[1bd1610]: ulWrappedKeyLen = 48
>> 331569088[1bd1610]: pTemplate = 0x7fffcd592cf0
>> 331569088[1bd1610]: ulAttributeCount = 6 331569088[1bd1610]:
>> phKey = 0x36bfd58 331569088[1bd1610]: CKA_SIGN = CK_TRUE [1]
>> 331569088[1bd1610]: CKA_VERIFY = CK_TRUE [1]
>> 331569088[1bd1610]: CKA_CLASS = CKO_SECRET_KEY [8]
>> 331569088[1bd1610]: CKA_KEY_TYPE = 0x10 [8] 331569088[1bd1610]:
>> CKA_DERIVE = CK_TRUE [1] 331569088[1bd1610]: CKA_VALUE_LEN =
>> 3000000000000000 [8] 331569088[1bd1610]: mechanism =
>> CKM_DES3_ECB 331569088[1bd1610]: *phKey = 0x54
> The extra data here is dumping out interesting values inside the
> parameters passed to C_UnwrapKey. The list of CKA_XXXX = XXXXX are
> dumping out the template sent in pTemplate. CKA_VALUE_LEN looks wrong
> to me, I don't know if tehre is some byteswap issue here.
I've checked nearly every C_UnwrapKey function trace in the log ... each
of them is showing 'CKA_VALUE_LEN = 3000000000000000 [8]'.
Every C_DeriveKey function trace shows 'CKA_VALUE_LEN = 1000000000000000
[8]'
>
> mechansim CKM_DES3_ECB means you are using DES to do unwrap, so this
> appears to be a restart operation, rather than a full handshake.
Looking at the SSLTrace file this seems to be the case. I don't know if
I can correlate the SSLTRACE file with the PKCS#11 log file, but it
seems the issue only appears in restart operations. Full handshakes
don't show the issue.
from SSLTRACE log
=================
10891: SSL[107778448]: sending client-hello
SSL: Lookup1: sid=0x36a9080
10891: SSL[107778448]: client, found session-id: [Len: 16]
20 4d 2d a6 3a 3c 67 1a 60 c8 e5 18 0d ea d4 e2 M-.:<g.`.......
10891: SSL3[107778448]: send client_hello handshake
10891: SSL3[107778448]: start handshake hashes
10891: SSL[107778448]: SecureSend: returning -1 count, error -8023
10891: SSL[107778448]: closing, rv=0 errno=-8023
10891: SSL[107778448]: sending client-hello
SSL: Lookup1: sid=0x36a9080
10891: SSL[107778448]: client, found session-id: [Len: 16]
20 4d 2d a6 3a 3c 67 1a 60 c8 e5 18 0d ea d4 e2 M-.:<g.`.......
10891: SSL3[107778448]: send client_hello handshake
10891: SSL3[107778448]: start handshake hashes
10891: SSL[107778448]: SecureSend: returning -1 count, error -8023
10891: SSL[107778448]: closing, rv=0 errno=-5966
This looks also interesting: PR_NO_ACCESS_RIGHTS_ERROR ??
It might be helpfull if SSLTRACE and PKCS#11 could log a timestamp to
help in correlation. Or is there already a way to correlate it?
>> 331569088[1bd1610]: rv = CKR_OK 331569088[1bd1610]: C_DeriveKey
>> 331569088[1bd1610]: slotID = 0x1 331569088[1bd1610]: flags =
>> 0x4 331569088[1bd1610]: pApplication = 0331569088
>> 331569088[1bd1610]: Notify = 0x13231f333 331569088[1bd1610]:
>> phSession = 0x7fffc 331569088[1bd1610]: phKey = 0x36c1618
>> 331569088[1bd1610]: CKA_CLASS = CKO_SECRET_KEY [8]
>> 331569088[1bd1610]: CKA_KEY_TYPE = 0x12 [8] 331569088[1bd1610]:
>> *phSession = 0x132317f8 331569088[1bd1610]: rv =
>> CKR_DEVICE_ERROR 331569088[1bd1610]: mechanism =
>> CKM_TLS_KEY_AND_MAC_DERIVE
> Here you are deriving the SSL session keys. A session handle of
> 0x132317f8 is pretty large, but not unexpected in a server that has
> been running a while.
Well the server is only running 22 minutes until the issue arises ...
might this be 'a while'?
During this time only 28 'send client_hello handshake' messages appear
in the SSLTRACE log. 19 of them show 'closing, rv=0 errno=0', 1
'closing, rv=0 errno=-5966', the others 'closing, rv=0 errno=-8023'.
So no much connection at all.
>
> CKR_DEVICE_ERROR can happen in a CKM_TLS_KEY_AND_MAC_DERIVE in a
> couple of pathological cases:
>
> 1) if the underlying object structure is hosed. 2) if the key object
> looks bogus (a token Object that isn't known or stored, for
> instance).
>
> This object should be a token object. Unfortunately it looks like
> the logging program did not out put hKey (or it did, but we ran into
> some overwrite do to threading... the missing new lines after the
> pApplication, Nofiy, and phSession) seems to indicate that. (In fact
> Notify should be pNotify, and that looks like part of a
> C_OpenSession() call, not a C_DeriveKey()... which makes some of the
> conclusions we are trying to draw questionable.
>
>> 331569088[1bd1610]: C_DigestInit 331569088[1bd1610]: hSession =
>> 0x1 331569088[1bd1610]: pMechanism = 0x7fffcd593290
>> 331569088[1bd1610]: mechanism = CKM_MD5 331569088[1bd1610]:
>> rv = CKR_DEVICE_ERROR
> I can see no place where NSC_Digest can even return CKR_DEVICE_ERROR
So I can not trust the log?
>> 331569088[1bd1610]: C_OpenSession 331569088[1bd1610]: slotID =
>> 0x1 331569088[1bd1610]: flags = 0x4 331569088[1bd1610]:
>> pApplication = 0x4bdff70 331569088[1bd1610]: Notify = 0x13231f30
>> 331569088[1bd1610]: phSession = 0x7fffcd593228
>> 331569088[1bd1610]: *phSession = 0x132317f8 331569088[1bd1610]:
>> rv = CKR_DEVICE_ERROR
> This looks like a normal C_OpenSession call...).
>
>
>> OS is CentOS 5.5 64bit.
>
> Arg, I just remembered. CentOS 5.5 will likely be following RHEL 5.5
> in having an NSS 3.11 softoken.
This is only the OS ... the library is not using NSS/NSPR from the OS.
It is using a bundled one.
strings libsoftokn3.so | grep -i header
$Header: NSS 3.12.4.5 (debug) Jan 10 2011 17:14:10
I've just checked and NSC_DigestInit
> still cannot return CKR_DEVICE_ERROR in NSS 3.11 -- unless you are
> running in FIPS mode. This is possible, but seem extremely unlikely
> since you are using slotid=1 (FIPS slot ID is 3). In FIPS mode, any
> call can fail with CKR_DEVICE_ERROR if the token is in 'fatalError'
> mode. That happens if any of the following fail: 1) the Random number
> generator. 2) the Seeding the random number generator. 3) generating
> a key pair fails with CKR_GENERAL_ERROR 4) power on self check fails
> at init time. It can be reset by reinitializing the token. If this is
> what is happening, you should be getting NSS messages in your syslog.
> I don't think FIPS is your issue here looking at the logs, however.
I don't think it's running in FIPS mode .. this has to be configured
AFAIK and noone did it.
What could be a next step to narrow it down? Give 3.12.9 a try?
TIA,
Bernhard
today I a built a debug version of NSS 3.12.8 (as I haven't found 3.12.9
yet)
The issue is still there, but occours much later then with 3.12.5.
Server (with lib using NSS) ran about 1.5 hours before the issue
occoured. During this time 911 SSL connections have been done. The last
6 failed (and then I stopped).
It seems SSLTRACE log has been truncated ... is there a limit? (It shows
much less connections as the lib's log does).
PKCS#11 logger now seems to be ok, but shows 'CKR_DEVICE_ERRORS' for
other functions as for 3.12.5.
822564800[1581690]: C_OpenSession
822564800[1581690]: slotID = 0x1
822564800[1581690]: flags = 0x4
822564800[1581690]: pApplication = 0x1884150
822564800[1581690]: Notify = 0x30468315
822564800[1581690]: phSession = 0x7fffb77fd208
822564800[1581690]: *phSession = 0x1884150
822564800[1581690]: rv = CKR_DEVICE_ERROR
822564800[1581690]: C_DigestInit
822564800[1581690]: hSession = 0x1
822564800[1581690]: pMechanism = 0x7fffb77fd260
822564800[1581690]: mechanism = CKM_MD5
822564800[1581690]: rv = CKR_DEVICE_ERROR
822564800[1581690]: C_OpenSession
822564800[1581690]: slotID = 0x1
822564800[1581690]: flags = 0x4
822564800[1581690]: pApplication = 0x1884150
822564800[1581690]: Notify = 0x30468315
822564800[1581690]: phSession = 0x7fffb77fd208
822564800[1581690]: *phSession = 0x1884150
822564800[1581690]: rv = CKR_DEVICE_ERROR
822564800[1581690]: C_DigestInit
822564800[1581690]: hSession = 0x1
822564800[1581690]: pMechanism = 0x7fffb77fd260
822564800[1581690]: mechanism = CKM_SHA_1
822564800[1581690]: rv = CKR_DEVICE_ERROR
...
..
Once the 'CKR_DEVICE_ERROR' occoured it seems to be unrecoverable.
Could this be a file-system problem?
TIA,
Bernhard
What is the actual client software you are running? I sounds like
someone is overwriting some buffers and causing very bizarre behavior.
That sounds like exactly the behavior of the FIPS token if it hits the
unrecoverable error state, but clearly you are using Slot 1, not Slot 3
(the FIPS slot). None of this matches anything I would expect to see
with NSS talking to softoken.;(.
>
> TIA,
> Bernhard
I looked into the CKA_VALUE_LEN issue. I found that our PKCS #11
logger prints CKA_VALUE_LEN as a hex dump. So
3000000000000000 [8]
means a sequence of 8 bytes:
0x30 0x00 0x00 0x00 0x00 0x00 0x00 0x00
On a little-endian system (such as x86/x64), this is 48 decimal, which
is the length of the SSL/TLS master secret.
I filed an NSS bug https://bugzilla.mozilla.org/show_bug.cgi?id=625491
to improve the logging of CKA_VALUE_LEN, with a patch.
Wan-Teh
You can add 'timestamp' to the NSPR_LOG_MODULES environment variable. See
http://www.mozilla.org/projects/nspr/reference/html/prlog.html#25306
This should add timestamps to PKCS #11 logging. I don't know if
SSLTRACE is based on NSPR logging...
Wan-Teh
On 01/13/2011 10:40 PM, Robert Relyea wrote:
--snip--
> What is the actual client software you are running?
the 'client' is the OpenSSO web-agent (a lib) used by Apache httpd.
It merley does the following ...
PR_SetConcurrency(4);
while (len > 0) {
PRInt32 bytesToWrite;
PRInt32 bytesWritten;
if (len > static_cast<std::size_t>(MAX_READ_WRITE_LEN)) {
bytesToWrite = MAX_READ_WRITE_LEN;
} else {
bytesToWrite = static_cast<PRInt32>(len);
}
bytesWritten = PR_Write(socket, data, bytesToWrite);
if (bytesWritten < 0) {
status = AM_NSPR_ERROR;
break;
}
data += bytesWritten;
len -= static_cast<std::size_t>(bytesWritten);
}
the socket referenced is created using the following function
PRFileDesc *createSocket(const PRNetAddr& address, bool useSSL,
const std::string &certDBPasswd,
const std::string &certNickName,
bool alwaysTrustServerCert);
char *growBuffer(char *oldBuffer, std::size_t oldBufferLen,
std::size_t newBufferLen);
am_status_t read(char *buffer, std::size_t& bufferLen);
static bool initialized;
PRFileDesc *socket;
char *certdbpasswd;
char *certnickname;
};
I sounds like
> someone is overwriting some buffers and causing very bizarre behavior.
Hmm, why could this 'buffer overwriting' only occour on this
configuration .. it does not happen on any other?!?
-- snip --
Is the output of 'PKCS#11' reliable?
AFAI have understood the code 'NSC_OpenSession' does not return
'CKR_DEVICE_ERROR' ... but I'm not a C-programmer.
Unfortunately 'printfs' are not passed through by Apache httpd ... I'll
try to use PR_LOG, but the question is will this be helpfull...
>>
>> TIA,
>> Bernhard
>
>
Just to be sure: we're talking of this code here, right?
Maybe Bob or Nelson can spot a problem when skimming over that code...?
I have another question/thought, however: what version of Apache httpd
and what MPM are you using? Is it possible that the
Connection::initialized boolean might not be shared among the httpd
processes, resulting in multiple (concurrent) NSS initializations?
Also, are "CertDir" and "dbPrefix" set in your configuration? If not,
Connection::initialize() would call NSS_NoDB_Init, which isn't intended
for SSL operations, from what I understand
(http://www.mozilla.org/projects/security/pki/nss/ref/ssl/sslfnc.html#1234224).
Kaspar
On 01/16/2011 12:16 PM, Kaspar Brand wrote:
> On 14.01.2011 10:24, Bernhard Thalmayr wrote:
>> the 'client' is the OpenSSO web-agent (a lib) used by Apache httpd.
>
> Just to be sure: we're talking of this code here, right?
yes
>
> http://sources.forgerock.org/browse/openam/trunk/opensso/products/webagents/am/source/connection.cpp?r=HEAD&content=true
>
> Maybe Bob or Nelson can spot a problem when skimming over that code...?
> I have another question/thought, however: what version of Apache httpd
Apache httpd 2.2.17
> and what MPM are you using?
Worker MPM is used , but is configured so start multiple processes (default)
Is it possible that the
> Connection::initialized boolean might not be shared among the httpd
> processes, resulting in multiple (concurrent) NSS initializations?
Would this be a problem? The Agent codes has always worked this way for
multi-process servers.
>
> Also, are "CertDir" and "dbPrefix" set in your configuration?
They are not set.
If not,
> Connection::initialize() would call NSS_NoDB_Init, which isn't intended
> for SSL operations, from what I understand
> (http://www.mozilla.org/projects/security/pki/nss/ref/ssl/sslfnc.html#1234224).
Hmm if this is true then the Agent code should exit and should not call
'NSS_NoDB_Init'. However how could 911 handshakes be made if this holds
true?
Thanks,
Bernhard
>
> Kaspar
Not sure. But when you say "has always worked this way for multi-process
servers", does this mean it worked with previous Apache httpd versions?
Did you also try with the prefork MPM (which is the default for the
CentOS-bundled version of httpd)?
>> Also, are "CertDir" and "dbPrefix" set in your configuration?
>
> They are not set.
>
> If not,
>> Connection::initialize() would call NSS_NoDB_Init, which isn't intended
>> for SSL operations, from what I understand
>> (http://www.mozilla.org/projects/security/pki/nss/ref/ssl/sslfnc.html#1234224).
>
>
> Hmm if this is true then the Agent code should exit and should not call
> 'NSS_NoDB_Init'. However how could 911 handshakes be made if this holds
> true?
Well, unless I'm completely misreading the following code,
Connection::initialize will proceed with NSS_NoDB_Init if certDir is not
set:
> if (certDir.length() != 0) {
> Log::log(Log::ALL_MODULES, Log::LOG_DEBUG, "Connection::initialize() "
> "calling NSS_Initialize() with directory = \"%s\" and "
> "prefix = \"%s\"", certDir.c_str(), dbPrefix.c_str());
> nssMethodName = "NSS_Initialize";
> secStatus = NSS_Initialize(certDir.c_str(), dbPrefix.c_str(),
> dbPrefix.c_str(), "secmod.db",
> NSS_INIT_READONLY|NSS_INIT_FORCEOPEN);
> } else {
> Log::log(Log::ALL_MODULES, Log::LOG_DEBUG, "Connection::initialize() "
> "CertDir and dbPrefix EMPTY -- Calling NSS_NoDB_Init");
> nssMethodName = "NSS_NoDB_Init";
> secStatus = NSS_NoDB_Init(NULL);
> }
Kaspar
I had been employed at Sun/iPlanet for last 10 years and I've never run
into this issue ....
> Did you also try with the prefork MPM (which is the default for the
> CentOS-bundled version of httpd)?
Not yet...
I meant it might be a bug in Agent code to call 'NSS_NoDBInit' ...
however this code has been there for some years already.
If this is a bug ... how can SSL handshake work at all? It does not fail
over time, but it fails somehwere in time.
If the error once occured NSS can not recover from it.
I'm not 100% sure but it seems the error does only occur on SSL restarts
.. not on full handshakes.
-Bernhard
>
> Kaspar
One explanation I can think of is that it would only break with more
recent versions of NSS, due to stricter application behavior being
enforced? (https://bugzilla.mozilla.org/show_bug.cgi?id=331096 comes to
mind, e.g. - but that's just a [very] wild guess).
I'll defer to the real experts (Bob, Nelson, Wan-Teh) for authoritative
answers on this, however.
> I'm not 100% sure but it seems the error does only occur on SSL restarts
> .. not on full handshakes.
Perhaps it can do full handshakes even if it has been init'ed with
NSS_NoDBInit, but no session resumptions? (Again, just guessing.)
What you could also try is creating an empty DB (certutil -N -d somedir)
and configure that for the agent. Would be interesting to see if it
makes a difference.
Kaspar
Not really ... I've already created a debug build for NSS 3.12.8 (as
this seems to fix some PKCS#11 logger issues).
The issue does occour later, but it occurs.
(https://bugzilla.mozilla.org/show_bug.cgi?id=331096 comes to
> mind, e.g. - but that's just a [very] wild guess).
>
> I'll defer to the real experts (Bob, Nelson, Wan-Teh) for authoritative
> answers on this, however.
>
>> I'm not 100% sure but it seems the error does only occur on SSL restarts
>> .. not on full handshakes.
>
> Perhaps it can do full handshakes even if it has been init'ed with
> NSS_NoDBInit, but no session resumptions? (Again, just guessing.)
Well, the error does not occour on every restart ... that's the problem.
It's not reproducable ... the agent just sends requests to a secure
server and on some point in time NSS can not recover.
>
> What you could also try is creating an empty DB (certutil -N -d somedir)
> and configure that for the agent. Would be interesting to see if it
> makes a difference.
I'll give this a try.
Thanks,
Bernhard
>
> Kaspar