Hi all,
I am performing some stress testing with cuse swtpm and I
found malfunction when stressing to much swtpm. When swtpm prepares the response buffer it
uses wrong offset. My finds point race condition caused by timing.
Please, let me know your thoughts.
My setup:
- VM with 2 cores, 2 GB memory, Linux Suse Enterprise SP3
- swtpm runs in a docker container based on linux Alpine
- client written in C using libtss2 (https://github.com/tpm2-software/tpm2-tss). For convenience I attached this client source code ( tpm_stress.zip_ to this e-mail.
Test:
Loop
- Create sapi context with tcti dev (/dev/tpm0)
- Get capabilities
- Read target PCR
- Extend digest to target PCR
- Read target PCR again to check previous extend operation
- Release tcti/sapi context
Debugging swtpm I found that the root cause is 2 'read' threads that are executing function ptm_read_result that change ptm_read_offset variable.
I instrumented the code to help debugging as follow:
1st info is timestamp
2nd info is thead id
3rd,4th line number and file name. Do not rely on line number because my instrumentation changed line number form base code.
Notice that the issue shows up due to thread id 11
">> 2022-08-05 12:32:51.023492 - [11] [ 626:cuse_tpm.c]in ptm_read_result ptm_read_offset 19 len 19"
This execution increments the read offset that the other read trhead ( thread id 12) is about of using ptm_read_offset
---------
2022-08-05 12:32:51.023186 - [35] [ 974:cuse_tpm.c]ptm_req_len 53, ptm_res_len 0, ptm_res_tot 4096
2022-08-05 12:32:51.023207 - [35] [ 979:cuse_tpm.c]write in [53][800200000035000001820000000800000009400000090000000000000000010004000102030405060708090A0B0C0D0E0F10111213]
2022-08-05 12:32:51.023217 - [35] [1003:cuse_tpm.c]
>>2022-08-05 12:32:51.023346 - [11] [ 596:cuse_tpm.c]ptm_read_offset 0, ptm_res_len 19
>>2022-08-05 12:32:51.023366 - [11] [ 601:cuse_tpm.c]read out [19][80020000001300000000000000000000010000]
2022-08-05 12:32:51.023445 - [28] [ 974:cuse_tpm.c]ptm_req_len 20, ptm_res_len 0, ptm_res_tot 4096
2022-08-05 12:32:51.023464 - [28] [ 979:cuse_tpm.c]write in [20][8001000000140000017E00000001000403000100]
>> 2022-08-05 12:32:51.023492 - [11] [ 626:cuse_tpm.c]in ptm_read_result ptm_read_offset 19 len 19
2022-08-05 12:32:51.023473 - [28] [1003:cuse_tpm.c]
-------
The failure rate is very irregular: sometimes it fails at iteration 100, 500, 10000 and so on.
I changed a small piece of code and it solved the issue I was facing. Now the loop executes successfully: My last try was 500,000,000 ( five hundred million) iterations.
What I changed:
in cuse_tpm.c:
From :
fuse_reply_buf(req, (const char *)&ptm_response[ ptm_read_offset ], len);
ptm_read_offset += len;
To:
+ size_t local_offset = ptm_read_offset;
+ ptm_read_offset += len;
+fuse_reply_buf(req, (const char *)&ptm_response[local_offset], len);
- fuse_reply_buf(req, (const char *)&ptm_response[ ptm_read_offset ], len);
+// ptm_read_offset += len;
Basically, I incremented ptm_read_offset before calling fuse_reply_buf. This prevents the situation where 1st reading thread is pre-empted just after calling fuse_reply_buffer and it is resumed after some while executing the read offset increment after ‘writing’ thread had zeroed the reading offset.
Of course, there are other ways to address this issue. E.g.: Critical regions, etc... but I didn’t go deeper.
Thanks,
Marcos Antonio de Moraes