seeing more than one cpu id on one output stream with MAP_TO_RECORDED_OUTPUT

33 views
Skip to first unread message

prasun3

unread,
Sep 25, 2023, 7:27:58 AM9/25/23
to DynamoRIO Users
In some of my tests, I am seeing more than one cpu id on one output stream with MAP_TO_RECORDED_OUTPUT.

The value returned by stream->get_output_cpuid matches the initial CPU_ID marker values, but at some point on some streams, we see other CPU_ID values as shown below.

Example:

Output #5 processing CPU #7                   (from get_stream_cpuid)
Output #5 saw CPU_ID marker: 7
Output #5 saw CPU_ID marker: 7
Output #5 saw CPU_ID marker: 7
...
Output #5 saw CPU_ID marker: 7
Output #5 saw CPU_ID marker: 7
Output #5 saw CPU_ID marker: 5

Kai Luk

unread,
Sep 25, 2023, 11:30:15 AM9/25/23
to DynamoRIO Users
How many CPU cores were there? It could be the result of the thread being moved to a different CPU core.

Kai

Kai Luk

unread,
Sep 25, 2023, 1:25:53 PM9/25/23
to DynamoRIO Users
For the mapping, MAP_TO_RECORDED_OUTPUT, "This requires an output stream count equal to the number of cores occupied by the input stream set. "


Does the number of output stream count equal to the number of cores occupied by the input stream set?

Kai

Prasun

unread,
Sep 26, 2023, 12:13:10 AM9/26/23
to Kai Luk, DynamoRIO Users
Yes, the output stream count is equal to the number of cores occupied by the input stream set. If the output stream had fewer cores, then it would error out in read_traced_schedule ("replay_as_traced_istream cpu count != output count"). If the output stream had more cores, then the extra cores would simply be unused.

Thread migrating should not affect things here since the output stream is expected to map to a particular core - so whichever threads ran on core 7 will show up in stream #5.

One interesting thing I noted in this run is the exact opposite problem occurs on output #4. It is expected to run core 5, but it sees one marker from core 7 (On output #5, we expect to see core 7 but we see one marker from core 5). Maybe Derek has some idea?

Output #4 processing CPU #5
Output #4 saw CPU_ID marker: 5
Output #4 saw CPU_ID marker: 5
Output #4 saw CPU_ID marker: 5
...
Output #4 saw CPU_ID marker: 5
Output #4 saw CPU_ID marker: 5
Output #4 saw CPU_ID marker: 7

--
You received this message because you are subscribed to the Google Groups "DynamoRIO Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dynamorio-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dynamorio-users/49c463cb-56a2-4795-bdf6-d72b8b1f510en%40googlegroups.com.

Derek Bruening

unread,
Sep 26, 2023, 11:39:01 AM9/26/23
to Prasun, Kai Luk, DynamoRIO Users
Agreed that this is not expected behavior. First, let's figure out whether the problem lies in cpu_schedule.zip (generated by raw2trace), or is it in the scheduler?  Run invariant_checker on one of these cases: does it complain that the cpu_schedule.zip doesn't match?  You can also manually look inside the cpu_schedule.zip for the problematic CPU (use "unzip -p" on the target cpu and pipe it to "od -t d8 -A x -w32") and sleuth around manually, but invariant_checker is supposed to provide an automated sanity check.

Prasun

unread,
Sep 27, 2023, 2:23:05 AM9/27/23
to Derek Bruening, Kai Luk, DynamoRIO Users
The invariant_checker shows this error: "Trace invariant failure in T-1 at ref # 0 (0 instrs since timestamp 0): Serial schedule entry count does not match trace"

The CPU schedule looks pretty short - is this causing the issue?

$ unzip -p cpu_schedule.bin.zip 7 | od -t d8 -A x -w32
000000               239417    13340107545587152                    7                    0
000020               239420    13340107547564540                    7               194593

$ unzip -p cpu_schedule.bin.zip 5 | od -t d8 -A x -w32
000000               239420    13340107546174734                    5                    0
000020               239421    13340107546184469                    5                    0
000040               239420    13340107547555480                    5                    1
000060               239417    13340107547564608                    5               180781

Derek Bruening

unread,
Sep 27, 2023, 11:59:37 AM9/27/23
to Prasun, Kai Luk, DynamoRIO Users
So the serial schedule is also incorrect (fatal error so didn't make it to the cpu schedule checks; could remove the abort to keep going).   What I would do is get into the details on what is different: is the count off by 1 or by hundreds; where does it diverge; why is raw2trace getting it wrong.  It's fairly straightforward: it just looks at all the timestamp,cpu pairs and collapses same-thread adjacent ones.  Is this the latest DR, with the additional timestamps added by PR #6290?  Does this issue not happen prior to that change?

Prasun

unread,
Oct 3, 2023, 2:41:05 AM10/3/23
to Derek Bruening, Kai Luk, DynamoRIO Users
This is what I am seeing (snipped since there are a lot of warnings after I removed the abort or rather the return statement after the check)

Serial schedule: read 2658 records from the file and observed 3039 transition in the trace

Trace invariant failure in T-1 at ref # 0 (0 instrs since timestamp 0): Serial schedule entry count does not match trace
Saw T93824993851234 on 0 @93824993851372 93824997716400 vs file T57266 on 0 @13340785386080432 0
Trace invariant failure in T57266 at ref # 0 (0 instrs since timestamp 0): Serial schedule entry does not match trace
Saw T93824993846210 on 93824993846290 @93824993846252 93824993846332 vs file T57268 on 7 @13340785391018579 0
Trace invariant failure in T57268 at ref # 0 (0 instrs since timestamp 0): Serial schedule entry does not match trace
Saw T93824993846348 on 93824993846468 @93824993846452 93824993846484 vs file T57269 on 5 @13340785391522391 0
Trace invariant failure in T57269 at ref # 0 (0 instrs since timestamp 0): Serial schedule entry does not match trace
Saw T93824993846500 on 93824993846532 @93824993846516 93824993846548 vs file T57270 on 3 @13340785391595193 0
Trace invariant failure in T57270 at ref # 0 (0 instrs since timestamp 0): Serial schedule entry does not match trace
Saw T93824992605428 on 0 @93824992605444 0 vs file T57271 on 0 @13340785391598500 0
Trace invariant failure in T57271 at ref # 0 (0 instrs since timestamp 0): Serial schedule entry does not match trace
Saw T93824994479664 on 93824994479681 @93824994479672 93824994479688 vs file T57272 on 0 @13340785391608559 0


With this, I don't see the problem I reported, but I do see invariant_checker failures with this too (again, this is snipped)

Serial schedule: read 1519 records from the file and observed 1522 transition in the trace

Trace invariant failure in T-1 at ref # 0 (0 instrs since timestamp 0): Serial schedule entry count does not match trace
Saw T93824993841786 on 0 @93824993841942 93824997646776 vs file T67777 on 0 @13340787428857605 0
Trace invariant failure in T67777 at ref # 0 (0 instrs since timestamp 0): Serial schedule entry does not match trace
Saw T93824993836618 on 93824993836698 @93824993836660 93824993836740 vs file T67779 on 7 @13340787433849554 0
Trace invariant failure in T67779 at ref # 0 (0 instrs since timestamp 0): Serial schedule entry does not match trace
Saw T93824993836756 on 93824993836876 @93824993836860 93824993836892 vs file T67780 on 1 @13340787434358291 0
Trace invariant failure in T67780 at ref # 0 (0 instrs since timestamp 0): Serial schedule entry does not match trace
Saw T93824993836908 on 93824993836940 @93824993836924 93824993836956 vs file T67781 on 5 @13340787434430479 0
Trace invariant failure in T67781 at ref # 0 (0 instrs since timestamp 0): Serial schedule entry does not match trace
Saw T93824992605204 on 0 @93824992605220 0 vs file T67782 on 2 @13340787434433857 0
Trace invariant failure in T67782 at ref # 0 (0 instrs since timestamp 0): Serial schedule entry does not match trace
Saw T0 on 93824994467744 @93824994467736 93824994467753 vs file T67783 on 4 @13340787434446804 0

Derek Bruening

unread,
Oct 3, 2023, 10:56:15 AM10/3/23
to Prasun, Kai Luk, DynamoRIO Users
As you can see it has what looks like a timestamp as the thread id: "Saw T93824993851234.. vs file T57266".  Is that really a thread id in the trace file?  Where did that come from?

Prasun

unread,
Oct 4, 2023, 5:32:18 AM10/4/23
to Derek Bruening, Kai Luk, DynamoRIO Users
Sorry those lines should not have been printed. I had modified the invariant checker to keep going but the change was incorrect and those lines print uninitialized data.

I checked further and found that the invariant checker produces a different order of records and maybe that's why the reduced list size is different. For example, here the serial schedule file has 2690 records, but the size seen in invariant checker is 2684 only.

        Serial schedule: read 2690 records from the file and observed 3043 transition in the trace. Serial redux size = 2684

I added some code to dump the schedule in invariant_checker and see there is a difference in the order of the records.

For example

21,22c21,22
< 0001200               125859    13340877902486153                    1                    1
< 0001240               125856    13340877902486153                    0                    1
---
> 0001200               125856    13340877902486153                    0                    1
> 0001240               125859    13340877902486153                    1                    1

Then I noticed that the compare function used in raw2trace was different from the one used in invariant_checker. I updated raw2trace to use the same function. This fixed the invariant failure, but did not change the original scheduling issue.

Derek Bruening

unread,
Oct 4, 2023, 11:28:56 AM10/4/23
to Prasun, Kai Luk, DynamoRIO Users
Could you send a PR on the ordering fix?

So the invariant checker is happy for the serial schedule now, and also the per-cpu schedule?  So unless both it and raw2trace handled a cpuid marker incorrectly then the schedule in the file does *not* contain the problem with two cpuids on one output?  Unzipping each output component in the cpu_schedule.zip should allow a manual verification too.  That means the problem is inside the scheduler somewhere...

Prasun

unread,
Oct 5, 2023, 4:26:10 AM10/5/23
to Derek Bruening, Kai Luk, DynamoRIO Users
I have sent a PR.

I don't see any scheduler related messages from the invariant checker.

The cpu schedule looks good. I had checked it earlier too and I checked on another trace where I am seeing this error.

$ unzip -p cpu_schedule.bin.zip 6 | od -td8 -w32
0000000               232316    13340965089564189                    6                    0
0000040               232320    13340965090823509                    6               730891
0000100               232315    13340965106405521                    6               785137
0000140               232314    13340965106989293                    6               790175
0000200

$ unzip -p cpu_schedule.bin.zip 4 | od -td8 -w32
0000000               232320    13340965090802548                    4                    1
0000040               232316    13340965090823495                    4               715959

Prasun

unread,
Oct 10, 2023, 5:43:18 AM10/10/23
to Derek Bruening, Kai Luk, DynamoRIO Users
The PR for the invariant_checker is merged and I have filed a bug (#6354) to track the scheduler issue. It didn't look like there was test to check for this, so I have sent a PR with a test (which is failing right now).

Derek Bruening

unread,
Oct 10, 2023, 10:39:51 AM10/10/23
to Prasun, Kai Luk, DynamoRIO Users
The existing tests are in scheduler_unit_tests: test_replay_as_traced, test_replay_as_traced_from_file, etc.  They do check the exact schedule, but are small.

Reply all
Reply to author
Forward
0 new messages