Csi Payload Corruption

0 views

Skip to first unread message

Lakia Throssell

unread,

Aug 4, 2024, 2:35:19 PM8/4/24

to ilewofli

NEis a site for to ask and provide answers about professionally managed networks in a business environment. Your question falls outside the areas our community decided are on topic. Please visit the help center for more details. If you disagree with this closure, please ask on Network Engineering Meta.

I'm working on developing a transport protocol layered directly atop IP and while performing an experiment to see how routers on the internet would react to an undefined protocol I've observed some unusual behavior.

Simple enough right? Well apparently not as what I'm observing is that although the packets are being received, there is a pattern where-in the first 4 bytes of every other packet's payload is being nulled. Below is a sample:

My current hypothesis is that the problem lies with one of the routers of which both hosts sit behind so I'm hoping maybe if someone here has knowledge of their firmware implementation they could shed some light on this. (The fact it's the first 4 bytes suggests to me it may have to do with NAT since that offset correlates with the src/dst ports of TCP/UDP)

Turns out that one of the routers (a Billion 7300GRA) was to blame; its handling of packets with a protocol ID of 0xFD is what was responsible for the observed behavior and after simply using a different ID (0x8F only tested) the problem ceased.

Everything works great during normal operation but when sending a large number of messages (over 100 messages / second) the websocket connection starts sending corrupted payloads. Messages vary in size from 4kb to over 100kb.

From message 38765 things go wrong. That message is ok up to character 49139 and after that is malformed. The next messages that the connection sends are malformed as well and have a different frame type.

some months ago I got in my application TCP Zero Copy working, thanks to Hein and some contributors in the forum. Now I have tried the same with UDP Zero Copy to gain speed and it seems that randomly the payloads get corrupt. It happens very infrequently, around 1 time every 2 million packets. My application is very packet loss sensitive and even this packet every couple of million packets creates a problem. When using non zero copy I have no payload corruption issues at all.

1- I have a dedicated ethernet (double shielded) connection between my computer where the application receives data and the STM32H7 acquisition board. I see no packet loss at all, just payload corruption

3- I fill in the data into the payload buffers in a SAI ISR and send the packets in a task. I use a task notification system and make sure that the SAI ISR never overwrites a payload which is being sent by the task

Hi, the corruption happens in the payload not into the UDP header. The packet itself seems to be fine, as it is properly received by my application in Labview where I gather the packets and process them. I notice the corruption both in Labview as well as when debugging the STM32H7, as I have a packet counter integrated into the first 4 bytes of the payload and I can stop exactly when I realized that the sequence is broken. The packet counter (as well as the rest of the payload) gets some random data. I could gather a packet and show the payload to you but I do not think you will be able to extract any valuable info out of it.

The packet counter was corrupted both before as well as after the payload was sent. I will repeat the test today in the evening to assure that it is the data is fully equally corrupted both and after sending the packet but I think the problem happens somewhere else.

A common technique to catch memory corruption is place a variable next to the one getting corruption and put a data breakpoint on that. Is it possible for you to do that? Also, can you share some code snippet to help us understand what memory is getting corrupted?

I will do the test today in the afternoon but I doubt it comes from the application as it does not touch the eth buffer array from FreeRTOS+TCP but in an ISR for zero copy transfers. When doing non zero copy I use my own application buffer array and this one never gets corrupted and it is written in the exact same was from the ISR

Good point Hartmut, I will check how did I setup the memory protection areas. I thought in any case when doing non zero copy my local buffer would in any case be copied in the Ethernet buffer structure eventually, so I should see cache problems as well there if the issue is coherency and the area is cacheable. I will have a look today and give feedback

When using zero-copy, the SAI will write to a different region of RAM. Can you tell what memory banks are involved. The ETH peripheral has complete access to that memory, but does the SAI also have full access? Does that work reliably?

2- The SAI DMA has its own very small buffers in a region that can be accessed by it. I copy manually those small buffers in parts of the payload packets in each SAI ISR. Due to this reason it does not really matter in which memory region the ethernet buffers are (from the point of view of SAI).

As I have no issue at all when using my own buffers w/o Zero Copy, I suspect the corruption could come from something that is happening in the ethernet buffer (ucNetworkPackets in ethernet_data section starting on 0x24040000), not when I send the payloads but somewhere else. As mentioned, everything works typically well, it is just a matter of 1 corrupted payload every 2 million packets sent (in my application that creates a big issue though).

I will modify a little bit the code and do parallel buffering both on the ethernet buffer as well as on my own one in the SAI ISRs and compare them once I detect a payload corruption. I will let you know about the results.

BaseBuffAddress is an array of pointers that points to the payload areas pre-allocated at startup in the ucNetworkPackets buffer via the FreeRTOS_GetUDPPayloadBuffer function. The first position of the payloads are packet counters. As you can see the first area (array position 0) is corrupted

image906514 114 KB

As mentioned, I am not the hardcore Cortex M SW developer, so this issue is really complex for me to debug out. I am wondering if I get some heap/stack corruption from time to time and I am trying to check if and how memory watchpoints work in Stm32CubeMX.

That is interesting. Since we know that this buffer is getting corrupted, we stop copying to this secondary buffer and put data breakpoints on it at some locations which we know get corrupted? This way we probably can catch corruption right when it happens.

Thanks a log for the offer Gaurav, the thing is that I have developed a custom board with external ADCs that communicate through SAI and getting the code to run in a NUCLEO board will be difficult due to the missing ICs.

I have however news for all of you. I think I might be getting closer to the root cause of the issue, at least for the ethernet buffer (for the secondary I really need to check what I am doing wrong).

This is amazing - seems like we are really close. As you mentioned, the buufer you obtained using FreeRTOS_GetUDPPayloadBuffer is getting overwritten. Can you examine the values of pxBuffer and pxNewBuffer in the function pxDuplicateNetworkBufferWithDescriptor and see those are same (meaning the buffer somehow got allocated twice) or some areas overlap (indicating some issue with allocator)?

This, very frequently, turns up in Action Center. Various utilities indicate NO problem with any drive attached to the computer. Google search finds similar issues regarding a LOOP ... but I am not experiencing a (true) 'loop' ... just this false notification.

Summary: Operation: Detect and Repair Operation result: 0x0 Last Successful Step: Entire operation completes. Total Detected Corruption: 1 CBS Manifest Corruption: 0 CBS Metadata Corruption: 0 CSI Manifest Corruption: 0 CSI Metadata Corruption: 0 CSI Payload Corruption: 1 Total Repaired Corruption: 1 CBS Manifest Repaired: 0 CSI Manifest Repaired: 0 CSI Payload Repaired: 1 CSI Store Metadata refreshed: True

This is no false report. The file is indeed corrupted and DISM fixed it. I reported this issue some time ago to Microsoft and they could see this corruption inside Microsoft, too. But my contact told me the team dropped the investigation, because they had no idea what causes the corruption.

I believe the DISM repair DID solve the problem. I ran it, it fixed 'something', and I exited WMIC but the notification remained. I assumed a 'no fix'. Since, I have rebooted ... and have not (yet) seen the notification. I believe I can assume problem solved.

We are regularly checking running operating systems for integrity. One important component on Windows is the side-by-side store (mostly known as WinSXS). Having quite some Windows Server 2016 Servers running, we saw an increasing number of systems with WinSXS corruption with the result component store can be repaired when running dism /online /cleanup-image /scanhealth

When receiving this error (The component store is repairable / Der Komponentenspeicher kann repariert werden), you might be able to repair the side-by-side store directly using dism (dism /online /cleanup-image /restorehealth), even if you're trying to query Windows Update servers. Yet chances are high that it will fail and one of the following errors might be shown:

So, now we've extracted the MSU archive but not yet our manifests we need. To get these, we need to extract the actual update package (the selected CAB archive in the screenshot above) within our already extract files.

And, to reiterate: is is possible to view a historical log of application errors? If not, I would really like to have this feature added. Otherwise, we have to log on the device-end to diagnose transmit failures. Something like a basic server log of rejected messages would be very useful.

If you are publishing packets to Losant with a QoS level of 1 (we support QoS levels 0 and 1), then the broker replies with an ACK when it accepts a packet. Any corruption that would cause a packet to be rejected out of hand (i.e., bad topics, un-parsable packet) the Losant broker would not respond with an ACK - and so that should be enough to know to resend.