## Environment
- BeeGFS version: 8.2.2
- OS (servers): Ubuntu 22.04, kernel 5.15.0-164-generic
- OS (clients): Ubuntu 22.04, kernel 5.15.0-25-generic
- Hardware: 5 meta/storage servers (32 cores, 192GB RAM, NVMe, ConnectX-7 100GbE)
- Clients: 2 (one RDMA via ConnectX-6 Dx, one TCP-only 10GbE)
- Meta storage: ext4 on NVMe
## Description
When a client opens files with 128 concurrent threads accessing the same directory, the metadata server occasionally returns ENOENT for files that definitively exist. The files are immediately accessible on retry.
## Reproduction
1. A directory containing 84-1400 files (e.g., JPEG images)
2. A multi-threaded application opens files listed in a CSV, using 128 threads
3. Approximately 1-2 out of every 9,000 open() calls return ENOENT
4. The failing file varies randomly between runs — it is never the same file twice
5. Immediately after the failure, the same file can be opened successfully
Example output (128-thread application reading 9,428 files across multiple directories):
Warning: Failed to open file for reading!
File: /mnt/bee/data/ins_inf/ecu/edualexi/e7d6f1d67ea62cc29eb669e91dd1baf94b5a9f4a.jpg
Files: 9428 Templates: 9427
The file exists and is world-readable:
-rwxrwxrwx+ 1 ben ben 131113 Mar 6 2023 /mnt/bee/data/.../file.jpg
## Key findings
- **No client-side communication errors**: `dmesg` on the client shows zero BeeGFS messages during the failure. The meta server sends a clean ENOENT response (not a connection timeout or retry).
- **Happens across multiple meta targets**: Failures occur on directories owned by different meta nodes (m:1, m:3, etc.), ruling out a single faulty server.
- **Happens with both RDMA and TCP clients**: Reproduced on a 100GbE RDMA client and a 10GbE TCP-only client.
- **Not related to caching**: Setting `tuneENOENTCacheValidityMS=0`, `tuneFileSubentryCacheValidityMS=0`, and `tuneDirSubentryCacheValidityMS=0` on the client does not fix it.
- **Not related to server capacity**: Meta servers are 95-99% CPU idle during reproduction. Increasing `connMaxInternodeNum` (64→256), `tuneNumStreamListeners` (8→32), `tuneNumWorkers` (128), and `tuneUsePerUserMsgQueues=true` did not fix it.
- **Workaround**: An LD_PRELOAD shim that retries open() on ENOENT with a 2ms delay succeeds on the first retry every time, confirming the ENOENT is transient (~1-2ms duration).
## Expected behavior
The meta server should never return ENOENT for a file that exists, regardless of concurrent lookup load on the same directory.
## Suspected cause
A race condition in the meta server's internal directory entry lookup when multiple concurrent requests access the same directory simultaneously. The per-directory locking or hash-walk may briefly return "not found" while another operation is in progress.