Hi all,
I've been running some benchmarks using PMEM-backed ext4-DAX files as heap memory, and I've encountered a non-deterministic error that I am unsure how to fix.
The benchmarks use a multi-threaded append-only PMEM allocator that works as follows: at thread creation, a PMEM file (the "heap file") is created on an ext4-DAX FS, given a large fixed size (~ 500 MB) with fallocate, and then the entire file is mapped shared read/write with MAP_SYNC. Each thread's allocations just increment a pointer into that thread's heap file mapping and return the previous value; memory is not released after allocation until the application exits. The benchmark itself runs various YCSB workloads on a concurrent hashmap; the error was observed for a 4-thread workload. The system is x86-64 (Cascade Lake), running Ubuntu 18.04/kernel 4.15. The PMEM is first-gen Optane.
The error is as follows: on about 20% of runs, SIGBUS will be delivered when a thread pages in the page corresponding to the beginning of the second extent in its heap file (checked using debugfs), which is at that time uninitialized. I have verified that the faulting address is within a valid mapping according to /proc/smaps, and that the access is aligned. There is enough space on the device for the files (they are usually the only files on the ~ 700 GB device when the benchmarks are running). Using FTrace, I was able to determine that these faults reach dax_iomap_pte_fault, and return out of there with (MAJOR | NEEDDSYNC) set for write faults and (NOPAGE) for read faults, but I was not able to isolate the cause beyond that.
I'd appreciate pointers to explanations/potential solutions (or kernel patches if this is an issue that's been fixed since 4.15), or any ideas for how to further isolate the cause. I plan to update this thread by the end of the week, once I test whether the error reproduces with different workload config (single thread, smaller working set size, etc).
Thanks,
George