Delta Standalone Bugs

53 views
Skip to first unread message

Tai Henrichs

unread,
Feb 21, 2024, 12:29:21 PMFeb 21
to Delta Lake Users and Developers
Hello,

I originally messaged about these issues within the Delta-questions Slack channel; however, I received no responses after over a week. Thus, I am also asking here. 

I think I found two bugs within Delta Standalone version 3.0.0 (I assume they remain in 3.0.1 since the changelog for Delta 3.0.1 says Standalone is unchanged). I couldn't find an existing GitHub issue for either bug, but please let me know if I missed them.

The first bug is more involved, and I'm still unsure of the cause. I wanted to describe the bugs here before I submitted issues for them.

So, the first bug seems to be a race condition.
If people want a minimal reproducible example, I can try making one, though it may take some time.

For now, I thought I would get discussion started by merely describing the issue.
Suppose you have a multi-threaded Java/JVM application, call it JA. JA runs within a JDK 8 compatible JVM.

JA spawns a thread T1. Within T1, a DeltaSnapshot is instantiated; call it snapshot. Then, a scan is created using snapshot.scan(). This DeltaScan is passed into the constructor of a class ScanContainer, where the scan's object-reference is saved to a class-field named 'scan'. ScanContainer has a getter named 'getIterator' that, when called, checks if a variable 'iter' is null. If so, it initializes iter with the result of calling scan.getFiles(). Then, it returns the value of iter.

Now, here's where the bug manifests. Suppose our ScanContainer instance is named scanContainer. We call scanContainer.iterate(). The method-body of iterate() has the following form:
while(true) {
    if (!getIterator().hasNext()) {
        // block A: do some stuff
        return;
    }
   
    if (someCondition) {
        return;
    }
   
    // block B: do different stuff
}
// end of method

In "normal" execution, on the first loop-iteration, this code does what I expect, which is to enter block B to "do different stuff."

However, if I attach a thread-suspending breakpoint in Intellij, specifically at the line "if (!getIterator().hasNext())", the code enters block A instead. The same thing occurs if I alter the breakpoint to "suspend all threads."

The issue does not necessarily occur if the debugger is attached to the JVM, and I am using some breakpoints, but none of them are at the location specified. That is, some other breakpoint locations may elicit the bug, but none of the other locations I have used have triggered it.

Thus, I believe that there is a race-condition somewhere within Delta Standalone. Having explored the source-code myself, I have not found where the issue might be.
To give more context, the relevant Delta Table is on HDFS, which is on a different set of machines than the running JVM. Using Intellij's debugger, I confirmed that the LogStore being used via DelegatingLogStore.scala is the HDFS one.

Also, this is not a resource-cleanup issue on my end. I have confirmed that the issue manifests even when the JVM constructs only a single DeltaSnapshot for the lifetime of the JVM's existence, so there could not be any old iterators from DeltaScan::getFiles() that have not been cleaned up: no older iterators ever existed.

Finally, I have tried reproducing the bug without using breakpoints. I have not managed to do so. Specifically, I inserted a Thread.sleep() call immediately before "if (!getIterator().hasNext())", and I tried a variety of lengths for Thread.sleep(), none of which triggered the issue. 

I suspect this is related to breakpoints not triggering until a safepoint is reached in every thread, which may affect timings/thread-interleaving.
I initially suspected there might a timeout somewhere (hadoop configuration, some internal Delta Standalone timeout, or the like), but I ruled out that hypothesis based on these findings. After all, for one test, I called Thread.sleep(300000) and, while Thread.sleep is imprecise, the wall-clock time measured on my phone (program start to finish and also program start to either of block-A/B) was substantially longer than when I observed the bug.

While looking for the source of the bug, I looked for places where Delta Standalone uses multiple threads. The only place I have seen is within SnapshotImpl.scala's loadInMemory method, but I have not carefully read every source file of the project.

This brings me to the second bug. Within loadInMemory, the thread pool is made before a try-finally block, with the thread pool being cleaned up within the finally block. However, if the thread in which loadInMemory is executed is stopped by calling stop(), then the thread pool won't always be released. This doesn't pose a problem for me personally, but it seems suboptimal. A try-with-resources wouldn't entirely address the problem, as stop() could still be called while the finally-block is executing, but using a try-with-resources does seem closer to "best-effort" cleanup here.

I also understand that stop() has been deprecated, but an analogous problem arises if the containing thread is interrupted before the try block but after the thread pool is made. The containing thread may wish to respond to the interrupt by cleaning up its resources, yet as things stand, there is no proper mechanism by which it could do so.

This is because the thread pool is inaccessible outside the method in which it is created.
Based on comments in the method, the pool is created on a per-method basis to avoid some other problems. Regarding a per-snapshot pool, the comment says the following: "If we instead create this on a per-Snapshot instance then we couldn't close the pool and might leak threads." It's not clear to me why this is true. Couldn't Snapshot's interface be made to extend AutoCloseable, and the close method implemented in SnapshotImpl could clean the thread pool?

Anyways, sorry for the lengths message. I wanted to be reasonably thorough.

Thank you,
Tai Henrichs

Tai Henrichs

unread,
Feb 21, 2024, 9:33:32 PMFeb 21
to Delta Lake Users and Developers
Hello,

I think the race-condition issue was on my end, sort of. It was being caused by one of Intellij's threads, which exist outside the view of threads that Intellij provides. 

However, my point regarding the leaky thread pool stands.

Best,
Tai Henrichs

Reply all
Reply to author
Forward
0 new messages