MSOM 1RR Decision on SIG-2026-0377

0 views

Skip to first unread message

MSOM Conference

unread,

May 14, 2026, 12:22:18 AM (yesterday) May 14

to d...@jhu.edu

14-May-2026

Re: SIG-2026-0377, "Online Learning with Survival Data"

Decision: Major Revision

Dear Author (this is to ensure anonymity):

Thank you for submitting your manuscript to the M&SOM Healthcare Operations Management SIG. The paper was reviewed by an expert Associate Editor and two knowledgeable reviewers, and I am grateful to all three for the time, effort, and care they put into their reports. As you know, this paper was submitted under the SIG-Day + 1RR track. A Major Revision decision keeps the paper at M&SOM with the same review team.

The paper develops "survival bandits" that use the Cox proportional hazards model for adaptive experimentation with time-to-event outcomes. The main theory compares a Survival ETC algorithm with a Dichotomized ETC alternative. This is an important problem: time-to-event outcomes arise naturally in healthcare operations, and dichotomizing them can discard useful information. Both reviewers note the paper's clarity, and the AE finds that the combination of algorithm design, asymptotic analysis, and a healthcare case study gives the paper a plausible path forward. I agree.

Both reviewers recommend Major Revision, though R2 describes the recommendation as falling between Major Revision and reject. The AE also recommends Major Revision. Having read the paper and all three reports, I concur with the AE's assessment that the paper has a solid core contribution and merits another round, provided the authors address the concerns below.

Below, I first summarize the main concerns from the review team, before adding a few observations from my own reading.

First, the ETC analysis needs to be made technically secure. R1 raises two concerns that go directly to the main results: (a) the analysis treats the exploration fraction m_n as non-random, even though the stopping rule is data-dependent, and (b) the regret decomposition in equations (3) and (4) omits a (1 - m_n) factor in the post-commitment term, which requires justification under the asymptotic equivalence claim. The AE endorses both points. These are the top priority. If the technical argument is sound, a clear explanation will strengthen the paper; if it requires correction, the main results may need adjustment.

Second, the paper needs clearer positioning. R2 questions whether the setup fits naturally into a bandit paradigm and whether ETC is the right central object. The AE does not view this as fatal, but agrees that the framing could be clearer. The authors should either lean more explicitly into ETC as the practically relevant test-and-roll baseline, or develop the Thompson Sampling side more fully so the broader "survival bandit" framing feels balanced.

R2 also questions whether comparing asymptotic upper bounds (Theorem 2 / Corollary 1) can establish that one method dominates the other without tightness or a matching lower bound. I find the AE's reading fair: the paper frames its contribution as an asymptotic expected-regret characterization in a large-n regime. The revision should state that positioning earlier and explain directly why the large-n regime is appropriate for the applications the paper targets.

Third, the practical motivation and case study need more work. R2 argues that practitioners gravitate toward dichotomization not only for simplicity but because they optimize against specific KPIs such as 30-day readmission rates, and questions whether the retrospective cervical cancer case study is sufficiently compelling. The AE agrees that the paper needs a clearer account of when preserving full time-to-event information matters operationally. The revision should identify the class of applications where survival-based learning would materially improve decisions, connect the "fast" and "slow" event regimes more clearly to practice, and strengthen or supplement the case study accordingly.

I want to add three observations from my own reading.

First, the case study defines regret relative to the arm preferred by a Cox PH model fit on the full dataset. Since the survival algorithms also use Cox PH internally, the evaluator and the learner share the same model class. This matters because the paper itself notes that the empirical survival distributions may violate proportional hazards or other theoretical assumptions. A model-agnostic benchmark, such as empirical restricted mean survival time under each arm's Kaplan-Meier curve, would remove this structural advantage.

Second, the Thompson Sampling algorithm uses the Cox estimator and its estimated precision under adaptive allocation, but the theoretical justification for the Cox normal approximation is developed in the paper's fixed-allocation ETC setting. The paper itself notes that adaptive-allocation asymptotics require allocation proportions to converge to positive constants. The authors should therefore clarify the status of the TS normal approximation and present the TS claims as empirical unless additional theory is provided.

Third, Theorem 3's "no cost of uncertainty" claim for Survival ETC appears to rest on the invariance of the optimal target precision to the baseline event-rate parameter. Appendix E states this argument, but the derivation is terse. Because this result is one of the paper's distinctive contributions, the algebra establishing the invariance should be shown explicitly.

Both reviewers also offer useful minor suggestions, including a formal definition of regret, UCB results in the case study, notational improvements, and connections to hypothesis testing and additional literature. The authors should address these as space permits.

Thank you again for submitting your manuscript. Please consult the appended reports for full details as you prepare the revision.

Sincerely,

Tinglong Dai
MSOM Journal 1 RR Department Editor - Healthcare Operations
d...@jhu.edu

=======================================
REVIEWER 1 REPORT
=======================================

Attached Referee Report:

The paper analyzes a multi-armed bandit model with time-to-event outcomes. The majority of previous works in the multi-armed bandit space with delayed outcomes assume that the outcomes are independent of the delay. In this paper, the outcomes here are the delay itself, i.e., outcome and delay are perfectly correlated. A common approach in practice to tackle such problems is to dichotomize the outcome by selecting a threshold: if the event happens before the threshold, the observation is labeled as a one and otherwise as a zero. This way, the problem simplifies to a multi-armed bandit with binary outcomes and a fixed delay that can be solved with standard tools.

The contribution of the paper lies in analyzing the benefits of utilizing methods designed specifically to handle time-to-event outcomes against the dichotomize approach. To do so, the authors compare theoretically the so-called survival explore-then-commit (Survival ETC) algorithm, which is designed specifically for multi-armed bandits with time-to-event outcomes, to the Dichotomized ETC algorithm. The findings reveal that there are two relevant regimes: "slow event regimes" (few participants are observed in the exploration phase) and "fast event regimes" (almost all participants in the exploration phase experience an event). In both regimes, the Survival ETC, compared to the Dichotomized ETC, is shown to have a significantly lower asymptotic cumulative regret and to be more robust to misspecifications. In addition, a case study based on cervical cancer screening shows under more relaxed assumptions that the results continue to hold for the ETC algorithm and for the analogous Thompson Sampling algorithms.

The contribution of the paper is clear and its relevance clearly justified. The paper also tells a clear story, and all of its pieces (algorithms, theoretical analysis, numerical case study) serve their purpose in supporting the thesis that "survival algorithms" are superior to "dichotomized algorithms". In addition, the paper is well-written and polished.

I have two concerns with the correctness of the analysis, which I state below, that are critical for publication. Otherwise, I have a positive view of the paper, and the remainder of my comments are labeled as minor.

Major comments:

1. The ETC algorithm stops exploring after certain precision nu_n is achieved. Given that the outcomes and the allocations are random, the path of the estimated precision nu_hat_n and the proportion of exploration m_n are also random. The analysis incorrectly treats m_n as a non-random variable. A proper argument is needed even if the bound is expected to remain. In addition, m_n is never formally defined.

2. In equations (3) and (4), the second term multiplies n with the probability of committing to the wrong arm. In reality, it should multiply n(1 - m_n) with the probability of committing to the wrong arm. Classical results of ETC algorithms get rid of the 1 - m_n factor because they work with inequalities and the big O notation. Using your asymptotic equivalence operator (~) requires justification for removing that factor.

Minor comments:

1. ETC seems to be a bad choice for bandits with delays because the last patients in the exploration phase are not observed before committing nor are they used for exploitation. You properly justify at the beginning of section 3 why you use ETC for the theoretical analysis. Can you comment, specifically for bandits with delays, why ETC is not a bad choice?

2. Survival UCB is used in Figure 1. I believe results using UCB in the case study would also add relevant information.

3. The slow and fast event regimes are defined based on s and alpha, which are artificial parameters for the asymptotic analysis. Can you comment on how fast and slow regimes would be identified in practice? Is it possible to place the cervical screening case study in the fast or slow regimes?

4. Can you artificially scale the recruitment rate (or some other parameter of the model) in the case study to illustrate the differences between the two regimes (fast and slow) in practice?

5. The dichotomized algorithm assumes that the outcome is observed after tau. However, if the event occurs before tau, the outcome can be observed earlier. Would use such information earlier significantly improve the performance of the dichotomized algorithms? Is this something that can easily be tested numerically or theoretically?

6. Section 4.1 claims that expected regret was defined in Section 2.1. I did not find a formal, mathematical definition of regret anywhere in the paper.

7. Page 17: tau*/m_n^{dich*} = k/(k+1) should use ~ and the fraction inline should be horizontal.

8. Notational suggestions:
a. Sigma is usually used as variance but here it is used as the inverse of variance. nu is not defined in the main paper. Instead of defining Sigma, could you just define nu and consistently use it throughout as the measure of variation.
b. Why is s assumed to be negative instead of positive by defining beta_n = beta_0 * n^{-s}? This would make the exponents in regret bounds easier to interpret and to compare to the standard regret bounds.
c. Equation (6) redefines d, which was already defined in Section 3.2.

Recommendation: Major Revision

=======================================
REVIEWER 2 REPORT
=======================================

Q1. Summary of the paper's contributions

In this paper the authors present a method to dynamically assign participants to arms of a study, where the outcome of interest is time to effect. The key uncertainty comes from understanding the different time to effect distributions of different arms and their related hazard ratio. The goal is then to either find the arm which either maximizes or minimizes time to effect based on this ratio depending on the decision makers need. The authors present and analyze an Explore then Commit (ETC) algorithm for this problem, they also present a Thompson Sampling (TS) algorithm for which there is no theoretical analysis, these methods are then validated empirically using a computational study with retrospective cervical cancer screening data.

Q2. Paper's strengths

The paper is very well written and presented. It was quite clear to understand the main results and implications the authors were trying to communicate. In addition analyzing time to event outcomes is a critical problem in the operations literature as it relates to theory.

Q3. Path to publication / rejection rationale

I would not recommend the paper in its current form to SIG, and I'm between a major revision and a rejection of the paper since I can see a path forward but it would require a very significant rewrite of the paper and new theory. My main contentions are with the generalizability/applicability of the methods and the depth of the analysis and methods.

Applicability of methods:

One issue, which is acknowledged by the authors, is that while time to event outcomes are important they aren't considered by decision makers in practice for most settings who instead use dichotomization. Where I struggle is that I don't necessarily buy the argument laid out by the authors in the introduction for the statistical cost of dichotomization. Before I circle back to this I do want to acknowledge that the authors also try to make the case for this with theoretical and computational results. They present theoretical results about the potential regret gap between a dichotomized and survival algorithm, however these involve comparisons of asymptotic upper bounds which is a bit problematic (I'll touch on the reason for this when I discuss the depth of analysis). Then even when considering the computational results its hard to say that the survival methods are significantly better then the dichotomized methods outside of ETC style algorithms, the gap is quite small for TS style algorithms.

That said, the main issue is really more high level. As acknowledged by the authors, dichotomization isn't only used because it is simple but also (partially as a function of its simplicity) "Practitioners may also gravitate toward dichotomization when optimizing against specific Key Performance Indicators (KPIs), such as 30-day readmission rates, or due to concerns regarding the proportional hazards assumption". For this reason, most applications in practice (as it exists now) would not really benefit from the inclusion of the survival aspect of the algorithm. This is further exacerbated by the case study data and scenario, which is fairly contrived and based on an aggregation of retrospective data sets. This wouldn't necessarily be too major of an issue if there was a clear use case for the methods; however, the paper would be quite a bit sharper if there was more of a direct case study used where it is plausible that survival algorithms could be adopted by healthcare professionals in the future even though they aren't right now. This is I think a pretty big concern, since as it stands its not really clear what organizations or decision makers would use this methodology in practice, or what sort of particular problems would benefit from this new approach. Some discussion also on how KPIs could be augmented to consider survival aspects or if there is a way to use the proposed methods to promote this shift would be useful to make the case for applicability. Otherwise, while I do find the theoretical setting interesting, its really not clear who would implement this method in practice.

Additionally, with a new case study I think this would require additional computation in this new environment to show the effectiveness of the approach (especially in the TS case).

Depth of Analysis and Methods:

There are two main critiques that I have on the analysis and methods, the first is with the main algorithm presented in the paper and its framing, this being the Survival ETC algorithm, and second with the theoretical regret analysis.

Let me start with my critique of Survival ETC. So the central issue here is that as posed the problem doesn't fit neatly into a bandit paradigm. If the outcome is dichotomized, as the authors note, its pretty straightforward to define a multi-armed bandit style framework and model. However, once the focus shifts to TTE due to the information asymmetry between the "good" and "bad" arms its hard to think about how a bandit policy, here I'm thinking of a myopic policy given the current information filtration of the system, would realistically work in this setting. The fact that ETC is the preferred approach is also kind of a tacit admission by the authors that they agree with this conclusion. For the record ETC is technically a dynamic policy, and is very common in the operations, controls, and RL literature where it usually inherits from 2-stage control style approaches. But in this case, I think framing it as a bandit approach is not really in the spirit of what is being proposed. The algorithm boils down to doing a randomized trial and then once the trial is completed using the fitted model to make decisions. This is much more similar to standard approaches in the operations literature where we collect data fit a model and then optimize said model. If there were more emphasis put on the TS algorithm I think this would be a different story. That said, I think there is a compelling narrative here on experimental design with survival models, which is essentially what the results of the theory section are after, but that would depend on if the authors are thinking about this more as a bandit paper or an experimental design style paper. If the bandit direction is more preferable, then I would recommend a deeper treatment of the TS algorithm and its regret analysis.

The second main concern here is with the theoretical analysis provided for the regret. First let me make clear that the analysis is mathematically correct, the issue is that its not clear to me if these results are meaningful for the problem described. So the big challenge here is that everything hinges on asymptotic normality (i.e. weak convergence), for establishing approximate upper bounds on regret of various algorithms. In the bandits and RL literature, the standard approach is to use finite time concentration to establish these rates and not asymptotic results. The main reason for this is that in general just because we have convergence in distribution, i.e. weak convergence, this does not mean we have actually learned the parameters of a system. The standard is to consider convergence in probability, that is that with high probability we have estimated the system parameters. Finite time bounds give us a sense of how fast this convergence in probability is occurring and so tell us if we are learning sufficiently quickly about the system to minimize regret. To get around this shortcoming the authors propose their analysis in the "large n regime", while I agree that practically speaking n is generally "large enough" for using asymptotic hypothesis tests, that's a very different consideration from rigorously analyzing regret. Really I think what is needed here is using finite time concentration bounds to figure out how large is the large enough n and then working from that to establish the regret results. The resulting rate will almost certainly be looser, finite time bounds are generally more conservative then asymptotic bounds, but they will be more accurate to the analysis that the authors are after.

The next issue I have with the theoretical analysis relates to Theorem 2 and Corollary 1. Putting aside the asymptotic normality concern, these results rely on comparing the upper bounds on regret of 2 different algorithms to make a claim about the relevant effectiveness of each method. The problem here is that is not clear how loose these upper bounds are. For example while in this analysis it worked out that adding survival models is more efficient then using dichotomization, it could be the case that the upper bound for survival methods was just tighter then the bound for dichotomization. The results of Theorem 2 really only tell us the order of the regret growth (big O) but to do the kind of relative comparison in Corollary 1 we need some sense of a lower bound on the order of regret growth and related constants. This means getting either a big Theta result for the regret for each approach, or getting a big Omega result only for the survival approach. Without this, its hard to make the claims that one method is better then the other in terms of theoretical regret rates.

Minor comments:

1. The abstract should be reformatted to be in the MSOM style with distinct subsections.

2. Since the paper straddles experimental design and bandit algorithms I think it would be appropriate to include some additional literature on bandit applications in trials such as power constraints, causal inference, and statistical estimation bandits.

3. There is some discussion on how threshold selection for dichotomization could be a problem since it wouldn't distinguish between certain treatments. However, since the decision makers are the ones choosing the threshold, in the scenario where for instance these are KPI driven would this really be too much of an issue? If my KPI for instance is based on seeing patients within 3 weeks, but one condition gets people seen within 2 days and the other within 2 weeks, is estimating the difference really that crucial since they are both within my KPI goal? I think a more nuanced conversation here, and coming with sharper example as I said in my main comments, would help clarify this.

4. In section 3.1, I was a bit confused by the notation since capital sigma is usually reserved for a covariance matrix and not for precision. This is a very small nitpick but it could be cleaner using a different letter like nu here.

5. In the context of Assumption 2, another way of framing this problem is through the lens of Bayesian hypothesis testing where Assumption 2 is the conjugate prior. I would add some discussion potentially relating to that field as well.

Recommendation: Major Revision

=======================================
ASSOCIATE EDITOR REPORT
=======================================

Q1. Summary of the paper's contributions

This paper studies adaptive experimentation with time-to-event outcomes and proposes "survival bandits" that use the Cox proportional hazards model, with the main theory centered on a two-armed Survival ETC algorithm benchmarked against a Dichotomized ETC alternative.

Q2. Paper's strengths

I find the topic important and promising. The paper is clearly written, the core question is meaningful, and the combination of algorithm design, asymptotic analysis, and a healthcare case study gives the paper real potential. Both reviewers also acknowledge the paper's clarity and the importance of the problem.

Q3. Path to publication / rejection rationale

Both reviewers recommend Major Revision, although the tone of their comments is somewhat mixed. My recommendation is also Major Revision. In my view, the paper has a solid core result and merits another round of review, but a revision would need to address several substantive concerns raised by the reviewers, including the rigor of the theoretical development, the framing of the contribution, and the practical interpretation of the results. I summarize the main concerns below.

Reviewer 1 raises two concrete points that I view as important. First, the analysis appears to treat the exploration fraction m_n as non-random, even though the stopping rule is data dependent and the realized exploration length should therefore be random. Second, Reviewer 1 notes that the regret decomposition in equations (3) and (4) appears to omit a (1-m_n) factor in the post-commitment term; while such a simplification may be harmless in some big-O arguments, it requires justification under an asymptotic equivalence claim. These comments go directly to the derivation of the main ETC results and should be addressed carefully in a revision.

Reviewer 2 is right to push on whether the asymptotic framework is sufficiently well motivated and whether the paper has fully clarified what is being claimed. However, I also think Reviewer 2 somewhat mischaracterizes the current analysis. The paper does not present itself as comparing only asymptotic upper bounds; rather, it explicitly frames Section 4.1 as analyzing the expected regret of the ETC algorithms in a large-n regime and decomposes expected regret into exploration and incorrect-commitment components. The paper also explicitly situates this approach relative to more recent asymptotic work, rather than the classical finite-time concentration literature. That said, the authors should make this positioning sharper, since the current presentation leaves room for confusion. In revision, I would encourage the authors to state more clearly and early that their contribution is an asymptotic expected-regret characterization in a large-n regime, and to explain more directly why that regime is appropriate here.

Reviewer 2 raises two more issues that I think worth addressing. The first is about the paper's framing, i.e., a bandit paper versus a sequential experimental design paper. Reviewer 2 questions whether ETC is the right central object and whether the setup fits naturally into a bandit paradigm. I do not think this is fatal, because the paper does define a sequential adaptive problem and extends multiple bandit-style algorithms. Moreover, the paper gives a plausible reason for focusing on ETC (analytically tractable and operationally common). Still, I think the authors could sharpen this point. In the current version, the paper would benefit from either leaning more explicitly into ETC as the practically relevant test-and-roll baseline, or developing the TS side more fully so that the broader "survival bandit" framing feels more balanced.

The other issue is that the paper needs a sharper story about when decision makers would truly prefer survival-based learning to KPI-driven dichotomization. The current paper does acknowledge that practitioners may prefer threshold-based metrics or may worry about proportional hazards assumptions, but the discussion should go further. In the revision, the authors could better articulate the class of applications in which preserving full time-to-event information is operationally important, and it should connect the "fast" and "slow" event regimes more clearly to practice, as Reviewer 1 suggests.

Overall, I believe the paper has enough novelty and substance to merit a major revision rather than rejection. If the above main issues are addressed convincingly, I believe the paper could become a worthwhile contribution.

Recommendation: Major Revision