I think the answer to your question is that you use mu_i instead of the decoded y for all slices that haven't been transmitted yet.
As you decode each slice $i$, you always have \mu_i and \sigma_i before y_i since mu and sigma only depend on previous slices and the hyperprior. If you have the bits for y_i, then you use \mu_i and \sigma_i to decode them and get \hat{y}_i as a result (strictly speaking, you need to calculate LRP too).
If you don't have the bits yet (e.g., for progressive decoding you might have only received the bits for slice 1 and 2 so far), then you can't get decode y_i so you use \mu_i instead. Conceptually, \mu_i is the model's best guess for y_i before actually decoding the compressed representation so it's a good choice as a stand-in.
Note that you could use zeros or sample from the Gaussian represented by N(\mu_i, \sigma_i). I don't have the results in front of me, but I think I found that those approaches led to worse reconstructions, at least subjectively.
For additional slices (say you have the bits for slice 1 and 2 but you need values for slice 4), I just propagated the results using \mu_i as a stand-in. In the example, that means that \mu_4 is a function of [\hat{y}_1, \hat{y}_2, and \mu_3]. Again, strictly, we replace decoded y_3 with \mu_3 and then calculate: \mu_3 + LRP_3(\mu_3, \hat{y}_1, \hat{y}_2, mu').
Let me know if that's clear or not and if you have other questions.