Yes it seems to be pretty hard to backprop through it, also the slicing, the updates on subset of tensors is difficult to implement.
This question might be a little high level, but do you think the input-output function of CARFAC, including the inner hair cell, can be approximated by a RNN, maybe structurally engineered RNN fitting CARFAC? Approximate the functionality only for a subspace of soundscapes that are well defined and I could generate a myriad of examples of? I'd like to replicate the suppression and masking mechanisms mainly.
P.S. I just read your book, it's a huge help in my visual to auditory sensory substitution research, thank you!