Hi Alice,
Yes, info theoretic measures on continuous valued data can get a little strange. The entropies can go negative, but mutual information (and conditional mutual information)
in theory are still supposed to be non-negative.
Now, what you have to remember is that we are dealing with estimators, which aren't perfect numerically, and there are several things that can make the estimates dip below zero. The most common way we see this is with the
bias correction on the KSG estimator, when the uncorrected estimates would have been below the expected bias.
But it can happen on the kernel estimator too. You could have a situation where the TE should be zero, but the estimate of the marginal (conditional) entropies were slightly below the true value, whilst the estimate of the joint (conditional) entropy was slightly above the true value. I can construct something very pathological to demonstrate this, see the attached file (which runs in Matlab, not Python, sorry but that's easier for me and I'm sure you can translate it if you want to run it!). I've constructed this example such that the marginals are constant for almost all points, yet the joint space never has any points in it apart from the search point itself, and so the joint probability is generally lower than the product of the marginals and the result is rather negative (much more negative than the fluctuations you see). You can turn on debugging if you want to see more info for each search point when it runs.
So, to answer your questions, it is a normal numerical issue and you can't really do anything about it (just take those values as consistent with zero).
Now, I notice that you're not setting the kernel width in your example, which suggests you perhaps haven't thought too hard about this parameter? The kernel width is a crucial parameter for this estimator, and makes a very large difference in the values it produces. So large an effect, that you're really better off to go with an estimator that is less parameter dependent. Indeed, the kernel estimator is really only left in JIDT for historical comparative purposes.
May I suggest you switch your analysis over to the KSG estimator, which is considered best of breed these days? The only key parameter there is the number of nearest neighbours, which it is much more stable to, and you can basically leave at the default of 4 anyway (as suggested in the JIDT paper). Another advantage of this estimator is that it is bias corrected, but this does mean that you will see negative values with it too (as described at the link above).