A little help with CMI Estimators?

160 views

Skip to first unread message

Kayson Fakhar

unread,

Mar 25, 2021, 2:50:15 PM3/25/21

to IDTxl

Hi there,

I have a couple of conceptual questions. My friend is using JidtGaussianCMI and I'm using OpenCLKraskovCMI on the same dataset. The results are quite different in the way that he is getting a sparse network back and while I'm getting all the connections, but the delays are very much off. For example, the actual lag that we modeled in the network is 1 while the discovered lags are around 8 to 15 (max lag is 15). Also, they are very noisy over trials, for example, for the same connection from node A to node B I have 10 different values (from 3 to 15) after running the same pipeline 10 times.

Currently, I'm running the same analysis on a dataset in which the connections are stronger to see if the variability is due to weak coupling among nodes but I thought maybe I can ask for some expert opinions so here are the questions:

1. Is there any rule of thumb or convention in choosing CMI estimators? I went for the OpenCLKraskovCMI because I thought it will be faster since it's on GPU and expected the results to be the same as JidtGaussianCMI based on this paper.

2. Do you have any idea why the discovered lags are this large? I realized if the actual lag is larger than 4-5 then the discovered lags start to make sense, mostly around 3-8 but still not accurate. If the actual lag is large, let's say 10 then again the discovered lags start to drift away, so I was wondering if it has something to do with the simulated network or generally, we shouldn't expect to discover the real lags that often.

3. How much variability should I expect to have? I mean, is it normal to have different results every time, or given the same dataset the results should be the same?

Thanks again, I hope the developing team is having fun fixing the opencl GPU issue :D

p.wol...@gmail.com

unread,

Mar 30, 2021, 2:28:21 PM3/30/21

to IDTxl

Hi Kayson,

you are definitely keeping us busy hunting that OpenCl bug :D I will let you know when I have an update here. So far, we made sure the new code works on all our test data. I am now looking into your data and code.

Regarding your questions 1), what data are you using to compare estimators? Generally, the Gaussian estimator expects data to be jointly Gaussian while the Kraskov estimator does not make any assumptions on distributions. If data are not jointly Gaussian you may get rather different results when using the Gaussian versus Kraskov estimator. Most likely, the Gaussian estimator will detect fewer links if data and/or couplings are (highly) non-linear, hence returning a sparser network (this is a bit speculative without knowing what data you are using). You may also want to have a look at our paper by Leonardo Novelli, which compares Gaussian and Kraskov estimators on linear and non-linear systems.

Generally, the GPU estimator will not always be faster. There is also the JidtKraskovCMI estimator, which runs on CPU but uses multithreading. Depending on how many CPU cores you have available that may be a faster option than using GPU (we have a little benchmarking section in the wiki that may help).

2) How many samples are you using for reconstructing the lag? In the past, we have noticed that reconstructing the delay requires quite a lot of samples (again depending on what data you are looking at). By default, IDTxl returns the lag of the past sample with the highest CMI with the target as the reconstructed delay. Another possible criterion for reconstructing the delay would be to use the lag of the past sample with the minimum lag wrt. the target. Does this give more accurate results in your case?

Regarding point 3), some variability over runs on the same data is to be expected, depending on what you are doing. There is some non-deterministic behaviour, e.g., random generation of surrogate data for network inference algorithms. To get rid of these random parts, you may want to switch off the noise that is added to the data by the Kraskov estimator and you may pass a random seed when creating the data object (my_data = Data(seed=0)), which controls the generation of surrogate data. Let me know if that helped.