Incorrect results for simulated data

Tanmay Singhal

unread,

Apr 15, 2021, 12:18:54 AM4/15/21

to IDTxl

I came across this article regarding causality analysis and currently trying to reproduce the same results using idtxl. AFAIK, RTransferEntropy and idtxl BivariateTE should be giving similar results, at least for default settings.

I am getting very different results for non-linear system, and Multivariate TE is very different from ground truth as well. I have attached data file for non-linear system for reference. My query in this case is there some settings which I am missing and not getting correct result ?

In my case, using BivariateTE method (all default settings) with gaussian estimators, edges are :

0 -> 1

0 -> 2

1 -> 4

2 -> 0

3 -> 1

3 -> 4

non_linear_system.txt

p.wol...@gmail.com

unread,

Apr 19, 2021, 8:43:11 AM4/19/21

to IDTxl

Hi, we had some issues with the last release. Could you test whether the problem persists when using the development branch? I will upload a fixed release in the next couple of days.

Tanmay Singhal

unread,

Apr 22, 2021, 1:22:51 PM4/22/21

to IDTxl

Thanks for the response Patricia. I tried downloading the dev branch, which I guess was v1.2.1, and ran pip install as per installation instructions. I am not sure which version I am running, but the results for BivariateTE defaults were not correct.

conda list shows :

# Name Version Build Channel

idtxl 1.0 dev_0 <develop>

P.S: I also attached the actual variable dependencies and BivariateTE result. Sorry for not doing that in OP

system_relationship.png

bivariate-te_result.png

Joseph Lizier

unread,

Apr 25, 2021, 8:43:08 PM4/25/21

to IDTxl

Hi - any issues with the current release aside, I'm not sure this is a sensible comparison:

1. RTransferEntropy is using the discrete TE estimator (after discretising the data). As per it's documentation: "This package uses symbolic encoding based on selected bins or quantiles of the empirical distribution of the data to estimate empirical frequencies."

2. Whereas you've stated that you've used the Gaussian estimator with IDTxl.

The two different estimators will return different TE results, it's as simple as that.

If you want to more closely replicate what RTransferEntropy is doing, you will need to switch over to the discrete estimator. Furthermore, you would also need to make sure that the discretisation of the underlying continuous data was the same in both cases. I don't think the way we discretise (using max entropy discretisation) is directly compatible with the way RTransferEntropy does it (seems to be more complicated options), so you would want to discretise the data yourself externally before sending it in.

Regards,

--joe

Tanmay Singhal

unread,

Apr 25, 2021, 9:33:48 PM4/25/21

to IDTxl

Hi Joe,

I agree that RTransferentropy and IDTxl discretisation are done differently, and they will return different TE. My question was about getting the correct edges for the simulated data, which I have shared in the image above. Since ground truth is known in this case, we should get the correct connection between time series. I used RTransferentropy just as a reference since it was capturing that. But the main doubt was what can I do differently to get the correct edges

Joseph Lizier

unread,

Apr 26, 2021, 2:41:51 AM4/26/21

to Tanmay Singhal, IDTxl

Hi Tanmay,

First - since you're computing TE with the Gaussian estimator, forgetting whether you want it to give you the ground truth or not, you should expect it to give you basically the same thing as Granger Causality in the page you initially point to (since TE with linear Gaussian model is a scaled version of GC). If you print out your TE results for each pair, you will find they give you a match to what is shown in the left hand part of fig 4.6 in the text you point to. (I've verified this).

So, the estimators are doing what they've been asked to do here.

Next, if you want something to match causality, you've always got to be careful in expecting TE to give you a precise match to known causality, because they are simply not measuring the same thing (TE is returning results from a model). There's lots written about that, e.g. see Leo's latest. Here though as you say we've got no hidden nodes etc, so our results should converge to this as we get enough data points.

So, to answer your question about improving the results, I would suggest the following:

1. Switch to the nonlinear estimator ('cmi_estimator': 'JidtKraskovCMI'). You're currently running with the linear Gaussian estimator for what is known to be a non linear interaction, so this will miss interactions in precisely the same way as that article shows GC does in comparison to nonlinear TE. That will take longer to run, but if you want the accurate result here you have no choice. You could try the discretised estimator as well; it will run faster and give you results akin to RTransferEntropy, but the KSG estimator is best of breed for nonlinear estimates.

2. Allow more data points from the past of the target to be embedded rather than just 1 (e.g. use 'max_lag_target': 5 for instance). The false positives that we see here (e.g. 2 -> 0) look to me to be classic false reverse inferences due to inadequate embedding on the target, which Michael originally pointed out in this 2011 paper.

This will be required whether you switch to the nonlinear estimator or not.

3. Set the parameter for a Theiler window, which avoids false positive increases due to autocorrelations in the samples (this can lead to non-independent samples influencing the estimation). To do this, set e.g. 'theiler_t': 20, that should be enough. (I think this was definitely useful to remove one of the false positives).

When I run it with all of the above set, the results look much better; indeed all of the causal links are picked up, and the only false positive is 1 -> 4 (or 2 -> 5 in the articles' numbering). You can see that one coming through as well for RTransferEntropy with their discrete TE in Fig 4.6b of the article you point to. It's coming through specifically because nodes 1 and 3 are driven in a very similar nonlinear fashion by node 0, so node 1 will contain information about node 3 (the true driver of node 5), hence the false positive.

If you want to remove redundancies like that, you would need to move to the full multivariate algorithm. This is because a false positive that only comes through due to correlation with a stronger one (like the case of 1->4 above coming through because 1 is correlated to node 3) will be conditioned out by the stronger source.

I've run this at my end, with the same additional settings as I've detailed above (KSG estimator, max_lag_target = 5, theiler_t = 20), and can confirm that we get precisely the correct causal structure when we do. Network image is attached.

--joe
+61 408 186 901 (Au mobile)

--
You received this message because you are subscribed to the Google Groups "IDTxl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to idtxl+un...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/idtxl/6f00af0e-2bf0-4b23-a637-9c8e9f045c46n%40googlegroups.com.

non_linear_system-multivariate-ksg-delayUpTo5-theiler20.png

Joseph Lizier

unread,

Apr 26, 2021, 9:42:11 PM4/26/21

to Tanmay Singhal, IDTxl

I've just realised that this nonlinear example is basically the well known MUTE example from Montalto et al, PLOS ONE, 2014, but without interaction delays. (My bad, I should have noticed that before). It's rather strange that the author doesn't cite this as an adapted version of their example (though they do cite MUTE for other reasons).

Anyway, you will find several of the system tests that ship with IDTxl are already working on this example, e.g. test/systemtest_mute.py and other system tests in this folder, and you can explore the different analysis possibilities they provide.

--joe
+61 408 186 901 (Au mobile)

Reply all

Reply to author

Forward