Auto-embedding on data with missing values?

43 views

Skip to first unread message

Lukas Paulun

unread,

Feb 14, 2021, 6:09:52 PM2/14/21

to Java Information Dynamics Toolkit (JIDT) discussion

Hi all,

I have one specific and one general question regarding an analysis I am planning to do in JIDT and potentially IDTxl:

I would like to analyse tracking data of a locust swarm. We have a video of ~8 min with framerate 25 Hz. Throughout the video there are a few thousand locusts, but each individual locust appears in the video for only ~40 s, i.e. 1000 frames.

The overlap of two locusts in the video is usually a few hundred frames, i.e. the joint time series have a few hundred valid samples.

Sometimes there are missing values, so I load the data via .setObservations(double[][] source, double[][] destination, boolean[] sourceValid, boolean[] destValid).

The first and specific question is the following:

I run into different kinds of trouble when trying to use AutoEmbedding. I saw in a previous post that data with missing values do not work with AutoEmbedding. Is this still the case? If yes, what would be your suggestion how to handle such data?

All problems only occur if I set K_SEARCH_MAX above a certain value (how large that is depends on the type of estimator and auto embedding):

For a MultiVariateGaussian estimator and MAX_CORR_AIS_DEST_ONLY I get an error in MatrixUtils.CholeskyDecomposition(): java.lang.Exception: CholeskyDecomposition is only performed on symmetric matrices.

For Gaussian and Kraskov estimators and both types of auto embedding I sometimes get a NegativeArraySizeException: -1 in MatrixUtils.makeDelayEmbeddingVector() while setting the observations and sometimes an ArrayIndexOutOfBoundsException during computation of the TE.

Second, more general question:

Is it a problem that there are different overlaps for every pair of time series and that every individual overlap only contains a few hundred samples? I would assume that if we can justify to average over different individuals (as Crosato et al., 2018 did), we would have enough samples but I am not entirely sure about that.

I would also like to infer the effective network with IDTxl and I'm currently waiting to get accepted to the IDTxl Google Group. Maybe you can already leave a short comment, whether you think that network inference would be feasible at all with IDTxl. Some things I could not find out:

Can IDTxl handle data with missing values?
What's the proper way to deal with these fragmented time series, where the whole video has 9,000 frames but each animal only appears in ~1,000 frames and has different overlaps with all its neighbours?
Is it computationally feasible to run IDTxl on a network of >1,000 nodes? All papers I have seen only performed effective network inference on very small networks.

Thanks a lot in advance - any help will be much appreciated!

Lukas

Joseph Lizier

unread,

Feb 24, 2021, 8:13:55 AM2/24/21

to JIDT Discussion list

Hi Lukas,

Again, apologies for the late responses here.

In short, your intuition is good that using the setObservations() method signature which takes arrays signalling validity of the time series is a good way to address the issue where your pairs are not properly observed during the whole time series.

With that said, I have developed specific scripts for handling such data from flocks/swarms where we treat every pairwise interaction as samples from a representative pair (all pairs assumed to have homogeneous behaviour). This is the recommended way to compute transfer entropy for flocking/swarming data.

The scripts are distributed in the folder demos/octave/FlockingAnalysis (not in the latest distribution yet, but available in the repo itself), and described on the wiki page https://github.com/jlizier/jidt/wiki/Flocking

I won't describe it in any further detail here to save repetition, you can see all of the details at the wiki page.

So your specific questions aren't so relevant once you move to those scripts, but I'm answering them here for completeness:

1. The issue is that we handle the validity array by splitting the main time series into lots of shorter fully valid series, and adding the observations obtained from them separately. However, some of these shorter series are so short (because there are very few samples between two invalid points) that they are not long enough to get a full set of k history samples. This is handled better or worse in some estimators than others; overall I need to improve that so that it's detected but that exceptions are not thrown.

2. It is definitely better to put the samples from all pairs into a single calculation as above.

Regarding network inference -- this is a step up in complexity that requires more research. As per my response on the IDTxl mailing list, I've got some good ideas on where to take this, but need the time / student to do so. IDTxl is not ready for this with swarms/flocks yet. To the other subqueries here:

a. IDTxl cannot yet handle data with missing values

b. Proper way is as per the scripts above.

c. IDTxl on 1000 nodes, is feasible, but depends on: having only short time series for each at the moment (say a few hundred time points), using the faster estimators (e.g. discrete or Gaussian; KSG would be time challenging), and having access to a powerful high performance cluster. We have run ~200 nodes for 10000 time points with KSG, in parallel for each target on a cluster; this took about a week. Timing would be similar for 1000 nodes with a few hundred time points.

--joe
+61 408 186 901 (Au mobile)

--
JIDT homepage: https://github.com/jlizier/jidt
---
You received this message because you are subscribed to the Google Groups "Java Information Dynamics Toolkit (JIDT) discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jidt-discuss...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/jidt-discuss/9c367db2-c449-4881-88f1-41708247bfe0n%40googlegroups.com.

Reply all

Reply to author

Forward

0 new messages