Understanding conns data.frame

Ben Yang

unread,

Apr 12, 2022, 5:49:40 PM4/12/22

to cicero-users

Hello,

Thank you for providing this package! I'm curious about how to interpret the conns data.frame from Cicero (v 1.3.5) analysis. The raw conns object contains 2,464,222 rows, but the Peak2 column is a factor with 67,671 unique levels (structure shown below).

'data.frame': 2464222 obs. of 3 variables:
$ Peak1 : chr "chr10_100015385_100016646" "chr10_100015385_100016646" "chr10_100015385_100016646" "chr10_100015385_100016646" ...
$ Peak2 : Factor w/ 67671 levels "chr1_3123454_3123784",..: 7649 7650 7651 7653 7654 7655 7656 7657 7658 7659 ...
$ coaccess: num 0 0 0 0 0 0 0 0 0 0 ...

Additionally, co-accessible sites appear to be mirrored across Peak1 and Peak2. For example, sites involving chr10_100015385_100016646 are duplicated in the data.frame (shown below). Is this expected behavior of the conns output? Do I actually have 67671 unique co-accessible sites in this dataset?

Thanks,

Ben

> conns[conns$Peak1=="chr10_100015385_100016646" | conns$Peak2=="chr10_100015385_100016646", ]

Peak1 Peak2 coaccess
1 chr10_100015385_100016646 chr10_99758219_99758499 0.000000000
2 chr10_100015385_100016646 chr10_99839834_99840768 0.000000000
3 chr10_100015385_100016646 chr10_99913044_99913264 0.000000000
5 chr10_100015385_100016646 chr10_100023303_100023577 0.000000000
6 chr10_100015385_100016646 chr10_100024752_100024956 0.000000000
7 chr10_100015385_100016646 chr10_100025220_100025447 0.000000000
8 chr10_100015385_100016646 chr10_100031095_100031360 0.000000000
9 chr10_100015385_100016646 chr10_100032627_100033077 0.000000000
10 chr10_100015385_100016646 chr10_100038550_100038851 0.000000000
11 chr10_100015385_100016646 chr10_100059224_100059607 0.000000000
12 chr10_100015385_100016646 chr10_100297744_100297997 0.005845331
13 chr10_100015385_100016646 chr10_100302896_100304170 0.127522820
14 chr10_100015385_100016646 chr10_100305417_100306526 0.000000000
15 chr10_100015385_100016646 chr10_100486653_100487843 0.410737768
19 chr10_100023303_100023577 chr10_100015385_100016646 0.000000000
34 chr10_100024752_100024956 chr10_100015385_100016646 0.000000000
49 chr10_100025220_100025447 chr10_100015385_100016646 0.000000000
64 chr10_100031095_100031360 chr10_100015385_100016646 0.000000000
79 chr10_100032627_100033077 chr10_100015385_100016646 0.000000000
94 chr10_100038550_100038851 chr10_100015385_100016646 0.000000000
109 chr10_100059224_100059607 chr10_100015385_100016646 0.000000000
121 chr10_100297744_100297997 chr10_100015385_100016646 0.005845331
135 chr10_100302896_100304170 chr10_100015385_100016646 0.127522820
149 chr10_100305417_100306526 chr10_100015385_100016646 0.000000000
163 chr10_100486653_100487843 chr10_100015385_100016646 0.410737768
151257 chr10_99758219_99758499 chr10_100015385_100016646 0.000000000
151279 chr10_99839834_99840768 chr10_100015385_100016646 0.000000000
151301 chr10_99913044_99913264 chr10_100015385_100016646 0.000000000

Daniel Gingerich

unread,

Apr 15, 2022, 10:51:31 AM4/15/22

to cicero-users

This is my understanding:

Cicero conns matrix is similar to a correlation matrix in compressed long form - i.e. columns 1 (peak1) and 2 (peak2) are the (i,j) index of the matrix. Its not exactly a correlation matrix, but it makes it easier for me to view it this way.

In terms of the duplicate coaccessibilities, might be two reasons.

1) Ctrl+F "Reconciling overlapping local co-accessibility maps" in the cicero paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6582963/. Since windows are overlapping, peak pairs often have two coaccessibility calculations. The mean value is used in this case.

2) The other reason is that correlation matrices are symmetric across the diagonal, cor(a, b) = cor(b, a). see here: https://www.vertica.com/wp-content/uploads/2019/09/corr_matrix_Titanic.png This might explain why the peaks are mirrored/reversed in the data frame

hpl...@gmail.com

unread,

Apr 17, 2022, 10:08:52 AM4/17/22

to cicero-users

Seconding Daniel's answer.

I'll add that the conns table contains all of the tested pairs of sites with the scores, so you won't know how many pairs of coaccessible sites you have until you subset on scores (at least above 0). Because a site is tested against any other sites within the window, there will be lots of duplicates in each Peak column.

As for the mirroring, all pairs are included in both directions (i.e. peak1 with peak2 and peak2 with peak1). This is merely for convenience, so that you can subset more easily on pairs based on other criteria.

Best,

Hannah

Ben Yang

unread,

Apr 29, 2022, 12:53:52 AM4/29/22

to cicero-users

Thank you for the replies, this makes sense to me!

Reply all

Reply to author

Forward