Understanding conns data.frame

70 views
Skip to first unread message

Ben Yang

unread,
Apr 12, 2022, 5:49:40 PM4/12/22
to cicero-users
Hello, 

Thank you for providing this package! I'm curious about how to interpret the conns data.frame from Cicero (v 1.3.5) analysis. The raw conns object contains 2,464,222 rows, but the Peak2 column is a factor with 67,671 unique levels (structure shown below). 

'data.frame':    2464222 obs. of  3 variables:
 $ Peak1   : chr  "chr10_100015385_100016646" "chr10_100015385_100016646" "chr10_100015385_100016646" "chr10_100015385_100016646" ...
 $ Peak2   : Factor w/ 67671 levels "chr1_3123454_3123784",..: 7649 7650 7651 7653 7654 7655 7656 7657 7658 7659 ...
 $ coaccess: num  0 0 0 0 0 0 0 0 0 0 ...

Additionally, co-accessible sites appear to be mirrored across Peak1 and Peak2. For example, sites involving chr10_100015385_100016646 are duplicated in the data.frame (shown below). Is this expected behavior of the conns output? Do I actually have 67671 unique co-accessible sites in this dataset? 

Thanks,
Ben

> conns[conns$Peak1=="chr10_100015385_100016646" | conns$Peak2=="chr10_100015385_100016646", ]
                           Peak1                     Peak2    coaccess
1      chr10_100015385_100016646   chr10_99758219_99758499 0.000000000
2      chr10_100015385_100016646   chr10_99839834_99840768 0.000000000
3      chr10_100015385_100016646   chr10_99913044_99913264 0.000000000
5      chr10_100015385_100016646 chr10_100023303_100023577 0.000000000
6      chr10_100015385_100016646 chr10_100024752_100024956 0.000000000
7      chr10_100015385_100016646 chr10_100025220_100025447 0.000000000
8      chr10_100015385_100016646 chr10_100031095_100031360 0.000000000
9      chr10_100015385_100016646 chr10_100032627_100033077 0.000000000
10     chr10_100015385_100016646 chr10_100038550_100038851 0.000000000
11     chr10_100015385_100016646 chr10_100059224_100059607 0.000000000
12     chr10_100015385_100016646 chr10_100297744_100297997 0.005845331
13     chr10_100015385_100016646 chr10_100302896_100304170 0.127522820
14     chr10_100015385_100016646 chr10_100305417_100306526 0.000000000
15     chr10_100015385_100016646 chr10_100486653_100487843 0.410737768
19     chr10_100023303_100023577 chr10_100015385_100016646 0.000000000
34     chr10_100024752_100024956 chr10_100015385_100016646 0.000000000
49     chr10_100025220_100025447 chr10_100015385_100016646 0.000000000
64     chr10_100031095_100031360 chr10_100015385_100016646 0.000000000
79     chr10_100032627_100033077 chr10_100015385_100016646 0.000000000
94     chr10_100038550_100038851 chr10_100015385_100016646 0.000000000
109    chr10_100059224_100059607 chr10_100015385_100016646 0.000000000
121    chr10_100297744_100297997 chr10_100015385_100016646 0.005845331
135    chr10_100302896_100304170 chr10_100015385_100016646 0.127522820
149    chr10_100305417_100306526 chr10_100015385_100016646 0.000000000
163    chr10_100486653_100487843 chr10_100015385_100016646 0.410737768
151257   chr10_99758219_99758499 chr10_100015385_100016646 0.000000000
151279   chr10_99839834_99840768 chr10_100015385_100016646 0.000000000
151301   chr10_99913044_99913264 chr10_100015385_100016646 0.000000000

Daniel Gingerich

unread,
Apr 15, 2022, 10:51:31 AM4/15/22
to cicero-users
This is my understanding: 

Cicero conns matrix is similar to a correlation matrix in compressed long form - i.e. columns 1 (peak1) and 2 (peak2) are the (i,j) index of the matrix. Its not exactly a correlation matrix, but it makes it easier for me to view it this way.  

In terms of the duplicate coaccessibilities, might be two reasons.  
1) Ctrl+F "Reconciling overlapping local co-accessibility maps" in the cicero paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6582963/.   Since windows are overlapping, peak pairs often have two coaccessibility calculations.  The mean value is used in this case.  
2) The other reason is that correlation matrices are symmetric across the diagonal, cor(a, b) = cor(b, a).  see here: https://www.vertica.com/wp-content/uploads/2019/09/corr_matrix_Titanic.png  This might explain why the peaks are mirrored/reversed in the data frame

hpl...@gmail.com

unread,
Apr 17, 2022, 10:08:52 AM4/17/22
to cicero-users
Seconding Daniel's answer.

I'll add that the conns table contains all of the tested pairs of sites with the scores, so you won't know how many pairs of coaccessible sites you have until you subset on scores (at least above 0). Because a site is tested against any other sites within the window, there will be lots of duplicates in each Peak column. 

As for the mirroring, all pairs are included in both directions (i.e. peak1 with peak2 and peak2 with peak1). This is merely for convenience, so that you can subset more easily on pairs based on other criteria. 

Best,
Hannah

Ben Yang

unread,
Apr 29, 2022, 12:53:52 AM4/29/22
to cicero-users
Thank you for the replies, this makes sense to me!
Reply all
Reply to author
Forward
0 new messages