Di2multi

0 views

Skip to first unread message

Enrichetta

unread,

Aug 4, 2024, 3:12:23 PM8/4/24

to flavtherlaicon

Iam currently using the excellent R packages 'Phangorn' and 'ape' to do some parsimony-based phylogenetic analysis with the 'pratchet' fuction (parsimony ratchet) and have encountered a bit of a problem with nodes that I feel should probably be collapsed into polytomies.

At current the bootstrapped trees I am producing contain several nodes with bootstrap supports of either zero or very negligible support (that is the trees contain forced bifurcations). I am calculating the mean bootstrap score for any given tree (I am using numerous variants of the alignment to see which produces the best supported tree on average), and I have solved the problem of nodes with value zero by simply dividing the sum of the node support scores by the number of non-zero nodes. Where I am encountering a problem, however, is that nodes with very negligible support (i.e. less than 10) are of course still being counted, and this results in the mean node bootstrap support jumping way up for any tree that contains nodes with zero support.

A simpler solution would be to collapse nodes with bootstrap scores below a given value into polytomies. This way I could simply calculate the mean node support in a straight-forward manner without worrying about zero nodes or nodes with unacceptably low support.

Which tree are you placing support values on? A majority-rule consensus tree? If this is the case, I would read the tree into R, identify branches with low support values, and replace those branch lengths with a value below a threshold passed to di2multi(). If you need a soft polytomy (i.e., still bifurcating but with zero-length branches), I would then randomly resolve polytomies using multi2di(), again in ape.

Thank you for this Brice. I will give di2multi() a try. After spending some time reading over the code what I have is a parsimony ratchet tree that is then tested for support by bootstrapping rather than a consensus by majority after bootstrapping. This is potentially why some of the node support scores are so very low (I would not expect such low scores with a consensus by majority). On a side note, I will take a look at what can be done about producing a consensus tree from this data and how it might alter my results.

Is there another way that this can be done without needing to generate edge lengths? In other words, rather than creating polytomies based on edge lengths shorter than a tolerable value, is it possible to collapse nodes that simply have a poor bootstrap support regardless of edge length?

It has been years since the question was posted but if anyone is still looking for an easy answer, itol allows you to delete the branches under a certain threshold of bootstrap value that you specify. You can extract the resulting tree in any graphical or text format (svg, newick, pdf etc.)

I just posted some code the collapses internal branches of zero length (or, more specifically, branches with length shorter than some arbitrarily specified value tol) to polytomies for trees with mapped discrete characters created using (for instance) make.simmap or read.simmap. This is exactly the same as di2multi in the ape package (functionally - I programmed it totally differently, hopefully not at the peril of users); however it works for modified "phylo" objects with a mapped discrete trait.

The main reason this might be important is because densityMap runs into problems when your tree has internal edges that are very short. This can be addressed by first collapsing all zero/small branches in each of the stochastic mapped trees, and then computing the density-map on the collapses stochastic mapped trees. Code for that would look like the following:

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Monitoring SARS-CoV-2 spread and evolution through genome sequencing is essential in handling the COVID-19 pandemic. Here, we sequenced 892 SARS-CoV-2 genomes collected from patients in Saudi Arabia from March to August 2020. We show that two consecutive mutations (R203K/G204R) in the nucleocapsid (N) protein are associated with higher viral loads in COVID-19 patients. Our comparative biochemical analysis reveals that the mutant N protein displays enhanced viral RNA binding and differential interaction with key host proteins. We found increased interaction of GSK3A kinase simultaneously with hyper-phosphorylation of the adjacent serine site (S206) in the mutant N protein. Furthermore, the host cell transcriptome analysis suggests that the mutant N protein produces dysregulated interferon response genes. Here, we provide crucial information in linking the R203K/G204R mutations in the N protein to modulations of host-virus interactions and underline the potential of the nucleocapsid protein as a drug target during infection.

The emergence of novel severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), which causes the respiratory coronavirus disease 2019 (COVID-19), resulted in a pandemic that has triggered an unparalleled public health emergency1,2. The global spread of SARS-CoV-2 depended fundamentally on human mobility patterns. This is highly pertinent to a country like the Kingdom of Saudi Arabia, which as of 22nd February 2021 had a total of 374,691 cases and 6457 deaths3. The kingdom frequently experiences major population movements, particularly religious mass gatherings. For instance, during Umrah and Hajj roughly 9.5 million pilgrims visit two Islamic sites in Makkah and Madinah annually4,5 and the Ministry of Health takes public health measures to keep the pilgrims safe and major outbreaks have been by and large avoided in recent years. Further, an estimated 5 million Shiite Saudi nationals travel to Iran for pilgrimage, which became an early source of COVID-19 infections in the region5,6. This movement has been reflected in the early phase of COVID-19 transmission within Saudi, as the first case was officially reported in Qatif (Eastern Region) on March 2nd, 20207.

a Locations of the sampling cities within Saudi Arabia. b Stacked bars showing the numbers of samples retrieved from the 4 cities and the Eastern region during the first six months of the pandemic. Cities are colored as in panel a. Months are shown at the bottom of the figure, and each month is divided into 5-day intervals. New daily cases for the city of Khobar are shown on the Eastern Region plot. Major restrictions imposed by the Ministry of Health and by Royal decrees are indicated above plots. c Stacked bars showing the average numbers of new daily cases in sampling cities (Supplementary Note 1). d Estimate of effective reproduction number [Rt] over time in Saudi Arabia (top) and the estimate of effective population size [Ne], the relative population size required to produce the diversity seen in the sample (bottom). Central black lines show median estimates, and gray confidence areas denote the 95% credible intervals. The red horizontal red line represents an R of 1, the level required to sustain epidemic growth.

We sequenced and assembled SARS-CoV-2 genomes from 892 patient samples. This group includes 144 patients that were placed in quarantine and had either mild symptoms or were asymptomatic. The remaining patients were all hospitalized (Supplementary Table S1). Data on comorbidities were available for 689 patients with diabetes (39%) and hypertension (35%) being the most abundant (Supplementary Table S2). Patient outcome data were available for 850 samples, and 199 patients (23%) died during hospitalization (Supplementary Table S1).

From the 892 assembled viral genomes collected over a period of 6 months, we found a total of 836 single-nucleotide polymorphisms (SNPs) compared to the SARS-CoV-2 Wuhan-Hu-1 isolate reference (GenBank accession: NC_045512) (Supplementary Fig. S2). The observed numbers of SNPs relative to the reference sequence are in general lower than the numbers observed in global samples, but with the exception of a period from mid-June to late July, the average number of SNPs in Saudi samples is within one standard deviation of samples deposited in GISAID (Supplementary Fig. S3). We further detected 41 indels of which 26 reside in coding regions (Supplementary Table S3). Most indels were specific to a single sample, and no identical indel was found in more than four samples. Compared with global SNP data, seven SNPs were found in higher frequencies (absolute difference > 0.1) in samples from Saudi Arabia (Supplementary Fig. S2). These include the Spike protein D614G (A23403G) and three consecutive SNPs (G28881A, G28882A, and G28883C) causing the R203K and G204R changes in the nucleocapsid protein. Together with all sequences from Saudi Arabia available on GISAID on December 31st 2020, the assembled sequences were used to construct the effective population size and growth rate estimates of SARS-CoV2 over the course of the first wave of the epidemic. The skygrowth model16 (Fig. 1d) shows a downward trend in the effective reproduction number (R(t)) over time with the timely introduction and maintenance of effective non-pharmaceutical interventions by the Saudi Ministry of Health. Despite the bounds of R(t) estimate including one for much of the study period, we can infer an estimated decrease in R(t) over time until the lifting of restrictions in late June (Fig. 1d). By investigating the 1.6 million individual traces from the Skygrowth analysis we can infer a date of 27th April 2020 as the first point where over 50% of collected traces resulted in an R(t) estimate below one, suggesting an epidemic in decline for the first time.

The effective population size (Ne) represents the relative diversity of the sequences collected in Saudi Arabia over the course of the outbreak (Fig. 1d). The model predicts a peak in viral diversity at the beginning of June. This is ahead of the peak number of cases reported nationally and is likely influenced by the earlier peak in reported cases in the three cities, which contribute the most viral sequences to this analysis (Madinah, Makkah, and Jeddah).