Summary Of Mine Boy By Peter Abrahams Pdf

0 views

Skip to first unread message

Prisc Chandola

unread,

Aug 4, 2024, 10:29:15 PM8/4/24

to cenfillroro

Understandingthe relationship between the millions of functional DNA elements and their protein regulators, and how they work in conjunction to manifest diverse phenotypes, is key to advancing our understanding of the mammalian genome. Next-generation sequencing technology is now used widely to probe these protein-DNA interactions and to profile gene expression at a genome-wide scale. As the cost of DNA sequencing continues to fall, the interpretation of the ever increasing amount of data generated represents a considerable challenge.

The workflow of ngs.plot is depicted in Figure 1. Initially, ngs.plot searches through its database to find the genomic coordinates for the desired regions and uses them to query the alignment files of an NGS dataset. It then calculates the coverage vectors for each query region based on the retrieved alignments. It finally performs normalization and transformation on the coverage and generates two plots. One plot is an average profile that is generated from the mean of all regions. This plot provides the overall pattern at the regions of interest. The other plot is a heatmap that shows the enrichment of each region across the genome using color gradients. The heatmap can provide three-dimensional details (enrichment, region, and position) of the NGS samples under study.

The workflow of an ngs.plot run. The functional elements in the database are classified based on their types, such as TSS, CGI, enhancer, DHS. The genomic coordinates of the functional elements are used to query a BAM file which is indexed by an R-tree like data structure. Coverage vectors are calculated based on the retrieved alignments, which are further represented as average profiles or heatmaps.

ngs.plot can accurately calculate coverage for RNA-seq (Figure 2A). RNA-seq experiments are unique because the short reads are derived from messenger RNAs and other expressed RNAs, many of which result from exon splicing. The ngs.plot database contains the exon coordinates for each transcript so that the coverage vectors for exons are concatenated to simulate RNA splicing in silico.

ngs.plot can also calculate the log2 ratios for one sample vs. another and display the values using two different colors in a heatmap. This is a very useful feature for ChIP-seq where a target sample is often contrasted with a control sample to determine bona fide differences in enrichment.

We have implemented a few approaches to generate average profiles. Besides mean values, the standard error of mean (SEM) across the regions is calculated and shown as a semi-transparent shade around the mean curve. This provides users with a sense of statistical significance when two samples are being compared. It is known that the mean value is most influenced by extreme values that can sometimes deleteriously distort the average profiles. We therefore implemented robust statistics (as an optional feature) by removing a certain percentage of the extreme values before the average is taken. As well, curve smoothing was implemented to remove the spikes from average profiles as an option that can be controlled by moving window size. Heatmaps can be tuned by custom color scales and color saturation.

Hierarchical clustering. This method groups the most similar regions together first followed by the less similar ones. This process is performed repeatedly from bottom up until all regions are included in the grouping to form a tree-like structure. When dealing with multiple NGS samples, the clustering is applied to all of them together.

Difference. Regions are ranked by the difference of sums between two NGS samples. When two marks are mutually exclusive, such as H3K27ac and H3K27me3, this algorithm can maximize the appearance of such relationships.

Principal component analysis (PCA). PCA is performed on all NGS samples and then the first component is used to rank regions, which captures the largest proportion of the variance. This algorithm is complementary to the above mentioned methods.

In a multi-plot, an arbitrary number of plots can be combined into one figure and each plot can represent an NGS sample at a subset of the entire genomic region; a configuration file can be used to describe this combination. The configuration is a TAB-delimited text file where the first column contains the alignment file names; the second column contains the gene list names or BED file names; the third column contains the titles of the plots; the fourth and fifth columns are optional and contain fragment lengths and custom average profile colors, respectively. ngs.plot will parse a configuration file and obtain a list of unique BAM files and a list of unique regions (Figure 2B). Some pre-processing steps will be performed on each BAM file, such as calculating the number of alignments and indexing. The unique regions and unique BAM files are used to organize heatmaps into a grid so that each row represents a unique region and each column represents a BAM file.

Included in the ngs.plot package are several additional useful tools. A Python script called ngsplotdb.py can be used to install downloaded genome files, list currently installed genomes, or remove existing genomes. An R script called plotCorrGram.r can be used to calculate all pairwise correlations for samples in a configuration and visually display them as a corrgram [26]. Another R script called replot.r can be used to re-generate an average profile or a heatmap with different visual options so that users can tune their figures without extracting data again.

Therefore, we developed another strategy that uses a two-step procedure (Figure 1). First, the query regions are grouped into chunks and the BAM index is loaded into memory to perform alignment retrieval. Second, the retrieved alignments are used to calculate coverage on-the-fly for each region. A BAM file is indexed using hierarchical binning and linear index to allow very efficient retrieval so that only one disk seek (moving the disk head to the desired location) is often required for each query [25, 27]. Grouping regions into chunks allows us to avoid frequent index loading which is very expensive in comparison to alignment reading. This strategy has an additional advantage: no extra files need to be generated to represent coverage vectors. When the storage of many NGS samples becomes problematic, this advantage is highly desirable.

We also explored additional alternatives (see Benchmarking the performance of ngs.plot section). We used samtools to pre-calculate the genomic coverage vector for an NGS sample, merged the neighbouring base pairs that contain the same value, and compressed them using gzip to save space. We then used two different approaches to index the output file. Tabix [27] is a generic indexing program for TAB-delimited text files that contain a position column and a value column, and uses the same indexing algorithm as BAM. It can directly create an index on a compressed text file. bigWig [28] files are converted from wiggle ( ) files. It is a binary format that includes a data structure called R-tree as index. We first converted the output file to a variable-step wiggle file and then created the bigWig file using tools from the UCSC genome browser.

Genes and transcripts are categorized into five types: protein_coding, pseudogene, lincRNA, miRNA, and misc (everything else) according to GTF files. Gene/transcript IDs/names are indexed for random access. Each gene is represented by the isoform with the longest genomic span.

Enhancers are important transcriptional regulators that can activate distal promoters via DNA looping. They often regulate subsets of genes in a cell type specific way and are marked in part by the enrichment of H3K4me1 and H3K27ac [33, 34]. We have built into our database the enhancers of 9 human cell types and 15 mouse cell types (Table 1) by using data from the ENCODE [33] and muENCODE projects [34]. For human enhancers, we incorporated data from the ENCODE Analysis Working Group (AWG) which performs integrated analysis of all ENCODE data types based on uniform processing. We will continuously monitor the status of their download page and update our database as new data become available. We excluded the enhancers that are within 5 Kb of TSSs. The distance of 5 Kb is a cutoff inspired by this work [33] to avoid classifying promoters as enhancers accidentally. Each enhancer is assigned to their nearest genes whose IDs/names are also indexed.

The NGS data used in this manuscript were obtained from the Sequence Read Archive (SRA, ). The accession numbers and references of the datasets are listed in Table S1 [see Additional file 1]. ChIP-seq data were aligned to the reference genome by Bowtie [39]. Peak calling was accomplished by use of MACS [40] using default parameters. RNA-seq data were analyzed by the Tuxedo Suite [41]. The differential chromatin modification sites were detected by diffReps [42] using default parameters and the FDR cutoff was set as 0.1.

At first, coverage needs to be pre-calculated for Tabix, bigwig, and RLE. This takes a long time to complete and the run time is strongly associated with the alignment size (Figure 3A). It takes samtools around 1,000 s to calculate the coverage for a 10 million read BAM file and more than 5,000 s for a 160 million read BAM file. RLE is much faster but involves a more rapid increase in time than samtools: it takes 80 s for a 10 million read BAM file and more than 800 s for a 160 million read BAM file. This is because RLE tries to load all alignments into memory and then performs calculations in a batch while samtools does the calculations by reading alignments in a stream. After coverage calculations, Tabix and bigWig also require the coverage files to be indexed. The indexing is more than 10 times faster than coverage calculation and shows strong association with the alignment file size (Figure 3A). Tabix is faster than bigWig: this is most likely because bigWig uses more than one index for different zoom levels [28].