no informative reads

Aryn Wilder

unread,

Nov 29, 2019, 6:35:19 PM11/29/19

to STITCH imputation

Hi all,

I am trying to impute using STITCH from bam files generated from simulated data. When I run the command:

STITCH.R --chr="rep_1" --bamlist=bamlist_sorted_1x_wRGSM.txt --posfile=bam_list_160_1x_pos.txt --outputdir=./ --K=40 --nGen=100 --nCores=8

I get:

[2019-11-29 17:43:02] Running STITCH(chr = rep_1, nGen = 100, posfile = bam_list_160_1x_pos.txt, K = 40, S = 1, outputdir = ./, nStarts = , tempdir = NA, bamlist = bamlist_sorted_1x_wRGSM.txt, cramlist = , sampleNames_file = , reference = , genfile = , method = diploid, output_format = bgvcf, B_bit_prob = 16, outputInputInVCFFormat = FALSE, downsampleToCov = 50, downsampleFraction = 1, readAware = TRUE, chrStart = NA, chrEnd = NA, regionStart = NA, regionEnd = NA, buffer = NA, maxDifferenceBetweenReads = 1000, maxEmissionMatrixDifference = 1e+10, alphaMatThreshold = 1e-04, emissionThreshold = 1e-04, iSizeUpperLimit = 600, bqFilter = 17, niterations = 40, shuffleHaplotypeIterations = c(4, 8, 12, 16), splitReadIterations = 25, nCores = 8, expRate = 0.5, maxRate = 100, minRate = 0.1, Jmax = 1000, regenerateInput = TRUE, originalRegionName = NA, keepInterimFiles = FALSE, keepTempDir = FALSE, outputHaplotypeProbabilities = FALSE, switchModelIteration = NA, generateInputOnly = FALSE, restartIterations = NA, refillIterations = c(6, 10, 14, 18), downsampleSamples = 1, downsampleSamplesKeepList = NA, subsetSNPsfile = NA, useSoftClippedBases = FALSE, outputBlockSize = 1000, outputSNPBlockSize = 10000, inputBundleBlockSize = NA, genetic_map_file = , reference_haplotype_file = , reference_legend_file = , reference_sample_file = , reference_populations = NA, reference_phred = 20, reference_iterations = 40, reference_shuffleHaplotypeIterations = c(4, 8, 12, 16), output_filename = NULL, initial_min_hapProb = 0.2, initial_max_hapProb = 0.8, regenerateInputWithDefaultValues = FALSE, plotHapSumDuringIterations = FALSE, plot_shuffle_haplotype_attempts = FALSE, plotAfterImputation = TRUE, save_sampleReadsInfo = FALSE, gridWindowSize = NA, shuffle_bin_nSNPs = NULL, shuffle_bin_radius = 5000, keepSampleReadsInRAM = FALSE, useTempdirWhileWriting = FALSE, output_haplotype_dosages = FALSE)
[2019-11-29 17:43:02] Program start
[2019-11-29 17:43:02] Get and validate pos and gen
[2019-11-29 17:43:03] Done get and validate pos and gen
[2019-11-29 17:43:03] Get BAM sample names
[2019-11-29 17:43:03] Done getting BAM sample names
[2019-11-29 17:43:03] Generate inputs
[2019-11-29 17:43:04] WARNING - sample sample_28_sorted_1x has no informative reads. It is being given random reads. Consider removing from analysis
[2019-11-29 17:43:04] WARNING - sample sample_46_sorted_1x has no informative reads. It is being given random reads. Consider removing from analysis
[2019-11-29 17:43:04] WARNING - sample sample_100_sorted_1x has no informative reads. It is being given random reads. Consider removing from analysis

The last line repeats for every sample in the bam list. The name of the simulated chromosome is "rep_1" in the bams and the position file. Any ideas about the problem?

Thanks,

Aryn

Robbie Davies

unread,

Dec 2, 2019, 5:13:40 AM12/2/19

to Aryn Wilder, STITCH imputation

Hi Aryn,

Huh, that's weird. I just tried with that chromosome name using a modified version of one of my test scripts, and this seems to work just fine (see code under A below). Do your BAMs look fine (see B below for an example of the test data I used, top of samtools view). And the posfile (C below) makes sense as well?

One thing, what's the BQ? STITCH defaults to requiring >17, same as what GATK used to (still does?) require (see bqFilter). STITCH should be fine with paired end or single reads, but does filter out reads with insert size > 600 (iSizeUpperLimit).

Ps Let me know if you have any questions when thinking about parameter choices

Best,

Robbie

========= A ===========

if (1 == 0) {

library("testthat"); library("STITCH"); library("rrbgen")
dir <- "~/proj/STITCH/"
setwd(paste0(dir, "/STITCH/R"))
a <- dir(pattern = "*.R")
b <- grep("~", a)
if (length(b) > 0) {
a <- a[-b]
}
o <- sapply(a, source)
setwd(dir)
Sys.setenv(PATH = paste0(getwd(), ":", Sys.getenv("PATH")))

}

n_snps <- 20
reads_span_n_snps <- 6
chr <- "rep_1"
original_chr <- chr
n_reads <- 5 / (reads_span_n_snps / n_snps) ## want about 5X / sample
extension <- c("bgvcf" = ".vcf.gz", "bgen" = ".bgen")

K <- 10
phasemaster <- matrix(sample(c(0, 1), replace = TRUE, n_snps * K), ncol = K)
L <- 10 + seq(1, 3 * n_snps, 3)
L_original <- L

data_package <- make_acceptance_test_data_package(
n_samples = 10,
n_snps = n_snps,
n_reads = n_reads,
L = L,
seed = 3,
chr = chr,
K = 2,
reads_span_n_snps = reads_span_n_snps,
phasemaster = phasemaster
)

outputdir <- make_unique_tempdir()
set.seed(478)

STITCH(
chr = data_package$chr,
bamlist = data_package$bamlist,
posfile = data_package$posfile,
genfile = data_package$genfile,
outputdir = outputdir,
K = K,
S = 1,
nGen = 100,
nCores = 1
)

## seems to work
## read.table(file.path(outputdir, "stitch.rep_1.vcf.gz"))

========== B ============

r008 0 rep_1 11 60 16M * 0 0 GAAAAAGAAGAAAAAA ::::::::::::::::
r009 0 rep_1 11 60 16M * 0 0 GAAAAAAAAGAAAAAG ::::::::::::::::
r0017 0 rep_1 11 60 16M * 0 0 GAAAAAGAAGAAAAAA ::::::::::::::::

========= C =============

rep_1 11 A G
rep_1 14 A G
rep_1 17 A G
rep_1 20 A G

--
You received this message because you are subscribed to the Google Groups "STITCH imputation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stitch-imputat...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/stitch-imputation/f5d4f27e-e276-4a2e-acab-e0f9e357d7b7%40googlegroups.com.

Aryn Wilder

unread,

Dec 2, 2019, 12:16:05 PM12/2/19

to STITCH imputation

Hi Robbie,

Thanks for your reply. These are simulated paired-end 150bp reads. The mean insert size is 500bp, so some reads should be filtered out, but not all. I think the BQ is OK, unless you mean the BAQ field(?), in which case I'm not sure.

The top of my bam files look like this:

rep_1-1954106    99    rep_1    641    99    150=    =    946    455    AAGACTAGAGGGTTTTGTTCTTTGCGGGCTGTGCGAGGTTGAACTCCGCGCACCTGCTCGAGCAAGAGCCCTAAAGTATGCCATCTTTGTCCCATGCAACGTCATTAAGGTTATCGGCCGGGGTCCTGGAGTCTCTATTGACATATACCG    CCCGCGGGGGGGGGJJJJJJJ=JJJJJJJJJJJGJJJGCJJGJJJCGGJJJGJGJGCJJ8GGG=J=8GCCGGGCGCCGG=GGGGGCCGG=GCGGCGGCCGJGCGGGCCGGCGCGGCGG=GG1GCG8G=C=GGGCGGCG8GGCGCG(CGGC    RG:Z:1
rep_1-1326496    99    rep_1    733    99    137=1X12=    =    926    343    CATGCAACGTCATTAAGGTTATCGGCCGGGGTCCTGGAGTCTCTATTGACATATACCGTTACTCCGCTTCGAGCTTTTGGCATTACTGGACATAATGGAATTCAATATAGAAAATCGGAGTGGACCGGAGCCCTAGGCGTGGGCCCCCGT    CC=GGGGGGGGGGJJJGGJJGJJGJJCJJJ=JGGGJJCJ=J=JGGGJJGCGJJ=JGJJ8JJ=CCCGGJGGJ8GGCGGGGCGGGGJGGGG1GG=GGCCCCGJ=G=GGGCGGGGG=CGGCGCCCGCGCGCGGG1GC=G=1GGG=CCCGGGG=    RG:Z:1
rep_1-1326496    147    rep_1    926    99    150=    =    733    -343    GGGTCTCGAGGCGAATGGTACCCTGCGCCAAAGGGGCTAAGACTTAACCGTGTCCTTGCGGTTTCAAGCCGCTGAAGTAATGCAACACCCTGCCTACCCCCTTGCACACGAATCAACCGATTTGGGCGCGATACGCTTTGGCCCAATAAG    CGG8GGCGCCGGCGGGGCGGGGGGGGGGGGGGG=GGCG=GCGGJ1JJCCGGGGGGG=GGCCGGGG1GCGCCGGCGGGCCJ1CC8GGGJGCGGGGJJGGCCGJJJGGGJGCJJJGGJGGCJJCJJGGJGCGGJJJJJCGCGGGGGG=GCCC    RG:Z:1

And the top of my posfile looks like this (tab-delimited is OK, right?):

rep_1    292    G    C
rep_1    383    C    T
rep_1    407    G    A
rep_1    490    A    T
rep_1    530    A    T
rep_1    574    T    A
rep_1    953    T    C
rep_1    1026    C    T
rep_1    1096    T    A
rep_1    1097    A    T

Thanks again for your help. I'd love to have your advice choosing parameters once I've got it up and running!

Best,

Aryn

To unsubscribe from this group and stop receiving emails from it, send an email to stitch-i...@googlegroups.com.

Robbie Davies

unread,

Dec 2, 2019, 5:22:27 PM12/2/19

to Aryn Wilder, STITCH imputation

Hey,

Thanks for the info. It looks like the SAM/BAM spec can accept "=" in the alignment. If I change 150= to 150M, then I get your toy example to process that read correctly.

What aligner are you using? Reading the SAM/BAM spec, I'm blanking on the proper difference between M and =. I don't think I've seen "=" before, but I don't think I've used that many aligners

(if interested, somewhere like this is where the code would fall over)

https://github.com/rwdavies/STITCH/blob/master/STITCH/src/bam_access.cpp#L25

At this point, it would probably be to switch this conversation to a bug request on the STITCH github

Thanks,

Robbie

To unsubscribe from this group and stop receiving emails from it, send an email to stitch-imputat...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/stitch-imputation/655383b7-d344-499c-b67e-e70a1188b6d4%40googlegroups.com.

Aryn Wilder

unread,

Dec 2, 2019, 8:10:05 PM12/2/19

to STITCH imputation

Ah, OK. The reads were simulated in ART, and it looks like ART uses the extended CIGAR format. I reformatted the bams and now it's running. Thanks for the tip!

As for the parameter choices: We're comparing simulations of different reads depths and sample sizes to test how low coverage sequencing compares to high coverage of fewer samples (i.e. we're examining the trade-off between sample size and sequencing depth for inferring things like allele frequencies and population structure). I'm curious if we can get some improvement in allele frequency estimation if we use STITCH to impute genotype probabilities, then use something like ANGSD to estimate allele frequencies, etc from the posterior genotype probabilities.

We've started by simulating a large (Ne~10 000), neutrally evolving population for 10 000 generations. I assume we'll want to run STITCH in pseudo-haploid mode with a large K. Any suggestions on the appropriate K and nGen values under this scenario?

Thanks again,

Aryn

To view this discussion on the web visit https://groups.google.com/d/msgid/stitch-imputation/655383b7-d344-499c-b67e-e70a1188b6d4%40googlegroups.com.

Robbie Davies

unread,

Dec 3, 2019, 5:19:16 AM12/3/19

to Aryn Wilder, STITCH imputation

Hey,

Thanks, I'll take a look, and see if just something I want to make a note of on the webpage, or implement a fix for.

In experiences with other populations, I haven't found the pseudoHaploid to be as effective as I'd like, so I'd recommend trying diploid first, and seeing how that does. I have some methods development in the works on something that looks more effective, but it isn't ready for release yet, and teaching duties have thrown my ability to forecast timelines out the window.

K would be the same for both for a "wild" population - as large as you can get without affecting run time and memory constraints, given available coverage. I usually try first to set something like K = min(30, average depth * N / 10), so that on average each haplotype gets 10X of coverage. This means K can sometimes be quite small, but it should be OK, there shouldn't be bias in the model. Note also that "S" is a new parameter controlling how many repetitions of the initialization to perform, basically averaging results over S independent runs. It improves accuracy a bit, at a roughly linear cost to speed. So maybe S=4 or 10 might be a good idea if compute time isn't that long or limited.

The method is pretty insensitive to nGen. For a neutrally evolving population, nGen = 4 * K / Ne is pretty standard. nGen as a parameter choice controls recombination rate combined with other default parameters (usually fine for human or mouse like populations, less so for other populations like insects). As long as there is a moderate amount of data, the data usually matters quite a bit more than the parameter value chosen for this parameter (as opposed to K, which is important to set correctly).

Best,

Robbie

To unsubscribe from this group and stop receiving emails from it, send an email to stitch-imputat...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/stitch-imputation/ba130a4a-0f90-4b60-9a76-5329e15adb3f%40googlegroups.com.

Reply all

Reply to author

Forward