Bcftools Zip

3 views
Skip to first unread message

Sean Vaidhyanathan

unread,
Aug 4, 2024, 6:50:54 PM8/4/24
to rutraycheri
BCFtoolsis a set of utilities that manipulate variant calls in the VariantCall Format (VCF) and its binary counterpart BCF. All commands worktransparently with both VCFs and BCFs, both uncompressed and BGZF-compressed.

Most commands accept VCF, bgzipped VCF and BCF with filetype detectedautomatically even when streaming from a pipe. Indexed VCF and BCFwill work in all situations. Un-indexed VCF and BCF and streams willwork in most, but not all situations. In general, whenever multiple VCFs areread simultaneously, they must be indexed and therefore also compressed.(Note that files with non-standard index names can be accessed as e.g."bcftools view -r X:2928329 file.vcf.gz##idx##non-standard-index-name".)


Controls how to treat records with duplicate positions and defines compatiblerecords across multiple input files. Here by "compatible" we mean records whichshould be considered as identical by the tools. For example, when performingline intersections, the desire may be to consider as identical all sites withmatching positions (bcftools isec -c all), or only sites with matching varianttype (bcftools isec -c snps -c indels), or only sites with all allelesidentical (bcftools isec -c none).


When output consists of a single stream, write it to FILE rather thanto standard output, where it is written by default.The file type is determined automatically from the file name suffix and incase a conflicting -O option is given, the file name suffix takes precedence.


Regions can be specified either on command line or in a VCF, BED, ortab-delimited file (the default). The columns of the tab-delimited filecan contain either positions (two-column format: CHROM, POS) or intervals(three-column format: CHROM, BEG, END), but not both. Positions are 1-basedand inclusive. The columns of the tab-delimited BED file are alsoCHROM, POS and END (trailing columns are ignored), but coordinatesare 0-based, half-open. To indicate that a file be treated as BED ratherthan the 1-based tab-delimited file, the file must have the ".bed" or".bed.gz" suffix (case-insensitive). Uncompressed files are stored inmemory, while bgzip-compressed and tabix-indexed region files are streamed.Note that sequence names must match exactly, "chr20" is not the same as"20". Also note that chromosome ordering in FILE will be respected,the VCF will be processed in the order in which chromosomes first appearin FILE. However, within chromosomes, the VCF will always beprocessed in ascending genomic coordinate order no matter what order theyappear in FILE. Note that overlapping regions in FILE can result induplicated out of order positions in the output.This option requires indexed VCF/BCF files. Note that -R cannot be usedin combination with -r.


Comma-separated list of samples to include or exclude if prefixedwith "^." (Note that when multiple samples are to be excluded,the "^" prefix is still present only once, e.g. "^SAMPLE1,SAMPLE2".)The sample order is updated to reflect that given on the command line.Note that in general tags such as INFO/AC, INFO/AN, etc are not updatedto correspond to the subset samples. bcftools view is theexception where some tags will be updated (unless the -I, --no-updateoption is used; see bcftools view documentation). To use updatedtags for the subset in another command one can pipe from view intothat command. For example:


File of sample names to include or exclude if prefixed with "^".One sample per line. See also the note above for the -s, --samplesoption.The sample order is updated to reflect that given in the input file.The command bcftools call accepts an optional secondcolumn indicating ploidy (0, 1 or 2) or sex (as defined by--ploidy, for example "F" or "M"), for example:


Similar as -r, --regions, but the next position is accessed by streaming thewhole VCF/BCF rather than using the tbi/csi index. Both -r and -t optionscan be applied simultaneously: -r uses the index to jump to a regionand -t discards positions which are not in the targets. Unlike -r, targetscan be prefixed with "^" to request logical complement. For example, "^X,Y,MT"indicates that sequences X, Y and MT should be skipped.Yet another difference between the -t/-T and -r/-R is that -r/-R checks forproper overlaps and considers both POS and the end position of an indel, while -t/-Tconsiders the POS coordinate only (by default; see also --regions-overlap and --targets-overlap).Note that -t cannot be used in combination with -T.


With the call -C alleles command, third column of the targets file mustbe comma-separated list of alleles, starting with the reference allele.Note that the file must be compressed and indexed.Such a file can be easily created from a VCF using:


Automatically index the output files. FMT is optional and can beone of "tbi" or "csi" depending on output file format. Defaults toCSI unless specified otherwise. Can be used only for compressedBCF and VCF output.


Read the list of columns from a file (normally given via the -c, --columns option)."-" to skip a column of the annotation file.One column name per row, an additional space- or tab-separated field canbe present to indicate the merge logic (normally given via the -l, --merge-logic option).This is useful when many annotations are added at once.


continue even when parsing errors, such as undefined tags, are encountered. Notethis can be an unsafe operation and can result in corrupted BCF files. If thisoption is used, make sure to sanity check the result thoroughly.


assign ID on the fly. The format is the same as in the querycommand (see below). By default all existing IDs are replaced. If theformat string is preceded by "+", only missing IDs will be set. For example,one can use


include only sites for which EXPRESSION is true. For valid expressions seeEXPRESSIONS.


Additionally, the command bcftools annotate supports expressions updated from the annotationfile dynamically for each record:


When multiple regions overlap a single record, this option defines how to treat multipleannotation values when setting tag in the destination file: use the first encountered value ignoringthe rest (first); append allowing duplicates (append); append even if the appended value is missing,i.e. is a dot (append-missing); append discarding duplicate values (unique);sum the values (sum, numeric fields only); average the values (avg); use the minimum value (min) orthe maximum (max).+Note that this option is intended for use with BED or TAB-delimited annotation files only. Moreover,it is effective only when either REF and ALT or BEG and END --columns are present .+Multiple rules can be given either as a comma-separated list or giving the option multiple times.This is an experimental feature.


minimum overlap required as a fraction of the variant in the annotation -a file (ANN), in thetarget VCF file (:VCF), or both for reciprocal overlap (ANN:VCF).By default overlaps of arbitrary length are sufficient.The option can be used only with the tab-delimited annotation -a file and with BEG and ENDcolumns present.


Controls how to match records from the annotation file to the target VCF.Effective only when -a is a VCF or BCF. The option replaces the formeruninuitive --collapse.See Common Options for more.


rename annotations according to the map in file, with"old_name new_name\n" pairs separated by whitespaces, each on a separateline. The old name must be prefixed with the annotation type:INFO, FORMAT, or FILTER.


subset of samples to annotate. If the samples are named differently in thetarget VCF and the -a, --annotations VCF, the name mapping can begiven as "src_name dst_name\n", separated by whitespaces, each pair on aseparate line.


use this option to keep memory requirements low with very large annotationfiles. Note, however, that this comes at a cost, only single overlapping intervalsare considered in this mode. This was the default mode until the commitaf6f0c9 (Feb 24 2019).


List of annotations to remove. Use "FILTER" to remove all filters or"FILTER/SomeFilter" to remove a specific filter. Similarly, "INFO" canbe used to remove all INFO tags and "FORMAT" to remove all FORMAT tagsexcept GT. To remove all INFO tags except "FOO" and "BAR", use"^INFO/FOO,INFO/BAR" (and similarly for FORMAT and FILTER)."INFO" can be abbreviated to "INF" and "FORMAT" to "FMT".


This command replaces the former bcftools view caller. Some of the originalfunctionality has been temporarily lost in the process of transition underhtslib, but will be added back on populardemand. The original calling model can be invoked with the -c option.


ploidy definition given as a space/tab-delimited list ofCHROM, FROM, TO, SEX, PLOIDY. The SEX codes are arbitrary andcorrespond to the ones used by --samples-file.The default ploidy can be given using the starred records (seebelow), unlisted regions have ploidy 2. The default ploidy definition is


comma-separated list of FORMAT fields to output for each sample. CurrentlyGQ and GP fields are supported. For convenience, the fields can be givenas lower case letters. Prefixed with "^" indicates a request for tagremoval of auxiliary tags useful only for calling.


by default, all samples are assumed to come from a single population. This option groups samplesinto populations and apply the HWE assumption within but not across the populations. FILE is a tab-delimitedtext file with sample names in the first column and group names in the second column. If - isgiven instead, no HWE assumption is made at all and single-sample calling is performed. (Note thatin low coverage data this inflates the rate of false positives.) The -G option requires the presence ofper-sample FORMAT/QS or FORMAT/AD tag generated with bcftools mpileup -a QS (or -a AD).

3a8082e126
Reply all
Reply to author
Forward
0 new messages