Pbgzip

0 views

Skip to first unread message

Arlyne Doepner

unread,

Aug 5, 2024, 2:58:41 PM8/5/24

to clanadfabon

outputfile for duplicate statistics. If file exists, it will be open in the append mode. If the path ends with .gz or .lz4, the output is bgzip-/lz4c-compressed. By default, statistics are not printed.

output file for duplicate statistics. Note that the readID should be provided and contain tile information for this option. This analysis is possible when pairtools is run on a dataset with original Illumina-generated read IDs, because SRA does not store original read IDs from the sequencer. By default, by-tile duplicate statistics are not printed. If file exists, it will be open in the append mode. If the path ends with .gz or .lz4, the output is bgzip-/lz4c-compressed.

Engine for regular expression parsing for stats filtering. Python will provide you regex functionality, while pandas does not accept custom funtctions and works faster. [output stats filtering option]

A path to a chromosomes file (tab-separated, 1st column contains chromosome names) containing a chromosome subset of interest for stats filter. If provided, additionally filter pairs with both sides originating from the provided subset of chromosomes. This operation modifies the #chromosomes: and #chromsize: header fields accordingly. Note that this will not change the deduplicated output pairs. [output stats filtering option]

Cast a given column to a given type for stats filtering. By default, only pos and mapq are cast to int, other columns are kept as str. Provide as -t , e.g. -t read_len1 int. Multiple entries are allowed. [output stats filtering option]

A command to decompress the input file. If provided, fully overrides the auto-guessed command. Does not work with stdin and pairtools parse. Must read input from stdin and print output into stdout. EXAMPLE: pbgzip -dc -n 3

A command to compress the output file. If provided, fully overrides the auto-guessed command. Does not work with stdout. Must read input from stdin and print output into stdout. EXAMPLE: pbgzip -c -n 8

Find and remove pairs with >(MAX_COV-1) neighbouring pairswithin a +/- MAX_DIST bp window around either side. Useful for single-cellHi-C experiments, where coverage is naturally limited by the chromosomecopy number.

output file for statistics of multiple interactors. If file exists, it will be open in the append mode. If the path ends with .gz or .lz4, the output is bgzip-/lz4c-compressed. By default, statistics are not printed.

Required Chromosome order used to flip interchromosomal mates: path to a chromosomes file (e.g. UCSC chrom.sizes or similar) whose first column lists scaffold names. Any scaffolds not listed will be ordered lexicographically following the names provided.

Chromosome order used to flip interchromosomal mates: path to a chromosomes file (e.g. UCSC chrom.sizes or similar) whose first column lists scaffold names. Any scaffolds not listed will be ordered lexicographically following the names provided.

Merge triu-flipped sorted pairs/pairsam files. If present, the @SQ recordsof the SAM header must be identical; the sorting order ofthese lines is taken from the first file in the list.The ID fields of the @PG records of the SAM header are modified with anumeric suffix to produce unique records.The other unique SAM and non-SAM header lines are copied into the output header.

PAIRS_PATH : upper-triangular flipped sorted .pairs/.pairsam files to mergeor a group/groups of .pairs/.pairsam files specified by a wildcard. Forpaths ending in .gz/.lz4, the files are decompressed by bgzip/lz4c.

Find ligation pairs in .sam data, make .pairs.SAM_PATH : an input .sam/.bam file with paired-end sequence alignments ofHi-C molecules. If the path ends with .bam, the input is decompressed frombam with samtools. By default, the input is read from stdin.

The maximal size of a Hi-C molecule; used to rescue single ligations(from molecules with three alignments) and to rescue complex ligations.The default is based on oriented P(s) at short ranges of multiple Hi-C.Not used with walks-policy all.

output file for all parsed alignments, including walks. Useful for debugging and rnalysis of walks. If file exists, it will be open in the append mode. If the path ends with .gz or .lz4, the output is bgzip-/lz4-compressed. By default, not used.

Extracts pairs from .sam/.bam data with complex walks, make .pairs.SAM_PATH : an input .sam/.bam file with paired-end or single-end sequence alignments ofHi-C (or Hi-C-like) molecules. If the path ends with .bam, the input is decompressed frombam with samtools. By default, the input is read from stdin.

Reported position of alignments in pairs of complex walks (pos columns).Each alignment in .bam/.sam Hi-C-like data has two ends, and you can report one or another depending of the position of alignment on a read or in a pair.

Reported orientataion of pairs in complex walk (strand columns).Each alignment in .bam/.sam Hi-C-like data has orientation, and you can report it relative to the read, pair or whole walk coordinate system.

If specified, the input is single-end. Never use this for paired-end data, because R1 read will be omitted. If single-end data is provided, but parameter is unset, the pairs will be generated, but may contain artificial UN pairs.

If specified, parse2 will report pair index in the walk as additional columns (R1, R2, R1&R2 or R1-R2). See documentation: -complex-walks For combinatorial expanded pairs, two numbers will be reported: original pair index of the left and right segments.

If specified, flip pairs in genomic order and instead preserve the order in which they were sequenced. Note that no flip is recommended for analysis of walks because it will override the order of alignments in pairs. Flip is required for appropriate deduplication of sorted pairs. Flip is not required for cooler cload, which runs flipping internally.

output file with all parsed alignments (one alignment per line). Useful for debugging and analysis of walks. If file exists, it will be open in the append mode. If the path ends with .gz or .lz4, the output is bgzip-/lz4-compressed. By default, not used.

Phase pairs mapped to a diploid genome.Diploid genome is the genome with two set of the chromosome variants,where each chromosome has one of two suffixes (phase-suffixes)corresponding to the genome version (phase-suffixes).