RTG Core 3.5 / RTG Tools 3.5

55 views
Skip to first unread message

RTG Announcements

unread,
Jul 16, 2015, 12:23:21 AM7/16/15
to rtg-an...@realtimegenomics.com
Real Time Genomics are pleased to announce the availability of new releases of our full analysis suite, RTG Core, and our free utility package, RTG Tools.  This release includes new features and performance improvements. Some of the highlights of this release:

* Several improvements to somatic variant calling, including the ability to specify site-specific somatic priors, control of output for gain-of-reference and loss-of-heterozygosity events, and changes to the VCF to align with TCGA VCF specification.

* Improvements to metagenomic species reference database management. Several new options allow better customization of a species reference, and extraction of genomic information for individual species contained within the reference database.

* Improvements to variant evaluation with vcfeval, primarily the ability to perform evaluation restricted to individual regions or sets of regions (for example GiaB high-confidence intervals or exome target regions), and the inclusion of more accuracy metrics, both as a new summary file and included in the weighted ROC data file.

* We are also pleased to make the source code to RTG Tools available under the Simplified BSD License, on github. (Source code for RTG Core remains available for non-commercial use).

* Many other minor improvements (full release notes for this version are detailed below.)

If you haven't used RTG Core before (or maybe even if you have), it includes a nice new demo script that runs through an end-to-end demonstration of sex-aware and pedigree-aware family variant calling, including de novo variant detection and variant evaluation with vcfeval. (It also makes a nice demo of our comprehensive simulation tools.)

Commercial users of RTG Core may download the update from our website at http://realtimegenomics.com/products/rtg-core-downloads. Non-commercial users can download the update from our website at http://realtimegenomics.com/products/rtg-core-non-commercial or build from the source on github (note the updated build instructions).

Users of RTG Tools, which is made freely available for non-commercial or commercial use alike, can download the new version from our website at http://realtimegenomics.com/products/rtg-tools or build from the source code on github.


Detailed changes are listed below by area.  Please read these through
fully, as some command-line flags have changed, so updates to your
pipeline scripts may be required. For more information on new
features, see the RTG Operations Manual.

RTG Core 3.5 (2015-07-16)
-------------------------

### Basic Formatting and Mapping

* format/map: When formatting or mapping reads supplied as SAM/BAM
  input data, any alignments marked as supplementary are ignored.
  Note that if the input data has already been aligned, it is
  recommended that the BAM file be shuffled to avoid biases during
  mapping arising from the data being presented in chromosomal
  order. See the user manual for more information.

* sdf2fasta/sdf2fastq: These commands have new flags --names and
  --id-file that operate the same as their counterpart in sdfsubset.

* sdfsubset: This command has new flags --start-id and --end-id that
  allow specifying a range of sequences by ID.

* sdf2sam: This new command to allows the extraction of reads from SDF
  in the form of unaligned SAM/BAM.  This has a benefit over
  extraction as FASTQ in that some metadata (such as read group
  information) is preserved, paired end data is stored in a single
  file, and quality encoding is inherent in the format.

* chrstats: Reduce false positives in sex inconsistency detection that
  were due to applying the (tighter) sex-chromosome threshold also to
  autosomes. This threshold is now applied to sex-chromosomes only.

### Variant Calling and Analysis

* somatic: Now allows the user to specify a BED file containing
  per-site somatic priors, which can be used (for example) to reduce
  the somatic prior at sites typical of false positives (e.g. presence
  in dbSNP) or increase the somatic prior at sites known to harbour
  somatic variants (e.g. presence in COSMIC).  For more information
  see the user manual.

* somatic: At the end of variant calling, the somatic caller produces
  an estimate of somatic sample contamination.  Previously this
  estimate was only available in the log file, but in this release
  this computation has been greatly improved, and the contamination
  estimate is now included in the standard summary statistics.

* somatic: "Gain of reference" calls are now disabled by default.
  These can be included by specifying the new flag
  --include-gain-of-reference.

* somatic: Calls that are indicative of loss of heterozygosity (LOH)
  calls are not produced by default (since loss of heterozygosity
  analysis is most useful in conjunction with additional data such as
  germline variant calls or CNV data).  These calls can be produced if
  desired by specifying --loh with a prior greater than 0).

* somatic: When LOH calls are enabled, previously they were output in
  haploid GT representation, now they use the ploidy appropriate for
  the chromosome (according to the reference), for compatibility with
  downstream processing tools.

* somatic: VCF output changes to bring the somatic representation in
  line with TCGA 1.2 VCF specification. In particular:

  * Calls include a new FORMAT field SS that indicates the somatic
    status for the derived (tumor) sample. This field replaces the
    previous SOMATIC INFO field.

  * Calls include a new FORMAT field SSC which contains the somatic
    score for the derived (tumor) sample. This field replaces the
    previous RSS INFO field.

* lineage: Supports the input of pedigree in the form of VCF header
  annotations as output by the somatic caller, in the form:

  ##PEDIGREE=<Derived=TUMORSAMPLENAME,Original=NORMALSAMPLENAME>

* population: Fixed a rare case where sometimes after complex call
  simplification, the only sample genotype containing a non-ref allele
  was a member of the pedigree not being output, and in this case the
  QUAL score was the 10log10 prob(no variant) rather than 10log10
  prob(variant) as required by the VCF specification. This has been
  addressed.

* vcfmerge: Added a new flag --force-merge-all to always attempt to
  merge headers containing conflicting descriptions.

* vcfmerge: Previously vcfmerge would not process records containing
  symbolic alleles. These are now accepted.

* vcfmerge: More graceful handling when encountering records with a GT
  that refers to a non-existent ALT.

* vcfeval: Now outputs a summary containing various accuracy
  metrics. A first set of statistics is computed from the full set of
  variants evaluated (these will typically have highest sensitivity
  but potentially poor precision if the input call set has not been
  filtered). A second set of statistics is computed based on the ROC
  curve information, selected at a threshold which maximises the
  F-measure statistic (this provides some balance between sensitivity
  and precision, so may be a fairer point to gather statistics for
  cross-caller comparison).

* vcfeval: The weighted_roc.tsv file now includes columns containing
  additional accuracy metrics.

* vcfeval: Improved the detection that alerts the user when chromosome
  names are incompatible between reference, baseline, calls, and bed
  regions (if used). Improvements to other error and warning messages.

* vcfeval: Added a new flag --bed-regions to supply a BED file
  containing a list of regions that the VCF records must overlap with
  in order to be included in analysis.  For example, a common use case
  is to restrict to only evaluating calls contained within the GIAB
  high-confidence regions, or only within regions corresponding to
  exome target regions.

* vcfeval: Added a new flag --region to specify a single region to
  evaluate variants within. This is useful when evaluating calls on a
  single chromosome or within a small region of interest.

* vcfeval: Fixed a case where a ref-only call (i.e. containing no
  alts) could get output instead of an indel with a padding base at
  the same position.

* vcfeval: Disabled the output of slope analysis data files by default,
  as these are fairly special purpose (primary ROC files are still
  output). They can be re-enabled if desired by using the new
  expert/experimental flag --Xslope-files.

* vcffilter: The --remove-all-same-as-ref flag now does not consider a
  sample with missing GT as being variant, since the intent of this
  flag is to retain only records where at least one sample is called
  as variant.

* vcfannotate: Added two new flags --info-id and --info-description to
  allow specifying the name of the INFO ID and Description fields
  added to the header during annotation. These flags only take effect
  if the VCF header does not already contain an INFO declaration with
  that ID.

### Metagenomics

* taxfilter: Added a new flag --subtree which allows selecting entire
  taxonomic subtrees for inclusion in the output taxonomy.

* taxfilter: Added a new flag --remove-sequences to allow the removal
  of sequence data associated with specific taxon ids.

* sdf2fasta: Added a new flag --taxons to allow interpreting any
  supplied ID as a taxon ID and all sequences assigned to such taxon
  ID will be output. This provides an easy way to extract genomic
  sequence for any species from the reference SDF.

### Other

* genomesim: Added a new flag --prefix to specify a prefix for
  generated sequence names.

* many: Update the base library used for SAM/BAM input and output to
  htsjdk 1.128.

* many: VCF reading now detects cases where a header specifies a field
  declaration using an ID that is already in use, preventing duplicate
  header declarations.

* extract: Fix a regression where extracting from VCF without any
  region specified would include the VCF header.

Reply all
Reply to author
Forward
0 new messages