elprep4.0.0 released

27 views
Skip to first unread message

Charlotte Herzeel

unread,
Oct 19, 2018, 6:16:42 AM10/19/18
to elprep
Hi,


The ExaScience Life Lab at Imec is happy to announce the release of elPrep4.0, an open-source, drop-in replacement for GATK4/Picard/SAMtools functionality that produces identical results, while greatly improving performance. See https://github.com/ExaScience/elprep

elPrep4.0 introduces multiple new features allowing us to process the preparation steps defined by the GATK Best Practices for variant calling. This includes new and improved functionality for optical duplicate marking, base quality score recalibration, fasta, bed, and vcf parsing, and various filtering options. The implementations of these options in elPrep4.0 faithfully reproduce the exact outcomes of their counterparts in Picard/GATK4.0, while vastly improving the performance. Our benchmarks show that elPrep4.0 executes the sort/deduplicate/recalibrate and apply-BQSR-pipeline from the GATK Best Practices up to 12x faster on WES data and 7x faster for WGS data, while using fewer compute resources than Picard/GATK4.0.
 
elPrep4.0 introduces the following new features:

Functionality
- Base quality score recalibration (BQSR).
The option —bqsr combines the semantics of the GAKT4.0 commands BaseRecalibrator and ApplyBQSR, producing identical results.

- Optical duplicate marking.
The Picard/GATK4.0 option for duplicate marking (MarkDuplicates) automatically performs optical duplicate marking to generate metrics to distinguish between PCR and optical duplicates. elPrep4.0 now has the option —mark-optical-duplicates for this, producing identical results.

- Metrics (MultiQC compatible).
elPrep4.0 can now generate metrics files to produce statistics about the number of unmapped reads, duplicate reads, base quality scores etc. It generates .metrics and .recal files as in Picard/GATK4.0, with identical file formats and identical results. These files are compatible with MultiQC for visualisation.


File formats
- Support for SAM File Format version 1.6, including support for large CIGAR strings.

- Support for FASTA and VCF files.
VCF files are supported directly. BCF or VCF.GZ files require bcftools to be present.

- Support for elPrep-specific elsites and elfasta formats, with converters from BED, FASTA, and VCF files.

- Support for for BAM/BGZF files directly implemented in elPrep4.0; dependency on SAMtools is dropped.

- Support for CRAM files dropped. Please use external tools for CRAM conversion instead. elPrep supports input/output piping.

Tool changes
- Split/filter/merge (sfm) mode now implemented in Go; dependency on Python scripts is dropped.
elPrep offers two execution modes, a mode that operates entirely in RAM, and a mode that splits data using genomic regions for processing (sfm). This was previously implemented using Python scripts, but these are now replaced by an sfm subcommand, making elPrep both easier to install and use.

- split/merge: The split tool now groups entries in the sequence dictionary and splits the input data according to these groups rather than the individual “chromosomes”. The format of the file names that are generated this way is changed accordingly.

    -    —mark-duplicates-deterministic is deprecated, mark-duplicates is now by default deterministic

- added —log-path option to all tools which sets the path for writing the elPrep log file

    -    New command line parameters for filter/sfm commands: —mark-optical-duplicates, —bqsr, —bqsr-reference, —known-sites, quantize-levels, —sqq.

- New command line parameter for split command: —contig-group-size.

- New command line tools: sfm, fasta-to-elfasta, vcf-to-elsites, bed-to-elsites.

API
- Improved internal representation of reads.
- Improved API for reading/writing SAM/BAM files.
- New data structures for representing FASTA files.
- New intervals data type for representing BED and VCF intervals.

Performance
    -    Multiple performance and memory usage improvements.

License
    -    elPrep4.0 is released and distributed as an open-source project under the terms of the GNU Affero General Public License version 3 as published by the Free Software Foundation, with Additional Terms.

Demo
- Updated demos at https://github.com/ExaScience/elprep/tree/master/demo

Kind regards,
Charlotte
Reply all
Reply to author
Forward
0 new messages