Hi Kwat,
First off, I have no experience with GATK so I can't comment on the best steps for that workflow. Other users may be able to chime in there.
There is quite a bit of variability in data types so it is hard to describe a single workflow which is useful. Below are a few tools and libraries I use. Keep in mind you could likely choose any of a dozen others for each step and get very similar results.
Mutation Calling: Strelka, MutationSeq
Deep sequence alignment: I use bwa with the `aln` and `sampe` commands. I've found the `mem` algorithm to be a bit aggressive. I align to the entire genome not just the target amplicons.
Copy number array: I mostly use OncoSNP but have also used PICNIC and ASCAT to check the robustness of results.
Copy number WGSS: Currently I am using an in house tool, but I have used TITAN and OncoSNP-Seq.
Collating the data: I typically write custom Python scripts to do this. Python has good builtin support for csv (and tsv obviously). The "pandas" library is invaluable for working with table data and the "PyYAML" module is good for writing the '.yaml' config files.
Pipeline: I write my own pipelines using the "ruffus" library.
Our basic workflow is as follows:
- WGSS
- Call mutations
- Estimate copy number and tumour content (may use array if we did exome sequencing)
- Design PCR primers to deep sequence mutations.
- Use MiSeq to deepseq the mutations
- Align with BWA `aln` and `sampe` to the whole genome
- Extract counts using a custom script
- Perform the binomial exact test to determine if variant is actually present
- Remove any sites which are germline and wildtype
- Join the copy number data and deepseq counts with a custom Python script
- Autogenerate a yaml config file using a custom Python script.
- Run PyClone 3+ times with different random seeds to check convergence
Cheers,
Andy