I work in a lab mainly into evolution and genomics in fishes. We have been using CLC Genomic Workbench for the last year with great results. The software IS really a well integrated resources with a lot of small functionalities. Nothing that you couldn't find in the open source world, but well put together.
RNA sequencing (RNA-seq) is a recently developed approach to perform transcriptome profiling using next-generation sequencing (NGS) technologies. Studies have shown that RNA-seq provides accurate measurement of transcript levels as well as their isoforms, which is useful to address complex transcriptomes. In addition, the increasing publicly available sequencing datasets and decreasing sequencing cost promote the use of RNA-seq for hypothesis-generating studies. In this chapter, we demonstrate how to analyze RNA-seq data and generate interpretable results using CLC genomic workbench software and perform the downstream pathway analysis using ingenuity pathway analysis (IPA).
Ensure you have the most up-to-date version of the CLCbio Genomics Workbench (the software should tell you if there's a more recent version when you start it, or you can check the CLCbio website). Register CLC Genomics workbench and follow the steps to connect to CLCBio workbench license server.
If you would like to use extra plugins, click the Plug-ins button () in the toolbar at the top of the CLC Genomics Workbench window. This will bring up the Manage Plugins dialog box. Find the Plugin, click the Download and Install button, and then close the Manage Plugins dialog box and restart the CLC Genomics Workbench (choose Yes when the dialog box comes up that asks if you want to restart the workbench now). If you are using Windows machine, you may need to start CLC Genomics workbench as administrator to install Plugins.
-->
Computational genomics tasks require various reference genome. CLC_References with a blue S on the folder icons are the folders for reference genomes. CLC_Referenes are associated with Biomedical Genomics Server, and its contents include human, mouse and rat genomes. Reference genomes for the other species are installed inside Genomes folder under CLC_References. If you need any special reference genome, please open a support ticket).
Once you start a job running on HTC cluster, you will see the usual progress bars in the Process section of the Toolbox. When the job status is listed as "Running", you can close your Workbench software, and the job will continue running on the remote server. When you relaunch your workbench, it will again connect to the server (as long as you checked "Automatic login" above - otherwise you can manually log in again), and the status of your job will be updated.
Our AI Workbench 1.0 focused on correcting RNA splicing as a means to restore protein expression. We recognized that a wide range of disorders, such as haploinsufficiencies can be potentially treated by boosting gene expression. Version 2.0 of the workbench was expanded to include seven mechanisms to increase expression and reduce the work needed to identify efficacious compounds. We are currently developing AI Workbench 3.0 to support target identification and drug discovery for more common, complex diseases involving multiple genes.We are now advancing a pipeline of programs in a number of areas, including neurodevelopmental, neurodegenerative and metabolic.
As additional capabilities were added to the software platform, it was eventually split into several themed Workbenches and plugins with collections of features relevant to different applications (e.g. pathway analysis, genomics, and other omics). Features include read mapping and de novo assembly of high-throughput sequencing data, whole-genome detection of SNPs and structural variations, ChIP-seq, RNA-Seq, small RNA analysis, genome finishing, microbial genomics, structural biology, and functions to analyze, visualize, and compare genomic, transcriptomic, and epigenomic data.
My job is for genotyping of enteroviruses from patients. As you may know, there are lots of subtypes of enteroviruses, so we use multiple fasta references to align with fastq files. When we used your galaxy tools, like bowtie2, BWA, etc., only very few reads aligned to the multi-fasta references, but with CLC genomics workbench, lots of reads found to align to the multi-fasta references.
When we used your galaxy tools, like bowtie2, BWA, etc., only very few reads aligned to the multi-fasta references, but with CLC genomics workbench, lots of reads found to align to the multi-fasta references.
Analyzing high throughput genomics data is a complex and compute intensive task, generally requiring numerous software tools and large reference data sets, tied together in successive stages of data transformation and visualisation. A computational platform enabling best practice genomics analysis ideally meets a number of requirements, including: a wide range of analysis and visualisation tools, closely linked to large user and reference data sets; workflow platform(s) enabling accessible, reproducible, portable analyses, through a flexible set of interfaces; highly available, scalable computational resources; and flexibility and versatility in the use of these resources to meet demands and expertise of a variety of users. Access to an appropriate computational platform can be a significant barrier to researchers, as establishing such a platform requires a large upfront investment in hardware, experience, and expertise.
However, the reality is that the necessary tools, platforms and data services for best practice genomics are generally complicated to install and customize, require significant computational and storage resources, and typically involve a high level of ongoing maintenance to keep the software, data and hardware up-to-date. It is also the case that a single workflow platform, however comprehensive, is rarely sufficient for all the steps of a real-world analysis. This is because analyses often involve analyst decisions based on feedback from visualisation and evaluation of processing steps, requiring a combination of various analysis, data-munging and visualisation tools to carry out an end-to-end analysis. This in turn requires expertise in software development, system administration, hardware and networking, as well as access to hardware resources, all of which can be a barrier for widespread adoption of genomics by domain researchers.
We argue that lack of widespread access to an appropriate environment for conducting best-practice analysis is a significant obstruction to reproducible, high quality research in the genomics community; and further, transitioning from training to practice places non-trivial technical and conceptual demands on researchers. Public analysis platforms, such as Galaxy, provide solutions to some of these issues (particularly accessibility), but are generally handicapped by rapid growth in per-user demand for compute resources and data storage, and the enforced constraints on flexibility that are a requirement of a centrally managed resource.
Reproducible genomics requires, at a minimum, a way of accessing the same tools and reference datasets used in an analysis, combined with a comprehensive record of the steps taken in that analysis in the form of a workflow, in sufficient detail to reliably produce the same outcome from the same input data, assuming a deterministic analysis [18]. At the most basic level reproducibility can be achieved with shell scripting and documentation, but issues in ease of use, maintenance and genuine reproducibility are well-known [19], [20]. This has catalysed a number of efforts in developing platforms for reproducible scientific analysis through structured workflows, including Galaxy, Yabi, Chipster, GenePattern and numerous commercial products (e.g., Igor [21], BaseSpace ( ), Globus Genomics [22]). An environment supporting reproducible genomics requires at least a workflow platform and a system for ensuring stability of the underlying software and data [23].
Building an analysis environment that guarantees good performance for a wide user base is especially challenging. In the case of a managed service for genomics, the more successful the service is in attracting users, the more likely it is that performance will suffer due to the number of users, particularly as those users explore larger data sets through a wider range of analysis options [24]. Good performance on a per-user basis is a combination of available resources, user access to those resources, underlying infrastructure limits and bottlenecks (for instance, disk I/O), and the inherent scalability of the environment. We would argue that performance in the context of a widely available, flexible genomics environment requires high-availability, scalable back-end compute resources. We will discuss performance design principles and implications in more detail in a later section, as this is a particularly challenging but critical characteristic of an environment that aims to support large genomics data analysis.
As users become more sophisticated in genomics analysis, they often move from a single intuitive analysis platform (such as Galaxy) to multiple platforms (R, command line, custom scripts) that provide more capability and flexibility (generally at the expense of simplicity). Therefore, a design principle for a general genomics environment should be for that environment to be able to be used for training (implying at least an accessible platform), but able to scale in flexibility by adding more options for interaction (such as command line and/or programmatic interfaces), and scale computationally to provide the performance for real data analysis. For all levels of the environment, we would provide high capability through access to best practice tools and availability of reference datasets, and ideally linked to low latency visualisation and data interpretation services.
aa06259810