Bioinformatics Tools for Analyzing High Throughput Sequencing (NGS) Data

  • Commonly used tools in bioinformatics analyses of NGS dataset. Although, a lot of these tools can be used for other purposes too (such as microarray data, Sanger sequencing, proteomics).
  • This is a (non-exhaustive) list, but represents some of the most up-to-date and most commonly used tools.
  • Most of these tools are open source and free.
  • Most of these tools are manipulated from the command line, although some of them also provide a GUI (Graphical User Interface).

Keeping up to data with sequencing platforms and cost

Data manipulation

  • Picard Tools A set of Java command line tools for manipulating high-throughput sequencing data (HTS) data and formats.
  • FASTQC A quality control tool for high throughput sequence data.

Sequence alignment

  • The Genome Analysis Toolkit (GATK) -Software package developed at the Broad Institute to analyze high-throughput sequencing data.
  • bwa -Mapping sequences (e.g 454, Illumina) against a large reference genome, such as the human genome.
  • bowtie -Ultrafast, memory-efficient short read aligner.

de novo transcriptome assembly

  • trinity -Efficient and robust de novo reconstruction of transcriptomes from RNA-seq data
  • Trans-ABySS -de novo assembly of RNA-Seq data using ABySS
  • SOAPdenovo-Trans -de novo transcriptome assembler basing on the SOAPdenovo framework, adapt to alternative splicing and different expression level among transcripts
  • MIRA -dSequence assembler and sequence mapping for whole genome shotgun and EST / RNASeq sequencing data.

de novo genome assembly

Variant calling (SNPs / short indels)

  • samtools -Samtools is a suite of programs for interacting with high-throughput sequencing data.
  • The Genome Analysis Toolkit (GATK) There are a variety of tools, with a primary focus on variant discovery and genotyping as well as strong emphasis on data quality assurance.

Bioinformatics tools geared specifically towards GBS and RAD data

  • tassel -TASSEL is a bioinformatics software package that can analyze diversity for sequences, SNPs, or SSRs.
  • stacks -Software pipeline for building loci from short-read sequences, such as those generated on the Illumina platform. Stacks was developed to work with restriction enzyme-based data, such as RAD-seq, for the purpose of building genetic maps and conducting population genomics and phylogeography.
  • pyRAD -pyRAD can analyze RAD, ddRAD, GBS, paired-end ddRAD and paired-end GBS data sets.

Bioinformatics tools geared specifically towards gene expression (RNAseq) analyses

  • tophat -Aligns RNA-Seq reads to mammalian-sized genomes using the ultra high-throughput short read aligner Bowtie, and then analyzes the mapping results to identify splice junctions between exons. Usefull for analysing splice variants and their expression from NGS datasets
  • There are also several R packages listed here are specifically geared towards gene expression analyses

All-in-one proprietary software

  • geneious -Comprehensive bioinformatics software platform.
  • CLC Genomics Workbench -CLC Genomics Workbench, for analyzing and visualizing next generation sequencing data.

Gene Ontology (GO) analyses

  • Blast2GO -Functional annotation of (novel) sequences and the analysis of annotation data. Also has a GUI.
  • ErmineJ -Analyses of gene sets in high-throughput genomics data such as gene expression profiling studies. Also has a GUI.
  • DAVID -Comprehensive set of functional annotation tools for investigators to understand biological meaning behind large list of genes.

Microbial diversity / ecology

  • See also this page on the wiki.
  • mothur -A comprehensive bioinformatics software platform for microbial ecology (eg. 16S rRNA gene sequences diversity)
  • Quantitative Insights Into Microbial Ecology (Qiime) -Another comprehensive bioinformatics software platform for microbial ecology primarily based on high-throughput amplicon sequencing data (such as SSU rRNA). Also has a GUI.

Others

  • cd-hit Clustering and comparing protein or nucleotide sequences