A guide prepare by Terrence Bell, postdoctoral fellow at projet GenoRem, Institut de recherche en biologie végétale
Terminology note: the sequence data output by massively parallel sequencing instruments (Roche 454, Illumina HiSeq, IonTorrent…) are commonly termed “reads”. In this guide, “sequences” and “reads” are synonymous.
To make things easy, use MoBio distributed by VWR in Canada. It’s well-tested, and results compare easily to most of the literature. You can even do dual extractions of DNA/RNA if you need both.
If you have a ton of samples and money is a concern, I have also used the following protocol successfully:
Yergeau E, Bokhorst S, Huiskes AHL, Boschker HTS, Aerts R, Kowalchuk GA. (2007). Size and structure of bacterial, fungal and nematode communities along an Antarctic environmental gradient. FEMS Microbiol Ecol 59: 436-451.
An underrated but important step. Could someone easily identify your samples without you? Is it clear which samples are your DNA/RNA extracts, amplicons, purified amplicons, and amplicon pools? Minimally, your code should indicate the sample, project and MID # (for amplicons). It is also a good idea to put your initials and date on the side of each tube, not just the box.
You should also have well laid out Excel sheets that link each sample to a unique MID identifier, and samples that did not amplify properly or are otherwise excluded from sequencing should be indicated. It is extremely easy to get confused when you are working with hundreds of samples that have unique identifier codes – be extra careful at this step!
Criteria to consider:
Ion Torrent
Roche 454
Illumina
Primer sets adapted to Roche 454 are currently available at Centre sur la Biodiversité for universal 16S rRNA, universal fungal ITS, and 18S rRNA primers that favour arbuscular mycorrhizal fungi (AMF). Other sets are also available for several bacterial functional genes, and possibly other regions. Ask around before ordering a costly huge set of primers!
Primers for amplicon-pool sequencing are similar to regular primers, and can contain the same target sequence that you use for a regular PCR reaction. However, you must also add a sequencing adaptor, a library code, and an MID code. You will order multiple forward primers with unique MID codes, and will then amplify different samples with different MID-coded primers. The purpose is to be able to pool your DNA fragments, and later separate them by MID in silico, with bioinformatics steps. If we did not use this approach, it would cost several thousand dollars to get a ton of reads for only one sample! Reverse primers generally do not contain MID codes, unless you will be sequencing from the reverse direction.
Consult Roche for the latest recommended primers and adaptors (You must create a login to access the documentation). This may vary depending on the chemistry used in the reaction. For the original recommended MIDs, consult ‘454 Sequencing Technical Bulletin No. 005-2009’ .
Steps to prepare samples for sequencing
Can you add different concentrations of amplicons to the same pool? Yes, BUT…only in very specific circumstances. You should consider that, in many subsequent sequence analyses steps, some calculations will require to standardize all samples, to bring them to reads number equal to the sample with the lowest number of reads. For instance, if you have 2000, 3000, 2500, and 150 reads, you will have to reduce your number of reads to 150 in each sample to compare diversity metrics etc. However, if you have amplicons from two different genes, or from samples that will never be compared directly, you could potentially select two concentrations (e.g. have one gene be 3x less concentrated).
Can you use more cycles to amplify only your samples that produce weak bands? No. More cycles means more error, and more bias towards certain taxa. The PCR-amplification protocol should be standardized between all samples, and you should use as few cycles as is reasonable. If you are getting bands of weak intensity, from samples that had good measurable amount of DNA, try adding BSA or DMSO in your PCR mix, or dilute DNA samples 1:10 or 1:100.
If you will sequence at Genome Quebec, you must fill out the Roche Service request form, and you need to obtain a quotation number from them for the order. When your samples are ready to go, your best bet is to walk them over there yourself!
When you fill out your order, request the original .sff files. Some sequencing centres will offer to do a certain amount of the pre-processing for you, but you will want to have all of the quality data available to you for your analyses. Furthermore, many analysis pipelines start from this initial file, so it is preferable to work from one than having to redo the same steps on hundreds of pre-separated files!
You do not have to be a bioinformatics specialist to work with sequence data, especially when you are profiling commonly used phylogenetic markers such as 16S and 18S rRNA. Many pipelines are publicly available, and have excellent tutorials that will allow you to learn how to process your data. Some will provide sample datasets, so that you can walk through each step and make sure that you end up with the results that you expect.
This protocol details the processing of amplicons using Mothur, as I find their Wiki, tutorials and help forum to be the most comprehensive. If you use Mothur, you should cite the following reference: Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M, Hollister EB et al. (2009). Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol 75: 7537-7541.
If you use the protocol detailed here, you should also cite the Schloss standard operating procedure (SOP): Schloss PD, Gevers D, Westcott SL. (2011). Reducing the effects of PCR amplification and sequencing artifacts on 16S rRNA-based studies. PLoS One 6: e27310.
Below are some other resources that you may find useful. Be sure to locate and cite the appropriate reference when using each.
This step is meant to account for errors introduced in the amplification and sequencing steps, and to remove reads that are too short or do not match the MID/oligo requirements. The following sections will describe amplicon processing using Mothur. Sequence processing techniques are changing rapidly, and you should consult the Schloss SOP for the latest standard protocol and for options not described here.
The easiest way to start is to use a combined .sff (or .fastq) file, and not the files already separated by MID code. If you do not have access to a single file, you can run multiple .sff files in parallel and later merge your separate .fasta and .qual files.
## Split the .sff file into .fasta, .qual, and .flow files (for fastq use fastq.info)##
sffinfo(sff=example.sff, flow=T)
## If you run the above command on multiple .sff files, you can merge the .fasta and .qual files next. Use hyphens to separate each file name.##
merge.files(input=example1.fasta-example2.fasta-example3.fasta, output=example.fasta) merge.files(input=example1.qual-example2.qual-example3.qual, output=example.qual)
## Take a look at your sequence stats. Do this at any step you want to monitor how many reads you have, the average size, etc. ##
summary.seqs(fasta=example.fasta)
## Now you will trim your reads to remove primers, MID tags, and bad quality segments. Reads that end up being too short will be removed. The oligos file will have to be created (you can find an example in the Schloss SOP sample data), and will contain your primer sequence(s), MID codes, and group names. You can find descriptions for the parameters I used on the Mothur wiki. If you have sequenced in the reverse direction, you need to use the parameter flip=T. ##
trim.seqs(fasta=example.fasta, oligos=example.oligos, qfile=example.qual, maxambig=0, maxhomop=8, bdiffs=1, pdiffs=2, qwindowaverage=30, qwindowsize=50,minlength=250,processors=2)
## One of the great advantages of Mothur is that it becomes possible to work with a much smaller number of reads than were actually contained in your .sff file. This is done by reducing the file to only the unique reads, and creating a names file to map all identical reads to each other. ##
unique.seqs(fasta=example.trim.fasta) summary.seqs(fasta=example.trim.unique.fasta, name=example.trim.names)
If you are using 16S or 18S amplicons, or have another well aligned database, you should use the following steps in order to further improve the quality of your reads, and ensure that you are comparing your reads over the same gene region.
## First you will align your reads to a chosen database. In this case, I have selected the Mothur version of the bacterial Silva database. Other alignment options are explained in the Mothur wiki. ##
align.seqs(candidate=example.trim.unique.fasta, template=silva.bacteria.fasta, ksize=9, align=needleman, gapopen=-1,processors=2)
## It is very important to look at your reads at this point to make sure that everything appears to have aligned properly. ##
summary.seqs(fasta=example.trim.unique.align, name=example.trim.names, processors=2)
## Most of your reads should be either starting or ending in the same place, since you used primers that target a single region. If not, something has probably gone wrong earlier on (check trim.seqs parameters). In order to make sure that we are comparing reads over the same region, we have to screen out those that did not align well. If most of your reads (e.g. 90%) start around 1044, you should set start=1044 and the 10% that start after that position will be eliminated. At the other end, you will use ‘optimize’ in order to keep the best aligned reads. In the example below, I discard the 5% of the reads that end first in the alignment. ##
screen.seqs(fasta=example.trim.unique.align, name=example.trim.names, start=1044, group=example.groups, optimize=end, criteria=95, processors=2) summary.seqs(fasta=example.trim.unique.good.align, name=example.trim.good.names)
## This step is extremely important, and is also the most likely to give you problems. Because the alignment database is many columns long and your aligned reads only cover a region of 1000-2000 bases, you need to remove the mostly blank columns. The best option for downstream applications is to use the trump=. option in which every column with even one blank is removed. The problem is that even one poor sequence can lead to you eliminating all or most columns. Read more about this in the Mothur help forum. ##
filter.seqs(fasta=example.trim.unique.good.align,vertical=T, trump=., processors=2) summary.seqs(fasta=example.trim.unique.good.filter.fasta, name=example.trim.good.names)
## You should now run unique.seqs again, as some of the differences between reads earlier may have been in regions in which they did not overlap. ##
unique.seqs(fasta=example.trim.unique.good.filter.fasta, name=example.trim.good.names) summary.seqs(fasta=example.trim.unique.good.filter.unique.fasta, name=example.trim.unique.good.filter.names)
## To account for slight mismatches due to sequencing error, we use the pre.cluster command to group reads that are likely from the same OTU. ##
pre.cluster(fasta=example.trim.unique.good.filter.unique.fasta, name=example.trim.unique.good.filter.names, group=example.good.groups,diffs=2) summary.seqs(fasta=example.trim.unique.good.filter.unique.precluster.fasta, name=example.trim.unique.good.filter.unique.precluster.names)
## As a last quality-check step, we will remove potential chimeras (a single amplicon originating from multiple amplicons) from our sequence data. ##
chimera.uchime(fasta=example.trim.unique.good.filter.unique.precluster.fasta, name=example.trim.unique.good.filter.unique.precluster.names, group=example.good.groups, processors=2) remove.seqs(accnos=example.trim.unique.good.filter.unique.precluster.uchime.accnos, fasta=example.trim.unique.good.filter.unique.precluster.fasta, name=example.trim.unique.good.filter.unique.precluster.names, group=example.good.groups, dups=T) summary.seqs(fasta=example.trim.unique.good.filter.unique.precluster.pick.fasta, name=example.trim.unique.good.filter.unique.precluster.pick.names)
See the Mothur wiki for more information on what comes next. You can basically decide at this point whether you want to work with OTU data (clustered reads), phylotype data (assigned sequence names), or a mix of both. There are plenty of downstream analyses built into Mothur to allow you to compare your samples and treatments.
Simply listing species names is not enough. You need to come up with new ways to relate community composition to function, or to answer real ecological questions. Try scanning the most recent articles of your target journal(s) to see what is new and exciting, and try to emulate some of the more interesting figures. Or come up with your own!
For working with data, you will minimally want to be comfortable using R and Excel. If you can think of something to do, it can be done in R, but you often have to do some searching to figure out how. If you have a copy of Adobe Illustrator available, this is very useful for cleaning up your figures, or combining multiple figure types.
Take Analyse quantitative des données biologiques course with Pierre Legendre, or learn how to use R on your own with his Numerical Ecology with R book.
When you do massively parallel sequencing, you are technically supposed to submit all of your reads to SRA. This is because readers of your article need to be able to access not only your fasta reads, but the quality data that is associated with your sequencing runs, so that they can combine these in meta-analyses. The SRA may cease to be the go-to place for large datasets in the near future, but you will not be able to publish without making your data publically available somehow. Make sure that you do not make your data available before you publish. You can leave the accession number blank in your manuscript until the last stage of the review process, and do the submission at that point. This is preferred anyway, as they like to link your data to your article.
The procedure changes often, but ensure that you are providing all of your MID data so that other users are able to process your .sff file as you have. You can put this in one of the descriptive paragraphs that they provide for your experiment.