Oligotyping Pipeline Explained
This document is an attempt to explain software pipeline for oligotyping. Please do not hesitate to ask any questions you may have by sending an e-mail to meren at mbl.edu.
This article is about how to get the oligotyping pipeline up and running. Please see the published methods paper for technical details of the method:
(document version 0.56; last update Aug 15, 2014)
Examples in this post will periodically be updated using the latest version of the oligotyping pipeline that is available here:
I have two goals with this document. First is to demonstrate how oligotyping pipeline works, and second is to demonstrate what it does. In order to accomplish both in one document, I came up with a mock problem to solve by using the pipeline. But I would strongly suggest you to read some of the studies used oligotyping. A list can be found here.
Going back to the mock example, here is the problem: Lets assume we have 10 samples collected from 10 different environments. Lets say the first 5 samples were collected during the summer, and the last 5 were collected during the winter. Assume the taxonomy analysis shows that Pelagibacter (SAR 11) is present in all samples. And these are the questions we wish to answer:
- “Are all Pelagibacter in all 10 samples the same?“,
- and “if there are different types, how are they distributed?“.
In order to make a good case, I generated 10 samples, each containing 250 sequences, using 2 strains from Pelagibacter that differ from each other by 2 nucleotides (considering the sequence length, these two strains are 99% identical at the 16S rRNA gene region we sequenced). By using these two strains as a template, I generated the FASTA file below by using pommunity (pommunity is a simple tool I implemented to generate pseudo communities with expected sequencing errors based on the literature on 454 (Roche Genome Sequencer FLX)):
By using pommunity, I introduced random sequencing errors, so they look just like what we would expect to see if we had isolated these reads from an actual 454 run.
These 10 samples contain two different types of Pelagibacter. Also, the ratio of two types of Pelagibacter in these samples are different: the first 5 samples are composed of 91% of type 1 strain and 9% of type 2 strain, while the last 5 samples are composed of 9% percent type 1 strain and 91% type 2 strain.
At this point you know everything about the content of this FASTA file. If you are up for a challenge, you can play with this FASTA file by using your favorite tools to extract the information I just explained. I already tried a couple of things: if you use RDP or GAST, you will have all these reads assigned to Pelagibacter. If you would use UCLUST to cluster them, you will get 4 clusters at 97%, and 169 clusters at 99% identity level.
You can jump to the Preparation section, but for the curious, this is what the config file I used to generate reads for Environment-01 through -05 (and ratios were flipped for Environment-06 through -10) via pommumity looked like:
If you obtain the source code from GitHub, you can generate a sample by sending the config file above as a parameter to it by typing this command:
For installation instructions, please refer to this post:
Preparing the FASTA File
The oligotyping pipeline requires the input FASTA file to be formatted in a certain way.
It is not complicated, but it is a requirement for running the pipeline properly. First, all of your reads from all of your samples of interest have to be in a single FASTA file, and all reads have to have deflines formatted in the following format:
Basically, every defline starts with the sample name and unique ReadID, and these two bits of information should be separated from each other by an underscore (‘_’) character. More specifically, the very last underscore character in your deflines must separate your sample names from unique identifiers of individual reads (which means, it is OK to have underscore characters in your sample names as far as the last one separates unique portion of each defline). To see an example FASTA file that works well, please see the example FASTA file.
If you want to make sure that your FASTA file is properly formatted and meets the expectations of the pipeline, simply run this command and examine the output to see whether the number of samples found in your file and sample names make sense:
This is how the report from this command looks like when I run it on the example FASTA file:
meren ~ $ o-get-sample-info-from-fasta mock-env.fasta Samples and read counts found in the FASTA file: Environment-03 250 Environment-02 250 Environment-10 250 Environment-01 250 Environment-06 250 Environment-07 250 Environment-04 250 Environment-05 250 Environment-08 250 Environment-09 250 Total number of samples: 10 Total number of reads: 2,500
Because we are working with amplicon sequences, our reads are already aligned in a sense (since, thanks to our primers, they are all coming from an evolutionarily homologous region of all genomes we found in our samples). I will try to give a practical recap here, but please read the “Materials and methods” section of the oligotyping methods paper for a much better description of why and when alignment is necessary.
There is one rule: all your reads should have the same length.
It is very easy if you are working with fixed length Illumina reads (such as 100nt long V4 reads on HiSeq), they are already aligned: you can just discard reads that were trimmed during the QC and work with Illumina reads without aligning them. If you are working with Illumina reads merged at the partial overlap (such as V4-V5 on MiSeq), you can just use o-pad-with-gaps script to simply add gaps to short reads in your file to match their length to the longest read in the file. This is perfectly enough, because Illumina does not suffer from insertion/deletion errors, and any length variation is biologically relevant if no QC trimming was performed.
On the other hand, if you are working with Roche/454 reads, your reads will most likely have variable lengths not only due to natural variation among different taxa, but also due to the vast amount of in/del errors 454 introduced. Oligotyping pipeline is not accustomed to deal with this type of length variation, and artificial in/del errors can’t be fixed by simply trimming all reads to the same length. Because trimming 454 reads without alignment will not ameliorate shifts in 454 reads due to homo-polymer region-associated in/del problems. This is when the alignment is necessary. But don’t despair.
Since our mock example is generated with Roche/454 reads, I will continue with the alignment to show you how it could be done easily.
Alignment is an important step and needs to be done carefully. Alignment results will impact the analysis results downstream, however these impacts will mostly remain within alpha diversity estimates. There are many different ways to align a collection of sequences. But I suggest aligning reads against a template for the purpose of preparing them for oligotyping. PyNAST does a great job with GreenGenes alignment templates, and that is what I suggest. Here is the GreenGenes alignment file that you can download and use:
Once you align all your reads, you can trim them to the same length.
Note: If you use PyNAST and GreenGenes alignments, your resulting aligned sequences will contain many uninformative gaps. You may want to clear those gaps before the entropy analysis. You can use ‘o-trim-uninformative-columns-from-alignment script.
For our mock problem I explained in the Introduction section, I am going to use Muscle to align sequences in the FASTA file. But please stick with PyNAST and don’t use multiple sequence aligners. This is how I run Muscle:
In the next section I am going to be using the result of Muscle alignment, mock-env-aligned.fasta, for entropy analysis.
Shannon Entropy Analysis
The first step of oligotyping is to perform Shannon entropy analysis on the aligned sequences. By using Shannon entropy, we are quantifying the uncertainty among the columns of our alignment. Since our goal is to find out the most variable locations in our very closely related taxa to decompose them into meaningful groups, we really want to know how many components there are that are contributing to the uncertainty. Basically, the result of the entropy analysis helps us determine the appropriate number of components to identify oligotypes. Assume I have a directory called ‘sample-run‘ at the same level of oligotyping directory (the directory where the oligotyping codebase resides), and the alignment file that will undergo the entropy analysis (mock-env-aligned.fasta in this example) is in the ‘sample-run‘ directory. Then, this would be the command line to perform entropy analysis on this alignment:
Performing entropy analysis: 100%
Once the entropy analysis is finished, you are going to see an entropy distribution figure. For mock-env-aligned.fasta, it should look like this one:
If you recall the explanation of the nature of these sequences from the Introduction section, this figure will make much more sense. We had two types of Pelagibacter that differ from each other by two nucleotides. Entropy analysis shows, along with all the random sequencing errors down below, these two nucleotide positions are the ones that offer the most relevant information to separate these reads from each other. This means, if we use only nucleotides from these two positions, instead of using the entirety of our reads, we can generate distribution profiles for different types.
From the figure, it is easy to say that there are two nucleotide positions in this alignment that show high variation. Then, we can assume that using the top two components for oligotyping analysis will be sufficient to decompose those different types from each other. Even though usually it is not this clear cut, real world data present similar stories. Obviously the number of entropy peaks and their volume changes depending on the taxonomic depth you have used to isolate your ‘closely related taxa‘ at the beginning of the analysis. For instance, the following figure shows a real-world case where ~65K pyrotag reads from V3-V5 amplicons that were classified as Gardnerella vaginalis with 99+% identity to a group of known G. vaginalis sequences (click here for more information on G. vaginalis study, which was published recently):
The figure that shows the entropy distribution will be stored as a PNG file in the work directory.
The second product of the entropy analysis step is a text file that contains entropy values for every position in the alignment file. It also will be stored in the work directory:
This file (mock-env-aligned.fasta-ENTROPY in this case) will be the input of the next step, where the oligotyping pipeline will decompose our closely related taxa and show their distribution among our samples.
Now the entropy analysis is complete, and we know the number of initial components to use (which is 2, for our Pelagibacter example). At this stage we will be using oligotype program for the analysis, which is a part of the codebase and has many command line options.
You can always see available command line options by typing python oligotype --help.
Oligotyping pipeline can offer a very quick overlook, as well as an extensive analysis that requires more time but provides many essential output files for a deeper analysis.
A Quick Overlook
Oligotyping analysis may take some time depending on the complexity of data, the number of components that are used, and various user requests. However, it is possible to do a very quick analysis and get an idea by sending --quick directive to the pipeline as a parameter. If we go back to our mock-env-aligned.fasta, we already know based on the entropy analysis results, the initial number of components we should be interested is 2. So, if we’d like to oligotype this dataset by using 2 maximum entropy components, this is the command line we need (assuming that you are in the sample-run directory):
First parameter is the aligned FASTA file and the second parameter is the entropy file that was generated during the entropy analysis step. -c indicates the number of components to be used; -a and -s are parameters for quality filtering purposes which I will explain in a minute. Finally, the --quick directive tells the oligotyping pipeline to produce results as quickly as possible, without generating any figures or output files that are necessary for further supervision.
What files are generated by this run? If you type ls in your work directory, you will realize that there is a new directory created by the oligotyping pipeline called mock-env-aligned-c2-s1-a1.0-A0-M0 (this directory name is actually a combination of the file we analyzed and parameters we used), and here is what you have in that directory:
Although you will not find anything in the FIGURES directory (thanks to the --quick directive), you can visualize your results in the simplest form by typing the following command (do not forget to replace /PATH/TO/ in this command with the directory that contains the oligotyping codebase):
As a result of this command a new file, mock.pdf, would be generated in the output directory, which would look like this:
If you remember the nature of the input data from the Introduction section, you should be able to appreciate what you see in this figure:
- Every bar in this stacked bar figure represents a sample in the dataset (from Environment-01 to -10).
- Different colors represent different oligotypes within these samples that are decomposed by using two maximum entropy components.
- The relative abundance of a color in a sample is proportional to the percent abundance of the oligotype that is denoted by that color.
This is a very important result, showing that we managed to decompose 2 different strains that are 99% identical at the 16S rRNA gene region of interest.
Let’s go back to the contents of the output directory.
One of the most important files in this output is the MATRIX-COUNT.txt file. This file is a TAB delimited file that contains a contingency table that shows the actual number of reads per oligotype / sample pair. Third party analysis environments such as EXCEL can be used to make more sense of this data. So, what were those -a and -s parameters? Those are being used as quality filters. By sending -s X as a parameter we instruct oligotyping pipeline to “discard any oligotype that appears in less than X sample”, and similarly by sending -a N, we instruct oligotyping pipeline to “any oligotype has to be more than N percent in at least one sample”.
Even though we use a fraction of the available information in our reads to decompose oligotypes, which diminishes the impact of random sequencing errors in our results, the probability of having random errors in these positions of interest that define oligotypes is not zero. Certain information about the run, such as the number of raw oligotypes and the change in number after the application of various quality fliters, can be found in RUNINFO file after each run.
A More Extensive Analysis
If you would like to do an extensive analysis, the oligotyping pipeline will help you do that by making you wait a little longer to see your results. This time we will not include the --quick directive:
This time the output directory is richer:
One of the most important addition is OLIGO-REPRESENTATIVES directory, which contains all essential files about each individual oligotype:
These are some of the files for each oligotype and what are they useful for:
- *_unique files: These files contain unique reads (sorted by their frequencies) that are binned into an oligotype based on the nucleotides they had in the locations of high variation. For instance, the oligotype that is being represented by GC in this example collects all the reads that have G and C nucleotides at those 2 locations in these alignments revealed by the entropy analysis. However, there may be other components among these reads that may explain diversity deeper. In this case one might expect to see more than one major group of unique sequences (that are being differentiated from each other at another location that is not covered by the components that were used to generate these oligotypes).
- *_unique.png files: These figures are visualizations of what is going on in *_unique files. They have two panels, showing the abundance curve of unique sequences, and entropy distribution of all reads that were claimed by a given oligotype. The figure that was generated for 00000_GC_unique file is shown below. In the top panel the abundance curve of unique sequences can be seen (according to the figure, the most abundant unique sequence comprised almost all the reads, and all others, most likely slight variations from the most abundant unique ones due to random sequencing errors, were present in very small numbers). In the bottom panel we have Shannon entropy analysis for all reads (according to the figure entropy was almost flat, which is what we would expect if all sequences in this collection do not really comprise any hidden diversity):
- *_unique_BLAST.xml files: These XML files contain the BLAST result of the most abundant unique sequence against NCBI’s nr database. Even though it takes a lot of time to BLAST reads against nr, especially when there are many oligotypes, I find it very valuable to know if there is a perfect hit in the database for the oligotype of interest.
- *.cPickle files: These files contain a variety of information about each oligotype in the form of serialized Python objects. They are being used by the pipeline itself to make figures and charts (see the next section).
All of the information the oligotyping pipeline can offer is present at this point.
You may also want to use a mapping file to visually investigate the similarities or dissimilarities among your samples with respect to the oligotyping resuls. If you would like to use a mapping file, see this article:
A User Friendly Way of Analyzing Results
When it is run on real-world data, oligotyping requires user supervision. The best way to perform supervision is to examine results carefully, and make informed decisions about the nucleotide positions to use to explain diversity (you will find more about this aspect of the oligotyping analysis on this blog). Luckily, the oligotyping pipeline offers an easy-to-use interface to view and examine analysis results. This is a static HTML page that is being generated by the pipeline at the end of each analysis, if --gen-html directive is included in the command line:
At the end of this analysis you will be given an address to visit with your favorite browser. This is a screen shot of my screen with the web address:
This HTML output can be seen as a summary of what analysis results show. Since it is a static web page, you can copy it on a memory stick and carry it around. For instance, in order to demonstrate the results of this particular run, I copied that directory on this server:
In a real oligotyping analysis you will be especially interested in the ‘Oligotypes‘ section in the index page:
Even though it looks quite uninteresting with this mock problem, it might get very sophisticated and useful when the oligotyping pipeline is dealing with real-world analyses.
You may be wondering about the efficacy of oligotyping with actual data.
In this document I used a Pelagibacter dataset as a mock problem. It wasn’t a truly random choice. I recently analyzed 274,634 V4-V6 reads (approximately 450 nucleotides long, quality controlled, chimera checked and anchor trimmed at MBL) that were classified as Pelagibacter in a dataset that contained more than 160 samples. These 160+ samples were sequenced from environmental samples that were periodically collected from 7 geographical locations over a period of almost 2 years.
When I analyzed this data with the oligotyping pipeline explained in this document, this was the final stackbar figure, which shows the diversity within Pelagibacter (unpublished as of May 10, 2012):
Each white bar separates one station from another. Every color represents an oligotype, and each oligotype in this figure has a perfect hit to something in NCBI with 100% sequence identity and 100% query coverage. Green and blue types in this figure are 99.6% identical. Even though they are almost identical, there is some crucial information in these sequences that is a precursor to genomic variation that affects their survival through different seasons.
This is a variation that is being missed.
When can oligotyping be useful? I had answered that question in an earlier post in this blog:
You have 16S ribosomal RNA gene tag sequences generated from a number samples collected from various environments to investigate cross-sectional or temporal differences. After the classification (or clustering) analysis on your reads, you know the composition of your samples. There is a taxon (or an OTU) that appears to be in every sample, and you are suspecting that there is more to this taxon (or OTU) than meets the eye, you feel that there are more than one type that this unit could be broken into. Oligotyping helps you to investigate this question, and most of the time comes with surprising answers.
If you think oligotyping might be applicable to your study, and if you have questions about how to investigate your datasets via oligotyping, you can always reach me via meren at mbl.edu. or via @oligotyping on Twitter.