This is a personal project focused on web development, independent study and electronic arts. If you are looking for my consulting company and professional brand, please click the button below to visit effector.io.
Current Project: Brushing up on NGS
…or post stream is below:
- 1977 – Sanger sequencing,
- aka capillary electrophoresis sequencing, di-deoxy sequencing or chain termination
- published using gel electrophoresis
- 1981 – Applied Biosystems launches first automated system
- 2008 – Applied Biosystems merges with Invitrogen to form Life Technologies
- 2013 – Life Technologies bought by Thermo Fisher Scientific
- Sanger sequencing leads for 25 years.
- Next generation sequencing uses “massively” parallel process to reduce sequence time to days or weeks
- Industry is breaking out. Some recent events:
- November 2013 – Illumina’s cystic fibrosis diagnostic tests for their MiSeqDx system became the first FDA-cleared NGS diagnostics to be sold through the classic in vitro diagnostics (IVD) kit-sales model.
- October 2014 – industry leader Illumina raised full year revenue expectations to 30% more than the previous year, 2013.
- January 2015 – Roche buys 56.3% of companion diagnostics company Foundation Medicine for $1B
- January 2015 – 10X Genomics closes$55.5M Series B.
- Feburary 2015 – Invitae raised $101M in an IPO with a market cap of $594M
- Genentech and Pfizer announce large R&D deals with sequencing companies (what are these?)
- 2014 market size
- Instrument sales: $1.5B (15% CAGR)
- Consumables: $0.8B (28%)
- Products total: $2.3B
- Services: $0.4B (25%)
- Total: $2.7B (21%)
DNA is extracted from biological samples and prepared for the sequencing process. Two primary PCR methods are used to amplify library DNA without mixing the resulting amplicons: 1) classic emulsion PCR that aims to sequester each library component in a discrete bubble of PCR mix where amplicons are collected on beads as they form and 2) PCR-on-a-slide that randomly immobilizes individual library components for discrete amplification as the spread out over a dense lawn of primers. PCR-on-a-slide technology is owned by Illumina and Life Technologies (ABI). All 2nd-generation sequencing methods begin with one of these amplification techniques (as opposed to 3rd generation sequencing that can sequence single DNA strands without amplification).
Platforms and Technologies:
- Sanger Sequencing
- 1977 – Sanger sequencing
- 1981 – Applied Biosystems launches first automated system
- 2008 – Applied Biosystems merges with Invitrogen to form Life Technologies
- 2013 – Life Technologies bought by Thermo Fisher Scientific
- Link to article on how it works
- Shotgun Sequencing
- Primer Walking
- Serial Analysis of Gene Expression (SAGE)
- 1995 – Victor Velculescu develops SAGE at Johns Hopkins
- Uses reverse transcription to convert mRNA into cDNA and collect a gene expression profile. Restriction enzymes help concatenate cDNA fragments into a long chain. This long chain is sequenced with the Sanger method to measure the diversity and frequency of expressed genes in a sample. (I may do a full article to add detail later)
- Sequencing by Hybridization
- Sequencing by Ligation
- Sequencing by ligation harnesses the sensitivity of DNA ligase to drive a fluorescent readout of the DNA sequence.
- Applied Biosystems’ SOLiD (introduced 2006, bought by Life Technologies in 2008) is one example.
- Massively Parallel Signature Sequencing (SPSS)
- 2000 – MPSS (massively parallel signature sequencing) published. Developed by Lynx Therapeutics
- 2005 – Solexa buys Lynx
- Sequencing by synthesis (SBS)
- 1997 – Balasubramanian and Klenerman develop SBS, Illumina’s now-leading technology
- 1998 – Solexa formed
- 2007 – Illumina buys Solexa
- 2003 – 454 Life Sciences buys exclusive license from Pyrosequencing AB for whole genome applications
- 2005 – 454 Life Sciences releases GS 20 instrument
- 2007 – Roche buys 454 Life Sciences
- 2008 – Qiagen buys Biosystems business from Pyrosequencing (Biotage) and currently sells PyroMark pyrosequencing instruments
- 2009 – 454 Life Science (Roche) releases GS FLX
- SOLiD System (2 Base Encoding)
- 2007 – Applied Biosystems launched SOLiD System (Supported Olgonucleotide Ligation and Detection)
- Ion Torrent
- 2010 – Life Technologies acquires Ion Torrent
- SMRT (single molecule real time) sequencing
- 2011 – PacBio RS system introduced
- tSMS (True Single Molecule Sequencing)
For each sequencing platform, gDNA (or mRNA in some applications) is fragmented and platform-specific tags and adapters are added. Platform-compatible fragment lengths are gel purified and PCR amplified. This first basic amplification is different from the platform-specific amplification steps described below (emulsion PCR and PCR-on-a-slide) and it ensures 1) that there is sufficient starting material for the run and 2) that all species of DNA fragments are captured and represented. After this basic amplification, the DNA fragments are immobilized to beads or a slide in preparation for the platform amplification step. The later platform-specific amplification steps described below are required to amplify the platform’s chemical output signal so that it can be read by the platform’s sensors.
- Acoustic shear and sonication (Covaris instrument: 100bp – 5kbp, Bioruptor: 150bp-1kbp can be used with whole tissue)
- Hydrodynamic shear (Hydroshear)
- Compressed air atomization (Life Tech Nebulizer: 100bp – 3kbp loses 30% of DNA so not good for small amounts of DNA)
- Enzymatic method
- restriction enzyme cocktail,
- non-specific nuclease (MBP-T7 Endo I and a non-specific Vibrio vulnificus nuclease Vvn)
- Fragmentase (NEB)
- Nextera tagmentation (transposase simultaneously fragments and insert adapters)
- Heat and divalent metal cation (magnesium or zinc) usually reserved for RNA
Example platform fragment lengths: Illumina and Ion Torrent read lengths are under 600 bases. Roche 454 outputs reads at less than 1kb and PacBio less than 9kb in length.
Fragments of similar length are purified and platform-compatible adapters, barcodes, primers etc. are ligated to the ends.
Andy Vierstraete does an excellent job explaining emulsion PCR here. This bead-based technique is used by ABI and 454 Life Science technologies.
Interesting note, only 16.6% of emulsion bubbles produce useable amplification products:
Two primary methods for PCR-on-a-slide are bridge amplification owned by Illumina and Wildfire amplification owned by Life Technologies (via ABI).
Illumina’s Bridge Amplification (Used in Reversible Terminator Sequencing). Polymerase colonies (“polonies”) are generated through tethered bridging:
ABI/Life Technologies Wildfire Amplification (Used in SOLiD sequencing). Polymerase colonies (“polonies”) are generated by 60ºC “walk” partial melt step:
Example PCR-on-a-slide readout for a four color fluorescent system:
A sequencing run that proceeds in only one direction along a template strand is considered a single read run type.
Basic contig alignment in single-end read sequencing:
Paired-end reads sequence both ends of a DNA fragment (but these end-reads usually do not overlap or compliment each other because the reads do not sequence the entire fragment). The technique (originally called “Double-Barrel Shotgun Sequencing) offers several improvements over single-end reads. Paired-end reads improve ability to align reads across DNA regions because there are two reads per fragment and because the two sequences (which are generated with the same amount of DNA as single-end reads) also come with an estimation of their distance apart (based on the known range of fragment lengths input into the sequencing run). This is particularly helpful for sequencing repetitive regions and for detecting rearrangements such as insertions, deletions, and inversions. Paired-end reads also produce longer contigs for de novo sequencing by filling gaps in the consensus sequence. A contig (from contiguous) is the stitched-together set of overlapping short DNA sequence segments that together represent a continuous region of DNA. A consensus sequence is the most frequently encountered sequence as calculated by multiple overlapping sequence alignments.
Diagram of general paired-end read sequencing:
Paired-end reads can span a repetitive region and reveal how many repeats it contains, locate a mutation in a repetitive region or discern between different copies of a repeat in some cases.
Paired end reads can identify deletion and insertions
Great content for this part of the post provided by the Era7 Bioinformatics team found here
Paired-end reads can bridge gaps between contigs. Bridged contigs (with gaps of unknown sequence and approximately known length) are called scaffolds:
Mate-pair sequencing is essentially paired-end sequencing that uses longer fragments (2kbp – 5kbp instead of 100bp – 500bp). Practical applications of this include de novo sequencing, completing (e.g. final arrangement) of full genome sequencing, and detection of larger structural variations. The process for Mate-paired sequencing requires additional steps:
- Long fragments of known and consistent size (can be up to 25kbp but commonly 2kbp – 5kbp) are repaired with biotinylated nuleotides
- Fragments are circularized, “mating” the paired ends together
- Circular DNA is fragmented into 400bp – 600bp sections
- Biotinylated mated pairs are affinity-purified and sequenced
Image fragments from Era7 Bioinformatics team found here
Using barcoded sequencing to pool a large number of samples in a single sequencing run results in exponential savings of cost and time. Individual “barcode” sequences are added to each sample before they are mixed and run together in the same sequencing reaction. After the sequencing reaction, fragment sequences belonging to individual samples can be distinguished and sorted during data analysis.
The concept of coverage is indirectly discussed in the sections on Shotgun Sequencing and Run Types. Sequencing coverage is the average number of times that sequenced fragments overlap and align to, or ” cover,” an individual base in the reference genome (currently GRCh38 — Genome Reference Consortium human build 38). Put another way, coverage is the number of times a nucleotide is read during the sequencing process. Coverage (C) can be calculated using the total length of the target genome being sequenced (G), the number of fragments read in the sequencing reaction (F) and the average length of the fragments (L): C = (F x L)/G. For example, if you are sequencing a gene that is 10000 bases long and your fragments are on average 300 bases long and you sequence 1000 of these fragments, then your coverage is (300 x 1000)/10000 = 30. Each nucleotide was read an average of 30 times. Note: the numerator in this equation is called the yield — the total number of bases it can read in a run.
Each sequencing platform has a maximum yield that limits the amount of coverage it can produce. Yield features prominently in sequencer marketing because it is an important indicator of platform productivity and utility. Consider the high level of similarity between 454 Life Science’s pyrosequencing and Ion Torrent’s ion semiconductor sequencing yet Ion Torrent’s technology is still used while pyrosequencing is obsolete. One major difference between the platforms has little to do with the reaction chemistry: The beads 454 used to immobilize each library fragment for emulsion PCR amplification are much larger than those used by Ion Torrent so far fewer fit on a slide. This means lower yield, lower coverage, lower productivity and higher cost. Another difference (unrelated to yield and coverage) is that Ion Torrent developed better panels of genes to be sequenced for their platform, panels that specifically targeted clinical needs.
The depth of coverage needed for a sequencing process depends on the application. For example, germ line mutations which appear in every cell in the body (e.g. BRCA) can be identified with less coverage than is required to identify a cancer mutation that only exists in a tumor or in a specific sub-population of cancer cells within the tumor (e.g. EGFR or KRAS).
This article is part of a larger discussion on next-generation sequencing. Click here to go to the beginning. (the link won’t work until that first page is published)
Sanger sequencing is the “first generation” of DNA sequencing technology and it is still considered the gold standard. A number of improvements have been added to Sanger sequencing including automation and capillary electrophoresis but the technique has largely been replaced by “next generation” methods, especially for large-scale, automated genome analyses. The Sanger method is still used for small-scale projects, for validation of Next-Gen results and for long contiguous DNA sequence reads (>500 nucleotides, but generally <1kbp). Sanger sequencing has several names:
- capillary electrophoresis sequencing,
- di-deoxy sequencing or
- chain termination
Each of these describes a key aspect of how the process works.
Sanger sequencing incorporates replication-stopping dideoxyneucleotide triphosphates (ddNTPs) into a DNA replication process. The dideoxynucleotide (ddNTP) is added in approximately 100-fold excess of the corresponding deoxynucleotide (dNTP)
Compare dNTP with ddNTP:
The low concentration of ddNTPs allows them to be occasionally incorporated into the newly formed complimentary strands causing random early terminations of the strand replication process. This results in a mix of partial replicants of various lengths, each length representing a nucleotide position in the original template strand.
A discrete reaction is run for each of the four nucleotides (A, T, G and C) and the reaction products are run on a gel in side-by-side channels (wells). Either the ddNTPs or the primers are labelled with fluorescence or radioactivity and allow the sequence to be read with a corresponding UV or X-ray method. As with all electrophoresis and chromatography, each partial replicant has a different length and charge and travels through the gel at a proportional speed. The same primer is used for each dNTP reaction so that the fragments align.
Special thanks to Sarah Obenrader who published a good starting point for this article here.
Shotgun sequencing is the original name given to the common technique that stitches overlapping short sequences of DNA together to produce much longer readouts. Sanger sequencing is only useful up to about 1kbp so several techniques were developed to produce longer reads. These are Primer Walking and Shotgun Sequencing.
DNA is fragmented into numerous small segments, which are sequenced (originally using the Sanger method). Several rounds of fragmentation and sequencing yield multiple overlapping reads for the target DNA. Computer programs use the overlapping ends to assemble short reads into a continuous sequence.
Two methods of shotgun sequencing are 1) Bottom-up or Whole Genome Shotgun Sequencing and 2) Top-down or Hierarchical Shotgun Sequencing.
Whole Genome Shotgun Sequencing
In whole genome shotgun sequencing, the original DNA sequence is reconstructed from shorter reads using sequence assembly software. Overlapping reads are stitched into longer composite sequences known as “contigs.” Contigs may be linked together into “scaffolds” using paired-end or mate-paired reads of approximated fragment length if the average fragment length of the library is known and the deviation is small. One detractor for whole genome shotgun sequencing is that it can have problems correctly stitching regions for genomes with repeating regions.
Hierarchical Shotgun Sequencing
Hierarchical Shotgun Sequencing was an early attempt to address the challenges presented by repetitive DNA in large genomes (e.g. greater than 50% for the human genome) by creating a low-resolution map of the genome that lowered the computational load of sequence assembly. From this map, a minimal number of large fragments that cover the entire chromosome (and that are still much smaller than a full genome) are selected and subject to manageable, individual shotgun sequencing processes. This mapped, minimal number of large fragments that cover the entire genome resembles a meta-scaffold and is called a “tiling path“. Once a tiling path has been found, the fragments that form this path are sheared at random into smaller fragments and can be sequenced using the shotgun method on a smaller scale.
There are several methods to determine the order of these large fragments in the tiling path:
STS Content Mapping with Chromosome Walking: The amplified genome is sheared into larger fragments (50-200kb) and cloned into bacteria. A small radioactively- or chemically-labeled probe matching a sequence-tagged site (STS) is hybridized onto a microarray of the large genome fragments. An STS is a short (200 to 500 base pair) DNA sequence that has a single occurrence in the genome and whose location and base sequence are known. All the clones that contain a particular STS in the genome are identified. The end of one of these clones is sequenced to yield the sequence for a new probe and the process repeated. Clones that hybridize with the new clone are known to be downstream from the clones that hybridize with the STS. This method called chromosome walking (an application of primer walking that does not intend to yield a sequence but instead yields a scaffold or a tiling path). Once the overlap between the clones has been found and their order relative to the genome is known, a scaffold of a minimal subset of these contigs that covers the entire genome is shotgun-sequenced.
Restriction fingerprinting: The amplified genome is again sheared into larger fragments (50-200kb) and cloned into bacteria. Each piece is amplified by fermentation, cut with restriction enzymes and run on a gel. The resulting “fingerprint” pattern allows ordered assembly of clones for sequencing by overlapping cloned fragments that contain multiple similarly spaced restriction sites in common. Once the overlap between the clones has been found and their order relative to the genome is known, a scaffold of a minimal subset of these contigs that covers the entire genome is shotgun-sequenced.
Hierarchical shotgun sequencing is slower and more labor intensive than whole-genome shotgun sequencing (WGSS) because it involves first creating a low-resolution map of the genome. However, it relies less heavily on computer algorithms than whole-genome shotgun sequencing and helps to overcome (but does not completely solve) the challenge presented by long repeating sections. Hierarchical shotgun sequencing did not replace WGSS because WGSS data were proved to be reliable and the speed and cost efficiency made it the preferred choice. With next-gen improvements, WGSS has pulled away as the primary method.
Primer walking is used to sequence short fragments that are too long for Sanger sequencing (i.e. fragments between 1.3 and 7 kilo bases). This method divides the long sequence into several consecutive short ones. The DNA of interest is often a plasmid insert, a PCR product or a fragment representing a gap in a whole sequenced genome.
An adapter of known sequence is ligated to the long fragment and the first 1kbp are sequenced using the traditional Sanger method. Using the resulting sequence information, new primers that are complementary to the final 20 bases of the newly known sequence are designed to allow sequencing of the subsequent 1kbp. The process is repeated, thereby “walking” along the long DNA fragment.
The reduced labor, time and cost of shotgun sequencing significantly limited the practical applications of primer walking.
Serial Analysis of Gene Expression was first published in 1995 by Dr. Victor Velculescu at the Oncology Center of Johns Hopkins University. Unlike previous sequencing techniques that work on genomic DNA and output a genomic sequence – the genotype – SAGE profiles gene expression by isolating mRNA (messenger RNA) and converting it into cDNA (complimentary DNA) before sequencing. SAGE outputs the identity and quantity of a sample’s expressed genes – the phenotype.
A short sequence of mRNA (10-14 nucleotides) – a tag – can provide enough information to identify the mRNA gene transcript that it came from. SAGE strings together a long series of these short tags to be sequenced all at once. The number of times that a tag appears in the long sequence reveals approximately how many copies of the original mRNA gene transcript existed in the starting sample.
- mRNA is converted to cDNA using biotinylated primers. Biotin from primers has been incorporated into cDNA and allows cDNA to be bound to streptavidin beads
- cDNA is shortened with an “anchoring” restriction enzyme.
- Beads anchor the short cDNA and the rest of the fragment is cleared away
- Adapters are added that contain
- Anchoring enzyme sticky end (sticks adapter to short cDNA fragment)
- Tagging enzyme recognition site and
- Amplification primer A or B.
- Tagging enzyme cuts off the bead (the bead has done it’s job) and the beads are removed, leaving only the tag behind.
- Tags are ligated together and amplified with the A/B primer pair.
- After ditags are amplified, adapters are cleaved off using the original anchoring enzyme. This yields the tag pair only plus the anchoring enzyme remainder.
- Ditags are concatenated into long strings and cloned into bacterial for amplification.
- Long strings of ditags are sequenced to identify and count fragments.
- Data output is a specific gene fragment sequence and a count for the frequency with which that fragment appeared in a sample. The fragment can often be used to ID the fragment when it matches a known gene, but SAGE will output all expressed genes, including unknown and foreign (viral) genes. Output looks like this:
MPSS was published in 2000 and developed by Lynx Therapeutics. Biological samples were sent to Lynx’s laboratories where the complicated procedure was performed by the company’s staff. Lynx was bought by Solexa in a reverse merger in 2005 and Solexa was acquired by Illumina in 2007. Like Lynx, MPSS is also history, having been replaced by microarray technology (where expression of known genes is analyzed) and SuperSAGE (where expression of all genes, known and unknown, is analyzed). (Note: microarrays are genomics techniques but do not use NGS)
To call Massively Parallel Signature Sequencing (MPSS) a sequencing technique is not entirely accurate because the innovation and value developed by Lynx was really in it’s library prep and bead-based, bacteria-free “Megaclone” technology. Downstream of this library preparation and cloning method was a sequence-by-ligation process similar to SOLiD described HERE. MPSS (like SAGE – LINK) also measured mRNA-based gene expression levels.
Expression profiling with MPSS differed from profiling with DNA microarray chip in that MPSS used non-specific tags instead of probes to immobilize DNA fragments. Non-specific tags enabled the open-ended character of MPSS; mRNA transcripts did not need to be known and could be discovered de novo.
Two key differences between MPSS and SAGE are that 1) MPSS used fluorescence to record sequence information in a digital format and 2) MPSS used the improved bead-based Megaclone technique (all clones in one tube) while SAGE used bacteria (one clone per well) for its pre-sequencing amplification step. This digital aspect enabled use of a bead-based library preparation process that captured at least one million gene fragments (compared to SAGE’s 50,000 fragment libraries). This greater coverage (LINK discussing coverage) allowed MPSS to capture and count virtually all mRNA molecules in a tissue or cell sample. The greater efficiency created a scale and cost that enabled Lynx to build a business.
Even genes with low level expression could be quantified. The direct-to-digital output allowed all genes to be analysed simultaneously (“massively parallel”). Digital format also simplified use of bioinformatics tools to compare gene expression across cell types and lineages.
Large numbers of expressed genes make it difficult to track changes in expression patterns, particularly in view of the large fraction of genes that are expressed at very low levels: as much as 30% of mammalian mRNA consists of many thousands of distinct species each making up <<0.1% of the total. At the advent of MPSS, techniques based on direct sequence analysis or capture of specific probes by oligonucleotide microarrays provides the most comprehensive and sensitive analysis of gene expression. However, in both approaches, the sequences to be analyzed must either be known (microarray) or cloned and processed individually beforehand (SAGE), usually with the aid of complex robotics systems. This makes it difficult to isolate and/or monitor many potentially important genes that are differentially expressed at low absolute levels against a background of more abundantly expressed genes.
MPSS begins with Megaclone library preparation.
By Megaclone, millions of nucleic acid molecules, amplified with one set of common primers, can be cloned and specifically attached to 5-μm microbeads in a few single-tube reactions. Central to the method is the formation of a large (relative to the number of cDNA fragments in the sample) repertoire of oligonucleotide tags (called “barcodes” in the diagram) so that at least one fragment corresponding to virtually every cDNA species in the sample is captured and cloned into the process. cDNA fragment-barcode conjugates can be amplified and hybridized to their complementary sequences (anti-tags) on separate microbeads in a single reaction to form a library of microbeads, each having attached a clonal population of one cDNA from the original tissue or cell population.
The key advantage of this approach over biological cloning is that the DNA on the surface of each microbead is readily accessible for biochemical analysis without further processing. In addition, all clones in a microbead library can be sequenced simultaneously (“massively parallel”).
Approximately one million microbeads are then loaded into a specially designed flow-cell in a way that allows them to stack together along channels and form a tightly packed monolayer in the flow-cell.
The flow-cell is connected to a computer-controlled microfluidics network that delivers different reagents for the sequencing reactions. A high-resolution CCD camera is positioned directly over the flow-cell in order to capture fluorescent images from the microbeads at specific stages of the sequencing reactions.
The actual DNA sequencing reaction involves an automated series of adaptor ligations and enzymatic steps and is a sequence-by-ligation process described in essence HERE. The flowcell described above, as captured by the CCD camera, is represented below. At the top of this image is a representation of the sequence information captured from a single bead as the probe colors change over time with each passing nucleotide probe.
Pyrosequencing is a sequencing by synthesis method developed by Mostafa Ronaghi and Pål Nyrén at the Royal Institute of Technology in Stockholm in 1996. The technology measures the release of pyrophosphate (or diphosphate, PPi) that occurs when a nucleotide triphosphate (dNTP) is incorporated into a phosphodiester bond (that uses only a single phosphate) in the growing complimentary DNA strand.
The process cycles through dATP, dGTP, dCTP, dTTP so that only one dNTP is available at a time. When the appropriate dNTP is present and incorporated, ATP sulfurylase combines the released PPi (pyrophosphate) with APS (adenosine phosphosulfate) to form ATP that in turn drives a luciferase reaction. This produces recordable light. Unincorporated nucleotides and ATP are degraded by apyrase, and the cycle restarts with another nucleotide.
When there are two or more bases repeated on the template strand, the amount of light increases linearly. Repeat bases are easily recorded as higher amplitude light signal:
Apyrase degradation of unused nucleotides has the benefit of eliminating a wash step and enables nanotechnology-based sequestration of individual fragments that brings order to chip-based sequencing technology with the potential to increase readout clarity and efficiency over polony-based method.
A limitation of the method is that the lengths of individual reads of DNA sequence are 300bp-500bp, shorter than reads from Sanger sequencing (800bp – 1kbp). This makes genome assembly more difficult, particularly for sequences with repetitive DNA.
There are two methods for preparing pyrosequencing templates: solid phase and liquid phase. In the solid phase method, templates containing biotin are immobilized to streptavidin magnetic beads, sedimented out of solution and remaining PCR reagents are washed away. In the liquid phase method, nucleotide-degrading apyrase removes remaining nucleotides and exonuclease I degrades PCR primers remaining from the amplification step. After PCR reagents are degraded, the temperature of the solution is increased to heat-inactivate PCR polymerase and degradation enzymes before sequencing begins.
Sequencing by ligation harnesses the sensitivity of DNA ligase to drive a fluorescent readout of the DNA sequence. Applied Biosystems’ SOLiD (introduced 2006) is one example. Life Technologies bought ABI in 2008.
SOLiD sequencing (aka 2-base encoding) uses four colors of fluorescent probes. Each of four colors is carefully reused to represent four of the sixteen possible combinations of two-nucleotide sequences. This two-nucleotide system seems complicated to readout but it builds a natural accuracy test into the system because it analyzes each base twice and allows fewer errors:
Using the information available (that the first two nucleotides permit DNA ligase to capture a green fluorescent probe), we know that the first two bases of the template strand are either CA, AC, TG or GT (see the color grid above).
Red fluorescence on the subsequent probe means the new template bases are either TA, AT, GC or CG.
This thymidine starting point eliminates three of four possible template base combinations and allows identification of specific bases along the template strand:
It takes 5 iterations (with primers n, n-1, n-2, n-3 and n-4) to bridge a probe with 3 degenerate bases and two template bases:
Special thanks to Andy Vierstraete who published a good starting point for this article here.
Reversible terminator sequencing (RTS) is owned by Illumina and runs on their most popular machines. This makes it one of the (if not the) most widely used methods for DNA sequencing. Some numbers:
RTS begins with a PCR-on-a-slide library preparation.
A sequencing primer is added, followed by modified dNTPs containing fluorescent molecule on the 3′ end. This modification allows the dTNP to be incorporation into the complimentary strand (by a specially engineered polymerase) but it terminates extension of the complimentary strand like ddNTPs do in Sanger sequencing. In this way, RTS is a chain-termination, sequencing by synthesis technology. Once extension has terminated, an image of the slide is captured, including the fluorescent molecules that identify which dNTP was incorporated into the complimentary strand. After imaging, the fluorescent molecule is cleaved and washed away, allowing polymerase to incorporate the next base. What makes this “reversible terminator sequencing” is that the fluorescent molecule serves two purposes: 1) to terminate polymerase extension and 2) to identify the incorporated nucleotide. Cleavage of the fluorescent molecule allows the polymerase to resume extending the complimentary strand and therefore reverses the termination.
The image below represents a time course of six cycles of chain termination, imaging and termination reversal:
An NGS library ideally provides reads that are evenly distributed across the entire region of interest. (“library” = fragmented, amplified, tagged/adapter-ligated sample of either genomic DNA or cDNA of transcriptome mRNA) Unfortunately, in real NGS data, some regions are over-represented, others are poorly covered and some are missed altogether.
The image below shows the difference in coverage between two polymerases. The polymerase on the bottom has trouble amplifying GC-rich regions of DNA. (I got the image here, they credit the data to my former employer, the Broad Institute)
The Helicos Genetic Analysis System (HGAS) platform was the first commercial NGS implementation to use the principle of single molecule fluorescent sequencing. Working on single molecule level avoids amplification bias because amplification of DNA is not required. Helicos failed early in 2010 and filed for bankruptcy in 2012.
HGAS was a reversible terminator sequencing method similar to Illumina’s market-leading technology.
- DNA is fragmented by sanitation into fragments between 100 and 200 bp.
- Fragments are ligated to a poly-dA tail.
- fun fact: poly-dA tails should be over 50 dA in length so that they have no trouble annealing to the oligo-dT50 primers on the commercial standard flow cell surface.
- other fun fact: if there is enough DNA to measure mass and average fragment length, it is possible to determine how much dATP is needed to generate poly(dA) tails around 100 nucleotides long.
- Typical flow cells used with HGAS had 25 channels that could each be loaded with the same samples (for higher coverage) or different samples (for experimental comparison).
- Sample loading: Poly-dA-ligated fragments are incubated with the flow cell and they spread out and anneal randomly in the field.
- Filling and locking: normal dTTPs and fluorescent dye-labelled terminator dATPs, dGTPs and dCTPs are added first with polymerase.
- The dTTPs fill in any remaining overhanging poly-dAs (so they are not sequenced) and the
- dye-labelled others fill the first non-dTTP nucleotide position.
- Sequencing is terminated (locked) at this point due to polymerase interference by the dye.
- The slide may be imaged and all fragments that are capable of being sequenced will be illuminated at this point.
- Sequencing: Chemistry-and-imaging cycle begins now.
- The polymerase-interfering dye label has played its role above and is cleaved
- Fluorescent dye-labelled terminator dNTPs are added one by one (with polymerase) and these incorporate where they are able.
- An image is taken at each step that shows which fragments have incorporated the given dye-labelled dNTP
- The dye is cleaved and the cycle repeats
HGAS runs two flow cells at a time: one for chemistry while the other images.
Interestingly, Helicos uses only one dye color while Illumina/Solexa uses 4 (one for each dNTP). There are two hypotheses for why.
- Using one color allows measurement at higher resolution. This higher resolution may have been necessary for single molecule readout. Amplification clusters (polonies) would produce larger signals with plurality of fluorescent labels.
- Solexa’s IP protection on a simultaneous four color readout may have blocked Helicos.
Helicos failed because of a high error rate and a short readout length of around 30bp.
This is attributed to the sizable “scar” left after the terminator functional group is cleaved. (Scars effect read length for most/all reversible terminator technologies)
Brief comparison of reversible terminators: Two types of reversible terminators (3′ blocked and 3′ unblocked) have their advantages and disadvantages: the 3′-O-blocked reversible terminator contains a 3′ reversible blocking group, thus should render better termination effect; the 3′-unblocked reversible terminator, on the other hand, is easier to be accepted by the DNA polymerases due to the lack of a modified moiety at the 3′-OH.
Ion Semiconductor Sequencing is a sequencing by synthesis method that uses and ion-sensitive transistor to detect hydrogen ions that are released during DNA synthesis.
Process overview: microwells on a semiconductor chip that each contain DNA polymerase and many copies of one single-stranded template DNA molecule are sequentially flooded with unmodified A, C, G or T dNTPs. If an introduced dNTP is complementary to the next available nucleotide on the template strand, the dNTP is incorporated into the growing complementary strand by the DNA polymerase. A hydrogen ion that is released in the reaction changes the solution pH which is detected by an ion-sensitive transistor (ISFET).
Unused dNTP molecules are washed out before the next dNTP flood is introduced. dNTPs that are not complementary are not incorporated and do not produce a biochemical reaction or hydrogen ion. The series of electrical pulses transmitted from the chip to a computer is signal-processed and translated into a DNA sequence, with no optical measurement or intermediate signal conversion required.
The process shares many similarities with pyrosequencing because the formation of a covalent bond between a dNTP and a growing DNA strand releases pyrophosphate (used in pyrosequencing) and a positively charged hydrogen ion (used in ion semiconductor sequencing). Also, neither technology requires modified dNTPs. Lastly, in pyrosequencing, two identical template strand nucleotides in sequence produce twice the fluorescent signal while in ion semiconductor sequencing, two identical template strand nucleotides produce twice the electrical signal.
Ion Torrent sequencing example readout:
No need to reinvent the wheel on this one. This video does a great job explaining how PacBio’s SMRT technology works and it looks truly revolutionary: