Low Diversity Libraries/Sequencing bias
Low diversity or biased libraries are those where all the DNA fragments within a NGS library start with the same sequence. This can happen in a number of library preparations – for example:
Cluster generation produces millions of different clusters, with all the DNA strands within a particular each cluster containing all the same sequence. The sequence of the strand is read by extending the sequence by the addition into the flow cell of all 4 bases labeled with a different fluorescent dye. At each cluster one base will incorporate. That base is read by excitation of the fluorescence of the base and detected by a ccd camera. Each position is read 4 time, once for A, G, C and T. In order to collect the information for the growing strand of DNA, the software calculates its position from the images collected at each base over the first 4 cycles.
If a sample is of high diversity (unbiased), then that means that there will be equal representation of all 4 bases across the flow cell and so the software will be able to distinguish the border of one cluster to the next.
But in a biased sample, it is more likely that adjacent clusters may show the same base for that particular cycle and the software will not be able to separate the 2 or more clusters that overlap.
Overlapping clusters are filtered out and do not appear in the final data.
The graph below shows the difference in quantity of data between a biased sample and a non-biased sample.
Krueger F, Andrews SR, Osborne CS (2011) Large Scale Loss of Data in Low-Diversity Illumina Sequencing Libraries Can Be Recovered by Deferred Cluster Calling. PLoS ONE 6(1): e16607.
In order to try to produce a good quantity of reads we have several strategies to use when we know a library has low diversity. We have optimized our loading concentration to try to guarantee that we produce a minimum of 100 million reads per lane on the Illumina HiSeq2000. This guarantee is only for libraries that have high diversity (unbiased in the composition). If we have a sample with known bias we can try to prevent filtering out of clusters due to the first few bases being the same sequence by reducing the concentration of the library being clustered on the flow cell. Because the concentration is lower, the number of clusters and subsequently, the number of reads, will be lower and should be spaced further apart. Because they are spaced further apart they will not be filtered out by the software due to overlapping clusters. A second strategy would be to spike-in a high percentage of a PhiX library to introduce diversity in to the library. All HiSeq lanes contain a 0.5% PhiX spike-in as QC, but low diversity libraries can be helped by increasing this spike-in to 50%. Both strategies will result in a lower number of reads being generated.
The HiSeq2000 uses a green laser to sequence G/T and a red laser to sequence A/C. At each cycle at least one of the 2 nucleotides for each color channel needs to be read to ensure proper image registration. It is important to maintain color balance for each base of the index read being sequenced; otherwise index read sequencing could fail due to registration failure.
For low plex library pools it is important to select indices carefully. Illumina provide the following guide to ensure there is sufficient variation within the pool for cluster detection and differentiation between the libraries in a low plex pool.
Adding Indexes into our LIMs to allow demultiplexing of pooled samples:
During analysis of libraries sequenced in our facility, CASAVA demultiplexes pooled samples based on the index sequences provided during order entry. If the Index sequences are incorrect then all reads will be sent to the “Undetermined reads” folder and the libraries will not be demultiplexed. This can happen if the investigator provides the wrong sequence, or if the index sequence is not added in the correct orientation. While we always troubleshoot when this happens in order to find what indexes are actually present within the Undetermined Reads folder, and correct errors this has the effect of effectively doubling the time required for data analysis for all customers with libraries on that flow cell. We ask that you use this guide in order to ensure that your indexes are entered correctly into our system.
For instance, Illumina adaptor containing index AR001 shows the sequence of the index in the primer as ATCACG
But should be added into the LIMs as CGTGAT
Following these rules should allow demultiplexing to proceed correctly first time.