Biotechnology - Massively parallel sequencing - Part 2 : Quality evaluation of sequencing data
1 Scope
This document specifies general requirements and recommendations for quality assessments and control of massively parallel sequencing (MPS) data. It covers post raw data generation procedures, sequencing alignments, and variant calling.
This document also gives general guidelines for validation and documentation of MPS data.
This document does not apply to any processes related to de novo assembly.
2 Normative references
There are no normative references in this document.
3 Terms and definitions
For the purposes of this document, the following terms and definitions apply.
3.1
adapter sequence
adapter
artificial oligonucleotide of a known sequence that can be added to the 3' or 5' ends of a nucleic acid fragment
Note: It provides the primer site as well as other necessary sequences for sequencing the insert.
3.2
algorithm
completely determined finite sequence of instructions by which the values of the output variables may be calculated from the values of the input variables
[Source: IEC 60050-351:2013, 351-42-27, modified]
3.3
base calling
computational process in massively parallel sequencing of translating raw electrical signals to nucleotide sequence
Note: Base calling application and algorithm performance is characteristically defined by read and consensus accuracy.
3.4
bioinformatics pipeline
programs, scripts, or pieces of software linked together, where raw data or output from one program is used as input for the next step in data processing
Example: The output from a base quality trimming program may be used as input to a de-novo assembler.
3.5
capture efficiency
percent of all sequenced or mapped reads that overlap the targeted regions
3.6
coverage
coverage depth
number of times that a given base position is read in a sequencing run
Note: The number of reads that cover a particular position.
3.7
coverage breadth
fraction of the genome in target genome size in sequencing runs
3.8
cluster density
number of clusters for each tile
Note 1: The cluster density applied to the MPS (3.30) platforms requires an amplification step.
Note 2: The density of individual sequence clusters, each arising from a single molecule on some sequencing platforms.
Note 3: Cluster density is usually expressed in K/mm2.
3.9
circular consensus sequencing; CCS
sequencing mode where the insert size is sequenced multiple times in a rolling circle amplification type reaction, leading to high accuracy
Note: In this mode, multiple passes from the same molccule can be used to achieve higher single molecule accuracy.
3.10
coverage range
range of coverage depth across a genome for sequencing runs
3.11
copy number variation; CNV
copy number variant
variation of the number of copies of one or more sections of the DNA present in the genome of an organism
Note: CNVs are insertions, deletions, inversions and duplications containing at least 1kb in length.
3.12
deoxyribonucleic acid; DNA
polymer of deoxyribonucleotides occurring in a double-stranded (dsDNA) or single-stranded (ssDNA) form
[Source: ISO 22174:2005, 3.1.2]
3.13
deletion
loss of one (or more)nucleotide base pair(s)from a nucleic acid sequence compared to its reference sequence
3.14
duplication level
number of identical repeats for every sequence in a library
Note: The duplication level is usually displayed in a plot showing the relative number of sequences with different degrees of duplication.
3.15
GC content
percentage of guanine and cytosine in one or more nucleic acid sequence(s)
Note: The amount of guanine and cytosine in a polynucleic acid, is usually expressed in mole fraction (or percentage) of total nitrogenous bases. Total nitrogenous bases comprise the total number of nucleotide bases of reads from one or more MPS run.
3.16
gene
sequence of nucleotides in DNA or RNA encoding either an RNA or a protein product
Note 1: Genes are recognized as the basic unit of heredity.
Note 2: A gene can consist of non-contiguous nucleic acid segments that are rearranged through a nuclear processing step.
Note 3: A gene may include or be part of an operon that includes elements for gene expression.
3.17
indel
insertion (3.18) or/and deletion (3.13) of nucleotides in genomic DNA
Note: Indels are less than 1kb in length.
3.18
insertion
addition of one (or more) nucleotide base pair(s) into a nucleic acid sequence
[Source: ISO/TS 20428:2017, 3.19, modified]
3.19
sequencing
determining the order and the content of nucleotide bases (adenine, guanine, cytosine, thymine or uracil) of a nucleic acid molecule
Note: A sequence is generally described from the 5' to 3' end.
[Source: ISO/TS 17822-1:2020, 3.19, modified]
3.20
sequence alignment
arrangement of nucleic acid sequences according to regions of similarity
Note: Sequence alignment may not require a reference genome /reference targeted nucleic acid region and its aim might not produce an assembly.
3.21
raw data
primary sequencing data produced by a sequencer without involving any software-based pre-filtering for analysis purpose
3.22
ribonucleic acid
polymer of ribonucleotides occurring in a double-stranded or single-stranded form
Note: Synthesis of proteins in cells is directed by genetic information carried in the sequence of nucleotides in a class of RNA known as messenger RNA (mRNA).
3.23
ribonucleotide
nucleotide containing ribose as its pentose component forming the basic building blocks for RNA
Note: The ribonucleotides consist of adenylate (AMP), guanylate (GMP), cytidylate (CMP), or uridylate (UMP).
3.24
read
sequence read
nucleotide sequence generated by a sequencing device
Note: A read is a deduced sequence of nucleic acid base pairs (or base pairs probabilities) corresponding to all (or part of) a single nucleic acid fragment. Read can be used to refer to as those sequences obtained from MPS experiments
3.25
read type
category of sequence that depends on how the sequence reading experiment is designed and conducted
Example: Read type can be single-end, paired-end, mate-paired end, continuous long read, circular consensus.
3.26
reference sequence
nucleic acid sequence used either to align by mapping sequence reads or as the basis for annotations such as genes and sequence variations