Till KTH:s startsida Till KTH:s startsida

How should DNA reads be trimmed?

You task in this assignment is to suggest a way to trim DNA reads.

Background

Modern sequencers generate files in the TB range and can contain billions of reads. The read length varies, but are for many applications around 150 nt (nucleotides) long. Output files are in the FASTQ format, where each read (sequence) has an entry over four lines:

  • Line 1: A "@" character and then a sequence accession
  • Line 2: The DNA sequence, containing the letters A, C, G, and T for the four nucleotides, and sometimes an N representing "unknown", i.e., a failure to read a position.
  • Line 3: After a starting plus ("+"), the sequence accession may be repeated.
  • Line 4: Quality codes, given as a string of characters representing quality numbers. Position i in line 2 has its quality described on position i of line 4.

Since sequences tend to have lower quality at the end of a sequence, one often trims the sequences when the quality starts to go down. One could consider to chop off all sequences after a certain length, but that ignores the quality values. How do we get a good trade-off between length and quality? Your assignment is to propose at least two different ways to do this trimming and gather statistics of sequence length and quality.

Requirements

The following is required for passing grade on this project.

  • A program in Python that takes a FASTQ file as input and has a trimmed version of the input file as output in FASTQ format. Accessions should not be changed, and lines 2 and 4 in the entries may only be trimmed.
  • At least two described ways of trimming the input sequences must be developed and their impact must be described with statistics and plots: How much sequence is removed? How do the methods improve quality?

Note 1: the example from above ("cut after lenght L") is not an acceptable method of trimming.

Note 2: the two methods should be different, not just have different parametirisation (such as: "1 -- cut after length L1" and "2 -- cut after length L2").

Data

You have access to two files from a MISEQ sequencer: a smaller, miseq1.fq, and a larger, miseq2.fq. The smaller is meant for testing and quick experiments, while the larger is meant for real work. For practical reasons, the larger file is actually a reduced version of its original. Be sure to think of aspects of training and testing: do not use the same data for training parameters and then for evaluation.

Please contact me if you want larger datasets.

Quality codes

There are unfortunately slightly different ways of describing quality in the FASTQ format (depending on manufacturer and version), but this section describes what holds for the MISEQ data.

If the estimated probability of a position being incorrectly determined is p, then the quality code (also known as "PHRED score") is \(q=-10\log_{10}(p)\). The maximal quality value is 40 and corresponds to an error probability of \(10^{-4}\). To make it possible to fit a quality value in one character position, only integral values in the interval [0,40] are used and they are mapped to the character string

!"#$%&’()*+,-./0123456789:;<=>?@ABCDEFGHI

That is, the quality code "I" indicates quality value 40, while "!" indicates quality value 0 (which corresponds to p=1 and a sure error). In practice, q > 1.

It is not a coincidence that these characters are used. The exclamation point is the first visible character in the old (but still important) American text encoding standard ASCII (for details, see "man ascii"), and then the other characters follow in order. This makes it easy to convert a character c to a quality value q:

   q = ord(c) - 32

where the Python function ord(c) yields the ASCII code of c.

Lärare Lars Arvestad skapade sidan 17 november 2016

kommenterade 2 december 2016

Is the only option to trim sequences to remove n nucleotides from the end of the sequence? Or could one also argue to skip n at the beginning and remove m from the end?

Lärare kommenterade 2 december 2016

Quality tends to degrade at the end of sequences, with beginning being just fine.