Till KTH:s startsida Till KTH:s startsida

Translate DNA

Write a program that finds the longest open reading frame (ORF), i.e., the longest sequence of codons without a stop codon, in each input DNA sequence. The output is the ORFs translated to proteins using the genetic code (the standard code). The output should be formatted as in the previous assignment!

To present:

  1. What are the "stop codons" in the standard code?
  2. Why are we talking about a "standard code"?
  3. Looking for the longest ORF is a primitive way to find genes in prokaryotic genomes. Why does it not work for eukaryotes?
  4. Your code.
  5. How did you structure your code and why?
  6. Test runs showing that requirements are fulfilled.
  7. What is the longest protein snippet produced on the file an_exon.fa?
  8. Why should a real ORF finder also look at the so-called Watson-Crick complement?

Requirements

  • Input is one or more sequences in Fasta format.
  • The longest ORF in each sequence should be translated. It may start from any codon. Only the positive strand needs to be considered (i.e., reading left-to-right).
  • Your program must gracefully handle ambigous characters. Translate to X if it is not a regular codon.
  • Your program must be well structured and be written with functions performing important algorithmic steps.

Example session:

$ python dna2aa.py
Which file? translationtest.dna
>single_stop_codon

>stopcodons
NSDNSDNSDNSDNSDNSDNSDNSDNS
>ambiguities
XXXXXXXXXXXXXXXXXX
>proteinalphabet
ARNDCQEGHILKMFPSTWYV
>proteinalphabet2
ARNDCQEGHILKMFPSTWYV
>proteinalphabet3
ARNDCQEGHILKMFPSTWYV
>short
NS >tooshort

Test data

This file contains several interesting tests, with translations as above.

The file an_exon.fa is the main test file.

Lars Arvestad skapade sidan 27 oktober 2016

kommenterade 4 november 2016

Can we get the expected result for the main test file as well? Or at least the expected length.

kommenterade 5 november 2016

There are several "ORF finder" tools online, you can compare your result with the one produced by those.

Lärare kommenterade 5 november 2016

Note that this assignment has a lower bar: you only need to find the ORF on what is called the positive strand, reading from left to right (or 5' to 3', as biologists say), while ORF finders typically check both strands (i.e., also the reverse direction, while Watson-Crick complemented).

kommenterade 23 november 2016

I think I misunderstood this assignment : we need to find the longest ORF regardless of whether it ends in a stop codon or not. is that correct? Because when I look for all the stop codons and just translate what is before those, the results are very different than the expected ones.

Lärare kommenterade 23 november 2016

An ORF continues until a stop codon is found. There must be something else going on.

kommenterade 23 november 2016

Lars, I think Juliette is referring to ORFs that continue to the end of the sequence.

Juliette, I passed the assignment interpreting ORFs to include any sequence not terminated by a stop codon. This is compatible with the example session above as well.

Lärare kommenterade 23 november 2016

Aha, good point!