Till KTH:s startsida Till KTH:s startsida

Visa version

Version skapad av Lars Arvestad 2015-11-02 12:04

Visa < föregående | nästa >
Jämför < föregående | nästa >

Translate DNA

Write a program that finds the longest open reading frame (ORF), i.e., the longest sequence of codons without a stop codon, in each input DNA sequence. The output is the ORFs translated to proteins using the genetic code (the standard code). The output should be formatted as in the previous assignment!

To present:

  1. What are the "stop codons" in the standard code?
  2. Why are we talking about a "standard code"?
  3. Looking for the longest ORF is a primitive way to find genes in prokaryotic genomes. Why does it not work for eukaryotes?
  4. Your code.
  5. How did you structure your code and why?
  6. Test runs showing that requirements are fulfilled.
  7. What is the longest protein snippet produced on the file an_exon.fa?
  8. Why should a real ORF finder also look at the so-called Watson-Crick complement?

Requirements

  • Input is one or more sequences in Fasta format.
  • The longest ORF in each sequence should be translated. It may start from any codon. Only the positive strand needs to be considered (i.e., reading left-to-right).
  • Your program must gracefully handle ambigous characters. Translate to X if it is not a regular codon.
  • Your program must be well structured and be written with functions performing important algorithmic steps.

Example session:

$ python dna2aa.py
Which file? translationtest.dna
>single_stop_codon

>stopcodons
NSDNSDNSDNSDNSDNSDNSDNSDNS
>ambiguities
XXXXXXXXXXXXXXXXXX
>proteinalphabet
ARNDCQEGHILKMFPSTWYV
>proteinalphabet2
ARNDCQEGHILKMFPSTWYV
>proteinalphabet3
ARNDCQEGHILKMFPSTWYV
>tooshort

Test data

This file contains several interesting tests. The ORF sequences should translate as follows:stopcodons:All "stars" (*), because it only contains stop codons.ambiguities:The input sequence is all "N", and should therefore translate to all "X".proteinalphabet(2, and 3):There are three sequences that should translate to ARNDCQEGHILKMFPSTWYV.tooshort:A single nucleotide should translate to "X".

The file an_exon.fa is the main test file.