Till KTH:s startsida Till KTH:s startsida

Visa version

Version skapad av Lars Arvestad 2015-11-13 11:17

Visa < föregående | nästa >
Jämför < föregående | nästa >

Re-formatting DNA

Re-formatting sequences

Write a python program that reads sequences from Stockholm files and re-format the sequences into Fasta format with the sequences broken down to rows at most 60 characters wide.

For exampel, if the input is like this:

# STOCKHOLM 1.0 prot17 MAGQDPRLRGEPLKHVLVIDDDVAMRHLIVEYLTIHAFKVTAVADSKQFNRVLCSETVDVVVVDLNLGREDGLEIVRSLATKSDVPIIIISGARLEEADKVIALELGATDFIAKPFGTRE
prot4711 AAGQDVRLRGEPL----VIDDDVAMRHLIVEYLTIDAFKVTAVADSKQFNRVLCSETVDVVVVDTILGFEDGLEIVDSLATKSDVPIIIISGARLEEADKVIALELGAIDFIAGPFGTRD
//

then the output should be

>prot17
MAGQDPRLRGEPLKHVLVIDDDVAMRHLIVEYLTIHAFKVTAVADSKQFNRVLCSETVDV
VVVDLNLGREDGLEIVRSLATKSDVPIIIISGARLEEADKVIALELGATDFIAKPFGTRE
>prot4711
AAGQDVRLRGEPL----VIDDDVAMRHLIVEYLTIDAFKVTAVADSKQFNRVLCSETVDV
VVVDTILGFEDGLEIVDSLATKSDVPIIIISGARLEEADKVIALELGAIDFIAGPFGTRD

Suggested approach

There are many ways of solving this problem, but all good solutions involves identifying smaller smaller problems and writing functions that solves them. Here is one suggestion.

  • Write a function that takes a string (representing a sequence) and prints it as Fasta.
  • Use a separate function to break a string into several lines, each line at most 60 characters wide.
  • Write a function for handling reading of sequences, and nothing else.
  • Have a short main part of your code that opens the file you want and calls the sequence-reading function.

Data

We have three test cases. You solution should handle these files gracefully.

  1. A simple test case with three sequences.
  2. Shorter sequences. How should your program handle this file?
  3. A "corner case", this file is empty, except for the prefix and suffix necessary in Stockholm files.
  4. As a fourth test case, go to pfam.sbc.su.se (or pfam.sanger.ac.uk) and download the protein alignment for domain family PF00041 in Stockholm format. Verify that your program works well with this data!

Requirements

  • Your script must be able to read from any Stockholm file you want, with markup (#=GF etc) removed.
  • Every sequence in a file must be read and reformatted.
  • Your script must allow for, but no require, empty lines inbetween sequences.
  • Your program's output must be readable by other programs, e.g. "muscle".

A typical session looks like this:

$ python reformat.py
Which sequence file? longseqs.sthlm

>gene4711
ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT
ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT
>hubba
ACGTACGTACGTACGTACGTACGTANNNNNNNNNNTACGTACGTACGTACGTACGTACGT
ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT
T
>gene4712
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT

To present:

  1. Your code solving this problem according to the given requirements.
  2. Show how your program works on the four testcases.
  3. What is Pfam?