Till KTH:s startsida Till KTH:s startsida

Re-formatting DNA

Re-formatting sequences

Write a python program that reads sequences from Stockholm files and re-format the sequences into Fasta format with the sequences broken down to rows at most 60 characters wide.

For exampel, if the input is like this:

# STOCKHOLM 1.0 
prot17   MAGQDPRLRGEPLKHVLVIDDDVAMRHLIVEYLTIHAFKVTAVADSKQFNRVLCSETVDVVVV
prot4711 AAGQDVRLRGEPL----VIDDDVAMRHLIVEYLTIDAFKVTAVADSKQFNRVLCSETVDVVVV
//

then the output should be

>prot17
MAGQDPRLRGEPLKHVLVIDDDVAMRHLIVEYLTIHAFKVTAVADSKQFNRVLCSETVDV
VVV
>prot4711
AAGQDVRLRGEPL----VIDDDVAMRHLIVEYLTIDAFKVTAVADSKQFNRVLCSETVDV
VVV

Suggested approach

There are many ways of solving this problem, but all good solutions involves identifying smaller smaller problems and writing functions that solves them. Here is one suggestion.

  • Write a function that takes a string (representing a sequence) and prints it as Fasta.
  • Use a separate function to break a string into several lines, each line at most 60 characters wide.
  • Write a function for handling reading of sequences, and nothing else.
  • Have a short main part of your code that opens the file you want and calls the sequence-reading function.

Data

We have three test cases. You solution should handle these files gracefully.

  1. A simple test case with three sequences.
  2. Shorter sequences. How should your program handle this file?
  3. A "corner case", this file is empty, except for the prefix and suffix necessary in Stockholm files.
  4. As a fourth test case, go to The Pfam web site and download the protein alignment for domain family PF00041 in Stockholm format. Verify that your program works well with this data! Note: Mac users must use another browser than Safari for the format to come out right. I have verified that Firefox and the command 'wget' works just fine. 

Requirements

  • Your script must be able to read from any Stockholm file you want, with markup (#=GF etc) removed.
  • Every sequence in a file must be read and reformatted.
  • Do not use any Python modules (for example 'textwrap') for line breaking (as per the general Lab 1 instructions). 
  • Your script must allow for, but no require, empty lines inbetween sequences.
  • Your program's output must be readable by other programs, e.g. "muscle".

A typical session looks like this:

$ python reformat.py
Which sequence file? longseqs.sthlm

>gene4711
ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT
ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT
>hubba
ACGTACGTACGTACGTACGTACGTANNNNNNNNNNTACGTACGTACGTACGTACGTACGT
ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT
T
>gene4712
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT

To present:

  1. Your code solving this problem according to the given requirements.
  2. Show how your program works on the four testcases.
  3. What is Pfam?

Lars Arvestad skapade sidan 2 november 2015

kommenterade 6 november 2015

I can't access the test cases. anyone else that has the same problem?

Lärare kommenterade 6 november 2015

There appears to be a file server issue. I cannot access the files from the file system either.

Lärare kommenterade 6 november 2015

The files are available again.

Lärare Lars Arvestad ändrade rättigheterna 10 november 2015

Kan därmed läsas av alla och ändras av lärare.
kommenterade 10 november 2015

"As a fourth test case, go to pfam.sbc.su.se (or pfam.sanger.ac.uk) and download the protein alignment for domain family PF00041 in Stockholm format. Verify that your program works well with this data!"

Could you please provide us with a direct link to this file. It is not easily found on that website.

Lärare kommenterade 11 november 2015

Sorry, I won't do that. :-) It is part of the assignment to figure out what Pfam is, what you can find there, and how to navigate the site.