Re-formatting DNA

Re-formatting sequences

Write a python program that reads sequences from Stockholm files and re-format the sequences into Fasta format with the sequences broken down to rows at most 60 characters wide.

For exampel, if the input is like this:

# STOCKHOLM 1.0 
prot17   MAGQDPRLRGEPLKHVLVIDDDVAMRHLIVEYLTIHAFKVTAVADSKQFNRVLCSETVDVVVV
prot4711 AAGQDVRLRGEPL----VIDDDVAMRHLIVEYLTIDAFKVTAVADSKQFNRVLCSETVDVVVV
//

then the output should be

>prot17
MAGQDPRLRGEPLKHVLVIDDDVAMRHLIVEYLTIHAFKVTAVADSKQFNRVLCSETVDV
VVV
>prot4711
AAGQDVRLRGEPL----VIDDDVAMRHLIVEYLTIDAFKVTAVADSKQFNRVLCSETVDV
VVV

Suggested approach

There are many ways of solving this problem, but all good solutions involves identifying smaller smaller problems and writing functions that solves them. Here is one suggestion.

Write a function that takes a string (representing a sequence) and prints it as Fasta.
Use a separate function to break a string into several lines, each line at most 60 characters wide.
Write a function for handling reading of sequences, and nothing else.
Have a short main part of your code that opens the file you want and calls the sequence-reading function.

Data

We have three test cases. You solution should handle these files gracefully.

A simple test case with three sequences.
Shorter sequences. How should your program handle this file?
A "corner case", this file is empty, except for the prefix and suffix necessary in Stockholm files.
As a fourth test case, go to The Pfam web site and download the protein alignment for domain family PF00041 in Stockholm format. Verify that your program works well with this data! Note: Mac users must use another browser than Safari for the format to come out right. I have verified that Firefox and the command 'wget' works just fine.

Requirements

Your script must be able to read from any Stockholm file you want, with markup (#=GF etc) removed.
Every sequence in a file must be read and reformatted.
Do not use any Python modules (for example 'textwrap') for line breaking (as per the general Lab 1 instructions).
Your script must allow for, but no require, empty lines inbetween sequences.
Your program's output must be readable by other programs, e.g. "muscle".

A typical session looks like this:

$ python reformat.py
Which sequence file? longseqs.sthlm

>gene4711
ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT
ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT
>hubba
ACGTACGTACGTACGTACGTACGTANNNNNNNNNNTACGTACGTACGTACGTACGTACGT
ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT
T
>gene4712
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT

To present:

Your code solving this problem according to the given requirements.
Show how your program works on the four testcases.
What is Pfam?

"As a fourth test case, go to pfam.sbc.su.se (or pfam.sanger.ac.uk) and download the protein alignment for domain family PF00041 in Stockholm format. Verify that your program works well with this data!"

Could you please provide us with a direct link to this file. It is not easily found on that website.