Visa version

Version skapad av Lars Arvestad 2015-11-13 11:17

Visa < föregående | nästa >
Jämför < föregående | nästa >

Re-formatting DNA

Re-formatting sequences

Write a python program that reads sequences from Stockholm files and re-format the sequences into Fasta format with the sequences broken down to rows at most 60 characters wide.

For exampel, if the input is like this:

# STOCKHOLM 1.0 prot17 MAGQDPRLRGEPLKHVLVIDDDVAMRHLIVEYLTIHAFKVTAVADSKQFNRVLCSETVDVVVVDLNLGREDGLEIVRSLATKSDVPIIIISGARLEEADKVIALELGATDFIAKPFGTRE
prot4711 AAGQDVRLRGEPL----VIDDDVAMRHLIVEYLTIDAFKVTAVADSKQFNRVLCSETVDVVVVDTILGFEDGLEIVDSLATKSDVPIIIISGARLEEADKVIALELGAIDFIAGPFGTRD
//

then the output should be

>prot17
MAGQDPRLRGEPLKHVLVIDDDVAMRHLIVEYLTIHAFKVTAVADSKQFNRVLCSETVDV
VVVDLNLGREDGLEIVRSLATKSDVPIIIISGARLEEADKVIALELGATDFIAKPFGTRE
>prot4711
AAGQDVRLRGEPL----VIDDDVAMRHLIVEYLTIDAFKVTAVADSKQFNRVLCSETVDV
VVVDTILGFEDGLEIVDSLATKSDVPIIIISGARLEEADKVIALELGAIDFIAGPFGTRD

Suggested approach

There are many ways of solving this problem, but all good solutions involves identifying smaller smaller problems and writing functions that solves them. Here is one suggestion.

Write a function that takes a string (representing a sequence) and prints it as Fasta.
Use a separate function to break a string into several lines, each line at most 60 characters wide.
Write a function for handling reading of sequences, and nothing else.
Have a short main part of your code that opens the file you want and calls the sequence-reading function.

Data

We have three test cases. You solution should handle these files gracefully.

A simple test case with three sequences.
Shorter sequences. How should your program handle this file?
A "corner case", this file is empty, except for the prefix and suffix necessary in Stockholm files.
As a fourth test case, go to pfam.sbc.su.se (or pfam.sanger.ac.uk) and download the protein alignment for domain family PF00041 in Stockholm format. Verify that your program works well with this data!

Requirements

Your script must be able to read from any Stockholm file you want, with markup (#=GF etc) removed.
Every sequence in a file must be read and reformatted.
Your script must allow for, but no require, empty lines inbetween sequences.
Your program's output must be readable by other programs, e.g. "muscle".

A typical session looks like this:

$ python reformat.py
Which sequence file? longseqs.sthlm

>gene4711
ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT
ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT
>hubba
ACGTACGTACGTACGTACGTACGTANNNNNNNNNNTACGTACGTACGTACGTACGTACGT
ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT
T
>gene4712
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT

To present:

Your code solving this problem according to the given requirements.
Show how your program works on the four testcases.
What is Pfam?