Visa version
Visa
< föregående
|
nästa >
Jämför
< föregående
|
nästa >
Re-formatting DNA
Re-formatting sequences
Write a python program that reads sequences from Stockholm files and re-format the sequences into Fasta format with the sequences broken down to rows at most 60 characters wide.
For exampel, if the input is like this:
# STOCKHOLM 1.0 prot17 MAGQDPRLRGEPLKHVLVIDDDVAMRHLIVEYLTIHAFKVTAVADSKQFNRVLCSETVDVVVVDLNLGREDGLEIVRSLATKSDVPIIIISGARLEEADKVIALELGATDFIAKPFGTRE prot4711 AAGQDVRLRGEPL----VIDDDVAMRHLIVEYLTIDAFKVTAVADSKQFNRVLCSETVDVVVVDTILGFEDGLEIVDSLATKSDVPIIIISGARLEEADKVIALELGAIDFIAGPFGTRD //
then the output should be
>prot17 MAGQDPRLRGEPLKHVLVIDDDVAMRHLIVEYLTIHAFKVTAVADSKQFNRVLCSETVDV VVVDLNLGREDGLEIVRSLATKSDVPIIIISGARLEEADKVIALELGATDFIAKPFGTRE >prot4711 AAGQDVRLRGEPL----VIDDDVAMRHLIVEYLTIDAFKVTAVADSKQFNRVLCSETVDV VVVDTILGFEDGLEIVDSLATKSDVPIIIISGARLEEADKVIALELGAIDFIAGPFGTRD
Suggested approach
There are many ways of solving this problem, but all good solutions involves identifying smaller smaller problems and writing functions that solves them. Here is one suggestion.
- Write a function that takes a string (representing a sequence) and prints it as Fasta.
- Use a separate function to break a string into several lines, each line at most 60 characters wide.
- Write a function for handling reading of sequences, and nothing else.
- Have a short main part of your code that opens the file you want and calls the sequence-reading function.
Data
We have three test cases. You solution should handle these files gracefully.
- A simple test case with three sequences.
- Shorter sequences. How should your program handle this file?
- A "corner case", this file is empty, except for the prefix and suffix necessary in Stockholm files.
- As a fourth test case, go to pfam.sbc.su.se (or pfam.sanger.ac.uk) and download the protein alignment for domain family PF00041 in Stockholm format. Verify that your program works well with this data!
Requirements
- Your script must be able to read from any Stockholm file you want, with markup (#=GF etc) removed.
- Every sequence in a file must be read and reformatted.
- Your script must allow for, but no require, empty lines inbetween sequences.
- Your program's output must be readable by other programs, e.g. "muscle".
A typical session looks like this:
$ python reformat.py Which sequence file? longseqs.sthlm
>gene4711 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT >hubba ACGTACGTACGTACGTACGTACGTANNNNNNNNNNTACGTACGTACGTACGTACGTACGT ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT T >gene4712 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT |
To present:
- Your code solving this problem according to the given requirements.
- Show how your program works on the four testcases.
- What is Pfam?