Till KTH:s startsida Till KTH:s startsida

Ändringar mellan två versioner

Här visas ändringar i "Re-formatting DNA" mellan 2015-11-02 11:18 av Lars Arvestad och 2015-11-13 11:17 av Lars Arvestad.

Visa nästa > ändring.

Re-formatting DNA

Re-formatting sequences Write a python program that reads sequences from Stockholm files and re-format the sequences into Fasta format with the sequences broken down to rows at most 60 characters wide.

For exampel, if the input is like this:

# STOCKHOLM 1.0 prot17 MAGQDPRLRGEPLKHVLVIDDDVAMRHLIVEYLTIHAFKVTAVADSKQFNRVLCSETVDVVVVDLNLGREDGLEIVRSLATKSDVPIIIISGARLEEADKVIALELGATDFIAKPFGTRE prot4711 AAGQDVRLRGEPL----VIDDDVAMRHLIVEYLTIDAFKVTAVADSKQFNRVLCSETVDVVVVDTILGFEDGLEIVDSLATKSDVPIIIISGARLEEADKVIALELGAIDFIAGPFGTRD // then the output should be

>prot17 MAGQDPRLRGEPLKHVLVIDDDVAMRHLIVEYLTIHAFKVTAVADSKQFNRVLCSETVDV VVVDLNLGREDGLEIVRSLATKSDVPIIIISGARLEEADKVIALELGATDFIAKPFGTRE >prot4711 AAGQDVRLRGEPL----VIDDDVAMRHLIVEYLTIDAFKVTAVADSKQFNRVLCSETVDV VVVDTILGFEDGLEIVDSLATKSDVPIIIISGARLEEADKVIALELGAIDFIAGPFGTRD

Suggested approach There are many ways of solving this problem, but all good solutions involves identifying smaller smaller problems and writing functions that solves them. Here is one suggestion.


* Write a function that takes a string (representing a sequence) and prints it as Fasta.
* Use a separate function to break a string into several lines, each line at most 60 characters wide.
* Write a function for handling reading of sequences, and nothing else.
* Have a short main part of your code that opens the file you want and calls the sequence-reading function.
Data We have three test cases. You solution should handle these files gracefully.


* A simple test case with three sequences.
* Shorter sequences. How should your program handle this file?
* A "corner case", this file is empty, except for the prefix and suffix necessary in Stockholm files.
* As a fourth test case, go to pfam.sbc.su.se (or pfam.sanger.ac.uk) and download the protein alignment for domain family PF00041 in Stockholm format. Verify that your program works well with this data!

Requirements
* Your script must be able to read from any Stockholm file you want, with markup (#=GF etc) removed.
* Every sequence in a file must be read and reformatted.
* Your script must allow for, but no require, empty lines inbetween sequences.
* Your program's output must be readable by other programs, e.g. "muscle".
A typical session looks like this:

$ python reformat.py Which sequence file? longseqs.sthlm >gene4711 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT >hubba ACGTACGTACGTACGTACGTACGTANNNNNNNNNNTACGTACGTACGTACGTACGTACGT ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT T >gene4712 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT

To present:


* Your code solving this problem according to the given requirements.
* Show how your program works on the four testcases.
* What is Pfam?