Till KTH:s startsida Till KTH:s startsida

Ändringar mellan två versioner

Här visas ändringar i "Re-formatting DNA" mellan 2015-11-15 14:22 av Lars Arvestad och 2015-11-15 14:39 av Lars Arvestad.

Visa < föregående | nästa > ändring.

Re-formatting DNA

Re-formatting sequences Write a python program that reads sequences from Stockholm files and re-format the sequences into Fasta format with the sequences broken down to rows at most 60 characters wide.

For exampel, if the input is like this:

# STOCKHOLM 1.0 prot17 MAGQDPRLRGEPLKHVLVIDDDVAMRHLIVEYLTIHAFKVTAVADSKQFNRVLCSETVDVVVV prot4711 AAGQDVRLRGEPL----VIDDDVAMRHLIVEYLTIDAFKVTAVADSKQFNRVLCSETVDVVVV // then the output should be

>prot17 MAGQDPRLRGEPLKHVLVIDDDVAMRHLIVEYLTIHAFKVTAVADSKQFNRVLCSETVDV VVV >prot4711 AAGQDVRLRGEPL----VIDDDVAMRHLIVEYLTIDAFKVTAVADSKQFNRVLCSETVDV VVV

Suggested approach There are many ways of solving this problem, but all good solutions involves identifying smaller smaller problems and writing functions that solves them. Here is one suggestion.


* Write a function that takes a string (representing a sequence) and prints it as Fasta.
* Use a separate function to break a string into several lines, each line at most 60 characters wide.
* Write a function for handling reading of sequences, and nothing else.
* Have a short main part of your code that opens the file you want and calls the sequence-reading function.
Data We have three test cases. You solution should handle these files gracefully.


* A simple test case with three sequences.
* Shorter sequences. How should your program handle this file?
* A "corner case", this file is empty, except for the prefix and suffix necessary in Stockholm files.
* As a fourth test case, go to The Pfam web site and download the protein alignment for domain family PF00041 in Stockholm format. Verify that your program works well with this data! Note: Mac users must use another browser than Safari for the format to come out right. I have verified that Firefox and the command 'wget' works just fine.

Requirements
* Your script must be able to read from any Stockholm file you want, with markup (#=GF etc) removed.
* Every sequence in a file must be read and reformatted.
* Your script must allow for, but no require, empty lines inbetween sequences.
* Your program's output must be readable by other programs, e.g. "muscle".
A typical session looks like this:

$ python reformat.py Which sequence file? longseqs.sthlm >gene4711 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT >hubba ACGTACGTACGTACGTACGTACGTANNNNNNNNNNTACGTACGTACGTACGTACGTACGT ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT T >gene4712 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT

To present:


* Your code solving this problem according to the given requirements.
* Show how your program works on the four testcases.
* What is Pfam?