Till KTH:s startsida Till KTH:s startsida

Ändringar mellan två versioner

Här visas ändringar i "Advanced assignment" mellan 2015-11-02 11:18 av Lars Arvestad och 2015-11-02 13:08 av Lars Arvestad.

Advanced assignment

Advanced assignment: Controlling Phylip programs Completing this assignment will raise your grade one step.

The purpose of this assignment is to practice using a couple of advanced standard Python modules, tempfile and subprocess or pexpect. Using such modules, you can use use Python as a glue language.

Write a Python program that reads a protein alignment on Fasta format and runs a Phylip bootstrap analysis. The filename and number of boostrap replicates are supposed to be commandline arguments and the output should be a Newick tree written to stdout.

You can read about the necessary Phylip programs on the web. Note, however, that on Ubuntu computers, you start neighbor with the command

> phylip neighbor instead of plain "neighbor".

Requirements
* Phylip does not allow accessions longer than 10 characters. This is a hard limit and Phylip programs misbehave if it is violated. Your program must rename sequences internally so that Phylip works well. The output must have the original names however.
* Your program should not leave any temporary files laying around! In particular, there should be no file named "infile", "outfile", or similar in the directory where your program was called. Solve this using the tempfile module.
* A session should look something like:
> bootstrap small.fa 100 ((((horse:100.0,(dog1:100.0,dog2:100.0):100.0):60.0,rat:100.0):100.0, orang:100.0):100.0,human:100.0); > bootstrap small.fa Error: You have to specify the number of bootstraps. Usage: bootstrap <filename> <number of boostraps> Hints
* Use BioPython for reading and writing alignments.
* All Phylip programs read stdin for instructions. You will therefore have to create input to them dynamically.
* Use a Python module for controlling subprocesses:
* We have used the subprocess module to run Phylip programs in the past, but apparently there are some
It is common to miss the actual bootstrap step: you have to instruct Phylip programs to read from multiple datasets!
* Use a Python module for controlling subprocesses. There are two alternatives:
* The subprocess module: I have used this with success in the past, but some students have had
issues with this module (at least using the waitfuntion, which is a good idea).
* Pexpect: An alternative to subprocess is the pexpect module. Try it if you don't like subprocess!

Test data
* A small test dataset with short accessions.
* A small test dataset with long accessions.
* 18 seqs with long accessions.
To present:


* Your Python program, code and test runs
* How have you dealt with temporary files?
* How have you worked with the subprocess module?