Till KTH:s startsida Till KTH:s startsida

Advanced assignment

Advanced assignment: Controlling Phylip programs

Completing this assignment will raise your grade one step.

The purpose of this assignment is to practice using a couple of advanced standard Python modules, tempfile and subprocess or pexpect. Using such modules, you can use use Python as a glue language.

Write a Python program that reads a protein alignment on Fasta format and runs a Phylip bootstrap analysis. The filename and number of boostrap replicates are supposed to be commandline arguments and the output should be a Newick tree written to stdout.

You can read about the necessary Phylip programs on the web. Note, however, that on Ubuntu computers, you start neighbor with the command


> phylip neighbor

instead of plain "neighbor".

Requirements

  • Phylip does not allow accessions longer than 10 characters. This is a hard limit and Phylip programs misbehave if it is violated. Your program must rename sequences internally so that Phylip works well. The output must have the original names however.
  • Your program should not leave any temporary files laying around! In particular, there should be no file named "infile", "outfile", or similar in the directory where your program was called. Solve this using the tempfile module.
  • A session should look something like:
    
    
> bootstrap small.fa 100
((((horse:100.0,(dog1:100.0,dog2:100.0):100.0):60.0,rat:100.0):100.0,
orang:100.0):100.0,human:100.0);
> bootstrap small.fa
Error: You have to specify the number of bootstraps.

Usage: 
   bootstrap <filename> <number of boostraps>

Hints

  • Use BioPython for reading and writing alignments.
  • All Phylip programs read stdin for instructions. You will therefore have to create input to them dynamically.
  • It is common to miss the actual bootstrap step: you have to instruct Phylip programs to read from multiple datasets!
  • Use a Python module for controlling subprocesses. There are two alternatives:
    • The subprocess module: I have used this with success in the past, but some students have had issues with this module (at least using the waitfuntion, which is a good idea).
    • Pexpect: An alternative to subprocess is the pexpect module. Try it if you don't like subprocess!

Test data

To present:

  1. Your Python program, code and test runs
  2. How have you dealt with temporary files?
  3. How have you worked with the subprocess module?

Lars Arvestad skapade sidan 27 oktober 2016

kommenterade 10 november 2016

To avoid others having a bad time: temporary files created by tempfile can't be opened by other programs in Windows.

kommenterade 19 november 2016

Hey, we don't understand what the "Phylip bootstrap" analysis is. What is the command line argument to run this and where can we find information about how it works?

http://evolution.genetics.washington.edu/phylip/progs.algs.data.html

This is linked, but which one of them is the bootstrap analysis?

kommenterade 20 november 2016

As I understand it, you should run several phylip programs in sequence. - phylip seqboot: bootstrapping your samples to generate more (say K samples) - phylip protdist: computing the corresponding K distance matrices - phylip neighbor: computing the K newick trees - phylip consense: merging K trees into one. Now exactly how to do all these steps from within Python I do not know yet.

kommenterade 20 november 2016

The seqboot documentation has useful information on performing bootstrap analyses using phylip. Unfortunately the entire process is rather painful, but quite doable.

En användare har tagit bort sin kommentar