Till KTH:s startsida Till KTH:s startsida

Advanced assignment

Advanced assignment: Controlling Phylip programs

Completing this assignment will raise your grade one step.

The purpose of this assignment is to practice using a couple of advanced standard Python modules, tempfile and subprocess or pexpect. Using such modules, you can use use Python as a glue language.

Write a Python program that reads a protein alignment on Fasta format and runs a Phylip bootstrap analysis. The filename and number of boostrap replicates are supposed to be commandline arguments and the output should be a Newick tree written to stdout.

You can read about the necessary Phylip programs on the web. Note, however, that on Ubuntu computers, you start neighbor with the command


> phylip neighbor

instead of plain "neighbor".

Requirements

  • Phylip does not allow accessions longer than 10 characters. This is a hard limit and Phylip programs misbehave if it is violated. Your program must rename sequences internally so that Phylip works well. The output must have the original names however.
  • Your program should not leave any temporary files laying around! In particular, there should be no file named "infile", "outfile", or similar in the directory where your program was called. Solve this using the tempfile module.
  • A session should look something like:
    
    
> bootstrap small.fa 100
((((horse:100.0,(dog1:100.0,dog2:100.0):100.0):60.0,rat:100.0):100.0,
orang:100.0):100.0,human:100.0);
> bootstrap small.fa
Error: You have to specify the number of bootstraps.

Usage: 
   bootstrap <filename> <number of boostraps>

Hints

  • Use BioPython for reading and writing alignments.
  • All Phylip programs read stdin for instructions. You will therefore have to create input to them dynamically.
  • It is common to miss the actual bootstrap step: you have to instruct Phylip programs to read from multiple datasets!
  • Use a Python module for controlling subprocesses. There are two alternatives:
    • The subprocess module: I have used this with success in the past, but some students have had issues with this module (at least using the waitfuntion, which is a good idea).
    • Pexpect: An alternative to subprocess is the pexpect module. Try it if you don't like subprocess!

Test data

To present:

  1. Your Python program, code and test runs
  2. How have you dealt with temporary files?
  3. How have you worked with the subprocess module?

Lars Arvestad skapade sidan 2 november 2015

Lärare Lars Arvestad ändrade rättigheterna 10 november 2015

Kan därmed läsas av alla och ändras av lärare.
En användare har tagit bort sin kommentar
kommenterade 25 november 2015

I have two questions:
When we're supposed to use BioPython to read and write alignments, does that mean we assume translated sequences and we use it to recover the sequence of nucleotides?
After I run the neighbour program with 100 bootstrapped samples, I have 100 phylograms. Is this the expected output or are we supposed to aggregate them?

Lärare kommenterade 27 november 2015

Yes, you can assume protein input.