Visa version

Version skapad av Lars Arvestad 2016-11-16 15:20

Comparing base composition

This assignment has a programming part and a science part.

Programming: Composition-based distances

Computing phylogenies for species based on whole genomes is hard for several reasons. Do you work with genes only? How do you then pick the genes? You cannot, in general, align whole genomes because of recombination events that cause large scale differences.

One solution that has been tried is comparing the difference in nucleotide composition. Let π_X be the nucleotide composition vector for species X. If all nucleotides have the same frequence in X (which would be surprising), then π_X=(0.25, 0.25, 0.25, 0.25).

If genome Y we are looking at is GC rich, then the composition vector might be π_Y = (0.2, 0.3, 0.3, 0.2), assuming that the nucleotide frequencies are in the order A, C, G, and T.

Measuring difference in composition

To be able to use composition for distances, we want to have a measure for the difference. In this assignment, we use the root-mean-square distance:

Take the elementwise differences.
Square the differences and add them up.
Divide the sum by 4 and take the square root.

diff(π_X,π_Y) = √ (0.25 * ((0.25-0.2)² + (0.25-0.3)² + (0.25-0.3)² + (0.25-0.2)²))

Assignment

Write a Python program that takes as input a number of filenames, each pointing to files containing one genomic sequence in Fasta format, and output a distance matrix using the composition difference above.

Test data

simple1.fa and simple2.fa have composition vectors (0.25, 0.25, 0.25, 0.25) and (0.3, 0.2, 0.2, 0.3), respectively, and their composition distance should therefore be 0.05.
Some actual data:
- human.fa
- mouse.fa
- fly.fa
- yeast.fa
- ecoli.fa
- plasmodium.fa
- thermus.fa
For efficiency reasons, this is just parts of the chromosomes/genomes.

Requirements

Your script is an executable Python script, i.e., if your Python file is named "basedist" then you should be able to run it like in the example below.
All files are read from the commandline.
Output is written to stdout, any errors or warnings go to stderr.
The output is a distance matrix in Phylip format. Note: The accessions are limited to 10 characters in this format!
Stafford Noble's paper, the course's Centerpiece, has a section on "Handling and preventing errors". Read it! Your code has to handle and prevent errors somewhere. What could possibly go wrong if someone else is using your program?

A typical session looks like this:

prompt> ./basedist human.fa mouse.fa fly.fa yeast.fa ecoli.fa plasmodium.fa thermus.fa 
   7
Hsapiens  0.0   0.009   0.014   0.041   0.042   0.073   0.126
Mus_muscul0.009	0.0     0.017   0.044   0.039   0.079   0.126
Fly       0.014	0.017	0.0     0.054   0.031   0.081   0.112
Yeast     0.041	0.044	0.054	0.0     0.083   0.048   0.166
Ecoli     0.042	0.039	0.031	0.083	0.0     0.111   0.088
Plasmodium0.073	0.079	0.081	0.048	0.111	0.0     0.182
Thermus_th0.126	0.126	0.112	0.166	0.088	0.182	0.0

The program is here called basedist and each Fasta file contains a single chromosome. Note that the matrix is symmetric.

Clarifications

You do not need to change the accessions to look "nice", but you do have to limit their length to 10 characters.
The numbers in the example are made up.

To present:

Your well-structured Python code.
Examples of possible user mistakes that your program handles gracefully.
Test-runs of your program on the genome data.