Till KTH:s startsida Till KTH:s startsida

Visa version

Version skapad av Lars Arvestad 2016-11-16 15:20

Visa < föregående
Jämför < föregående

Comparing base composition

This assignment has a programming part and a science part.

Programming: Composition-based distances

Computing phylogenies for species based on whole genomes is hard for several reasons. Do you work with genes only? How do you then pick the genes? You cannot, in general, align whole genomes because of recombination events that cause large scale differences.

One solution that has been tried is comparing the difference in nucleotide composition. Let πX be the nucleotide composition vector for species X. If all nucleotides have the same frequence in X (which would be surprising), then πX=(0.25, 0.25, 0.25, 0.25).

If genome Y we are looking at is GC rich, then the composition vector might be πY = (0.2, 0.3, 0.3, 0.2), assuming that the nucleotide frequencies are in the order A, C, G, and T.

Measuring difference in composition

To be able to use composition for distances, we want to have a measure for the difference. In this assignment, we use the root-mean-square distance:

  • Take the elementwise differences.
  • Square the differences and add them up.
  • Divide the sum by 4 and take the square root.

diffXY) = √ (0.25 * ((0.25-0.2)2 + (0.25-0.3)2 + (0.25-0.3)2 + (0.25-0.2)2))

Assignment

Write a Python program that takes as input a number of filenames, each pointing to files containing one genomic sequence in Fasta format, and output a distance matrix using the composition difference above.

Suggested solution structure

Define a function dist that takes two composition vectors as arguments and returns a difference.

Test data

Requirements

  • Your script is an executable Python script, i.e., if your Python file is named "basedist" then you should be able to run it like in the example below.
  • All files are read from the commandline.
  • Output is written to stdout, any errors or warnings go to stderr.
  • The output is a distance matrix in Phylip formatNote: The accessions are limited to 10 characters in this format!
  • Stafford Noble's paper, the course's Centerpiece, has a section on "Handling and preventing errors". Read it! Your code has to handle and prevent errors somewhere. What could possibly go wrong if someone else is using your program?

A typical session looks like this:

prompt> ./basedist human.fa mouse.fa fly.fa yeast.fa ecoli.fa plasmodium.fa thermus.fa 
   7
Hsapiens  0.0   0.009   0.014   0.041   0.042   0.073   0.126
Mus_muscul0.009	0.0     0.017   0.044   0.039   0.079   0.126
Fly       0.014	0.017	0.0     0.054   0.031   0.081   0.112
Yeast     0.041	0.044	0.054	0.0     0.083   0.048   0.166
Ecoli     0.042	0.039	0.031	0.083	0.0     0.111   0.088
Plasmodium0.073	0.079	0.081	0.048	0.111	0.0     0.182
Thermus_th0.126	0.126	0.112	0.166	0.088	0.182	0.0

The program is here called basedist and each Fasta file contains a single chromosome. Note that the matrix is symmetric.

Clarifications

  • You do not need to change the accessions to look "nice", but you do have to limit their length to 10 characters.
  • The numbers in the example are made up.

To present:

  1. Your well-structured Python code.
  2. Examples of possible user mistakes that your program handles gracefully.
  3. Test-runs of your program on the genome data.