Ändringar mellan två versioner

Här visas ändringar i "Comparing base composition" mellan 2016-10-27 09:46 av Lars Arvestad och 2016-11-16 15:20 av Lars Arvestad.

Comparing base composition

This assignment has a programming part and a science part.

Programming: Composition-based distances Computing phylogenies for species based on whole genomes is hard for several reasons. Do you work with genes only? How do you then pick the genes? You cannot, in general, align whole genomes because of recombination events that cause large scale differences.

One solution that has been tried is comparing the difference in nucleotide composition. Let πX be the nucleotide composition vector for species X. If all nucleotides have the same frequence in X (which would be surprising), then πX=(0.25, 0.25, 0.25, 0.25).

If genome Y we are looking at is GC rich, then the composition vector might be πY = (0.2, 0.3, 0.3, 0.2), assuming that the nucleotide frequencies are in the order A, C, G, and T.

Measuring difference in composition To be able to use composition for distances, we want to have a measure for the difference. In this assignment, we use the root-mean-square distance:

* Take the elementwise differences.
* Square the differences and add them up.
* Divide the sum by 4 and take the square root.
diff(πX,πY) = √ (0.25 * ((0.25-0.2)2 + (0.25-0.3)2 + (0.25-0.3)2 + (0.25-0.2)2))

Assignment Write a Python program that takes as input a number of filenames, each pointing to files containing one genomic sequence in Fasta format, and output a distance matrix using the composition difference above.

Suggested solution structure Define a function dist that takes two composition vectors as arguments and returns a difference.

Test data
* simple1.fa and simple2.fa have composition vectors (0.25, 0.25, 0.25, 0.25) and (0.3, 0.2, 0.2, 0.3), respectively, and their composition distance should therefore be 0.05.
* Some actual data:
* human.fa
* mouse.fa
* fly.fa
* yeast.fa
* ecoli.fa
* plasmodium.fa
* thermus.fa
For efficiency reasons, this is just parts of the chromosomes/genomes.
Requirements
* Your script is an executable Python script, i.e., if your Python file is named "basedist" then you should be able to run it like in the example below.
* All files are read from the commandline.
* Output is written to stdout, any errors or warnings go to stderr.
* The output is a distance matrix in Phylip format. Note: The accessions are limited to 10 characters in this format!
* Stafford Noble's paper, the course's Centerpiece, has a section on "Handling and preventing errors". Read it! Your code has to handle and prevent errors somewhere. What could possibly go wrong if someone else is using your program?
A typical session looks like this:

prompt> ./basedist human.fa mouse.fa fly.fa yeast.fa ecoli.fa plasmodium.fa thermus.fa 7 Hsapiens 0.0 0.009 0.014 0.041 0.042 0.073 0.126 Mus_muscul0.009 0.0 0.017 0.044 0.039 0.079 0.126 Fly 0.014 0.017 0.0 0.054 0.031 0.081 0.112 Yeast 0.041 0.044 0.054 0.0 0.083 0.048 0.11266 Ecoli 0.042 0.039 0.031 0.083 0.0 0.111 0.088 Plasmodium0.073 0.079 0.081 0.048 0.111 0.0 0.182 Thermus_th0.126 0.126 0.112 0.166 0.088 0.182 0.0 The program is here called basedist and each Fasta file contains a single chromosome. Note that the matrix is symmetric.

Clarifications
* You do not need to change the accessions to look "nice", but you do have to limit their length to 10 characters.
* The numbers in the example are made up.
To present:

* Your well-structured Python code.
* Examples of possible user mistakes that your program handles gracefully.
* Test-runs of your program on the genome data.