Till KTH:s startsida Till KTH:s startsida

Motif filter

Programming: A motif filter

Use BioPython to write a program that reads a Fasta file containing protein sequences and writes those sequences that contain a given motif: K-L-[EI]{2-}-K (in Prosite notation). I.e., we want to extract those sequences that contain KL followed by two or more of either E or I, then a K. Output must be written to stdout (the terminal).

Test data

This file contains 62 proteins and 23 of them contains the motif we want.

Here is an empty file.

You must also put together a small testset yourself.

Requirements

  1. You must use a regular expression for the filtering.
  2. You must write your own test file containing one sequence with the motif and one without it!
  3. If a sequence contains the motif, the entire sequence should be written out in Fasta format.
  4. Sequences not containing the motif should be discarded.
  5. Your program must handle all test files gracefully.

Using the program should look something like:

Suppose your input file a.fa looks like this:


>acc1  This sequence contains KLEEK
SLKLEEKSL
>acc2  There is no KLEEK-like motif in this sequence
SLKEEKAR

Running you program on this file would the look like this:


./motiffilter a.fa
>acc1  This sequence contains KLEEK
SLKLEEKSL

Note 1:Only those sequences contaning the required motif are echoed to STDOUT. No change in the sequences' description is made.

Note 2: The above is a tiny example data set with tiny sequences. In the "big" example, we need the full sequences.

To present:

  1. You Python program.
  2. Your test case (see requirement 2).
  3. How many KLEEK proteins are there in the test data?

And then take a look at this! 

Lars Arvestad skapade sidan 27 oktober 2016

En användare har tagit bort sin kommentar