Motif filter
Programming: A motif filter
Use BioPython to write a program that reads a Fasta file containing protein sequences and writes those sequences that contain a given motif: K-L-[EI]{2-}-K (in Prosite notation). I.e., we want to extract those sequences that contain KL followed by two or more of either E or I, then a K. Output must be written to stdout (the terminal).
Test data
This file contains 62 proteins and 23 of them contains the motif we want.
Here is an empty file.
You must also put together a small testset yourself.
Requirements
- You must use a regular expression for the filtering.
- You must write your own test file containing one sequence with the motif and one without it!
- If a sequence contains the motif, the entire sequence should be written out in Fasta format.
- Sequences not containing the motif should be discarded.
- Your program must handle all test files gracefully.
Using the program should look something like:
Suppose your input file a.fa looks like this:
>acc1 This sequence contains KLEEK SLKLEEKSL >acc2 There is no KLEEK-like motif in this sequence SLKEEKAR |
Running you program on this file would the look like this:
./motiffilter a.fa >acc1 This sequence contains KLEEK SLKLEEKSL |
Note 1:Only those sequences contaning the required motif are echoed to STDOUT. No change in the sequences' description is made.
Note 2: The above is a tiny example data set with tiny sequences. In the "big" example, we need the full sequences.
To present:
- You Python program.
- Your test case (see requirement 2).
- How many KLEEK proteins are there in the test data?
And then take a look at this!