Till KTH:s startsida Till KTH:s startsida

Blast parsing

Use BioPython to filter the results from a Blast run. You want to be able to specify a substring that will match the accession, and you can assume that you always want to remove any hits with an E-value larger than 1e-20. The input is a Blast report in XML format.

Data

Here is some sample test data. You should however also manage to handle the output of the previous assignment.

Please note: Browsers get confused if you try to look at these XML files in the browser. Right click and download the files directly!

Requirements

Write a Python program that fulfills the following.

  • Your program is an executable Python file taking two inputs:
    1. A command line argument with a string that should match "Hit_id", "Hit_def", or "Hit_accession" field in the Blast hit list, e.g., "mouse".
    2. A file to read input from. The file contains Blast XML data.
    An extra smile from the tutor if you can read from stdin if the filename is a dash ("-")!  :-)
  • Output goes to stdout and is a table with four columns,
    1. Query accession: "BlastOutput_query-def"
    2. Target accession: Hit_accession
    3. Score
    4. E-value
    in which only sequences with an id, definition ("description"), or accession matching the input substring are listed. There might be several HSPs from the same target accession, but only the one with lowest E-value should be shown.

Example session

In this example, the first column has been simplified. The actual accession can be more complicated!


spel-01> ./blastparse MOUSE test1.xml
Warning: No hits
spel-01> ./blastparse MOUSE test2.xml
CST3    CYTC_MOUSE   185.3    3.5e-47

To present:

  1. Your code.
  2. A demonstration of your code running with the output from the last assignment (a search with CST3). What is the best MOUSE hit you find?

Lars Arvestad skapade sidan 2 november 2015

Lärare Lars Arvestad ändrade rättigheterna 30 november 2015

Kan därmed läsas av alla och ändras av lärare.
kommenterade 7 december 2015

test1.xml and test2.xml are not valid xml files as they have multiple xml declarations. See http://stackoverflow.com/a/20251895

Lärare kommenterade 7 december 2015

Not my fault. Welcome to the world of Bioinformatics. 

kommenterade 7 december 2015

Right. But the Python ElementTree won't deal with it, so it's not really "standard Python". Should we just hack together a solution?

Lärare kommenterade 8 december 2015

No, you should use the BioPython module for parsing Blast output.

kommenterade 8 december 2015

Yes, of course. How could I forget that BioPython has a module for that... :-)

kommenterade 8 december 2015

BioPython (ver 1.63 on Python 2.7.6) chokes on all three example files as well, when using the SearchIO module of BioPython. Errors are along the lines of "cElementTree.ParseError: junk after document element: "

SearchIO works perfectly with the (presumably valid) XML which I myself gathered from the remote NCBI BLAST service.

Using the older Bio.Blast.NCBIXML module seems to work with the example files however, so I'll have to use that. I'm still curious as to the origins of the broken XML files, because I haven't encountered that issue with either online or local BLAST.
kommenterade 8 december 2015

It seems the example output doesn't match the requirements. Col 3 is specified to be the hit accession, which in this example would be 24130, i.e.

<Hit_accession>24130</Hit_accession>

But the output shows it as "CYTC_MOUSE", which is a part of the Hit_def, which is not in the required output spec.