Nyhetsflöde
Logga in till din kurswebb
Du är inte inloggad på KTH så innehållet är inte anpassat efter dina val.
Har du frågor om kursen?
Om du är registrerad på en aktuell kursomgång, se kursrummet i Canvas. Du hittar rätt kursrum under "Kurser" i personliga menyn.
Är du inte registrerad, se Kurs-PM för DD2404 eller kontakta din studentexpedition, studievägledare, eller utbilningskansli.
I Nyhetsflödet hittar du uppdateringar på sidor, schema och inlägg från lärare (när de även behöver nå tidigare registrerade studenter).
Yes. I don't know what is "only" about that. :-) Distance 0 means the trees are identical.
Would a multialignment column of AACD mean two letters are unique (C and D) or that there are 3 unique letters (A, C and D)? regarding the noisy column requirement: "at least 50% of amino acids are unique".
The former, C and D are called unique here.
Hi all! If some other desperate person is starting the project just now and needs somebody to work with, PM me.
We were wondering some things regarding this project.
When we have analysed our data we get a really low bit-score (~0.5) for the logo sequences we want to find (eg "GT"). After investigating we found that when we look at the positive strand we get all of our target sequences with a bit-score of 2, whilst the negative strand seem to be random - indicating faulty retrieval of the sequences on the negative strands.
By now we have been stuck on trying to isolate the site sequences for the negative strands a really long time but it is not working. Our main idea so far have been to isolate coordinates by taking the 3' UTR end site position minus the first exon chromosome start position. It would seem we don't really have a full understanding of how the sequence/sequence positions are provided when using ensembl, could you get us some indication or somewhere where we can look it up?
Furthermore, is it necessary to use the negative strand as well? We can see no reason why only investigating the positive strand should infer a bias in the results. On the other hand we guess it is bad practice to exclude data, if it is available.
Nevermind! We finally solved it!
Bra!
In this project, are you looking for a "yes/no" classification, or are you looking for results that predict exactly where the n/h/c regions as well as the cleavage site (C) are?
It is primarily the yes/no classification that I am looking for. The "real tools" are also interested in the cleavage site, while other details are of little interest.
I keep getting the error: sqlite3.DatabaseError: file is encrypted or is not a database, when I try to query the protdb.sqlite3 file with the sqlite3 module. This happens on my computer as well as the CSC computers. Has anyone encountered this and knows what's up?
When you test no the school computers, have you tried to access the data file directly, from /info/appbio10/data/protdb.sqlite3 ?
I am getting the same error when I open the database into sqlite3, but it works fine if I 'read' it. (with command ".read" in sqlite3).
I managed to do 'Your own database' with sqlite3 just fine without any error, but using this code in python for the same database file:
#!/usr/bin/python
import sqlite3 as lite
con = lite.connect('protdb.sqlite3') #on my own computer
cur = con.cursor()
for row in cur.execute('SELECT * FROM species;'):
print row #indented
returns this error, and I always connect to the file directly on the school computers.
I have set up a database which should work: /info/DD2404/appbio15/data/protdb2.sqlite3
Great, do you have an online version?
This link will live for 5 days: https://transfer.sh/11dlbr/p.db
This seems to work, thanks!
There is no table called gene_stable_id, but a column called gene.stable_id.
test1.xml and test2.xml are not valid xml files as they have multiple xml declarations. See http://stackoverflow.com/a/20251895
Not my fault. Welcome to the world of Bioinformatics.
Right. But the Python ElementTree won't deal with it, so it's not really "standard Python". Should we just hack together a solution?
No, you should use the BioPython module for parsing Blast output.
Yes, of course. How could I forget that BioPython has a module for that... :-)
�BioPython (ver 1.63 on Python 2.7.6) chokes on all three example files as well, when using the SearchIO module of BioPython. Errors are along the lines of "cElementTree.ParseError: junk after document element: "
SearchIO works perfectly with the (presumably valid) XML which I myself gathered from the remote NCBI BLAST service.
Using the older Bio.Blast.NCBIXML module seems to work with the example files however, so I'll have to use that. I'm still curious as to the origins of the broken XML files, because I haven't encountered that issue with either online or local BLAST.
It seems the example output doesn't match the requirements. Col 3 is specified to be the hit accession, which in this example would be 24130, i.e.
<Hit_accession>24130</Hit_accession>
But the output shows it as "CYTC_MOUSE", which is a part of the Hit_def, which is not in the required output spec.
Some problems with Biomart downtime right now. Try one of the mirrors:
http://www.ensembl.org/info/about/mirrors.html
How large should the project groups be? 2-4 people?
At most 3, preferably 2.
I'm looking for a partner for the bioinformatics project. My background is in mathematics and computer science. Send me a message if interested.
What do you mean with 'create a histogram of all scores for the first Blast result in the file.'? The first blast result - one result - has only one score?
What I mean is "for the first query". If the Blast contains results from several queries, you only need to produce a histogram for the first one (which then includes all the suboptimal hits). Does this make sense?
Yeah, I guess so. But there is only one query in the cst3 file? But you want us to handle multiple queries anyways?
No, I am saying your code does not need to handle multiple queries. I am making this assignment easier than it perhaps should be. :-)
Ok, but now it does handle multiple queries. I think it says that it should handle multiple, but maybe I'm mistaken.
I added a parenthesis which hopefully clarifies the assignment for future generations. But I won't fail you for writing a more general program which solves the assignment.
We are only using symmetric distance (treecompare.symmetric_difference()) to compare the trees, is that enough?