Till KTH:s startsida Till KTH:s startsida

Ändringar mellan två versioner

Här visas ändringar i "Querying Ensembl" mellan 2015-11-02 11:18 av Lars Arvestad och 2015-11-02 13:34 av Lars Arvestad.

Querying Ensembl

Connecting to Ensembl's MySQL system At the core of the Ensembl web site is a large database running MySQL which is generously open to the scientific community at large. You connect to an Ensembl database using a command like

mysql -u anonymous -h ensembldb.ensembl.org -P 3306 DB where you choose DB to name a database for the species and genome version you want. The Ensembl databases update frequently, and they do not want to throw away information, instead you have to keep track of which version there is. Some notable databases (the latest at this writing) are


* homo_sapiens_core_47_36i
* mus_musculus_core_47_37
* pan_troglodytes_core_47_21f
* felis_catus_core_47_1b
and to connect to the (latest?) human database, you would therefore write

mysql -u anonymous -h ensembldb.ensembl.org -P 3306 homo_sapiens_core_47_36i82_38 To see more available databases, you do:

mysql -u anonymous -h ensembldb.ensembl.org -P 3306 and write

show databases; or

show databases like 'homo%'; for everything starting with 'homo'. The percent sign is "wild card" that matches anything.

Assignment 1 How many databases are available for cat?

Quering Ensembl using SQL The database schema of Ensembl is documented, at least on their web site. A paper describing Ensembl (Stabenau et al, Genome Research, 2004) contains an outdated figure which briefly describes the central tables. Luckily, you can get more details using the SQL command describe.

Assignment 2 Use the show and describe commands to figure what tables are related to describing genes, transcripts, and exons. What information are these tables storing and how do the tables relate to each other?

Assignment 3 How many genes are registred for human? Make a breakdown based on the column "biotype" by using the GROUP BY directive.

Assignment 4 How many pseudogenes have a non-empty description string? How many of those does not start with "similar to..."?

Assignment 5 How many transcripts are associated with the two breast-cancer associated genes BRCA1 (ENSG00000012048) and BRCA2 (ENSG00000139618)? Your solution is supposed to be written as one SQL query. Hint: Use the table gene_stable_id.