Till KTH:s startsida Till KTH:s startsida

Visa version

Version skapad av Lars Arvestad 2015-11-02 13:34

Visa < föregående
Jämför < föregående

Querying Ensembl

Connecting to Ensembl's MySQL system

At the core of the Ensembl web site is a large database running MySQL which is generously open to the scientific community at large. You connect to an Ensembl database using a command like

   mysql -u anonymous -h ensembldb.ensembl.org -P 3306 DB

where you choose DB to name a database for the species and genome version you want. The Ensembl databases update frequently, and they do not want to throw away information, instead you have to keep track of which version there is. Some notable databases (the latest at this writing) are

  • homo_sapiens_core_47_36i
  • mus_musculus_core_47_37
  • pan_troglodytes_core_47_21f
  • felis_catus_core_47_1b

and to connect to the (latest?) human database, you would therefore write

   mysql -u anonymous -h ensembldb.ensembl.org -P 3306 homo_sapiens_core_82_38

To see more available databases, you do:

   mysql -u anonymous -h ensembldb.ensembl.org -P 3306

and write

   show databases;

or

   show databases like 'homo%';

for everything starting with 'homo'. The percent sign is "wild card" that matches anything.

Assignment 1

How many databases are available for cat?

Quering Ensembl using SQL

The database schema of Ensembl is documented, at least on their web site. A paper describing Ensembl (Stabenau et al, Genome Research, 2004) contains an outdated figure which briefly describes the central tables. Luckily, you can get more details using the SQL command describe.

Assignment 2

Use the show and describe commands to figure what tables are related to describing genes, transcripts, and exons. What information are these tables storing and how do the tables relate to each other?

Assignment 3

How many genes are registred for human? Make a breakdown based on the column "biotype" by using the GROUP BY directive.

Assignment 4

How many pseudogenes have a non-empty description string? How many of those does not start with "similar to..."?

Assignment 5

How many transcripts are associated with the two breast-cancer associated genes BRCA1 (ENSG00000012048) and BRCA2 (ENSG00000139618)? Your solution is supposed to be written as one SQL query. Hint: Use the table gene_stable_id.