Till KTH:s startsida Till KTH:s startsida

Querying Ensembl

Connecting to Ensembl's MySQL system

At the core of the Ensembl web site is a large database running MySQL which is generously open to the scientific community at large. You connect to an Ensembl database using a command like

   mysql -u anonymous -h ensembldb.ensembl.org -P 3306 DB

where you choose DB to name a database for the species and genome version you want. The Ensembl databases update frequently, and they do not want to throw away information, instead you have to keep track of which version there is. Some notable databases (the latest at this writing) are

  • homo_sapiens_core_47_36i
  • mus_musculus_core_47_37
  • pan_troglodytes_core_47_21f
  • felis_catus_core_47_1b

and to connect to the (latest?) human database, you would therefore write

   mysql -u anonymous -h ensembldb.ensembl.org -P 3306 homo_sapiens_core_82_38

To see more available databases, you do:

   mysql -u anonymous -h ensembldb.ensembl.org -P 3306

and write

   show databases;

or

   show databases like 'homo%';

for everything starting with 'homo'. The percent sign is "wild card" that matches anything.

Assignment 1

How many databases are available for cat?

Quering Ensembl using SQL

The database schema of Ensembl is documented, at least on their web site. A paper describing Ensembl (Stabenau et al, Genome Research, 2004) contains an outdated figure which briefly describes the central tables. Luckily, you can get more details using the SQL command describe.

Assignment 2

Use the show and describe commands to figure what tables are related to describing genes, transcripts, and exons. What information are these tables storing and how do the tables relate to each other?

Assignment 3

How many genes are registred for human? Make a breakdown based on the column "biotype" by using the GROUP BY directive.

Assignment 4

How many pseudogenes have a non-empty description string? How many of those does not start with "similar to..."?

Assignment 5

How many transcripts are associated with the two breast-cancer associated genes BRCA1 (ENSG00000012048) and BRCA2 (ENSG00000139618)? Your solution is supposed to be written as one SQL query. Hint: Use the table gene_stable_id.

Lars Arvestad skapade sidan 27 oktober 2016

kommenterade 19 november 2016

Jag hittar ingen tabell som heter 'gene_stable_id' i DB 'homo_sapiens_core_82_38'. Menar ni kolumnen 'gene_stable_id' under tabellen 'gene_archive'?

kommenterade 22 november 2016

Questions in English will for sure get answered more quickly and others can benefit, too!

The problem was that there is no table "gene_stable_id" as referenced in the hint of assignment 5.

I think it should be "gene.stable_id", so the field "stable_id" within the table "gene" within the database "homo_sapiens_core_xx_xx". So the gene BRCA1 will have "ENSG00000012048" as its stable_id and we should look for transcripts associated to it.