Till KTH:s startsida Till KTH:s startsida

Ändringar mellan två versioner

Här visas ändringar i "Using BioMart" mellan 2015-11-02 11:18 av Lars Arvestad och 2015-11-02 13:17 av Lars Arvestad.

Using BioMart

BioMart is a system that enables a uniform programmatic interface to many different online databases, and one of the features is that it also offers a web based interface for creating and executing queries and convenient access to large sets of resulting data.

In this lab, we will explore Ensembl's BioMart interface.

Preliminaries Before you start, make sure you understand the following words:


* Gene
* Exons and introns
* Untranslated regions (UTR)
* Alternative transcripts
Trying Ensembl BioMart Go to Ensembl's BioMart and choose the "Ensembl Genes 7782" database (or later if they have updated the database after this writing). Then choose the Homo sapiens dataset.


* Try the "count" button. Ensembl should respond by claiming it has 60,6846,017 genes for human (as of this writing). You get this many genes because Ensembl has included RNA genes and pseudogenes. How many unique protein-coding genes are there then? Use filters to list those genes that have the type "protein coding".
* How many of the protein coding genes have been assigned an ID by the Human Gene Nomenclature Committee (HGNC)? (There should be significantly lessfewer than in the firstprevious question.)
* How many genes code for transmembrane proteins?
Retrieving results The page you get when you click "Results" is just a sample of the full list of resulting data.


* In what basic formats can you download your results?
* Figure out and explain what the top buttons, labeled "URL", "XML", and "Perl", are for.
Downloadning sequences Restrict your gene set to contain only those genes coding for proteins containing a domain with the Pfam identifier PF00104. This should give you 53 genes.


* What sequence format is used for downloading sequences?
* In the "attributes" settings, you can choose what kind of sequences you download. What is the difference between "unspliced transcript" and "unspliced gene"?
* What is the difference between "unspliced transcript" and "cDNA"?
* Suppose you want to work with genes' coding regions. If you download the "coding sequence" for your 53 genes, how many sequences do you get?