E-Utilities

7 server-side programs to interface to Entrez database system:

  1. EInfo
  2. EGQuery
  3. ESearch
  4. ESummary
  5. EPost
  6. EFetch
  7. ELink

Uses a fixed URL syntax

Access is provided by:

  1. Post a URL to NCBI
  2. Retrieve results in XML

Each database at NCBI refers to records within it by an integer ID called a UID.

UIDs are for both data input and output, and thus it is often critical, especially for advanced data pipelines, to know how to find the UIDs associated with the desired data before beginning a project with the eUtils

EGQuery, ESearch, and ESummary

Core Steps:

1. (you) Submit a query to a specific database using ESearch
2. (NCBI) Assemble list of UIDs that match query
3. (you) Retrieve summary (DocSum) for each UID using ESummary

EGQuery is a global version of ESearch to search all Entrez databases.

EInfo, EFetch and ELink

EInfo provides information about a given database
EFetch generates formatted output for a list of input UIDs (for example, fetching FASTA or GenBank records)
ELink generates a list of UIDs in a specified database that are linked to a set of input UIDs

EPost

Search queries can be stored temporarily at NCBI for subsequent queries. This is done by uploading UIDs to the history server which returns a query key and web environment (?). These two keys can be used in place of submitting UIDs in ESummary, EFetch, and ELink.

Guidelines for Constructing URLs

NOTE: NCBI recommends that you use the &tool and &email parameters to identify all of your eUtils URLs. For &tool, choose a value that uniquely identifies your software. If your name is John Smith, use, for example, &tool=johnsmithsoft. If your email address is jsmith@hotmail.com, use &email=jsmith@hotmail.com. This email address is used only to inform the creator of the software of any problems.

EInfo

Base URL: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi?
Databases: db=pubmed
Example: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi?db=pubmed

EGQuery

Base URL: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/egquery.fcgi?
Search Terms: term=search strategy (ex. terms=stem+cells)
Example: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/egquery.fcgi?term=stem+cells
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/egquery.fcgi?term=brca1+OR+brca2

ESearch

Base URL: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?
Databases: db=database name
Search Terms: term=search strategy (ex. term=asthma[mh]+OR+hay+fever[mh])
Retrieval Mode: retmode=xml
Example: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=journals&term=obstetrics
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=cancer&retmax=100&usehistory=y
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=nucleotide&term=biomol+trna[prop] http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=protein&term=200020[molecular+weight]

ESummary

Base URL: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?
Databases: db=database name
ID: id=UID
Example: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=pubmed&id=11850928,11482001&retmode=xml
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=journals&id=27731,439,735,905
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=protein&id=28800982,28628843&retmode=xml

EPost

Base URL: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/epost.fcgi?
Databases: db=database name
ID: id=UID
Example:
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/epost.fcgi?db=pubmed&id=11237011

EFetch

Base URL: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?
Databases: db=database name

ELink

Base URL: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?
Database: db=database name
ID: id=UID
Example: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=nucleotide&db=protein&id=48819,7140345

Basic Pipelines

  1. Retrieving data records matching an Entrez query
    1. ESearch -> ESummary
    2. ESearch -> EFetch
  2. Retrieving data records matching a list of UIDs
    1. EPost -> ESummary
    2. EPost -> EFetch
  3. Finding IDs linked to records matching an Entrez query
    1. ESearch -> ELink
  4. Finding IDs linked to other UIDs
    1. EPost -> ELink

Advanced Pipelines

  1. 1 Retrieving data records in database B linked to records in database A matching an Entrez query
    1. ESearch -> ELink -> ESummary
    2. ESearch -> ELink -> EFetch
  2. Retrieving data records from a subset of an ID list defined by an Entrez query
    1. EPost -> ESearch -> ESummary
    2. EPost -> ESearch -> EFetch
  3. Retrieving a subset of data records, defined by an Entrez query, from a set of records in database B linked to a list of UIDs in database A
    1. ELink -> EPost -> ESearch -> ESummary
    2. ELink -> EPost -> ESearch -> EFetch

Sample Applications

http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=coursework.section.sample-apps

Application 1: Converting GI Numbers to Accession Numbers

I have a list of nucleotide GI numbers and I want the corresponding accession numbers.

Solution: Use EFetch with &rettype=acc

URL: efetch.fcgi?db=nucleotide&id=$gi_list&rettype=acctop link

Application 2: Converting Accession Numbers to Data

I have a list of genome Accession numbers ($acc_list) and I want the sequences in FASTA format.

Solution: Use EFetch with &rettype=fasta

URL: efetch.fcgi?db=genome&id=$acc_list&rettype=fastatop link

Application 3: Retrieving Large Datasets

I want to retrieve an arbitrary number of formatted records that match an Entrez query.

Solution: First, run ESearch in Web Environment mode to retrieve the total number of UIDs that match the Entrez query (<Count> tag in the ESearch output). Then store this number into $count, and store the values of WebEnv and query_key into $Webenv and $key. Next, run EFetch multiple times, each time retrieving a batch of size $retmax (for example, $retmax = 500). Accomplish this by incrementing $retstart iteratively in a “for” loop to retrieve successive batches of records of size $retmax:

use LWP::Simple;

URL 1: esearch.fcgi?db=database&term=$query&usehistory=y

URL 2+: produced by the following loop:

Perl:

for ($retstart = 0; $retstart < $count; $restart += $retmax) {

$efetch_url = $base ."db=$db&WebEnv=$Webenv&query_key=$key";

$efetch_url .= "&retstart=$retstart&retmax=$retmax";

$efetch_out = get($efetch_url);

print "$efetch_out";

}

where $base = http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?, $db is the database, and $efetch_url is a string containing the EFetch URL. This Perl code assumes that the LWP::Simple module is installed. This module allows the use of the get command for retrieving data from a URL.top link

Application 4: Downloading Contigs

I want to download a flatfile with the full sequence of an assembly (e.g., a contig).

Solution: Use EFetch with &rettype=gbwithparts

URL: efetch.fcgi?db=nucleotide&id=27479347&rettype=gbwithpartstop link

Application 5: Limiting and Converting GI Lists

I have list of protein GI numbers from a BLAST search and I want to download the document summaries of only those protein records that are mammalian sequences with annotated SNPs.

Solution: Use EPost to upload the GI list, then use ESearch to limit the list, followed by EFetch to download the FASTA formatted data.

URL 1: epost.fcgi?db=protein&id=$gi_list

Result: In Perl, store WebEnv as $Webenv1, query_key as $key1

URL 2: esearch.fcgi?db=protein&term=%23$key1+AND+mammalia[organism]+AND+protein+snp[filter]&usehistory=y&WebEnv=$Webenv1

Result: In Perl, store WebEnv as $Webenv2, query_key as $key2

Note: The %23 resolves to the # symbol, so that %23$key1 ? #2.

URL 3: esummary.fcgi?db=protein&WebEnv=$Webenv2&query_key=$key2top link

Application 6: Finding Related Records in Other Entrez Databases

I want to find all available 3D structure records similar to protein BAA20519.

Solution: Use ESearch to find the GI number, then ELink to find related sequences to that protein. Then use ELink again to find linked MMDB-IDs, and finally ESummary to download the document summaries of the structure records.

URL 1: esearch.fcgi?db=protein&term=BAA20519

Result: Find GI 2208903.

URL 2: elink.fcgi?dbfrom=protein&db=protein&id=2208903

Result: Find 1084 related sequences, extract into $gi_list1

URL 3: elink.fcgi?dbfrom=protein&db=structure&id=$gi_list1

Result: Find 9 related structures, extract into $gi_list2

URL 4: esummary.fcgi?db=structure&id=$gi_list2top link

Application 7: Entrez TBLASTX

I want to download all mRNAs from green plants that are related at the protein level to human NM_001126, in flatfile format.

Motivation: For finding distant homologs, protein BLAST searches are generally more sensitive than nucleotide BLAST searches. In this specific case, a nucleotide BLAST search finds no significant matches to NM_001126 from green plants, whereas TBLASTX will find several homologous sequences. However, TBLASTX is the most time-consuming version of BLAST, and therefore using the pre-computed results in Entrez saves significant computing time.

Solution: Use ESearch to retrieve the record for NM_001126, and then use ELink to find the linked protein sequence. Then use ELink again to find all related sequences to that protein, and then use ELink a third time to find all nucleotide records linked to those related proteins and then limit them to mRNAs from green plants. Finally, download the formatted data with EFetch.

URL 1: esearch.fcgi?db=nucleotide&term=NM_001126

Result: Find GI = 4557270.

URL 2: elink.fcgi?dbfrom=nucleotide&db=protein&id=4557270

Result: Find GI = 4557271.

URL 3: elink.fcgi?dbfrom=protein&db=protein&id=4557271

Result: Extract the 507 GI numbers into $gi_list1, and if desired, the raw BLAST scores reported by ELink into @scores

URL 4: elink.fcgi?dbfrom=protein&db=nucleotide&id=$gi_list1&term=biomol+mrna[properties]+AND+viridiplantae[organism]

Result: Extract the 7 GI numbers into $gi_list2

URL 5: efetch.fcgi?db=nucleotide&WebEnv=$Webenv2&query_key=$key2&rettype=gb

Result: Download the 7 plant mRNAs, none of which are found using Related Sequences to NM_001126