7 server-side programs to interface to Entrez database system:
Uses a fixed URL syntax
Access is provided by:
Each database at NCBI refers to records within it by an integer ID called a UID.
UIDs are for both data input and output, and thus it is often critical, especially for advanced data pipelines, to know how to find the UIDs associated with the desired data before beginning a project with the eUtils
Core Steps:
1. (you) Submit a query to a specific database using ESearch
2. (NCBI) Assemble list of UIDs that match query
3. (you) Retrieve summary (DocSum) for each UID using ESummary
EGQuery is a global version of ESearch to search all Entrez databases.
EInfo provides information about a given database
EFetch generates formatted output for a list of input UIDs (for example, fetching FASTA or GenBank records)
ELink generates a list of UIDs in a specified database that are linked to a set of input UIDs
Search queries can be stored temporarily at NCBI for subsequent queries. This is done by uploading UIDs to the history server which returns a query key and web environment (?). These two keys can be used in place of submitting UIDs in ESummary, EFetch, and ELink.
NOTE: NCBI recommends that you use the &tool and &email parameters to identify all of your eUtils URLs. For &tool, choose a value that uniquely identifies your software. If your name is John Smith, use, for example, &tool=johnsmithsoft. If your email address is jsmith@hotmail.com, use &email=jsmith@hotmail.com. This email address is used only to inform the creator of the software of any problems.
Base URL: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi?
Databases: db=pubmed
Example: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi?db=pubmed
Base URL: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/egquery.fcgi?
Search Terms: term=search strategy (ex. terms=stem+cells)
Example: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/egquery.fcgi?term=stem+cells
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/egquery.fcgi?term=brca1+OR+brca2
Base URL: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?
Databases: db=database name
Search Terms: term=search strategy (ex. term=asthma[mh]+OR+hay+fever[mh])
Retrieval Mode: retmode=xml
Example: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=journals&term=obstetrics
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=cancer&retmax=100&usehistory=y
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=nucleotide&term=biomol+trna[prop] http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=protein&term=200020[molecular+weight]
Base URL: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/epost.fcgi?
Databases: db=database name
ID: id=UID
Example:
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/epost.fcgi?db=pubmed&id=11237011
Base URL: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?
Databases: db=database name
Base URL: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?
Database: db=database name
ID: id=UID
Example: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=nucleotide&db=protein&id=48819,7140345
http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=coursework.section.sample-apps
I have a list of nucleotide GI numbers and I want the corresponding accession numbers.
Solution: Use EFetch with &rettype=acc
URL: efetch.fcgi?db=nucleotide&id=$gi_list&rettype=acctop link
I have a list of genome Accession numbers ($acc_list) and I want the sequences in FASTA format.
Solution: Use EFetch with &rettype=fasta
URL: efetch.fcgi?db=genome&id=$acc_list&rettype=fastatop link
I want to retrieve an arbitrary number of formatted records that match an Entrez query.
Solution: First, run ESearch in Web Environment mode to retrieve the total number of UIDs that match the Entrez query (<Count> tag in the ESearch output). Then store this number into $count, and store the values of WebEnv and query_key into $Webenv and $key. Next, run EFetch multiple times, each time retrieving a batch of size $retmax (for example, $retmax = 500). Accomplish this by incrementing $retstart iteratively in a “for” loop to retrieve successive batches of records of size $retmax:
use LWP::Simple;
URL 1: esearch.fcgi?db=database&term=$query&usehistory=y
URL 2+: produced by the following loop:
Perl:
for ($retstart = 0; $retstart < $count; $restart += $retmax) {
$efetch_url = $base ."db=$db&WebEnv=$Webenv&query_key=$key";
$efetch_url .= "&retstart=$retstart&retmax=$retmax";
$efetch_out = get($efetch_url);
print "$efetch_out";
}
where $base = http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?, $db is the database, and $efetch_url is a string containing the EFetch URL. This Perl code assumes that the LWP::Simple module is installed. This module allows the use of the get command for retrieving data from a URL.top link
I want to download a flatfile with the full sequence of an assembly (e.g., a contig).
Solution: Use EFetch with &rettype=gbwithparts
URL: efetch.fcgi?db=nucleotide&id=27479347&rettype=gbwithpartstop link
I have list of protein GI numbers from a BLAST search and I want to download the document summaries of only those protein records that are mammalian sequences with annotated SNPs.
Solution: Use EPost to upload the GI list, then use ESearch to limit the list, followed by EFetch to download the FASTA formatted data.
URL 1: epost.fcgi?db=protein&id=$gi_list
Result: In Perl, store WebEnv as $Webenv1, query_key as $key1
URL 2: esearch.fcgi?db=protein&term=%23$key1+AND+mammalia[organism]+AND+protein+snp[filter]&usehistory=y&WebEnv=$Webenv1
Result: In Perl, store WebEnv as $Webenv2, query_key as $key2
Note: The %23 resolves to the # symbol, so that %23$key1 ? #2.
URL 3: esummary.fcgi?db=protein&WebEnv=$Webenv2&query_key=$key2top link
I want to find all available 3D structure records similar to protein BAA20519.
Solution: Use ESearch to find the GI number, then ELink to find related sequences to that protein. Then use ELink again to find linked MMDB-IDs, and finally ESummary to download the document summaries of the structure records.
URL 1: esearch.fcgi?db=protein&term=BAA20519
Result: Find GI 2208903.
URL 2: elink.fcgi?dbfrom=protein&db=protein&id=2208903
Result: Find 1084 related sequences, extract into $gi_list1
URL 3: elink.fcgi?dbfrom=protein&db=structure&id=$gi_list1
Result: Find 9 related structures, extract into $gi_list2
URL 4: esummary.fcgi?db=structure&id=$gi_list2top link
I want to download all mRNAs from green plants that are related at the protein level to human NM_001126, in flatfile format.
Motivation: For finding distant homologs, protein BLAST searches are generally more sensitive than nucleotide BLAST searches. In this specific case, a nucleotide BLAST search finds no significant matches to NM_001126 from green plants, whereas TBLASTX will find several homologous sequences. However, TBLASTX is the most time-consuming version of BLAST, and therefore using the pre-computed results in Entrez saves significant computing time.
Solution: Use ESearch to retrieve the record for NM_001126, and then use ELink to find the linked protein sequence. Then use ELink again to find all related sequences to that protein, and then use ELink a third time to find all nucleotide records linked to those related proteins and then limit them to mRNAs from green plants. Finally, download the formatted data with EFetch.
URL 1: esearch.fcgi?db=nucleotide&term=NM_001126
Result: Find GI = 4557270.
URL 2: elink.fcgi?dbfrom=nucleotide&db=protein&id=4557270
Result: Find GI = 4557271.
URL 3: elink.fcgi?dbfrom=protein&db=protein&id=4557271
Result: Extract the 507 GI numbers into $gi_list1, and if desired, the raw BLAST scores reported by ELink into @scores
URL 4: elink.fcgi?dbfrom=protein&db=nucleotide&id=$gi_list1&term=biomol+mrna[properties]+AND+viridiplantae[organism]
Result: Extract the 7 GI numbers into $gi_list2
URL 5: efetch.fcgi?db=nucleotide&WebEnv=$Webenv2&query_key=$key2&rettype=gb
Result: Download the 7 plant mRNAs, none of which are found using Related Sequences to NM_001126