Research Guides: Bioinformatics and Genomics: Data from the Command Line

NCBI Data

Command line access to NCBI using eUtils, R, or Python

Want to pull data directly from NCBI into your data analysis pipeline? Learn how to do that here.

The NCBI makes data available through a web interface, an FTP server and through a REST API called the Entrez Utilities (Eutils for short). You can access that data through the API at the command line, or using an R or Python library.

The rentrez R package, last updated May 2019, provides functions to use the EUtils API, allowing users to gather and combine data from multiple NCBI databases in the comfort of an R session or script. Here's a tutorial and an explanatory 2017 research article from The R Journal.

Python

Biopython is a set of freely available tools for biological computation which address the needs of current and future work in bioinformatics. It includes

t is a distributed collaborative effort to develop Python libraries and applications which address the needs of current and future work in bioinformatics. It can extract information from Entrez (and PubMed), ExPASy, and SCOP. The code in these modules basically makes it easy to write Python code that interact with the CGI scripts on these pages, so that you can get results in an easy to deal with format. In some cases, the results can be tightly integrated with the Biopython parsers to make it even easier to extract information. Biopython is extensively documented. It was last updated Dec, 2018.

Entrezpy is a dedicated Python library to interact with NCBI Entrez databases [Entrez2016] via the E-Utilities ([Sayers2018], E-Utilities). Entrezpy facilitates the implementation of queries to query or download data from the Entrez databases, e.g. search for specific sequences or publications or fetch your favorite genome. For more complex queries entrezpy offers the class entrezpy.conduit.Conduit to run query pipelines or cache results. See this great article about it in Bioinformatics published in May, 2019.

eutils is another Python package to simplify searching, fetching, and parsing records from NCBI using their E-utilities interface.