Biopython 简明教程

Biopython - Entrez Database

Entrez 是 NCBI 提供的在线搜索系统。它提供对几乎所有已知的分子生物学数据库的访问,并支持布尔运算符和字段搜索的集成全局查询。它返回来自所有数据库的结果,其中包含信息,例如来自每个数据库的点击数、带有指向原始数据库的链接的记录等。

下面列出了可以通过 Entrez 访问的一些流行数据库 -

  1. Pubmed

  2. Pubmed Central

  3. Nucleotide (GenBank Sequence Database)

  4. Protein (Sequence Database)

  5. Genome (Whole Genome Database)

  6. 结构(三维大分子结构)

  7. Taxonomy (Organisms in GenBank)

  8. SNP (Single Nucleotide Polymorphism)

  9. UniGene(以基因为导向的转录序列簇)

  10. CDD(保守蛋白结构域数据库)

  11. 3D 域(来自 Entrez 结构的域)

除了以上数据库之外,Entrez 还提供了更多数据库来执行字段搜索。

Biopython 提供了一个 Entrez 特定的模块 Bio.Entrez 来访问 Entrez 数据库。让我们在本章学习如何使用 Biopython 访问 Entrez -

Database Connection Steps

要添加 Entrez 的功能,请导入以下模块 -

>>> from Bio import Entrez

接下来,设置您的电子邮件以识别与下面给出的代码相连的是谁 -

>>> Entrez.email = '<youremail>'

然后,设置 Entrez 工具参数,默认情况下,它为 Biopython。

>>> Entrez.tool = 'Demoscript'

现在, call einfo function to find index term counts, last update, and available links for each database 如下所示 -

>>> info = Entrez.einfo()

einfo 方法返回一个对象,它可以通过其 read 方法访问信息,如下所示 -

>>> data = info.read()
>>> print(data)
<?xml version = "1.0" encoding = "UTF-8" ?>
<!DOCTYPE eInfoResult PUBLIC "-//NLM//DTD einfo 20130322//EN"
   "https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20130322/einfo.dtd">
<eInfoResult>
   <DbList>
      <DbName>pubmed</DbName>
      <DbName>protein</DbName>
      <DbName>nuccore</DbName>
      <DbName>ipg</DbName>
      <DbName>nucleotide</DbName>
      <DbName>nucgss</DbName>
      <DbName>nucest</DbName>
      <DbName>structure</DbName>
      <DbName>sparcle</DbName>
      <DbName>genome</DbName>
      <DbName>annotinfo</DbName>
      <DbName>assembly</DbName>
      <DbName>bioproject</DbName>
      <DbName>biosample</DbName>
      <DbName>blastdbinfo</DbName>
      <DbName>books</DbName>
      <DbName>cdd</DbName>
      <DbName>clinvar</DbName>
      <DbName>clone</DbName>
      <DbName>gap</DbName>
      <DbName>gapplus</DbName>
      <DbName>grasp</DbName>
      <DbName>dbvar</DbName>
      <DbName>gene</DbName>
      <DbName>gds</DbName>
      <DbName>geoprofiles</DbName>
      <DbName>homologene</DbName>
      <DbName>medgen</DbName>
      <DbName>mesh</DbName>
      <DbName>ncbisearch</DbName>
      <DbName>nlmcatalog</DbName>
      <DbName>omim</DbName>
      <DbName>orgtrack</DbName>
      <DbName>pmc</DbName>
      <DbName>popset</DbName>
      <DbName>probe</DbName>
      <DbName>proteinclusters</DbName>
      <DbName>pcassay</DbName>
      <DbName>biosystems</DbName>
      <DbName>pccompound</DbName>
      <DbName>pcsubstance</DbName>
      <DbName>pubmedhealth</DbName>
      <DbName>seqannot</DbName>
      <DbName>snp</DbName>
      <DbName>sra</DbName>
      <DbName>taxonomy</DbName>
      <DbName>biocollections</DbName>
      <DbName>unigene</DbName>
      <DbName>gencoll</DbName>
      <DbName>gtr</DbName>
   </DbList>
</eInfoResult>

数据以 XML 格式存在,要将数据作为 python 对象获得,请使用 Entrez.read 方法,只要调用 Entrez.einfo() 方法 -

>>> info = Entrez.einfo()
>>> record = Entrez.read(info)

此处,record 是一个字典,它有一个键 DbList,如下所示 -

>>> record.keys()
[u'DbList']

访问 DbList 键返回下面显示的数据库名称列表 -

>>> record[u'DbList']
['pubmed', 'protein', 'nuccore', 'ipg', 'nucleotide', 'nucgss',
   'nucest', 'structure', 'sparcle', 'genome', 'annotinfo', 'assembly',
   'bioproject', 'biosample', 'blastdbinfo', 'books', 'cdd', 'clinvar',
   'clone', 'gap', 'gapplus', 'grasp', 'dbvar', 'gene', 'gds', 'geoprofiles',
   'homologene', 'medgen', 'mesh', 'ncbisearch', 'nlmcatalog', 'omim',
   'orgtrack', 'pmc', 'popset', 'probe', 'proteinclusters', 'pcassay',
   'biosystems', 'pccompound', 'pcsubstance', 'pubmedhealth', 'seqannot',
   'snp', 'sra', 'taxonomy', 'biocollections', 'unigene', 'gencoll', 'gtr']
>>>

基本上,Entrez 模块解析 Entrez 搜索系统返回的 XML 并将其作为 python 字典和列表提供。

Search Database

要搜索 Entrez 数据库的任何一个,我们可使用 Bio.Entrez.esearch() 模块。它在下面进行了定义 −

>>> info = Entrez.einfo()
>>> info = Entrez.esearch(db = "pubmed",term = "genome")
>>> record = Entrez.read(info)
>>>print(record)
DictElement({u'Count': '1146113', u'RetMax': '20', u'IdList':
['30347444', '30347404', '30347317', '30347292',
'30347286', '30347249', '30347194', '30347187',
'30347172', '30347088', '30347075', '30346992',
'30346990', '30346982', '30346980', '30346969',
'30346962', '30346954', '30346941', '30346939'],
u'TranslationStack': [DictElement({u'Count':
'927819', u'Field': 'MeSH Terms', u'Term': '"genome"[MeSH Terms]',
u'Explode': 'Y'}, attributes = {})
, DictElement({u'Count': '422712', u'Field':
'All Fields', u'Term': '"genome"[All Fields]', u'Explode': 'N'}, attributes = {}),
'OR', 'GROUP'], u'TranslationSet': [DictElement({u'To': '"genome"[MeSH Terms]
OR "genome"[All Fields]', u'From': 'genome'}, attributes = {})], u'RetStart': '0',
u'QueryTranslation': '"genome"[MeSH Terms] OR "genome"[All Fields]'},
attributes = {})
>>>

如果您分配了不正确的 db,则它会返回

>>> info = Entrez.esearch(db = "blastdbinfo",term = "books")
>>> record = Entrez.read(info)
>>> print(record)
DictElement({u'Count': '0', u'RetMax': '0', u'IdList': [],
u'WarningList': DictElement({u'OutputMessage': ['No items found.'],
   u'PhraseIgnored': [], u'QuotedPhraseNotFound': []}, attributes = {}),
   u'ErrorList': DictElement({u'FieldNotFound': [], u'PhraseNotFound':
      ['books']}, attributes = {}), u'TranslationSet': [], u'RetStart': '0',
      u'QueryTranslation': '(books[All Fields])'}, attributes = {})

如果您想跨数据库搜索,则可以使用 Entrez.egquery 。这与 Entrez.esearch 类似,只不过它只需要指定关键字并跳过数据库参数即可。

>>>info = Entrez.egquery(term = "entrez")
>>> record = Entrez.read(info)
>>> for row in record["eGQueryResult"]:
... print(row["DbName"], row["Count"])
...
pubmed 458
pmc 12779 mesh 1
...
...
...
biosample 7
biocollections 0

Fetch Records

Entrez 提供了一种特殊方法 efetch,用于从 Entrez 中搜索和下载记录的详细信息。请考虑以下简单示例 −

>>> handle = Entrez.efetch(
   db = "nucleotide", id = "EU490707", rettype = "fasta")

现在,我们可以使用 SeqIO 对象简单地读取记录

>>> record = SeqIO.read( handle, "fasta" )
>>> record
SeqRecord(seq = Seq('ATTTTTTACGAACCTGTGGAAATTTTTGGTTATGACAATAAATCTAGTTTAGTA...GAA',
SingleLetterAlphabet()), id = 'EU490707.1', name = 'EU490707.1',
description = 'EU490707.1
Selenipedium aequinoctiale maturase K (matK) gene, partial cds; chloroplast',
dbxrefs = [])