mgkit.net.embl module

Access EMBL Services

exception mgkit.net.embl.EntryNotFound

Bases: exceptions.Exception

Raised if at least one entry was not found by get_sequences_by_ids(). NOT_FOUND is used to check if any entry wasn’t downloaded.

exception mgkit.net.embl.NoEntryFound

Bases: exceptions.Exception

Raised if no sequences where found by get_sequences_by_ids(), the check is based on the NONE_FOUND variable.

Changed in version 0.2.3: added fields parameter to retrieve tab separated information

New in version 0.1.13.

Perform a datawarehouse search on EMBL dbs. Instructions on the query language used to query the datawarehouse are available at this page with more details about the databases domains at this page

Parameters:
  • query (str) – query for the search enging
  • domain (str) – database domain to search
  • result (str) – domain result requested
  • display (str) – display option (format to retrieve the entries)
  • offset (int) – the offset of the search results, defaults to the first
  • length (int) – number of results to retrieve at the specified offset and the limit is automatically set a 100,000 records for query
  • contact (str) – email of the user
  • download (str) – type of response. Gzip responses are automatically decompressed
  • url (str) – base URL for the resource
  • fields (None, iterable) – must be an iterable of fields to be returned if display is set to report
Returns:

the raw request

Return type:

str

Examples

Querying EMBL for all sequences of type rRNA of the Clostridium genus. Only from the EMBL release database in fasta format:

>>> query = 'tax_tree(1485) AND mol_type="rRNA"'
>>> result = 'sequence_release'
>>> display = 'fasta'
>>> data = embl.datawarehouse_search(query, result=result,
... display=display)
>>> len(data)
35919

Each entry taxon_id from the same data can be retrieved using report as the display option and fields an iterable of fields to just (‘accession’, tax_id’):

>>> query = 'tax_tree(1485) AND mol_type="rRNA"'
>>> result = 'sequence_release'
>>> display = 'report'
>>> fields = ('accession', 'tax_id')
>>> data = embl.datawarehouse_search(query, result=result,
    display=display, fields=fields)
mgkit.net.embl.dbfetch(embl_ids, db='embl', contact=None, out_format='seqxml', num_req=10)

New in version 0.1.12.

Function that allows to use dbfetch service (REST). More information on the output formats and the database available at the service page

Parameters:
  • embl_ids (str, iterable) – list or single sequence id to retrieve
  • db (str) – database from which retrieve the sequence data
  • contact (str) – email contact to use as per EMBL guidlines
  • out_format (str) – output format, depends on database
  • num_req (int) – number of ids per request
Returns:

a list with the results from each request sent. Each request sent has a maximum number num_req of ids, so the number of items in the list depends by the number of ids in embl_ids and the value of num_req.

Return type:

list

mgkit.net.embl.get_sequences_by_ids(embl_ids, contact=None, out_format='fasta', num_req=10, embl_db='embl_cds', strict=False)

Changed in version 0.3.4: removed compress as it’s bases on the requests package

Downloads entries using EBI REST API. It can download one entry at a time or accept an iterable and all sequences will be downloaded in batches of at most num_req.

It’s fairly general, so can be customised, from the DB used to the output format: all batches are simply concatenate.

Note

There are some checks on the some errors reported by the EMBL api, but not documented, in particular two errors, which are just reported as text lines in the fasta file (the only one tested at this time).

The are two possible cases:

  • if no entry was found NoEntryFound will be raised.
  • if at least one entry wasn’t found:
    • if strict is False (the default) the error will be just logged as a debug message
    • if strict is True EntryNotFound is raised
Parameters:
  • embl_ids (iterable, str) – list of ids to download
  • contact (str) – email address to be passed in the query
  • format (str) – format of the entry
  • num_req (int) – number of entries to download with each request
  • embl_db (str) – db to which the ids refer to
  • strict (bool) – if True, a check on the number of entries retrieved is performed
Returns:

the entries requested

Return type:

str

Raises:

Warning

The number of sequences that can be downloaded at a time is 11, it seems, since the returned sequences for each request was at most 11. I didn’t find any mention of this in the API docs, but it may be a restriction that’s temporary.