mgkit.net.embl module¶

Access EMBL Services

exception mgkit.net.embl.EntryNotFound¶

Bases: exceptions.Exception

Raised if at least one entry was not found by get_sequences_by_ids(). NOT_FOUND is used to check if any entry wasn’t downloaded.

exception mgkit.net.embl.NoEntryFound¶

Bases: exceptions.Exception

Raised if no sequences where found by get_sequences_by_ids(), the check is based on the NONE_FOUND variable.

mgkit.net.embl.datawarehouse_search(query, domain='sequence', result='sequence_release', display='fasta', offset=0, length=100000, contact=None, download='gzip', url='http://www.ebi.ac.uk/ena/data/warehouse/search?', fields=None)¶

Changed in version 0.2.3: added fields parameter to retrieve tab separated information

New in version 0.1.13.

Perform a datawarehouse search on EMBL dbs. Instructions on the query language used to query the datawarehouse are available at this page with more details about the databases domains at this page

Parameters:	query (str) – query for the search enging domain (str) – database domain to search result (str) – domain result requested display (str) – display option (format to retrieve the entries) offset (int) – the offset of the search results, defaults to the first length (int) – number of results to retrieve at the specified offset and the limit is automatically set a 100,000 records for query contact (str) – email of the user download (str) – type of response. Gzip responses are automatically decompressed url (str) – base URL for the resource fields (None, iterable) – must be an iterable of fields to be returned if display is set to report
Returns:	the raw request
Return type:	str

Examples

Querying EMBL for all sequences of type rRNA of the Clostridium genus. Only from the EMBL release database in fasta format:

>>> query = 'tax_tree(1485) AND mol_type="rRNA"'
>>> result = 'sequence_release'
>>> display = 'fasta'
>>> data = embl.datawarehouse_search(query, result=result,
... display=display)
>>> len(data)
35919

Each entry taxon_id from the same data can be retrieved using report as the display option and fields an iterable of fields to just (‘accession’, tax_id’):

>>> query = 'tax_tree(1485) AND mol_type="rRNA"'
>>> result = 'sequence_release'
>>> display = 'report'
>>> fields = ('accession', 'tax_id')
>>> data = embl.datawarehouse_search(query, result=result,
    display=display, fields=fields)

mgkit.net.embl.dbfetch(embl_ids, db='embl', contact=None, out_format='seqxml', num_req=10)¶

New in version 0.1.12.

Function that allows to use dbfetch service (REST). More information on the output formats and the database available at the service page

Parameters:	embl_ids (str, iterable) – list or single sequence id to retrieve db (str) – database from which retrieve the sequence data contact (str) – email contact to use as per EMBL guidlines out_format (str) – output format, depends on database num_req (int) – number of ids per request
Returns:	a list with the results from each request sent. Each request sent has a maximum number num_req of ids, so the number of items in the list depends by the number of ids in embl_ids and the value of num_req.
Return type:	list

mgkit.net.embl.get_sequences_by_ids(embl_ids, contact=None, out_format='fasta', num_req=10, embl_db='embl_cds', strict=False)¶

Changed in version 0.3.4: removed compress as it’s bases on the requests package

Downloads entries using EBI REST API. It can download one entry at a time or accept an iterable and all sequences will be downloaded in batches of at most num_req.

It’s fairly general, so can be customised, from the DB used to the output format: all batches are simply concatenate.

Note

There are some checks on the some errors reported by the EMBL api, but not documented, in particular two errors, which are just reported as text lines in the fasta file (the only one tested at this time).

The are two possible cases:

if no entry was found NoEntryFound will be raised.
if at least one entry wasn’t found:
- if strict is False (the default) the error will be just logged as a debug message
- if strict is True EntryNotFound is raised

Parameters:	embl_ids (iterable, str) – list of ids to download contact (str) – email address to be passed in the query format (str) – format of the entry num_req (int) – number of entries to download with each request embl_db (str) – db to which the ids refer to strict (bool) – if True, a check on the number of entries retrieved is performed
Returns:	the entries requested
Return type:	str
Raises:	`EntryNotFound` – if at least an entry was not found `NoEntryFound` – if NO entry were found

Warning

The number of sequences that can be downloaded at a time is 11, it seems, since the returned sequences for each request was at most 11. I didn’t find any mention of this in the API docs, but it may be a restriction that’s temporary.