Access EMBL Services
Raised if at least one entry was not found by
NOT_FOUNDis used to check if any entry wasn’t downloaded.
Raised if no sequences where found by
get_sequences_by_ids(), the check is based on the
datawarehouse_search(query, domain='sequence', result='sequence_release', display='fasta', offset=0, length=100000, contact=None, download='gzip', url='http://www.ebi.ac.uk/ena/data/warehouse/search?', fields=None)¶
Changed in version 0.2.3: added fields parameter to retrieve tab separated information
New in version 0.1.13.
- query (str) – query for the search enging
- domain (str) – database domain to search
- result (str) – domain result requested
- display (str) – display option (format to retrieve the entries)
- offset (int) – the offset of the search results, defaults to the first
- length (int) – number of results to retrieve at the specified offset and the limit is automatically set a 100,000 records for query
- contact (str) – email of the user
- download (str) – type of response. Gzip responses are automatically decompressed
- url (str) – base URL for the resource
- fields (None, iterable) – must be an iterable of fields to be returned if display is set to report
the raw request
Querying EMBL for all sequences of type rRNA of the Clostridium genus. Only from the EMBL release database in fasta format:
>>> query = 'tax_tree(1485) AND mol_type="rRNA"' >>> result = 'sequence_release' >>> display = 'fasta' >>> data = embl.datawarehouse_search(query, result=result, ... display=display) >>> len(data) 35919
Each entry taxon_id from the same data can be retrieved using report as the display option and fields an iterable of fields to just (‘accession’, tax_id’):
>>> query = 'tax_tree(1485) AND mol_type="rRNA"' >>> result = 'sequence_release' >>> display = 'report' >>> fields = ('accession', 'tax_id') >>> data = embl.datawarehouse_search(query, result=result, display=display, fields=fields)
dbfetch(embl_ids, db='embl', contact=None, out_format='seqxml', num_req=10)¶
New in version 0.1.12.
Function that allows to use dbfetch service (REST). More information on the output formats and the database available at the service page
a list with the results from each request sent. Each request sent has a maximum number num_req of ids, so the number of items in the list depends by the number of ids in embl_ids and the value of num_req.
get_sequences_by_ids(embl_ids, contact=None, out_format='fasta', num_req=10, embl_db='embl_cds', strict=False)¶
Changed in version 0.3.4: removed compress as it’s bases on the requests package
Downloads entries using EBI REST API. It can download one entry at a time or accept an iterable and all sequences will be downloaded in batches of at most num_req.
It’s fairly general, so can be customised, from the DB used to the output format: all batches are simply concatenate.
There are some checks on the some errors reported by the EMBL api, but not documented, in particular two errors, which are just reported as text lines in the fasta file (the only one tested at this time).
The are two possible cases:
- embl_ids (iterable, str) – list of ids to download
- contact (str) – email address to be passed in the query
- format (str) – format of the entry
- num_req (int) – number of entries to download with each request
- embl_db (str) – db to which the ids refer to
- strict (bool) – if True, a check on the number of entries retrieved is performed
the entries requested
Return type: Raises:
The number of sequences that can be downloaded at a time is 11, it seems, since the returned sequences for each request was at most 11. I didn’t find any mention of this in the API docs, but it may be a restriction that’s temporary.