mgkit.taxon module¶
This module gives access to Uniprot taxonomy data. It also defines classes to filter, order and group data by taxa
-
exception
mgkit.taxon.
NoLcaFound
¶ Bases:
exceptions.Exception
New in version 0.1.13.
Raised if no lowest common ancestor can be found in the taxonomy
-
mgkit.taxon.
TaxonTuple
¶ alias of
mgkit.taxon.UniprotTaxonTuple
-
class
mgkit.taxon.
Taxonomy
(fname=None)¶ Bases:
future.types.newobject.newobject
Class that contains the whole Uniprot taxonomy. Defines some methods to easy access of taxonomy. Follows the conventions of NCBI Taxonomy.
Defines:
- methods to load taxonomy from a pickle file or a generic file handle
- can be iterated over and returns a generator its UniprotTaxon instances
- can be used as a dictionary, in which the key is a taxon_id and the value is its UniprotTaxon instance
-
__contains__
(taxon)¶ Returns True if the taxon is in the taxonomy
Accepts an int (check for taxon_id) or an instance of UniprotTaxon
-
__getitem__
(taxon_id)¶ Defines dictionary behavior. Key is a taxon_id, the returned value is a UniprotTaxon instance
-
__iter__
()¶ Defines iterable behavior. Returns a generator for UniprotTaxon instances
-
__len__
()¶ Returns the number of taxa contained
-
__repr__
()¶ New in version 0.2.5.
-
add_lineage
(**lineage)¶ New in version 0.3.1.
Adds a lineage to the taxonomy. It’s passed by keyword arguments, where each key is a value in the TAXON_RANKS rankes and the value is the scientific name. Appended underscores ‘_’ will be stripped from the rank name. This is for cases such as class where the key is a reserved word in Python. Also one extra node can be added, such as strain/cultivar/subspecies and so on, but one only is expected to be passed.
Parameters: lineage (dict) – the lineage as a keyword arguments
Returns: the taxon_id of the last element in the lineage
Return type: Raises: ValueError
– if more than a keyword argument is not contained in- TAXON_RANKS
-
add_taxon
(taxon_name, common_name='', rank='no rank', parent_id=None)¶ New in version 0.3.1.
Adds a taxon to the taxonomy. If a taxon with the same name and rank is found, its taxon_id is returned, otherwise a new taxon_id is returned.
Parameters: Returns: the taxon_id of the added taxon (if new), or the taxon_id of the taxon with the same name and rank found in the taxonomy
Return type: Raises: KeyError
– if more than one taxon has already the passed name and- rank and it can’t be resolved by looking at the parent_id passed,
- the exception is raised.
-
drop_taxon
(taxon_id)¶ New in version 0.3.1.
Drops a taxon and all taxa below it in the taxonomy. Also reset the name map for conistency.
Parameters: taxon_id (int) – taxon_id to drop from the taxonomy
-
find_by_name
(s_name, rank=None, strict=True)¶ Changed in version 0.2.3: the search is now case insensitive
Changed in version 0.3.1: added rank and strict parameter
Returns the taxon IDs associated with the scientific name provided
Parameters: Returns: a reference to the list of IDs that have for s_name, if rank is None. If rank is not None and one taxon is found, its taxon_id is returned, or None if no taxon is found. If strict is True and rank is not None, the set of taxon_ids found is resturned.
Return type: Raises: KeyError
– If multiple taxa are found, a KeyError exception israised.
-
gen_name_map
()¶ Changed in version 0.2.3: names are stored in the mapping as lowercase
Generate a name map, where to each scientific name in the taxonomy an id is associated.
-
get_lineage
(taxon_id, names=False, only_ranked=True, with_last=True)¶ New in version 0.3.1.
Proxy for
get_lineage()
, with changed defaultsParameters: Returns: the lineage of the passed taxon_id as a list of IDs or names
Return type:
-
get_lineage_string
(taxon_id, only_ranked=True, with_last=True, sep=';', rank=None)¶ New in version 0.3.3.
Generates a lineage string, with the possibility of getting another ranked taxon (via
Taxonomy.get_ranked_taxon()
) to another rank, such as phylum.Parameters: - taxon_id (int) – taxon_id to return the lineage
- only_ranked (bool) – only return the ranked taxa
- with_last (bool) – include the taxon_id passed to the list
- sep (str) – separator used to join the lineage string
- rank (int or None) – if None the full lineage is returned, otherwise the lineage will be cut to the specified rank
Returns: lineage string
Return type:
-
get_name_map
()¶ Returns a taxon_id->s_name dictionary
-
get_ranked_id
(taxon_id, rank=None, it=False, include_higher=True)¶ New in version 0.3.4.
Gets the ranked taxon of another one. Useful when it’s better to get a taxon_id instead of an instance of
TaxonTuple
. Internally, it relies onTaxonomy.get_ranked_taxon()
.Parameters: Returns: The type returned is based on the it paramenter. If it is True, the return value is a list with the taxon_id of the ranked taxon as the sole value. If False, the returned value is the taxon_id. include_higher determines if the return value should be None if the exact rank was not found and include_higher is False
Return type:
-
get_ranked_taxon
(taxon_id, rank=None, ranks=('superkingdom', 'kingdom', 'phylum', 'class', 'subclass', 'order', 'family', 'genus', 'species'), roots=False)¶ Changed in version 0.1.13: added roots argument
Traverse the branch of which the taxon argument is the leaf backward, to get the specific rank to which the taxon belongs to.
Warning
the roots options is kept for backward compatibility and should be be set to False
Parameters: - taxon_id – id of the taxon or instance of
UniprotTaxon
- rank (str) – string that specify the rank, if None, the first valid rank will be searched. (i.e. the first with a value different from ‘’)
- ranks – tuple of all taxonomy ranks, default to the default module value
- roots (bool) – if True, uses
TAXON_ROOTS
to solve the root taxa
Returns: instance of
TaxonTuple
for the rank found.- taxon_id – id of the taxon or instance of
-
is_ancestor
(leaf_id, anc_ids)¶ Changed in version 0.1.13: now uses
is_ancestor()
and changed behaviorChecks if a taxon is the leaf of another one, or a list of taxa.
Parameters: Return bool: True if the ancestor taxon is in the leaf taxon lineage
-
load_data
(file_handle)¶ Changed in version 0.2.3: now can use read msgpack serialised files
Changed in version 0.1.13: now accepts file handles and compressed files (if file names)
Loads serialised data from file name “file_handle” and accepts compressed files.
if the .msgpack string is found in the file name, the msgpack package is used instead of pickle
Parameters: file_handle (str, file) – file name to which save the instance data
Raises: DependencyError
– if the file name contains .msgpack and the- package is not installed
-
static
parse_gtdb_lineage
(lineage, sep=';')¶ New in version 0.3.3.
Parse a GTDB lineage, one that defines the rank as a single letter, followed by __ for each taxon name. Taxa are separated by semicolon by default. Also the domain rank is renamed into superkingdom to allow mixing of taxonomies.
Returns: dictionary with the parsed lineage, which can be passed to Taxonomy.add_lineage()
Return type: dict
-
read_from_gtdb_taxonomy
(file_handle, use_gtdb_name=True, sep='\t')¶ New in version 0.3.0.
Changed in version 0.3.1: replaced domain with superkingdom to support get_lineage
Reads a GTDB taxonomy file (tab separated genome_id/taxonomy) and populate the taxonomy instance. The method also return a dictionary of genome_id -> taxon_id.
Parameters: - file_handle (file) – file with the taxonomy
- use_gtdb_name (bool) – if True, the names are kept as-is in the
s_name attribute of
TaxonTuple
and the “cleaned” version in c_name (e.g. f__Ammonifexaceae -> Ammonifexaceae). If False, the values are switched - sep (str) – separator between the columns of the file
Returns: dictionary of genome_id -> taxon_id, reflecting the created taxonomy
Return type: Note
the taxon_id are generated, so there’s no guarantee they will be the same in a successive execution
-
read_from_ncbi_dump
(nodes_file, names_file=None, merged_file=None)¶ New in version 0.2.3.
Uses the nodes.dmp and optionally names.dmp, merged.dmp files from ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/ to populate the taxonomy.
Parameters:
-
read_taxonomy
(f_handle, light=True)¶ Changed in version 0.2.1: added light parameter
Reads taxonomy from a file handle. The file needs to be a tab separated format return by a query on Uniprot. If light is True, lineage is not stored to decrease the memory usage. This is now the default.
New taxa will be added, duplicated taxa will be skipped.
Parameters: f_handle (handle) – file handle of the taxonomy file.
-
save_data
(file_handle)¶ Changed in version 0.2.3: now can use msgpack to serialise
Saves taxonomy data to a file handle or file name, can write compressed data if the file ends with “.gz”, “.bz2”
if the .msgpack string is found in the file name, the msgpack package is used instead of pickle
Parameters: file_handle (str, file) – file name to which save the instance data
Raises: DependencyError
– if the file name contains .msgpack and the- package is not installed
-
class
mgkit.taxon.
UniprotTaxonTuple
(taxon_id, s_name, c_name, rank, lineage, parent_id)¶ Bases:
tuple
-
__getnewargs__
()¶ Return self as a plain tuple. Used by copy and pickle.
-
__getstate__
()¶ Exclude the OrderedDict from pickling
-
__repr__
()¶ Return a nicely formatted representation string
-
_asdict
()¶ Return a new OrderedDict which maps field names to their values
-
_replace
(_self, **kwds)¶ Return a new UniprotTaxonTuple object replacing specified fields with new values
-
c_name
¶ Alias for field number 2
-
lineage
¶ Alias for field number 4
-
parent_id
¶ Alias for field number 5
-
rank
¶ Alias for field number 3
-
s_name
¶ Alias for field number 1
-
taxon_id
¶ Alias for field number 0
-
-
mgkit.taxon.
UniprotTaxonomy
¶ alias of
mgkit.taxon.Taxonomy
-
mgkit.taxon.
distance_taxa_ancestor
(taxonomy, taxon_id, anc_id)¶ New in version 0.1.16.
Function to calculate the distance between a taxon and the given ancestor
The distance is equal to the number of step in the taxonomy taken to arrive at the ancestor.
Parameters: - Raturns:
- int: distance between taxon_id and it ancestor anc_id
-
mgkit.taxon.
distance_two_taxa
(taxonomy, taxon_id1, taxon_id2)¶ New in version 0.1.16.
Calculate the distance between two taxa. The distance is equal to the sum steps it takes to traverse the taxonomy until their last common ancestor.
Parameters: - Raturns:
- int: distance between taxon_id1 and taxon_id2
-
mgkit.taxon.
get_ancestor_map
(leaf_ids, anc_ids, taxonomy)¶ This function returns a dictionary where every leaf taxon is associated with the right ancestors in anc_ids
ex. {clostridium: [bacteria, clostridia]}
-
mgkit.taxon.
get_lineage
(taxonomy, taxon_id, names=False, only_ranked=False, with_last=False)¶ New in version 0.2.1.
Changed in version 0.2.5: added only_ranked
Changed in version 0.3.0: added with_last
Returns the lineage of a taxon_id, as a list of taxon_id or taxa names
Parameters: - taxonomy – a
Taxonomy
instance - taxon_id (int) – taxon_id whose lineage to return
- names (bool) – if True, the returned list contains the names of the taxa instead of the taxon_id
- only_ranked (bool) – if True, only taxonomic levels whose rank is in data:TAXON_RANKS will be returned
- with_last (bool) – if True, the passed taxon_id is included in the lineage
Returns: lineage of the taxon_id, the elements are int if names is False, and str when names is True. If a taxon has no scientific name, the common name is used. If only_ranked is True, the returned list only contains ranked taxa (according to
TAXON_RANKS
).Return type: - taxonomy – a
-
mgkit.taxon.
is_ancestor
(taxonomy, taxon_id, anc_id)¶ Changed in version 0.1.16: if a taxon_id raises a KeyError, False is returned
Determine if the given taxon id (taxon_id) has anc_id as ancestor.
:param
Taxonomy
taxonomy: taxonomy used to test :param int taxon_id: leaf taxon to test :param int anc_id: ancestor taxon to test againstReturn bool: True if anc_id is an ancestor of taxon_id or their the same
-
mgkit.taxon.
last_common_ancestor
(taxonomy, taxon_id1, taxon_id2)¶ New in version 0.1.13.
Finds the last common ancestor of two taxon IDs. An alias to this function is in the same module, called lowest_common_ancestor for compatibility.
Parameters: - Raturns:
- int: taxon ID of the lowest common ancestor
Raises: NoLcaFound
– if no common ancestor can be found
-
mgkit.taxon.
last_common_ancestor_multiple
(taxonomy, taxon_ids)¶ New in version 0.2.5.
Applies
last_common_ancestor()
to an iterable that yields taxon_id while removing any None values. If the list is of one element, that taxon_id is returned.Parameters: - taxonomy – instance of
Taxonomy
- taxon_ids (iterable) – an iterable that yields taxon_id
Returns: the taxon_id that is the last common ancestor of all taxon_ids passed
Return type: Raises: NoLcaFound
– when no common ancestry is found or the number of- *taxon_ids* is 0
- taxonomy – instance of
-
mgkit.taxon.
lowest_common_ancestor
(taxonomy, taxon_id1, taxon_id2)¶ New in version 0.1.13.
Finds the last common ancestor of two taxon IDs. An alias to this function is in the same module, called lowest_common_ancestor for compatibility.
Parameters: - Raturns:
- int: taxon ID of the lowest common ancestor
Raises: NoLcaFound
– if no common ancestor can be found
-
mgkit.taxon.
parse_ncbi_taxonomy_merged_file
(file_handle)¶ New in version 0.2.3.
Parses the merged.dmp file where the merged taxon_id are stored. Available at ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/
Parameters: file_handle (str, file) – file name or handle to the file Returns: dictionary with merged_id -> taxon_id Return type: dict
-
mgkit.taxon.
parse_ncbi_taxonomy_names_file
(file_handle, name_classes=('scientific name', 'common name'))¶ New in version 0.2.3.
Parses the names.dmp file where the names associated to a taxon_id are stored. Available at ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/
Parameters: Returns: dictionary with merged_id -> taxon_id
Return type:
-
mgkit.taxon.
parse_ncbi_taxonomy_nodes_file
(file_handle, taxa_names=None)¶ New in version 0.2.3.
Parses the nodes.dmp file where the nodes of the taxonomy are stored. Available at ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/.
Parameters: - file_handle (str, file) – file name or handle to the file
- taxa_names (dict) – dictionary with the taxa names (returned from
parse_ncbi_taxonomy_names_file()
)
Yields: TaxonTuple – TaxonTuple instance
-
mgkit.taxon.
parse_uniprot_taxon
(line, light=True)¶ Changed in version 0.1.13: now accepts empty scientific names, for root taxa
Changed in version 0.2.1: added light parameter
Parses a Uniprot taxonomy file (tab delimited) line and returns a UniprotTaxonTuple instance. If light is True, lineage is not stored to decrease the memory usage. This is now the default.
-
mgkit.taxon.
taxa_distance_matrix
(taxonomy, taxon_ids)¶ New in version 0.1.16.
Given a list of taxonomic identifiers, returns a distance matrix in a pairwise manner by using
distance_two_taxa()
on all possible two element combinations of taxon_ids.Parameters: - taxonomy –
Taxonomy
instance - taxon_ids (iterable) – list taxonomic identifiers
Returns: matrix with the pairwise distances of all taxon_ids
Return type: pandas.DataFrame
- taxonomy –