taxon-utils - Taxonomy Utilities¶
Overview¶
The script contains commands used to access functionality related to
taxonomy, without the need to write ad-hoc code for functionality that
can be part of a workflow. One example is access to the the last common
ancestor function contained in the mgkit.taxon
.
Last Common Ancestor (lca and lca_line)¶
These commands expose the functionality of
last_common_ancestor_multiple()
, making it accessible via the command
line. They differ in the input file format and the choice of output files.
the lca command can be used to define the last common ancestor of contigs from the annotation in a GFF file. The command uses the taxon_ids from all annotations belonging to a contig/sequence, if they have a bitscore higher or equal to the one passed (50 by default). The default output of the command is a tab separated file where the first column is the contig/sequence name, the taxon_id of the last common ancestor, its scientific/common name and its lineage.
For example:
contig_21 172788 uncultured phototrophic eukaryote cellular organisms,environmental samples
If the -r is used, by passing the fasta file containing the nucleotide sequences the output file is a GFF where for each an annotation for the full contig length contains the same information of the tab separated file format.
The lca_line command accept as input a file where each line consist of a list of taxon_ids. The separator for the list can be changed and it defaults to TAB. The last common ancestor for all taxa on a line is searched. The ouput of this command is the same as the tab separated file of the lca command, with the difference that instead of the first column, which in this command becames a list of all taxon_ids that were used to find the last common ancestor for that line. The list of taxon_ids is separated by semicolon “;”.
Note
Both also accept the -n option, to report the config/line and the taxon_ids that had no common ancestors. These are treated as errors and do not appear in the output file.
Krona Output¶
New in version 0.3.0.
The lca command supports the writing of a file compatible with Krona. The output file can be used with the ktImportText/ImportText.pl script included with KronaTools. Specifically, the output from taxon_utils will be a file with all the lineages found (tab separated), that can be used with:
$ ktImportText -q taxon_utils_ouput
Note the use of -q to make the script count the lineages. Sequences with no LCA found will be marked as No LCA in the graph, the -n is not required.
Note
Please note that the output won’t include any sequence that didn’t have a hit with the software used. If that’s important, the -kt option can be used to add a number of Unknown lines at the end, to read the total supplied.
Filter by Taxon¶
The filter command of this script allows to filter a GFF file using the taxon_id attribute to include only some annotations, or exclude some. The filter is based on the mgkit.taxon.is_ancestor function, and the mgkit.filter.taxon.filter_taxon_by_id_list. It allows to pass a list of taxon_id (or taxon_names) to the script. The include filter will only output annotations that have one of the passed taxa as ancestors, while the exclude filter will remove those annotations, that have the passed taxa as ancestors, from the output.
A list of comma separated taxon_ids can be supplied, as for the names. If any of the the supplied names have multiple taxon_id (e.g. Actinobacteria) the script exits and in the log can be found the list of duplicates. For cases like this, it’s preferred for the user to supply a taxon_id, as they can be searched in NCBI taxonomy (also Uniprot).
Warning
Annotations with no taxon_id are not included in the output of both filters
Convert Taxa Tables to HDF5¶
This command is used to convert the taxa tables download from Uniprot and NCBI, using the scripts mentioned in download-data - Download Taxonomy from NCBI, download-uniprot-taxa.sh and download-ncbi-taxa into a HDF5 file that can be used with the addtaxa command in add-gff-info - Add informations to GFF annotations.
The advantage is a faster lookup of the IDs. The other is a smaller memory footprint when a great number of annotations are kept in memory.
Changes¶
Changed in version 0.3.1: added to_hdf command
Changed in version 0.3.1: added -j option to lca, which outputs a JSON file with the LCA results
Changed in version 0.3.0: added -k and -kt options for Krona output, lineage now includes the LCA also added -a option to select between lineages with only ranked taxa. Now it defaults to all components.
Changed in version 0.2.6: added feat-type option to lca command, added phylum output to nolca
New in version 0.2.5.
Options¶
Taxonomy Utilities
usage: taxon_utils [-h] [-v | --quiet] [--cite] [--manual] [--version]
{lca,lca_line,filter,to_hdf} ...
Named Arguments¶
-v, --verbose | more verbose - includes debug messages Default: 20 |
--quiet | less verbose - only error and critical messages |
--cite | Show citation for the framework |
--manual | Show the script manual |
--version | show program’s version number and exit |
Sub-commands:¶
lca¶
Finds the last common ancestor for each sequence in a GFF file
taxon_utils lca [-h] [-b BITSCORE] [-s] [-a] [-ft FEAT_TYPE]
[-r REFERENCE | -k | -j] [-kt KRONA_TOTAL] [-n NO_LCA] -t
TAXONOMY [-v | --quiet] [--cite] [--manual] [--version]
[input_file] [output_file]
Positional Arguments¶
input_file | Input file, defaults to stdin Default: - |
output_file | Output file, defaults to stdout Default: <open file ‘<stdout>’, mode ‘w’ at 0x7fa31cd94150> |
Named Arguments¶
-b, --bitscore | Minimum bitscore accepted Default: 0 |
-s, --sorted |
Default: False |
-a, --only-ranked | |
Default: False | |
-ft, --feat-type | |
Feature type used if the output is a GFF (default is LCA) Default: “LCA” | |
-r, --reference | |
Reference file for the GFF, if supplied a GFF file is the output | |
-k, --krona | Output a file that can be read by Krona (text) Default: False |
-j, --json | If used, the output is a JSON file with the LCA information Default: False |
-kt, --krona-total | |
| |
-n, --no-lca | File to which write records with no LCA |
-t, --taxonomy | Taxonomy file |
-v, --verbose | more verbose - includes debug messages Default: 20 |
--quiet | less verbose - only error and critical messages |
--cite | Show citation for the framework |
--manual | Show the script manual |
--version | show program’s version number and exit |
lca_line¶
Finds the last common ancestor for all IDs in a text file line
taxon_utils lca_line [-h] [-s SEPARATOR] [-n NO_LCA] -t TAXONOMY
[-v | --quiet] [--cite] [--manual] [--version]
[input_file] [output_file]
Positional Arguments¶
input_file | Input file, defaults to stdin Default: - |
output_file | Output file, defaults to stdout Default: <open file ‘<stdout>’, mode ‘w’ at 0x7fa31cd94150> |
Named Arguments¶
-s, --separator | |
separator for taxon_ids (defaults to TAB) Default: ” “ | |
-n, --no-lca | File to which write records with no LCA |
-t, --taxonomy | Taxonomy file |
-v, --verbose | more verbose - includes debug messages Default: 20 |
--quiet | less verbose - only error and critical messages |
--cite | Show citation for the framework |
--manual | Show the script manual |
--version | show program’s version number and exit |
filter¶
Filter a GFF file based on taxonomy
taxon_utils filter [-h]
[-i INCLUDE_TAXON_ID | -in INCLUDE_TAXON_NAME | -e EXCLUDE_TAXON_ID | -en EXCLUDE_TAXON_NAME]
-t TAXONOMY [-v | --quiet] [--cite] [--manual] [--version]
[input_file] [output_file]
Positional Arguments¶
input_file | Input file, defaults to stdin Default: - |
output_file | Output file, defaults to stdout Default: <open file ‘<stdout>’, mode ‘w’ at 0x7fa31cd94150> |
Named Arguments¶
-i, --include-taxon-id | |
Include only taxon_ids (comma separated) | |
-in, --include-taxon-name | |
Include only taxon_names (comma separated) | |
-e, --exclude-taxon-id | |
Exclude taxon_ids (comma separated) | |
-en, --exclude-taxon-name | |
Exclude taxon_names (comma separated) | |
-t, --taxonomy | Taxonomy file |
-v, --verbose | more verbose - includes debug messages Default: 20 |
--quiet | less verbose - only error and critical messages |
--cite | Show citation for the framework |
--manual | Show the script manual |
--version | show program’s version number and exit |
to_hdf¶
Convert a taxa table to HDF5
taxon_utils to_hdf [-h] [-n TABLE_NAME] [-w] [-s INDEX_SIZE] [-c CHUNK_SIZE]
[-v | --quiet] [--cite] [--manual] [--version]
[input_file] [output_file]
Positional Arguments¶
input_file | Input file, defaults to stdin Default: - |
output_file | Output file, defaults to (taxa-table.hf5) Default: “taxa-table.hf5” |
Named Arguments¶
-n, --table-name | |
Name of the table/storage to use Default: “taxa” | |
-w, --overwrite | |
Overwrite the file, instead of appending to it Default: False | |
-s, --index-size | |
Maximum number of characters for the gene_id Default: 12 | |
-c, --chunk-size | |
Chunk size to use when reading the input file Default: 5000000 | |
-v, --verbose | more verbose - includes debug messages Default: 20 |
--quiet | less verbose - only error and critical messages |
--cite | Show citation for the framework |
--manual | Show the script manual |
--version | show program’s version number and exit |