download-profiles - Download Custom Profiles

Overview

This script downloads sequence data for each gene of interest (ortholog) and all the specified taxa. The files that are downloaded with this script can then be used to create HMMER profiles, to search for similarity in a aminoacidic or nucleotidic sequence.

Limitations

At the moment, the script uses Kegg Orthologs as the ortholog database.

Warning

Some taxa may still black listed, because they are not relevant to the rumen microbiome. If you find such thing to occur to you, please contact me or open an issue on the repository.

Required Data

The script requires data from Kegg Orthologs and Uniprot to be downloaded, before it can be used. The script download_data (download-data - Download Taxonomy from NCBI) automates the process.

Workflow for Custom Profiles

blockdiag AA Sequences AA Sequences (unknown) Nucleotide Sequences Alignment Files Custom Profiles Alignment (clustalo) Results download_profiles translate_seq nhmmscan hmmscan hmmbuild

The process of building the profiles to be used with HMMER is a step that involves several tasks (illustrated in the Workflow above):

  1. download of data
  2. alignment of sequences
  3. conversion in HMMER profiles.

The first step involves, for all ortholog genes, to download all sequences available for each taxon level of interest: this will produce a series of file which contain the amino-acid sequences for each tuple gene-taxon. This sctipt, download_profiles can be used. The aminoacidic sequences downloaded are then aligned using Clustal Omega (or other) and for each alignment a profile is built.

HMMER required the use of aminoacidic sequences, to be match against the profiles. The translate_seq script can be used to translate nucleotidic sequences into aminoacidic ones. However, the last version of HMMER should be able to match nucleotidic sequences, but it was not tested by us. The example Workflow above illustrate that.

Building profiles in this way, by going through all ortholog genes and choosing the taxon level desired, opens the possibility of incrementally refining the profiling of a metagenome without having to rerun all profiles again, as only the new ones need to be run. Filtering the all the results is a much faster operation.

Usage

The default behaviour is to download all Kegg Orthologs for all taxa in the given taxonomy. Taxa can be filtered by both lineage (e.g. archaea, carnivora, etc.) and rank (e.g. genus, family, etc.). Another option is to specify the KO and taxa IDs to download.

Taxa Filters

The way a taxon is specified is through a few different rules:

  • specific taxon ids in uniprot
  • a specific taxon rank (e.g.: genus, phylum, etc.)
  • optional lineage filter: the lineage filter make sure that the name specified is included in the lineage attribute in the taxonomy.

As an example, if the rank chosen is genus, and the lineage option is set to archaea, only the taxa whose rank is genus and that belong to the archaea subtree will be downloaded:

$ download_profiles -m EMAIL -r genus -l archaea mg_data/kegg.pickle \
-t mg_data/taxonomy.pickle

This allows to customise the level of specificity that we want in profiling and make the process of downloading faster. For metagenomic data, a good start is mixing different taxon ranks, using the order or genus for the genes and then specifying a lineage of interest.

Because each profile is indipendent from each other, it’s useful to start the download with a certain rank and then run the profiling. During the profiling a new download can be started and so on.

Specific Genes and Taxa

It is possible to download only specific taxa and KO and can be done using the -i and -ko respectively. When -ko is used, loading Kegg Data with -k is not required and it is up to the user to ensure the correct genes or taxa.

An example to download only KO from 3 different taxa:

$ download_profiles -v -m EMAIL -ko K00016 -i 9611 9645 9682 \
-t mg_data/taxonomy.pickle

The same example using taxa filtering, instead (at the time of writing):

$ download_profiles -v -m EMAIL -ko K00016 -r genus -l carnivora \
-t mg_data/taxonomy.pickle

Changes

Changed in version 0.2.1: added -ko option, resolved issues caused by changes in library

Options

Download KO sequences from Uniprot

usage: download_profiles [-h] [-o OUTPUT_DIR] [-k KEGG_DATA] -m EMAIL
                         [-t TAXON_DATA] [-r TAXON_RANK] [-l LINEAGE]
                         [-i TAXON_ID [TAXON_ID ...]] [-ko KO_ID [KO_ID ...]]
                         [-R] [-a] [-v | --quiet] [--cite] [--manual]
                         [--version]

Named Arguments

-o, --output-dir
 

directory in which to store the downloaded files

Default: “profile_files”

-k, --kegg-data
 

pickle file containing Kegg data

Default: “data/kegg.pickle”

-m, --email email address to use for Uniprot communications
-t, --taxon-data
 

pickle file containing taxonomy data

Default: “data/taxonomy.pickle”

-r, --taxon-rank
 taxon rank to download
-l, --lineage lineage for filtering (e.g. archaea)
-i, --taxon-id id(s) of taxa to download. If specified take precedence over lineage-rank
-ko, --ko-id KO id(s)to download. If specified option -k is not needed
-R, --only-reviewed
 

Only download reviewed sequences

Default: False

-a, --all-path

Download all KO from Kegg - exclude blacklist

Default: False

-v, --verbose

more verbose - includes debug messages

Default: 20

--quiet less verbose - only error and critical messages
--cite Show citation for the framework
--manual Show the script manual
--version show program’s version number and exit