mgkit.workflow.download_profiles module

Overview

This script downloads sequence data for each gene of interest (ortholog) and all the specified taxa. The files that are downloaded with this script can then be used to create HMMER profiles, to search for similarity in a aminoacidic or nucleotidic sequence.

Limitations

At the moment, the script uses Kegg Orthologs as the ortholog database.

Warning

Some taxa may still black listed, because they are not relevant to the rumen microbiome. If you find such thing to occur to you, please contact me or open an issue on the repository.

Required Data

The script requires data from Kegg Orthologs and Uniprot to be downloaded, before it can be used. The script download_data (download-data - Download Taxonomy from NCBI) automates the process.

Workflow for Custom Profiles

blockdiag AA Sequences AA Sequences (unknown) Nucleotide Sequences Alignment Files Custom Profiles Alignment (clustalo) Results download_profiles translate_seq nhmmscan hmmscan hmmbuild

The process of building the profiles to be used with HMMER is a step that involves several tasks (illustrated in the Workflow above):

  1. download of data
  2. alignment of sequences
  3. conversion in HMMER profiles.

The first step involves, for all ortholog genes, to download all sequences available for each taxon level of interest: this will produce a series of file which contain the amino-acid sequences for each tuple gene-taxon. This sctipt, download_profiles can be used. The aminoacidic sequences downloaded are then aligned using Clustal Omega (or other) and for each alignment a profile is built.

HMMER required the use of aminoacidic sequences, to be match against the profiles. The translate_seq script can be used to translate nucleotidic sequences into aminoacidic ones. However, the last version of HMMER should be able to match nucleotidic sequences, but it was not tested by us. The example Workflow above illustrate that.

Building profiles in this way, by going through all ortholog genes and choosing the taxon level desired, opens the possibility of incrementally refining the profiling of a metagenome without having to rerun all profiles again, as only the new ones need to be run. Filtering the all the results is a much faster operation.

Usage

The default behaviour is to download all Kegg Orthologs for all taxa in the given taxonomy. Taxa can be filtered by both lineage (e.g. archaea, carnivora, etc.) and rank (e.g. genus, family, etc.). Another option is to specify the KO and taxa IDs to download.

Taxa Filters

The way a taxon is specified is through a few different rules:

  • specific taxon ids in uniprot
  • a specific taxon rank (e.g.: genus, phylum, etc.)
  • optional lineage filter: the lineage filter make sure that the name specified is included in the lineage attribute in the taxonomy.

As an example, if the rank chosen is genus, and the lineage option is set to archaea, only the taxa whose rank is genus and that belong to the archaea subtree will be downloaded:

$ download_profiles -m EMAIL -r genus -l archaea mg_data/kegg.pickle \
-t mg_data/taxonomy.pickle

This allows to customise the level of specificity that we want in profiling and make the process of downloading faster. For metagenomic data, a good start is mixing different taxon ranks, using the order or genus for the genes and then specifying a lineage of interest.

Because each profile is indipendent from each other, it’s useful to start the download with a certain rank and then run the profiling. During the profiling a new download can be started and so on.

Specific Genes and Taxa

It is possible to download only specific taxa and KO and can be done using the -i and -ko respectively. When -ko is used, loading Kegg Data with -k is not required and it is up to the user to ensure the correct genes or taxa.

An example to download only KO from 3 different taxa:

$ download_profiles -v -m EMAIL -ko K00016 -i 9611 9645 9682 \
-t mg_data/taxonomy.pickle

The same example using taxa filtering, instead (at the time of writing):

$ download_profiles -v -m EMAIL -ko K00016 -r genus -l carnivora \
-t mg_data/taxonomy.pickle

Changes

Changed in version 0.2.1: added -ko option, resolved issues caused by changes in library

mgkit.workflow.download_profiles.add_profiles_to_length(seqs, length_data)

Adds the average profile length to the dictionary

mgkit.workflow.download_profiles.choose_ko_ids(kegg_data, options)

Returns the list of mapping ID->Name according to the options passed

mgkit.workflow.download_profiles.choose_taxa(taxonomy, options)

Returns the list of ids to look for in Uniprot

mgkit.workflow.download_profiles.download_ko_sequences(ko_id, taxon_ids, reviewed, contact)

Downloads the sequences associated to all taxon IDs provided

mgkit.workflow.download_profiles.filter_found_taxa(taxon_ids_found, taxon_ids, taxonomy)

Filter the taxa found in Uniprot, making sure that they at a lower level of those requested

mgkit.workflow.download_profiles.filter_taxonomy_by_lineage(taxonomy, taxon_ids, lineage)
mgkit.workflow.download_profiles.filter_taxonomy_by_rank(taxonomy, taxon_ids, rank)
mgkit.workflow.download_profiles.load_data(taxon_data, length_data_name)

Loads data for script

mgkit.workflow.download_profiles.main()

Main function

mgkit.workflow.download_profiles.map_ko_to_uniprot(ko_id, taxon_ids, reviewed, contact)

Returns the taxon IDs found in Uniprot for a specific id

mgkit.workflow.download_profiles.set_parser()

argument parser configuration

mgkit.workflow.download_profiles.write_ko_sequences(seqs, taxonomy, output_dir)

Writes fasta sequences to disc