download-data - Download Taxonomy from NCBI

A bash script called download-taxonomy.sh is installed with MGKit. The script downloads the required file (taxdump.tar.gz) form the NCBI ftp taxonomy directory, using wget.

Note

If the file is found then it is not downladed. This is handy in case wget is not installed on the system by default (e.g. MacOS X)

The file is then decompressed using tar and a files.txt file created to be used by a simple script in the same directory. Both the directory and files.txt are deleted at the end, but not taxdump.tar.gz. A taxonomy file taxonomy.pickle is created in the same directory.

Download Taxa IDs from Uniprot

A script is included to download and prepare a tab separated list of all Taxa IDs associated with Uniprot IDs, so it can be used with add-gff-info - Add informations to GFF annotations. The script is called download-uniprot-taxa.sh and is installed with MGKit. By default both SwissProt and TrEMBL IDs are downloaded, but passing either sp or SP will download only SwissProt. The output file is called uniprot-taxa

Download Taxa IDs from NCBI

A script is included to download and prepare a tab separated list of all Taxa IDs associated with NCBI (GenBank) IDs, so it can be used with add-gff-info - Add informations to GFF annotations. The script is called download-ncbi-taxa.sh and is installed with MGKit. By default nt (nucleotide) IDs are downloaded, but passing either prot or PROT will download nr (protein) IDs. The output file is called ncbi-nucl-taxa.gz or ncbi-prot-taxa.gz depending of the downloaded data.

Download Required Data (Deprecated)

The scripts downloads the data that is used by the framework for some of its functions. It’s mostly a shortcut to call the download_data function that is present in every module that is in the package mappings and in the kegg module.

The only option required, is the email contact for the person using the script; this is used to make sure that the API requirements in Uniprot are fullfilled and they can contact the person using the script is any problem arise.

Note

The default behavior is to download first the taxonomy data form Uniprot, Kegg and additional mapping data.

Taxonomy

It is downloaded from Uniprot and build a data structure that is used by several scripts and function in the package. The download can take some time.

Warning

if only the taxonomy is to be downloaded, both the -x and -p options must be passed to the script.

Kegg Onthologs

The Kegg data is the only “required” data at the moment, because it’s used to download the sequence data (via the donwload_profiles script) for the profiling. It is the only data that can’t be saved unless it’s fully downloaded.

Kegg data is required by the mappers currently supported, and its download takes longer. The mappers handle timeouts and if exceptions are raised the data is saved and the download is resumed when the script is started again.

Other Mappings

The other mappings (from KO) are downloaded by default and this can be excluded by using the -p option. Mappings for Gene Onthology, eggNOG and CaZy are downloaded.

As the download of the mappings can take a lot of time, or break because of the number of requests to the web sites, checkpoints are saved often, so it can resumed at a later time.

Warning

the Gene Onthology module has specific requirements, so if they are not downloaded

Options

SNPs analysis, requires a vcf file and SNPDat results

usage: download_data [-h] [-o OUTPUT_DIR] [-k KEGG] [-c CAZY] [-g GO]
                     [-e EGGNOG] [-t TAXONOMY] [-p] [-x] -m EMAIL
                     [-v | --quiet] [--cite] [--manual] [--version]

Named Arguments

-o, --output-dir
 

Ouput directory

Default: “mg_data”

-k, --kegg

Kegg data file name

Default: “kegg.pickle”

-c, --cazy

CaZy data file name

Default: “cazy.pickle”

-g, --go

Gene Onthology data file name

Default: “go.pickle”

-e, --eggnog

eggNOG data file name

Default: “eggnog.pickle”

-t, --taxonomy

Taxonomy data file name

Default: “taxonomy.pickle”

-p, --no-mappings
 

Use to not download Mapping data

Default: True

-x, --only-taxonomy
 

Use to only download Taxonomy data, no Kegg data

Default: True

-m, --email email address to use for Uniprot communications
-v, --verbose

more verbose - includes debug messages

Default: 20

--quiet less verbose - only error and critical messages
--cite Show citation for the framework
--manual Show the script manual
--version show program’s version number and exit