pnps-gen - pN/pS Table Generation

Overview

Calculates pN/pS values

This script calculates pN/pS using the data produced by the script snp_parser. The result table is a CSV file.

Calculate Rank pN/pS

The rank command of the script reads SNPs information and calculate for each element of a specific taxonomic rank (species, genus, family, etc.) its pN/pS.

For example, choosing the rank genus a table will be produced, similar to:

Prevotella,0.0001,1,1.1,0.4
Methanobrevibacter,1,0.5,0.6,0.8

A pN/pS value for each genus and sample (4 in this case) will be calculated.

It is important to specify the taxonomic IDs to include in tha calculations. By default only bacteria are included. To get those values, the taxonomy can be queried using taxon-utils get.

Calculate Gene/Rank pN/pS

The full command create a gene/taxon table of pN/pS, internally is a pandas MultiIndex DataFrame, written in CSV format after script execution. The difference with the rank is the pN/pS calculation is for each gene/taxon and by default the gene_id from the original GFF file is used (which is stored in the file generated by snp_parser). If other gene IDs needs to be used, a table file can be provided, which can be passed in two column formats.

The default in MGKit is to use Uniprot gene IDs for the functions, but we may want to examine the Kegg Orthologs instead. A table can be passed where the first column in the gene_id stored in the GFF file and the second is the KO:

Q7N6F9  K05685
Q7N6F9  K01242
G7E4F2  K05625

The Q7N6F9 gene_id is repeated because it has multiple correspondences to KOs and this format needs to be selected using the -2 option of the command.

The default type of table expected by the command is a table with a gene ID as first column one or more tab separated columns with mappings. The previous table would look like this:

Q7N6F9  K05685  K01242
G7E4F2  K05625

These tables can be created from the original GFF file, assuming that mappings to KO, EC Numbers are included, with a command line like this:

edit-gff view -a gene_id -a map_KO final.contigs-a3.gff.gz | tr ',' '      '

Extracting the KOs (which are comma separated in a MGKit GFF file) and changing any comma to tab. This table can be passed to the script and will make it possible to calculate the pN/pS for the KOs associated to the genes. Only gene IDs present in this file have a calculated pN/pS.

Changes

New in version 0.5.0.

Options

pnps-gen

Main function

pnps-gen [OPTIONS] COMMAND [ARGS]...

Options

--version

Show the version and exit.

--cite

full

Calculates pN/pS

pnps-gen full [OPTIONS] [TXT_FILE]

Options

-v, --verbose
-t, --taxonomy <taxonomy>

Required Taxonomy file

-s, --snp-data <snp_data>

Required SNP data, output of snp_parser

-r, --rank <rank>

Taxonomic rank

Options:superkingdom|kingdom|phylum|class|subclass|order|family|genus|species
-m, --min-num <min_num>

Minimum number of samples with a pN/pS to accept

Default:2
-c, --min-cov <min_cov>

Minimum coverage for SNPs to be accepted

Default:4
-i, --taxon_ids <taxon_ids>

Taxon IDs to include

Default:2
-g, --gene-map <gene_map>

Dictionary to map gene_id to another ID

-2, --two-columns

gene-map is a two columns table with repeated keys

-p, --separator <separator>

column separator for gene-map file

Default:

Arguments

TXT_FILE

Optional argument

rank

Calculates pN/pS for a taxonomic rank

pnps-gen rank [OPTIONS] [TXT_FILE]

Options

-v, --verbose
-t, --taxonomy <taxonomy>

Required Taxonomy file

-s, --snp-data <snp_data>

Required SNP data, output of snp_parser

-r, --rank <rank>

Taxonomic rank

Default:order
Options:superkingdom|kingdom|phylum|class|subclass|order|family|genus|species
-m, --min-num <min_num>

Minimum number of samples with a pN/pS to accept

Default:2
-c, --min-cov <min_cov>

Minimum coverage for SNPs to be accepted

Default:4
-i, --taxon_ids <taxon_ids>

Taxon IDs to include

Default:2

Arguments

TXT_FILE

Optional argument