snp_parser - SNPs analysis

Overview

blockdiag Alignments Assembly VCF files VCF Merge SNPs Calling Add Information snp_parser GFF

The workflow starts with a number of alignments passed to the SNP calling software, which produces one VCF file per alignment/sample. These VCF files are used by SNPDat along a GTF file and the reference genome to integrate the information in VCF files with synonymous/non-synonymous information.

All VCF files are merged into a VCF that includes information about all the SNPs called among all samples. This merged VCF is passed, along with the results from SNPDat and the GFF file to snp_parser.py which integrates information from all data sources and output files in a format that can be later used by the rest of the pipeline. [1]

Note

The GFF file passed to the parser must have per sample coverage information.

[1]This step is done separately because it’s both time consuming and can helps to paralellise later steps

Script Reference

This script parses results of SNPs analysis from any tool for SNP calling [2] and integrates them into a format that can be later used for other scripts in the pipeline.

It integrates coverage and expected number of syn/nonsyn change and taxonomy from a GFF file, SNP data from a VCF file.

Note

The script accept gzipped VCF files

[2]GATK pipeline was tested, but it is possible to use samtools and bcftools

Changes

Changed in version 0.2.1: added -s option for VCF files generated using bcftools

Changed in version 0.1.16: reworkked internals and removed SNPDat, syn/nonsyn evaluation is internal

Changed in version 0.1.13: reworked the internals and the classes used, including options -m and -s

Options

SNPs analysis, requires a vcf file and SNPDat results

usage: snp_parser [-h] [-o OUTPUT_FILE] [-q MIN_QUAL] [-f MIN_FREQ]
                  [-r MIN_READS] -g GFF_FILE -p VCF_FILE -a REFERENCE -m
                  SAMPLES_ID [-c COV_SUFF] [-s] [-v | --quiet] [--cite]
                  [--manual] [--version]

Named Arguments

-o, --output-file
 

Ouput file

Default: snp_data.pickle

-q, --min-qual

Minimum SNP quality (Phred score)

Default: 30

-f, --min-freq

Minimum allele frequency

Default: 0.01

-r, --min-reads
 

Minimum number of reads to accept the SNP

Default: 4

-g, --gff-file GFF file with annotations
-p, --vcf-file Merged VCF file
-a, --reference
 Fasta file with the GFF Reference
-m, --samples-id
 the ids of the samples used in the analysis
-c, --cov-suff

Per sample coverage suffix in the GFF

Default: “_cov”

-s, --bcftools-vcf
 

bcftools call was used to produce the VCF file

Default: False

-v, --verbose

more verbose - includes debug messages

Default: 20

--quiet less verbose - only error and critical messages
--cite Show citation for the framework
--manual Show the script manual
--version show program’s version number and exit