snp_parser - SNPs analysis¶

Overview¶

The workflow starts with a number of alignments passed to the SNP calling software, which produces one VCF file per alignment/sample. These VCF files are used by SNPDat along a GTF file and the reference genome to integrate the information in VCF files with synonymous/non-synonymous information.

All VCF files are merged into a VCF that includes information about all the SNPs called among all samples. This merged VCF is passed, along with the results from SNPDat and the GFF file to snp_parser.py which integrates information from all data sources and output files in a format that can be later used by the rest of the pipeline. [1]

Note

The GFF file passed to the parser must have per sample coverage information.

[1]	This step is done separately because it’s both time consuming and can helps to paralellise later steps

Script Reference¶

This script parses results of SNPs analysis from any tool for SNP calling [2] and integrates them into a format that can be later used for other scripts in the pipeline.

It integrates coverage and expected number of syn/nonsyn change and taxonomy from a GFF file, SNP data from a VCF file.

Note

The script accept gzipped VCF files

[2]	GATK pipeline was tested, but it is possible to use samtools and bcftools

Changes¶

Changed in version 0.2.1: added -s option for VCF files generated using bcftools

Changed in version 0.1.16: reworkked internals and removed SNPDat, syn/nonsyn evaluation is internal

Changed in version 0.1.13: reworked the internals and the classes used, including options -m and -s

Options¶

SNPs analysis, requires a vcf file and SNPDat results

usage: snp_parser [-h] [-o OUTPUT_FILE] [-q MIN_QUAL] [-f MIN_FREQ]
                  [-r MIN_READS] -g GFF_FILE -p VCF_FILE -a REFERENCE -m
                  SAMPLES_ID [-c COV_SUFF] [-s] [-v | --quiet] [--cite]
                  [--manual] [--version]

Named Arguments¶

`-o, --output-file`
	Ouput file Default: snp_data.pickle
`-q, --min-qual`	Minimum SNP quality (Phred score) Default: 30
`-f, --min-freq`	Minimum allele frequency Default: 0.01
`-r, --min-reads`
	Minimum number of reads to accept the SNP Default: 4
`-g, --gff-file`	GFF file with annotations
`-p, --vcf-file`	Merged VCF file
`-a, --reference`
	Fasta file with the GFF Reference
`-m, --samples-id`
	the ids of the samples used in the analysis
`-c, --cov-suff`	Per sample coverage suffix in the GFF Default: “_cov”
`-s, --bcftools-vcf`
	bcftools call was used to produce the VCF file Default: False
`-v, --verbose`	more verbose - includes debug messages Default: 20
`--quiet`	less verbose - only error and critical messages
`--cite`	Show citation for the framework
`--manual`	Show the script manual
`--version`	show program’s version number and exit