snp_parser - SNPs analysis


The workflow starts with a number of alignments passed to the SNP calling software, which produces one VCF file per alignment/sample. These VCF files are used by SNPDat along a GTF file and the reference genome to integrate the information in VCF files with synonymous/non-synonymous information.

All VCF files are merged into a VCF that includes information about all the SNPs called among all samples. This merged VCF is passed, along with the results from SNPDat and the GFF file to which integrates information from all data sources and output files in a format that can be later used by the rest of the pipeline. [1]


The GFF file passed to the parser must have per sample coverage information.

[1]This step is done separately because it’s both time consuming and can helps to paralellise later steps

Script Reference

This script parses results of SNPs analysis from any tool for SNP calling [2] and integrates them into a format that can be later used for other scripts in the pipeline.

It integrates coverage and expected number of syn/nonsyn change and taxonomy from a GFF file, SNP data from a VCF file.


The script accept gzipped VCF files

[2]GATK pipeline was tested, but it is possible to use samtools and bcftools


Changed in version 0.2.1: added -s option for VCF files generated using bcftools

Changed in version 0.1.16: reworkked internals and removed SNPDat, syn/nonsyn evaluation is internal

Changed in version 0.1.13: reworked the internals and the classes used, including options -m and -s


SNPs analysis, requires a vcf file and SNPDat results

usage: snp_parser [-h] [-o OUTPUT_FILE] [-q MIN_QUAL] [-f MIN_FREQ]
                  [-r MIN_READS] -g GFF_FILE -p VCF_FILE -a REFERENCE -m
                  SAMPLES_ID [-c COV_SUFF] [-s] [-v | --quiet] [--cite]
                  [--manual] [--version]

Named Arguments

-o, --output-file

Ouput file

Default: snp_data.pickle

-q, --min-qual

Minimum SNP quality (Phred score)

Default: 30

-f, --min-freq

Minimum allele frequency

Default: 0.01

-r, --min-reads

Minimum number of reads to accept the SNP

Default: 4

-g, --gff-file GFF file with annotations
-p, --vcf-file Merged VCF file
-a, --reference
 Fasta file with the GFF Reference
-m, --samples-id
 the ids of the samples used in the analysis
-c, --cov-suff

Per sample coverage suffix in the GFF

Default: “_cov”

-s, --bcftools-vcf

bcftools call was used to produce the VCF file

Default: False

-v, --verbose

more verbose - includes debug messages

Default: 20

--quiet less verbose - only error and critical messages
--cite Show citation for the framework
--manual Show the script manual
--version show program’s version number and exit