MGKit GFF Specifications

The GFF produced with MGKit follows the conventions of GFF/GTF files but it provides some additional fields in the 9th column which translate to a Python dictionary when an annotation is loaded into an Annotation instance.

The 9th column is a list of key=value item, separated by a semicolon (;); each value is also expected to be quoted with double quotes and the values to not include a semicolon or other characters that can make the parsing difficult. MGKit uses urllib.quote() to encode those characters and also ” ()/”. The mgkit.io.gff.from_gff() uses urllib.unquote() to set the values.

Warning

As the last column translates to a dictionary in the data structures, duplicate keys are not allowed. mgkit.io.gff.from_gff() raises an exception if any are found.

Reserved Values

Any key can be added to a GFF annotation, but MGKit expects a few key to be in the GFF annotation as summarised in the following tables.

Reserved values, used by the scripts
Key Value Explanation
gene_id any string used to identify the gene predicted
db any string, like UNIPROT-SP, UNIPROT-TR, NCBI-NT identifies the database used to make the gene_id prediction
taxon_db any string, like UNIPROT-SP, UNIPROT-TR, NCBI-NT identifies the database used to make the taxon_id prediction
dbq integer identifies the quality of the database, used when filtering annotations
taxon_id integer identifies the annotation taxon, NCBI taxonomy is used
uid string unique identifier for the annotation, any string is accepted but a value is assigned by using uuid.uuid4()
cov and {any}_cov integer coverage for the annotation over all samples, keys ending with _cov indicates coverage for each sample
exp_syn, exp_nonsyn integer used for expected number of synonymous and non-synonymous changes for the annotation

The following keys are added by different scripts and may be used in different scripts or annotation methods.

Interpreted Values
Key Value Explanation Used
taxon_name string name of the taxon not used
lineage string taxon lineage not used
EC comma separated values list of EC numbers associated to the annotation used by mgkit.io.gff.Annotation.get_ec()
map_{any} comma separated values list of mapping to a specific db (e.g. eggNOG -> map_EGGNOG) used by mgkit.io.gff.Annotation.get_mapping()
counts_{any} float Stores the count data for a sample (e.g. counts_Sample1) used by script add-gff-info
fpkms_{any} float Stores the count data for a sample (e.g. fpkms_Sample1) used by script add-gff-info