get-gff-info - Extract informations to GFF annotations¶

Overview¶

Extract information from GFF files

sequence command¶

Used to extract the nucleotidic sequences from GFF annotations. It requires the fasta file containing the sequences referenced in the GFF seq_id attribute (first column of the raw GFF).

The sequnces extract have as identifier the uid stored in the GFF file and by default the sequnece is not reverse complemented if the annotation is on the - strand, but this can be changed by using the -r option.

The sequences are wrapped at 60 characters, as per FASTA specs, but this behavior can be disabled by specifing the -w option.

Warning

The reference file is loaded in memory

dbm command¶

Creates a dbm DB using the semidbm package. The database can then be loaded using mgkit.db.dbm.GFFDB

mongodb command¶

Outputs annotations in a format supported by MongoDB. More information about it can be found in mgkit.db.mongo

gtf command¶

Outputs annotations in the GTF format

split command¶

Splits a GFF file into smaller chunks, ensuring that all of a sequence annotations are in the same file.

cov command¶

Calculate annotation coverage for each contig in a GFF file. The command can be run as strand specific (not by default) and requires the reference file to which the annotation refer to. The output file is a tab separated one, with the first column being the sequence name, the second is the strand (+, -, or NA if not strand specific) and the third is the percentage of the sequence covered by annotations.

Warning

The GFF file is assumed to be sorted, by sequence or sequence-strand if wanted. The GFF file can be sorted using sort -s -k 1,1 -k 7,7 for strand specific, or sort -s -k 1,1 if not strand specific.

Changes¶

Changed in version 0.3.4: using click instead of argparse, renamed split command –json to –json-out

Changed in version 0.3.1: added cov command

Changed in version 0.3.0: added –split option to sequence command

Changed in version 0.2.6: added split command, –indent option to mongodb

Changed in version 0.2.3: added –gene-id option to gtf command

New in version 0.2.2: added gtf command

New in version 0.2.1: dbm and mongodb commands

New in version 0.1.15.

Options¶

get-gff-info¶

Main function

get-gff-info [OPTIONS] COMMAND [ARGS]...

Options

--version¶: Show the version and exit.

--cite¶

cov¶

Report on how much a sequence length is covered by annotations in [gff-file]

get-gff-info cov [OPTIONS] [GFF_FILE] [OUTPUT_FILE]

Options

-v, --verbose¶

-f, --reference <reference>¶: Required Reference FASTA file for the GFF

-j, --json-out¶: The output will be a JSON dictionary

-s, --strand-specific¶: If the coverage must be calculated on each strand

-r, --rename¶: Emulate BLAST output (use only the header part before the first space)

--progress¶: Shows Progress Bar

Arguments

GFF_FILE¶: Optional argument

OUTPUT_FILE¶: Optional argument

dbm¶

Creates a dbm database with annotations from file [gff-file] into db [output-dir]

get-gff-info dbm [OPTIONS] [GFF_FILE]

Options

-v, --verbose¶

-d, --output-dir <output_dir>¶

Directory for the database

Default:	gff-dbm

Arguments

GFF_FILE¶: Optional argument

gtf¶

Extract annotations from a GFF file [gff-file] to a GTF file [gtf-file]

get-gff-info gtf [OPTIONS] [GFF_FILE] [GTF_FILE]

Options

-v, --verbose¶

-g, --gene-id <gene_id>¶

GFF attribute to use for the GTF gene_id attribute

Default:	gene_id

Arguments

GFF_FILE¶: Optional argument

GTF_FILE¶: Optional argument

mongodb¶

Extract annotations from a GFF [gff-file] file and makes output for MongoDB [output-file]

get-gff-info mongodb [OPTIONS] [GFF_FILE] [OUTPUT_FILE]

Options

-v, --verbose¶

-t, --taxonomy <taxonomy>¶: Taxonomy used to populate the lineage

-c, --no-cache¶: No cache for the lineage function

-i, --indent <indent>¶: If used, the json will be written in a human readble form

--progress¶: Shows Progress Bar

Arguments

GFF_FILE¶: Optional argument

OUTPUT_FILE¶: Optional argument

sequence¶

Extract the nucleotidic sequences of annotations from [gff-file] to [fasta-file]

get-gff-info sequence [OPTIONS] [GFF_FILE] [FASTA_FILE]

Options

-v, --verbose¶

-r, --reverse¶: Reverse complement sequences on the - strand

-w, --no-wrap¶: Write the sequences on one line

-s, --split¶: Split the sequence header of the reference at the first space, to emulate BLAST behaviour

-f, --reference <reference>¶: Fasta file containing the reference sequences of the GFF file

--progress¶: Shows Progress Bar

Arguments

GFF_FILE¶: Optional argument

FASTA_FILE¶: Optional argument

split¶

Split annotations from a GFF file [gff-file] to several files starting with [prefix]

get-gff-info split [OPTIONS] [GFF_FILE]

Options

-v, --verbose¶

-p, --prefix <prefix>¶

Prefix for the file name in output

Default:	split

-n, --number <number>¶

Number of chunks into which split the GFF file

Default:	10

-z, --gzip¶: gzip output files

Arguments

GFF_FILE¶: Optional argument