filter-gff - Filter GFF annotations

Overview

Filters GFF annotations in different ways.

Value Filtering

Enables filtering of GFF annotations based on the the first 8 columns, which are fixed values as well using the last column which holds information in a key=value way. There are some predefined key=value filters, like gene_id, but –str-eq, –str-in, –num-ge and –num-le allow to make additional filters.

The functions used to make the filters are located in the module mgkit.filter.gff, and their names start with filter_base, filter_attr and filter_len.

blockdiag GFF parse_gff setup_filters Filters Filtered Annotations

Overlap Filtering

Filters overlapping annotations using the functions mgkit.filter.gff.choose_annotation() and mgkit.filter.gff.filter_annotations(), after the annotations are grouped by both sequence and strand. If the GFF is sorted by sequence name and strand, the -t can be used to make the filtering use less memory. It can be sorted in Unix using sort -s -k 1,1 -k 7,7 gff_file, which applies a stable sort using the sequence name as the first key and the strand as the second key.

Note

It is also recommended to use:

export LC_ALL=C

To speed up the sorting

blockdiag sort group_annotations GFF parse_gff filter_annotati ons Filtered Annotations

The above digram describes the internals of the script.

The annotations needs first to be grouped by seq_id and strand, forming a group that can be then be passed to mgkit.filter.gff.filter_annotations(). This function:

  1. sort annotations by bit score, from the highest to the lowest

  2. loop over all combination of N=2 annotations:

    1. choose which of the two annotations to discard if they overlap for a the required amount of bp (defaults to 100bp)
    2. in which case, the preference is given to the db quality first, than the bit score and finally the lenght of annotation, the one with the highest values is kept

While the default behaviour is the same, now it is posible to decided the function used to discard one the two annotations. It is possible to use the -c argument to pass a string that defines the function. The string passed must start with or without a +. Using + translates into the builtin function max while no + translates into min from the second character on, any number of attributes can be used, separated by commas. The attributes, however, must be one of the properties defined in mgkit.io.gff.Annotation, bitscore that returns the value converted in a float. Internally the attributes are stored as strings, so for attributes that have no properties in the class, such as evalue, the float builtin is applied.

The tuples built for both annotations are then passed to the comparison function to be selected and the value returned by it is discarded. The order of the elements in the string is important to define the priority given to each element in the comparison and the leftmost one has the highesst priority.

Examples of function strings:

  • -dbq,bitscore,length becomes max((ann1.dbq, ann1.bitscore, ann1.length), (ann2.dbq, ann2.bitscore, ann2.length) - This is default and previously only choice
  • -bitscore,length,dbq uses the same elements but gives lowest priority to dbq
  • +evalue: will discard the annotation with the highest evalue

Per Sequence Values

The sequence command allows to filter on a per sequence basis, using functions such as the median, quantile and mean on attributes like evalue, bitscore and identity. The file can be passed as sorted already, saving memory (like in the overlap command), but it’s not needed to sort the file by strand, only by the first column.

Coverage Filtering

The cov command calculates the coverage of annotations as a measure of the percentage of each reference sequence length. A minimum coverage percentage can be used to keep the annotations of sequences that have a greater or equal coverage than the specified one.

Changes

New in version 0.1.12.

Changed in version 0.1.13: added –sorted option

Changed in version 0.2.0: changed option -c to accept a string to filter overlap

Changed in version 0.2.5: added sequence command

Changed in version 0.2.6: added length as attribute and min/max, and ge is the default comparison for command sequence, –sort-attr to overlap

Changed in version 0.3.1: added –num-gt and –num-lt to values command, added cov command

Options

Filter GFF files

usage: filter-gff [-h] [-v | --quiet] [--cite] [--manual] [--version]
                  {values,overlap,sequence,cov} ...

Named Arguments

-v, --verbose

more verbose - includes debug messages

Default: 20

--quiet less verbose - only error and critical messages
--cite Show citation for the framework
--manual Show the script manual
--version show program’s version number and exit

Sub-commands:

values

Filter based on values

filter-gff values [-h] [-s SEQ_ID] [--strand {+,-}]
                  [-sl START_LOWER | -sg START_HIGHER]
                  [-el END_LOWER | -eg END_HIGHER]
                  [-lg LENGTH | -ls LENGTH_SHORT] [--source SOURCE]
                  [-f FEAT_TYPE] [-g GENE_ID] [-d DB] [-q DB_QUAL]
                  [-b BITSCORE] [-t TAXON_ID] [--str-eq STR_EQ]
                  [--str-in STR_IN] [--num-ge NUM_GE] [--num-le NUM_LE]
                  [--num-gt NUM_GT] [--num-lt NUM_LT] [-v | --quiet] [--cite]
                  [--manual] [--version]
                  [input_file] [output_file]
Positional Arguments
input_file

Input GFF file, defaults to stdin

Default: -

output_file

Output GFF file, defaults to stdout

Default: <open file ‘<stdout>’, mode ‘w’ at 0x7fa31cd94150>

Named Arguments
-s, --seq-id filter by sequence id
--strand

Possible choices: +, -

filter by strand

-sl, --start-lower
 returns only annotations where the start position is less than
-sg, --start-higher
 
returns only annotations where the start position is greater
than
-el, --end-lower
 
returns only annotations where the end position is equal or
less than
-eg, --end-higher
 
returns only annotations where the end position is equal ot
greater than
-lg, --length filter by annotation length equal to or longer than
-ls, --length-short
 filter by annotation length equal to or shorter than
--source filter by source
-f, --feat-type
 filter by feature type
-g, --gene-id filter by gene_id
-d, --db filter by db
-q, --db-qual filter by db quality equal or greater than
-b, --bitscore filter by bitscore equal or greater than
-t, --taxon-id filter by taxon_id
--str-eq
filter by custom key:value, if the argument is ‘key:value’ the
annotation is kept if it contains an attribute ‘key’ whose value is exactly ‘value’ as a string.
--str-in Same as ‘–str-eq’ but ‘value’ is contained in the attribute
--num-ge Same as ‘–str-eq’ but ‘value’ is a number which is equal or greater than
--num-le Same as ‘–num-ge’ but ‘value’ is a number which is equal or less than
--num-gt Same as ‘–str-eq’ but ‘value’ is a number which is greater than
--num-lt Same as ‘–num-ge’ but ‘value’ is a number which is less than
-v, --verbose

more verbose - includes debug messages

Default: 20

--quiet less verbose - only error and critical messages
--cite Show citation for the framework
--manual Show the script manual
--version show program’s version number and exit

overlap

Use overlapping filter

filter-gff overlap [-h] [-s SIZE] [-t] [-c CHOOSE_FUNC]
                   [-a {bitscore,identity,length}] [-v | --quiet] [--cite]
                   [--manual] [--version]
                   [input_file] [output_file]
Positional Arguments
input_file

Input GFF file, defaults to stdin

Default: -

output_file

Output GFF file, defaults to stdout

Default: <open file ‘<stdout>’, mode ‘w’ at 0x7fa31cd94150>

Named Arguments
-s, --size

Size of the overlap that triggers the filter

Default: 100

-t, --sorted
If the GFF file is sorted (all of a sequence annotations are
contiguos and sorted by strand) can use less memory, sort -s -k 1,1 -k 7,7 can be used

Default: False

-c, --choose-func
 

Function to choose between two overlapping annotations

Default: dbq,bitscore,length

-a, --sort-attr
 

Possible choices: bitscore, identity, length

Attribute to sort annotations before filtering (default bitscore)

Default: “bitscore”

-v, --verbose

more verbose - includes debug messages

Default: 20

--quiet less verbose - only error and critical messages
--cite Show citation for the framework
--manual Show the script manual
--version show program’s version number and exit

sequence

Filter on a per sequence basis

filter-gff sequence [-h] [-t] [-a {evalue,bitscore,identity,length}]
                    [-m | -d | -q QUANTILE | -s STD | -x | -n]
                    [-c {gt,ge,lt,le}] [-v | --quiet] [--cite] [--manual]
                    [--version]
                    [input_file] [output_file]
Positional Arguments
input_file

Input GFF file, defaults to stdin

Default: -

output_file

Output GFF file, defaults to stdout

Default: <open file ‘<stdout>’, mode ‘w’ at 0x7fa31cd94150>

Named Arguments
-t, --sorted
If the GFF file is sorted (all of a sequence annotations are
contiguos) can use less memory, sort -s -k 1,1 can be used

Default: False

-a, --attribute
 

Possible choices: evalue, bitscore, identity, length

Attribute on which to apply the filter

Default: “bitscore”

-m, --mean

Filter by the mean value

Default: False

-d, --median

Filter by the median value

Default: False

-q, --quantile Filter by the quantile value
-s, --std
Filter by: mean + (X * std) where X is the number
supplied
-x, --max

Filter by the maximum value

Default: False

-n, --min

Filter by the minimum value

Default: False

-c, --comparison
 

Possible choices: gt, ge, lt, le

Type of comparison (e.g. ge -> greater than or equal to)

Default: “ge”

-v, --verbose

more verbose - includes debug messages

Default: 20

--quiet less verbose - only error and critical messages
--cite Show citation for the framework
--manual Show the script manual
--version show program’s version number and exit

cov

Filter on a per coverage basis

filter-gff cov [-h] -f REFERENCE [-s] [-t] [-c MIN_COVERAGE] [-v | --quiet]
               [--cite] [--manual] [--version]
               [input_file] [output_file]
Positional Arguments
input_file

Input GFF file, defaults to stdin

Default: -

output_file

Output GFF file, defaults to stdout

Default: <open file ‘<stdout>’, mode ‘w’ at 0x7fa31cd94150>

Named Arguments
-f, --reference
 Reference FASTA file for the GFF
-s, --strand-specific
 

If the coverage must be calculated on each strand

Default: False

-t, --sorted

Assumes the GFF to be correctly sorted

Default: False

-c, --min-coverage
 

Minimum coverage for the contig/strand

Default: 0.0

-v, --verbose

more verbose - includes debug messages

Default: 20

--quiet less verbose - only error and critical messages
--cite Show citation for the framework
--manual Show the script manual
--version show program’s version number and exit