blast2gff - Convert BLAST output to GFF

Overview

Blast output conversion in GFF requires a BLAST+ tabular format which can be obtained by using the –outfmt 6 option with the default columns, as specified in mgkit.io.blast.parse_blast_tab(). The script can get data from the standard in and ouputs GFF lines on the standard output by default.

Uniprot

The Function mgkit.io.blast.parse_uniprot_blast() is used, which filters BLAST hits based on bitscore and adds by default a db attribute to the annotation with the value UNIPROT-SP, indicating that the SwissProt db is used and a dbq attribute with the value 10. The feature type used in the GFF is CDS.

blockdiag BLAST+ parse_uniprot_blast GFF

BlastDB

If a BlastDB, such as nt or nr was used, the blastdb command offers some quick defaults to parse BLAST results.

It now includes options to control the way the sequence header are formatted. Options to change the separator used, as well as the column used as gene_id. This was added because at the moment the GI identifier (the second column in the header) is used, but it’s being phased out in favour of the embl/gb/dbj (right now the fourth column in the header). This should easy the transition to the new format and makes it easier to adapt an older pipeline/blastdb to newer files (like the ID to TAXA files).

The header from the a ncbi-nt header looks like this:

gi|160361034|gb|CP000884.1

This is the default output accepted by the blastdb command. The fields are separated by | (pipe) and the GI is used (–gene-index 1, since internally the string is split by the separator and the second element is take - lists indices are 0-based in Python). This output uses the following options:

--header-sep '|' --gene-index 1

Notice the single quotes to pass the pipe symbol, since bash would interpret it as pipeing to the next coommand otherwise. This is the default.

In case, for the same header, we want to use the gb identifier, the only option to be specified is:

--gene-index 3

This will get the fourth element of the header (since we’re splitting by pipe).

As in the uniprot command, the gene_id can be set to use the whole header, using the -n option. Useful in case the BLAST db that was used was custom made. While pipe is used in major databases, it was made the default, by if the db used has different conventions the separator can be changed. There’s also the options of later changing the gene_id in the output GFF if necessary.

Changes

Changed in version 0.2.6: added -r option to blastdb

Changed in version 0.2.5: added more options to give user control to the blastdb command

New in version 0.2.3: added –fasta-file option, added more data from a blsat hit

New in version 0.2.2: added blastdb command

Changed in version 0.2.1: added -ft option

Changed in version 0.1.13.

  • added -n parameter to uniprot command
  • added -k option to uniprot command

New in version 0.1.12.

Options

Convert BLAST output to a GFF file

usage: blast2gff [-h] [-v | --quiet] [--cite] [--manual] [--version]
                 {uniprot,blastdb} ...

Named Arguments

-v, --verbose

more verbose - includes debug messages

Default: 20

--quiet less verbose - only error and critical messages
--cite Show citation for the framework
--manual Show the script manual
--version show program’s version number and exit

Sub-commands:

uniprot

Blast results from a Uniprot database, by default SwissProt

blast2gff uniprot [-h] [-db DB_USED] [-n] [-dbq DB_QUALITY] [-b BITSCORE]
                  [-k ATTR_VALUE] [-ft FEAT_TYPE] [-a FASTA_FILE]
                  [-v | --quiet] [--cite] [--manual] [--version]
                  [input_file] [output_file]
Positional Arguments
input_file

BLAST+ output file in tabular format, defaults to stdin

Default: -

output_file

Output GFF file, defaults to stdout

Default: <open file ‘<stdout>’, mode ‘w’ at 0x7fa31cd94150>

Named Arguments
-db, --db-used

Uniprot database used with BLAST

Default: “UNIPROT-SP”

-n, --no-split
if used, the script assumes that the sequence header contains
only the gene id

Default: False

-dbq, --db-quality
 

Quality of the DB used

Default: 10

-b, --bitscore

Minimum bitscore to keep the annotation

Default: 0.0

-k, --attr-value
 
Additional attribute and value to add to each annotation,
in the form attr:value
-ft, --feat-type
 

Feature type to use in the GFF

Default: “CDS”

-a, --fasta-file
 
Fasta file with nucleotide sequences, used to calculate the frame, if not used, the frame on the ‘-‘ strand will always be 0
-v, --verbose

more verbose - includes debug messages

Default: 20

--quiet less verbose - only error and critical messages
--cite Show citation for the framework
--manual Show the script manual
--version show program’s version number and exit

blastdb

Blast results from a NCBI database, like nt

blast2gff blastdb [-h] [-db DB_USED] [-n] [-s HEADER_SEP] [-i GENE_INDEX] [-r]
                  [-dbq DB_QUALITY] [-b BITSCORE] [-k ATTR_VALUE]
                  [-ft FEAT_TYPE] [-a FASTA_FILE] [-v | --quiet] [--cite]
                  [--manual] [--version]
                  [input_file] [output_file]
Positional Arguments
input_file

BLAST+ output file in tabular format, defaults to stdin

Default: -

output_file

Output GFF file, defaults to stdout

Default: <open file ‘<stdout>’, mode ‘w’ at 0x7fa31cd94150>

Named Arguments
-db, --db-used

blastdb used

Default: “NCBI-NT”

-n, --no-split
if used, the script assumes that the sequence header will be
used as gene_id

Default: False

-s, --header-sep
 

The separator for the header, defaults to ‘|’ (pipe)

Default: “|”

-i, --gene-index
 
Which of the header columns (0-based) to use as gene_id
(defaults to 1 - the second column)

Default: 1

-r, --remove-version
 
if used, the script removes the version information from the
gene_id

Default: False

-dbq, --db-quality
 

Quality of the DB used

Default: 10

-b, --bitscore

Minimum bitscore to keep the annotation

Default: 0.0

-k, --attr-value
 
Additional attribute and value to add to each annotation,
in the form attr:value
-ft, --feat-type
 

Feature type to use in the GFF

Default: “CDS”

-a, --fasta-file
 
Fasta file with nucleotide sequences, used to calculate the frame, if not used, the frame on the ‘-‘ strand will always be 0
-v, --verbose

more verbose - includes debug messages

Default: 20

--quiet less verbose - only error and critical messages
--cite Show citation for the framework
--manual Show the script manual
--version show program’s version number and exit