blast2gff - Convert BLAST output to GFF

Overview

Blast output conversion in GFF requires a BLAST+ tabular format which can be obtained by using the –outfmt 6 option with the default columns, as specified in mgkit.io.blast.parse_blast_tab(). The script can get data from the standard in and ouputs GFF lines on the standard output by default.

Uniprot

The Function mgkit.io.blast.parse_uniprot_blast() is used, which filters BLAST hits based on bitscore and adds by default a db attribute to the annotation with the value UNIPROT-SP, indicating that the SwissProt db is used and a dbq attribute with the value 10. The feature type used in the GFF is CDS.

blockdiag BLAST+ parse_uniprot_blast GFF

BlastDB

If a BlastDB, such as nt or nr was used, the blastdb command offers some quick defaults to parse BLAST results.

It now includes options to control the way the sequence header are formatted. Options to change the separator used, as well as the column used as gene_id. This was added because at the moment the GI identifier (the second column in the header) is used, but it’s being phased out in favour of the embl/gb/dbj (right now the fourth column in the header). This should easy the transition to the new format and makes it easier to adapt an older pipeline/blastdb to newer files (like the ID to TAXA files).

The header from the a ncbi-nt header looks like this:

gi|160361034|gb|CP000884.1

This is the default output accepted by the blastdb command. The fields are separated by | (pipe) and the GI is used (–gene-index 1, since internally the string is split by the separator and the second element is take - lists indices are 0-based in Python). This output uses the following options:

--header-sep '|' --gene-index 1

Notice the single quotes to pass the pipe symbol, since bash would interpret it as pipeing to the next coommand otherwise. This is the default.

In case, for the same header, we want to use the gb identifier, the only option to be specified is:

--gene-index 3

This will get the fourth element of the header (since we’re splitting by pipe).

As in the uniprot command, the gene_id can be set to use the whole header, using the -n option. Useful in case the BLAST db that was used was custom made. While pipe is used in major databases, it was made the default, by if the db used has different conventions the separator can be changed. There’s also the options of later changing the gene_id in the output GFF if necessary.

Changes

Changed in version 0.3.4: using click instead of argparse

Changed in version 0.2.6: added -r option to blastdb

Changed in version 0.2.5: added more options to give user control to the blastdb command

New in version 0.2.3: added –fasta-file option, added more data from a blsat hit

New in version 0.2.2: added blastdb command

Changed in version 0.2.1: added -ft option

Changed in version 0.1.13: added -n and -k parameters to uniprot command

New in version 0.1.12.

Options

blast2gff

Main function

blast2gff [OPTIONS] COMMAND [ARGS]...

Options

--version

Show the version and exit.

--cite

blastdb

Reads a BLAST output file [blast-file] in tabular format (using -outfmt 6) and outputs a GFF file [gff-file]

blast2gff blastdb [OPTIONS] [BLAST_FILE] [GFF_FILE]

Options

-v, --verbose
-db, --db-used <db_used>

blastdb used

Default:NCBI-NT
-n, --no-split

if used, the script assumes that the sequence header will be used as gene_id

-s, --header-sep <header_sep>

The separator for the header, defaults to ‘|’ (pipe)

Default:

-i, --gene-index <gene_index>

Which of the header columns (0-based) to use as gene_id (defaults to 1 - the second column)

Default:1
-r, --remove-version

if used, the script removes the version information from the gene_id

-a, --fasta-file <fasta_file>

Optional FASTA file with the query sequences

-dbq, --db-quality <db_quality>

Quality of the DB used

Default:10
-b, --bitscore <bitscore>

Minimum bitscore to keep the annotation

Default:0.0
-k, --attr-value <attr_value>

Additional attribute and value to add to each annotation, in the form attr:value

-ft, --feat-type <feat_type>

Feature type to use in the GFF

Default:CDS
--progress

Shows Progress Bar

Arguments

BLAST_FILE

Optional argument

GFF_FILE

Optional argument

uniprot

Reads a BLAST output file [blast-file] in tabular format (using -outfmt 6) from a Uniprot DB and outputs a GFF file [gff-file]

blast2gff uniprot [OPTIONS] [BLAST_FILE] [GFF_FILE]

Options

-v, --verbose
-db, --db-used <db_used>

Uniprot database used with BLAST

Default:UNIPROT-SP
-n, --no-split

if used, the script assumes that the sequence header will be used as gene_id

-a, --fasta-file <fasta_file>

Optional FASTA file with the query sequences

-dbq, --db-quality <db_quality>

Quality of the DB used

Default:10
-b, --bitscore <bitscore>

Minimum bitscore to keep the annotation

Default:0.0
-k, --attr-value <attr_value>

Additional attribute and value to add to each annotation, in the form attr:value

-ft, --feat-type <feat_type>

Feature type to use in the GFF

Default:CDS
--progress

Shows Progress Bar

Arguments

BLAST_FILE

Optional argument

GFF_FILE

Optional argument