blast2gff - Convert BLAST output to GFF¶

Overview¶

Blast output conversion in GFF requires a BLAST+ tabular format which can be obtained by using the –outfmt 6 option with the default columns, as specified in mgkit.io.blast.parse_blast_tab(). The script can get data from the standard in and ouputs GFF lines on the standard output by default.

Uniprot¶

The Function mgkit.io.blast.parse_uniprot_blast() is used, which filters BLAST hits based on bitscore and adds by default a db attribute to the annotation with the value UNIPROT-SP, indicating that the SwissProt db is used and a dbq attribute with the value 10. The feature type used in the GFF is CDS.

BlastDB¶

If a BlastDB, such as nt or nr was used, the blastdb command offers some quick defaults to parse BLAST results.

It now includes options to control the way the sequence header are formatted. Options to change the separator used, as well as the column used as gene_id. This was added because at the moment the GI identifier (the second column in the header) is used, but it’s being phased out in favour of the embl/gb/dbj (right now the fourth column in the header). This should easy the transition to the new format and makes it easier to adapt an older pipeline/blastdb to newer files (like the ID to TAXA files).

The header from the a ncbi-nt header looks like this:

gi|160361034|gb|CP000884.1

This is the default output accepted by the blastdb command. The fields are separated by | (pipe) and the GI is used (–gene-index 1, since internally the string is split by the separator and the second element is take - lists indices are 0-based in Python). This output uses the following options:

--header-sep '|' --gene-index 1

Notice the single quotes to pass the pipe symbol, since bash would interpret it as pipeing to the next coommand otherwise. This is the default.

In case, for the same header, we want to use the gb identifier, the only option to be specified is:

--gene-index 3

This will get the fourth element of the header (since we’re splitting by pipe).

As in the uniprot command, the gene_id can be set to use the whole header, using the -n option. Useful in case the BLAST db that was used was custom made. While pipe is used in major databases, it was made the default, by if the db used has different conventions the separator can be changed. There’s also the options of later changing the gene_id in the output GFF if necessary.

Changes¶

Changed in version 0.2.6: added -r option to blastdb

Changed in version 0.2.5: added more options to give user control to the blastdb command

New in version 0.2.3: added –fasta-file option, added more data from a blsat hit

New in version 0.2.2: added blastdb command

Changed in version 0.2.1: added -ft option

Changed in version 0.1.13.

added -n parameter to uniprot command
added -k option to uniprot command

New in version 0.1.12.

Options¶

Convert BLAST output to a GFF file

usage: blast2gff [-h] [-v | --quiet] [--cite] [--manual] [--version]
                 {uniprot,blastdb} ...

Named Arguments¶

`-v, --verbose`	more verbose - includes debug messages Default: 20
`--quiet`	less verbose - only error and critical messages
`--cite`	Show citation for the framework
`--manual`	Show the script manual
`--version`	show program’s version number and exit

Sub-commands:¶

uniprot¶

Blast results from a Uniprot database, by default SwissProt

blast2gff uniprot [-h] [-db DB_USED] [-n] [-dbq DB_QUALITY] [-b BITSCORE]
                  [-k ATTR_VALUE] [-ft FEAT_TYPE] [-a FASTA_FILE]
                  [-v | --quiet] [--cite] [--manual] [--version]
                  [input_file] [output_file]

Positional Arguments¶

input_file

BLAST+ output file in tabular format, defaults to stdin

Default: -

output_file

Output GFF file, defaults to stdout

Default: <open file ‘<stdout>’, mode ‘w’ at 0x7fa31cd94150>

Named Arguments¶

`-db, --db-used`	Uniprot database used with BLAST Default: “UNIPROT-SP”
`-n, --no-split`	if used, the script assumes that the sequence header contains only the gene id Default: False
`-dbq, --db-quality`
	Quality of the DB used Default: 10
`-b, --bitscore`	Minimum bitscore to keep the annotation Default: 0.0
`-k, --attr-value`
	Additional attribute and value to add to each annotation, in the form attr:value
`-ft, --feat-type`
	Feature type to use in the GFF Default: “CDS”
`-a, --fasta-file`
	Fasta file with nucleotide sequences, used to calculate the frame, if not used, the frame on the ‘-‘ strand will always be 0
`-v, --verbose`	more verbose - includes debug messages Default: 20
`--quiet`	less verbose - only error and critical messages
`--cite`	Show citation for the framework
`--manual`	Show the script manual
`--version`	show program’s version number and exit

blastdb¶

Blast results from a NCBI database, like nt

blast2gff blastdb [-h] [-db DB_USED] [-n] [-s HEADER_SEP] [-i GENE_INDEX] [-r]
                  [-dbq DB_QUALITY] [-b BITSCORE] [-k ATTR_VALUE]
                  [-ft FEAT_TYPE] [-a FASTA_FILE] [-v | --quiet] [--cite]
                  [--manual] [--version]
                  [input_file] [output_file]

Positional Arguments¶

input_file

BLAST+ output file in tabular format, defaults to stdin

Default: -

output_file

Output GFF file, defaults to stdout

Default: <open file ‘<stdout>’, mode ‘w’ at 0x7fa31cd94150>

Named Arguments¶

`-db, --db-used`	blastdb used Default: “NCBI-NT”
`-n, --no-split`	if used, the script assumes that the sequence header will be used as gene_id Default: False
`-s, --header-sep`
	The separator for the header, defaults to ‘\|’ (pipe) Default: “\|”
`-i, --gene-index`
	Which of the header columns (0-based) to use as gene_id (defaults to 1 - the second column) Default: 1
`-r, --remove-version`
	if used, the script removes the version information from the gene_id Default: False
`-dbq, --db-quality`
	Quality of the DB used Default: 10
`-b, --bitscore`	Minimum bitscore to keep the annotation Default: 0.0
`-k, --attr-value`
	Additional attribute and value to add to each annotation, in the form attr:value
`-ft, --feat-type`
	Feature type to use in the GFF Default: “CDS”
`-a, --fasta-file`
	Fasta file with nucleotide sequences, used to calculate the frame, if not used, the frame on the ‘-‘ strand will always be 0
`-v, --verbose`	more verbose - includes debug messages Default: 20
`--quiet`	less verbose - only error and critical messages
`--cite`	Show citation for the framework
`--manual`	Show the script manual
`--version`	show program’s version number and exit