mgkit.io.blast module¶

Blast routines and parsers

mgkit.io.blast.add_blast_result_to_annotation(annotation, gi_taxa_dict, taxonomy, threshold=60)[source]¶

Deprecated since version 0.4.0.

Adds blast information to a GFF annotation.

Parameters:	annotation – GFF annotation object gi_taxa_dict (dict) – dictionary returned by `parse_gi_taxa_table()`. taxonomy – Uniprot taxonomy, used to add the taxon name to the annotation

mgkit.io.blast.parse_accession_taxa_table(file_handle, acc_ids=None, key=1, value=2, num_lines=1000000, no_zero=True)[source]¶

New in version 0.2.5.

Changed in version 0.3.0: added no_zero

This function superseeds parse_gi_taxa_table(), since NCBI is deprecating the GIDs in favor of accessions like X53318. The new file can be found at the NCBI ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid, for DNA sequences (nt DB) nucl_gb.accession2taxid.gz.

The file contains 4 columns, the first one is the accession without its version, the second one includes the version, the third column is the taxonomic identifier and the fourth is either the old GID or na.

The column used as key is the second, since by default the fasta headers used in NCBI DBs use the versioned identifier. To use the GID as key, the key parameter can be set to 3, but if no identifier is found (na as per the file README), the line is skipped.

Parameters:

file_handle (str, file) – file name or open file handle
acc_ids (None, list) – if it’s not None only the keys included in the passed acc_ids list will be returned
key (int) – 0-based index for the column to use as accession. Defaults to the versioned accession that is used in GenBank fasta files.
num_lines (None, int) – number of which a message is logged. If None, no message is logged
no_zero (bool) – if True (default) a key with taxon_id of 0 is not yield

Note

GIDs are being phased out in September 2016: http://www.ncbi.nlm.nih.gov/news/03-02-2016-phase-out-of-GI-numbers/

mgkit.io.blast.parse_blast_tab(file_handle, seq_id=0, ret_col=(0, 1, 2, 6, 7, 11), key_func=None, value_funcs=None)[source]¶

New in version 0.1.12.

Parses blast output tab format, returning for each line a key (the query id) and the columns requested in a tuple.

Parameters:

file_handle (file) – file name or file handle for the blast ouput
seq_id (int) – index for the column which has the query id
ret_col (list, None) – list of indexes for the columns to be returned or None if all columns must be returned
key_func (None, func) – function to transform the query id value in the key returned. If None, the query id is used
value_funcs (None, list) – list of functions to transform the value of all the requested columns. If None the values are not converted

Yields:

tuple – iterator of tuples with the first element being the query id after key_func is applied, if requested and the second element of the tuple is a tuple with the requested columns ret_col

BLAST+ used with -outfmt 6, default columns¶
column index	description
0	query name
1	subject name
2	percent identities
3	aligned length
4	number of mismatched positions
5	number of gap positions
6	query sequence start
7	query sequence end
8	subject sequence start
9	subject sequence end
10	e-value
11	bit score

mgkit.io.blast.parse_fragment_blast(file_handle, bitscore=40.0)[source]¶

New in version 0.1.13.

Parse the output of a BLAST output where the sequences are the single annotations, so the sequence names are the uid of the annotations.

The only returned values are the best hits, maxed by bitscore and identity.

Parameters:	file_handle (str, file) – file name or open file handle bitscore (float) – minimum bitscore for accepting a hit
Yields:	tuple – a tuple whose first element is the uid (the sequence name) and the second is the a list of tuples whose first element is the GID (NCBI identifier), the second one is the identity and the third is the bitscore of the hit.

mgkit.io.blast.parse_uniprot_blast(file_handle, bitscore=40, db='UNIPROT-SP', dbq=10, name_func=None, feat_type='CDS', seq_lengths=None)[source]¶

New in version 0.1.12.

Changed in version 0.1.13: added name_func argument

Changed in version 0.2.1: added feat_type

Changed in version 0.2.3: added seq_lengths and added subject start and end and e-value

Parses BLAST results in tabular format using parse_blast_tab(), applying a basic bitscore filter. Returns the annotations associated with each BLAST hit.

Parameters:

file_handle (str, file) – file name or open file handle
bitscore (int, float) – the minimum bitscore for an annotation to be accepted
db (str) – database used
dbq (int) – an index indicating the quality of the sequence database used; this value is used in the filtering of annotations
name_func (func) – function to convert the name of the database sequences. Defaults to lambda x: x.split(‘|’)[1], which can be be used with fasta files provided by Uniprot
feat_type (str) – feature type in the GFF
seq_lengths (dict) – dictionary with the sequences lengths, used to deduct the frame of the ‘-‘ strand

Yields:

Annotation – instances of mgkit.io.gff.Annotation instance of each BLAST hit.