mgkit.io.blast module¶
Blast routines and parsers
-
mgkit.io.blast.
add_blast_result_to_annotation
(annotation, gi_taxa_dict, taxonomy, threshold=60)[source]¶ Deprecated since version 0.4.0.
Adds blast information to a GFF annotation.
Parameters: - annotation – GFF annotation object
- gi_taxa_dict (dict) – dictionary returned by
parse_gi_taxa_table()
. - taxonomy – Uniprot taxonomy, used to add the taxon name to the annotation
-
mgkit.io.blast.
parse_accession_taxa_table
(file_handle, acc_ids=None, key=1, value=2, num_lines=1000000, no_zero=True)[source]¶ New in version 0.2.5.
Changed in version 0.3.0: added no_zero
This function superseeds
parse_gi_taxa_table()
, since NCBI is deprecating the GIDs in favor of accessions like X53318. The new file can be found at the NCBI ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid, for DNA sequences (nt DB) nucl_gb.accession2taxid.gz.The file contains 4 columns, the first one is the accession without its version, the second one includes the version, the third column is the taxonomic identifier and the fourth is either the old GID or na.
The column used as key is the second, since by default the fasta headers used in NCBI DBs use the versioned identifier. To use the GID as key, the key parameter can be set to 3, but if no identifier is found (na as per the file README), the line is skipped.
Parameters: - file_handle (str, file) – file name or open file handle
- acc_ids (None, list) – if it’s not None only the keys included in the passed acc_ids list will be returned
- key (int) – 0-based index for the column to use as accession. Defaults to the versioned accession that is used in GenBank fasta files.
- num_lines (None, int) – number of which a message is logged. If None, no message is logged
- no_zero (bool) – if True (default) a key with taxon_id of 0 is not yield
Note
GIDs are being phased out in September 2016: http://www.ncbi.nlm.nih.gov/news/03-02-2016-phase-out-of-GI-numbers/
-
mgkit.io.blast.
parse_blast_tab
(file_handle, seq_id=0, ret_col=(0, 1, 2, 6, 7, 11), key_func=None, value_funcs=None)[source]¶ New in version 0.1.12.
Parses blast output tab format, returning for each line a key (the query id) and the columns requested in a tuple.
Parameters: - file_handle (file) – file name or file handle for the blast ouput
- seq_id (int) – index for the column which has the query id
- ret_col (list, None) – list of indexes for the columns to be returned or None if all columns must be returned
- key_func (None, func) – function to transform the query id value in the key returned. If None, the query id is used
- value_funcs (None, list) – list of functions to transform the value of all the requested columns. If None the values are not converted
Yields: tuple – iterator of tuples with the first element being the query id after key_func is applied, if requested and the second element of the tuple is a tuple with the requested columns ret_col
¶ column index description 0 query name 1 subject name 2 percent identities 3 aligned length 4 number of mismatched positions 5 number of gap positions 6 query sequence start 7 query sequence end 8 subject sequence start 9 subject sequence end 10 e-value 11 bit score
-
mgkit.io.blast.
parse_fragment_blast
(file_handle, bitscore=40.0)[source]¶ New in version 0.1.13.
Parse the output of a BLAST output where the sequences are the single annotations, so the sequence names are the uid of the annotations.
The only returned values are the best hits, maxed by bitscore and identity.
Parameters: Yields: tuple – a tuple whose first element is the uid (the sequence name) and the second is the a list of tuples whose first element is the GID (NCBI identifier), the second one is the identity and the third is the bitscore of the hit.
-
mgkit.io.blast.
parse_uniprot_blast
(file_handle, bitscore=40, db='UNIPROT-SP', dbq=10, name_func=None, feat_type='CDS', seq_lengths=None)[source]¶ New in version 0.1.12.
Changed in version 0.1.13: added name_func argument
Changed in version 0.2.1: added feat_type
Changed in version 0.2.3: added seq_lengths and added subject start and end and e-value
Parses BLAST results in tabular format using
parse_blast_tab()
, applying a basic bitscore filter. Returns the annotations associated with each BLAST hit.Parameters: - file_handle (str, file) – file name or open file handle
- bitscore (int, float) – the minimum bitscore for an annotation to be accepted
- db (str) – database used
- dbq (int) – an index indicating the quality of the sequence database used; this value is used in the filtering of annotations
- name_func (func) – function to convert the name of the database sequences. Defaults to lambda x: x.split(‘|’)[1], which can be be used with fasta files provided by Uniprot
- feat_type (str) – feature type in the GFF
- seq_lengths (dict) – dictionary with the sequences lengths, used to deduct the frame of the ‘-‘ strand
Yields: Annotation – instances of
mgkit.io.gff.Annotation
instance of each BLAST hit.