mgkit.workflow.assembly module¶
Workflow associated with assembly statistics and evaluation
-
mgkit.workflow.assembly.
assign_contigs_to_taxa
(annotations, root_map=None, black_list=None)¶ Groups annotations by contig (seq_id) and counts how many contigs a taxon, or its root if root_map is supplied, have been assigned to.
The actual form of the dictionary like this:
Note
the number of ranks for a taxon is not pretedermined, but depends on the values returned by
rank_annotations_by_attr()
.Parameters: - annotations (iterable) – list of
gff.GFFKegg
instances - root_map (dict) – dictionary taxon->root
Return dict: dictionary
- annotations (iterable) – list of
-
mgkit.workflow.assembly.
basic_stats
(array, sep)¶ Returns formatted basic statistics for contig lengths
-
mgkit.workflow.assembly.
filter_contig_assignments
(contig_assign, threshold=5, min_counts=1)¶ Filter contigs assignments using a threshold for the rank: all rank counts belonging to a taxon which are greater than or equal to threshold will be summed up.
Parameters: - contig_assign (dict) – dictionary returned by
assign_contigs_to_taxa()
- threshold (int) – the minimum rank for which the counts are summed up
Return dict: dictionary in the form taxon_name->count
- contig_assign (dict) – dictionary returned by
-
mgkit.workflow.assembly.
rank_annotations_by_attr
(annotations, attr='taxon')¶ For all annotations in the list (usually all annotations for a contig), counts how many time a set attribute ‘attr’ appears. The resulting dictionary is then sorted by the number of counts and the one with the highest count is ranked by how much it represent the total number of counts.
The rank is an integer number between 0 and 10.
Parameters: - annotations (iterable) – list of
gff.GFFKegg
instances - attr (str) – the attribute for which the annotations are counted
Return tuple: the attr with the most counts and its rank
- annotations (iterable) – list of
-
mgkit.workflow.assembly.
write_fasta_summary
(file_handle, seq_lengths, seq_lengths_filt, sep='\t')¶ Write summary file for assembly
Parameters: - file_handle – file handle for output
- seq_lengths (array) – array for sequence lengths
- seq_lengths_filt (array) – array for sequence lengths of annotated contigs
- sep (str) – string used as column separator