mgkit.utils.dictionary module

Dictionary utils

class mgkit.utils.dictionary.HDFDict(file_name, table, cast=<class 'int'>, cache=True)[source]

Bases: object

Changed in version 0.3.3: added cache in __init__

New in version 0.3.1.

Used a table in a HDFStore (from pandas) as a dictionary. The table must be indexed to perform well. Read only.

Note

the dictionary cannot be modified and exception:ValueError will be raised if the table is not in the file

mgkit.utils.dictionary.apply_func_to_values(dictionary, func)[source]

New in version 0.1.12.

Assuming a dictionary whose values are iterables, func is applied to each element of the iterable, retuning a set of all transformed elements.

Parameters:
  • dictionary (dict) – dictionary whose values are iterables
  • func (func) – function to apply to the dictionary values
Returns:

dictionary with transformed values

Return type:

dict

class mgkit.utils.dictionary.cache_dict_file(iterator, skip_lines=0)[source]

Bases: object

New in version 0.3.0.

Used to cache the result of a function that yields a tuple (key and value). If the value is found in the internal dictionary (as the class behave), the correspondent value is returned, otherwise the iterator is advanced until the key is found.

Example

>>> from mgkit.io.blast import parse_accession_taxa_table
>>> i = parse_accession_taxa_table('nucl_gb.accession2taxid.gz', key=0)
>>> d = cache_dict_file(i)
>>> d['AH001684']
4400
next()[source]
mgkit.utils.dictionary.combine_dict(keydict, valuedict)[source]

Combine two dictionaries when the values of keydict are iterables. The combined dictionary has the same keys as keydict and the its values are sets containing all the values associated to keydict values in valuedict.

key1 -> [v1, v2, .., vN]

v1 -> [u1, u2, .., uN] v2 -> [t1, t2, .., tN]

Resulting dictionary will be

key1->{u1, u2, .., uN}

Parameters:
  • keydict (dict) – dictionary whose keys are the same as the returned dictionary
  • valuedict (dict) – dictionary whose values are the same as the returned dictionary
Return dict:

combined dictionary

mgkit.utils.dictionary.combine_dict_one_value(keydict, valuedict)[source]

Combine two dictionaries by the value of the keydict is used as a key in valuedict and the resulting dictionary is composed of keydict keys and valuedict values.

Same as comb_dict(), but each value in keydict is a single element that is key in valuedict.

Parameters:
  • keydict (dict) – dictionary whose keys are the same as the returned dictionary
  • valuedict (dict) – dictionary whose values are the same as the returned dictionary
Return dict:

combined dictionary

mgkit.utils.dictionary.dict_to_text(stream, dictionary, header=None, comment=None, sep='\t')[source]

New in version 0.4.4.

Writes the content of a dictionary to a stream (supports write), like io.StringIO or an opened file. Intended to be used only for dictionaries with key-value of type integer/strings, other data types are better served by more complex options, like JSON, etc.

Warning

The file is expected to be opened in text mode (‘r’)

Parameters:
  • stream (file) – stream to write to, to output a string, use io.StringIO
  • dictionary (dict) – dictionary to write
  • header (iterable) – a tuple/list to be used as header
  • comment (str) – a comment at the start of the file - ‘# ‘ will be prepended to the value passed.
  • sep (str) – column separator to use
mgkit.utils.dictionary.filter_nan(ratios)[source]

Returns a dictionary with the NaN values taken out

mgkit.utils.dictionary.filter_ratios_by_numbers(ratios, min_num)[source]

Returns from a dictionary only the items for which the length of the iterables that is the value of the item, is equal or greater of min_num.

Parameters:
  • ratios (dict) – dictionary key->list
  • min_num (int) – minimum number of elements in the value iterable
Return dict:

filtered dictionary

mgkit.utils.dictionary.find_id_in_dict(s_id, s_dict)[source]

Finds a value ‘s_id’ in a dictionary in which the values are iterables. Returns a list of keys that contain the value.

Parameters:
  • s_id (dict) – element to look for in the dictionary’s values
  • d (object) – dictionary to search in
Return list:

list of keys in which d was found

Given a dictionary whose values (iterables) can be linked back to other keys, it returns a dictionary in which the keys are the original keys and the values are sets of keys to which they can be linked.

key1->[v1, v2] key2->[v3, v4] key3->[v2, v4]

Becomes:

key1->[key1, key3] key2->[key3] key3->[key1, key2]

Parameters:
  • id_map (dict) – dictionary of keys to link
  • black_list (iterable) – iterable of values to skip in making the links
Return dict:

linked dictionary

mgkit.utils.dictionary.merge_dictionaries(dicts)[source]

New in version 0.3.1.

Merges keys and values from a list/iterable of dictionaries. The resulting dictionary’s values are converted into sets, with the assumption that the values are one of the following: float, str, int, bool

mgkit.utils.dictionary.reverse_mapping(map_dict)[source]

Given a dictionary in the form: key->[v1, v2, .., vN], returns a dictionary in the form: v1->[key1, key2, .., keyN]

Parameters:map_dict (dict) – dictionary to reverse
Return dict:reversed dictionary
mgkit.utils.dictionary.split_dictionary_by_value(value_dict, threshold, aggr_func=<function median>, key_filter=None)[source]

Splits a dictionary, whose values are iterables, based on a threshold:

  • one in which the result of aggr_func is lower than the threshold (first)
  • one in which the result of aggr_func is equal or greater than the threshold (second)
Parameters:
  • valuedict (dict) – dictionary to be splitted
  • threshold (number) – must be comparable to threshold
  • aggr_func (func) – function used to aggregate the dictionary values
  • key_filter (iterable) – if specified, only these key will be in the resulting dictionary
Returns:

two dictionaries

mgkit.utils.dictionary.text_to_dict(stream, skip_lines=0, sep='\t', key_index=0, value_index=1, key_func=<class 'str'>, value_func=<class 'str'>, encoding=None)[source]

New in version 0.4.4.

Reads a dictionary form a table file, the passed file is assumed to be opened as text, not binary - in which case you need to pass the encoding (e.g. ascii). The file may have multiple columns, so the key and value columns can be chosen with key_index and value_index, respectively.

Parameters:
  • stream (file) – stream that can be read as a file
  • skip_lines (int) – number of lines to skip at the start of the file
  • sep (str) – column separator to use
  • key_index (int) – zero-based column number of keys
  • value_index (int) – zero-based column number of values
  • key_func (func) – function to apply to the keys (defaults to str)
  • value_func (func) – function to apply to the values (defaults to str)
  • encoding (None, str) – if None is passed, the file is assumed to be opened in text mode, otherwise the encoding of the file must be passed
Yields:

tuple – the keys and values that can be passed to dict