mgkit.counts.func module

New in version 0.1.13.

Misc functions for count data

mgkit.counts.func.batch_load_htseq_counts(count_files, samples=None, cut_name=None)

Loads a list of htseq count result files and returns a DataFrame (IDxSAMPLE)

The sample names are names are the file names if samples and cut_name are None, supplying a list of sample names with samples is the preferred way, and cut_name is used for backward compatibility and as an option in cases a string replace is enough.

Parameters:
  • count_files (file or str) – file handle or string with file name
  • samples (iterable) – list of sample names, in the same order as count_files
  • cut_name (str) – string to delete from the the file names to get the sample names
Returns:

with sample names as columns and gene_ids as index

Return type:

pandas.DataFrame

mgkit.counts.func.filter_counts(counts_iter, info_func, gfilters=None, tfilters=None)

Returns counts that pass filters for each uid associated gene_id and taxon_id.

Parameters:
  • counts_iter (iterable) – iterator that yields a tuple (uid, count)
  • info_func (func) – function accepting a uid that returns a tuple (gene_id, taxon_id)
  • gfilters (iterable) – list of filters to apply to each uid associated gene_id
  • tfilters (iterable) – list of filters to apply to each uid associated taxon_id
Yields:

tuple(uid, count) that pass filters

mgkit.counts.func.from_gff(annotations, samples, ann_func=None, sample_func=None)

New in version 0.3.1.

Loads count data from a GFF file, only for the requested samples. By default the function returns a DataFrame where the index is the uid of each annotation and the columns the requested samples.

This can be customised by supplying ann_func and sample_func. sample_func is a function that accept a sample name and is expected to return a string or a tuple. This will be used to change the columns in the DataFrame. ann_func must accept an mgkit.io.gff.Annotation instance and return an iterable, with each iteration yielding either a single element or a tuple (for a MultiIndex DataFrame), each element yielded will have the count of that annotation added to.

Parameters:
  • annotation (iterable) – iterable yielding annotations
  • samples (iterable) – list of samples to keep
  • ann_func (func) – function used to customise the output
  • sample_func (func) – function to customise the column elements
Returns:

dataframe with the count data, columns are the samples and rows the annotation counts (unless mapped with ann_func)

Return type:

DataFrame

Exmples:

Assuming we have a list of annotations and sample SAMPLE1 and SAMPLE2 we can obtain the count table for all annotations with this

>>> from_gff(annotations, ['SAMPLE1', 'SAMPLE2'])

Assuming we want to group the samples, for example treatment1, treatment2 and control1, control2 into a MultiIndex DataFrame column

>>> sample_func = lambda x: ('T' if x.startswith('t') else 'C', x)
>>> from_gff(annotations, ['treatment1', 'treatment2', 'control1', 'control2'], sample_func=sample_func)

Annotations can be mapped to other levels for example instead of using the uid that is the default, it can be mapped to the gene_id, taxon_id information that is included in the annotation, resulting in a MultiIndex index for the rows, with (gene_id, taxon_id) as key.

>>> ann_func = lambda x: [(x.gene_id, x.taxon_id)]
>>> from_gff(annotations, ['SAMPLE1', 'SAMPLE2'], ann_func=ann_func)
mgkit.counts.func.get_uid_info(info_dict, uid)

Simple function to get a value from a dictionary of tuples (gene_id, taxon_id)

mgkit.counts.func.get_uid_info_ann(annotations, uid)

Simple function to get a value from a dictionary of annotations

mgkit.counts.func.load_counts_from_gff(annotations, elem_func=<function <lambda>>, sample_func=None, nozero=True)

New in version 0.2.5.

Loads counts for each annotations that are stored into the annotation counts_ attributes. Annotations with a total of 0 counts are skipped by default (nozero=True), the row index is set to the uid of the annotation and the column to the sample name. The functions used to transform the indices expect the annotation (for the row, elem_func) and the sample name (for the column, sample_func).

Parameters:
  • annotations (iter) – iterable of annotations
  • elem_func (func) – function that accepts an annotation and return a str/int for a Index or a tuple for a MultiIndex, defaults to returning the uid of the annotation
  • sample_func (func, None) – function that accepts the sample name and returns tuple for a MultiIndex. Defaults to None so no transformation is performed
  • nozero (bool) – if True, annotations with no counts are skipped
mgkit.counts.func.load_deseq2_results(file_name, taxon_id=None)

New in version 0.1.14.

Reads a CSV file output with DESeq2 results, adding a taxon_id to the index for concatenating multiple results from different taxonomic groups.

Parameters:file_name (str) – file name of the CSV
Returns:a MultiIndex DataFrame with the results
Return type:pandas.DataFrame
mgkit.counts.func.load_htseq_counts(file_handle, conv_func=<type 'int'>)

Changed in version 0.1.15: added conv_func parameter

Loads an HTSeq-count result file

Parameters:
  • file_handle (file or str) – file handle or string with file name
  • conv_func (func) – function to convert the number from string, defaults to int, but float can be used as well
Yields:

tuple – first element is the gene_id and the second is the count

mgkit.counts.func.load_sample_counts(info_dict, counts_iter, taxonomy, inc_anc=None, rank=None, gene_map=None, ex_anc=None, include_higher=True, cached=True, uid_used=None)

Changed in version 0.1.14: added cached argument

Changed in version 0.1.15: added uid_used parameter

Changed in version 0.2.0: info_dict can be a function

Reads sample counts, filtering and mapping them if requested. It’s an example of the usage of the above functions.

Parameters:
  • info_dict (dict) – dictionary that has uid as key and (gene_id, taxon_id) as value. In alternative a function that accepts a uid as sole argument and returns (gene_id, taxon_id)
  • counts_iter (iterable) – iterable that yields a (uid, count)
  • taxonomy – taxonomy instance
  • inc_anc (int, list) – ancestor taxa to include
  • rank (str) – rank to which map the counts
  • gene_map (dict) – dictionary with the gene mappings
  • ex_anc (int, list) – ancestor taxa to exclude
  • include_higher (bool) – if False, any rank different than the requested one is discarded
  • cached (bool) – if True, the function will use mgkit.simple_cache.memoize to cache some of the functions used
  • uid_used (None, dict) – an empty dictionary in which to store the uid that were assigned to each key of the returned pandas.Series. If None, no information is saved
Returns:

array with MultiIndex (gene_id, taxon_id) with the filtered and mapped counts

Return type:

pandas.Series

mgkit.counts.func.load_sample_counts_to_genes(info_func, counts_iter, taxonomy, inc_anc=None, gene_map=None, ex_anc=None, cached=True, uid_used=None)

New in version 0.1.14.

Changed in version 0.1.15: added uid_used parameter

Reads sample counts, filtering and mapping them if requested. It’s a variation of load_sample_counts(), with the counts being mapped only to each specific gene_id. Another difference is the absence of any assumption on the first parameter. It is expected to return a (gene_id, taxon_id) tuple.

Parameters:
  • info_func (callable) – any callable that accept an uid as the only parameter and and returns (gene_id, taxon_id) as value
  • counts_iter (iterable) – iterable that yields a (uid, count)
  • taxonomy – taxonomy instance
  • inc_anc (int, list) – ancestor taxa to include
  • rank (str) – rank to which map the counts
  • gene_map (dict) – dictionary with the gene mappings
  • ex_anc (int, list) – ancestor taxa to exclude
  • cached (bool) – if True, the function will use mgkit.simple_cache.memoize to cache some of the functions used
  • uid_used (None, dict) – an empty dictionary in which to store the uid that were assigned to each key of the returned pandas.Series. If None, no information is saved
Returns:

array with Index gene_id with the filtered and mapped counts

Return type:

pandas.Series

mgkit.counts.func.load_sample_counts_to_taxon(info_func, counts_iter, taxonomy, inc_anc=None, rank=None, ex_anc=None, include_higher=True, cached=True, uid_used=None)

New in version 0.1.14.

Changed in version 0.1.15: added uid_used parameter

Reads sample counts, filtering and mapping them if requested. It’s a variation of load_sample_counts(), with the counts being mapped only to each specific taxon. Another difference is the absence of any assumption on the first parameter. It is expected to return a (gene_id, taxon_id) tuple.

Parameters:
  • info_func (callable) – any callable that accept an uid as the only parameter and and returns (gene_id, taxon_id) as value
  • counts_iter (iterable) – iterable that yields a (uid, count)
  • taxonomy – taxonomy instance
  • inc_anc (int, list) – ancestor taxa to include
  • rank (str) – rank to which map the counts
  • ex_anc (int, list) – ancestor taxa to exclude
  • include_higher (bool) – if False, any rank different than the requested one is discarded
  • cached (bool) – if True, the function will use mgkit.simple_cache.memoize to cache some of the functions used
  • uid_used (None, dict) – an empty dictionary in which to store the uid that were assigned to each key of the returned pandas.Series. If None, no information is saved
Returns:

array with Index taxon_id with the filtered and mapped counts

Return type:

pandas.Series

mgkit.counts.func.map_counts(counts_iter, info_func, gmapper=None, tmapper=None, index=None, uid_used=None)

Changed in version 0.1.14: added index parameter

Changed in version 0.1.15: added uid_used parameter

Maps counts according to the gmapper and tmapper functions. Each mapped gene ID count is the sum of all uid that have the same ID(s). The same is true for the taxa.

Parameters:
  • counts_iter (iterable) – iterator that yields a tuple (uid, count)
  • info_func (func) – function accepting a uid that returns a tuple (gene_id, taxon_id)
  • gmapper (func) – fucntion that accepts a gene_id and returns a list of mapped IDs
  • tmapper (func) – fucntion that accepts a taxon_id and returns a new taxon_id
  • index (None, str) – if None, the index of the Series if (gene_id, taxon_id), if a str, it can be either gene or taxon, to specify a single value
  • uid_used (None, dict) – an empty dictionary in which to store the uid that were assigned to each key of the returned pandas.Series. If None, no information is saved
Returns:

array with MultiIndex (gene_id, taxon_id) with the mapped counts

Return type:

pandas.Series

mgkit.counts.func.map_counts_to_category(counts, gene_map, nomap=False, nomap_id='NOMAP')

Used to map the counts from a certain gene identifier to another. Genes with no mappings are not counted, unless nomap=True, in which case they are counted as nomap_id.

Parameters:
  • counts (iterator) – an iterator that yield a tuple, with the first value being the gene_id and the second value the count for it
  • gene_map (dictionary) – a dictionary whose keys are the gene_id yield by counts and the values are iterable of mapping identifiers
  • nomap (bool) – if False, counts for genes with no mappings in gene_map are discarded, if True, they a counted as nomap_id
  • nomap_id (str) – name of the mapping for genes with no mappings
Returns:

mapped counts

Return type:

pandas.Series

mgkit.counts.func.map_gene_id_to_map(gene_map, gene_id)

Function that extract a list of gene mappings from a dictionary and returns an empty list if the gene_id is not found.

mgkit.counts.func.map_taxon_id_to_rank(taxonomy, rank, taxon_id, include_higher=True)

Maps a taxon_id to the request taxon rank. Returns None if include_higher is False and the found rank is not the one requested.

Internally uses mgkit.taxon.UniprotTaxonomy.get_ranked_taxon()

Parameters:
  • taxonomy – taxonomy instance
  • rank (str) – taxonomic rank requested
  • taxon_id (int) – taxon_id to map
  • include_higher (bool) – if False, any rank different than the requested one is discarded
Returns:

if the mapping is successful, the ranked taxon_id is returned, otherwise None is returned

Return type:

(int, None)