mgkit.align module

Module dealing with BAM/SAM files

class mgkit.align.SamtoolsDepth(file_handle, num_seqs=10000)

Bases: future.types.newobject.newobject

New in version 0.3.0.

A class used to cache the results of read_samtools_depth(), while reading only the necessary data from a`samtools depth -aa` file.

data = None
file_handle = None
region_coverage(seq_id, start, end)

Returns the mean coverage of a region. The start and end parameters are expected to be 1-based coordinates, like the correspondent attributes in mgkit.io.gff.Annotation or mgkit.io.gff.GenomicRange.

If the sequence for which the coverage is requested is not found, the depth file is read (and cached) until it is found.

Parameters:
  • seq_id (str) – sequence for which to return mean coverage
  • start (int) – start of the region
  • end (int) – end of the region
Returns:

mean coverage of the requested region

Return type:

float

mgkit.align.add_coverage_info(annotations, bam_files, samples, attr_suff='_cov')

Changed in version 0.3.4: the coverage now is returned as floats instead of int

Adds coverage information to annotations, using BAM files.

The coverage information is added for each sample as a ‘sample_cov’ and the total coverage as as ‘cov’ attribute in the annotations.

Note

The bam_files and sample variables must have the same order

Parameters:
  • annotations (iterable) – iterable of annotations
  • bam_files (iterable) – iterable of pysam.Samfile instances
  • sample (iterable) – names of the samples for the BAM files
mgkit.align.covered_annotation_bp(files, annotations, min_cov=1, progress=False)

New in version 0.1.14.

Returns the number of base pairs covered of annotations over multiple samples.

Parameters:
  • files (iterable) – an iterable that returns the alignment file names
  • annotations (iterable) – an iterable that returns annotations
  • min_cov (int) – minumum coverage for a base to counted
  • progress (bool) – if True, a progress bar is used
Returns:

a dictionary whose keys are the uid and the values the number of bases that are covered by reads among all samples

Return type:

dict

mgkit.align.get_region_coverage(bam_file, seq_id, feat_from, feat_to)

Return coverage for an annotation.

Note

feat_from and feat_to are 1-based indexes

Parameters:
  • bam_file (Samfile) – instance of pysam.Samfile
  • seq_id (str) – sequence id
  • feat_from (int) – start position of feature
  • feat_to (int) – end position of feature
Return int:

coverage array for the annotation

mgkit.align.read_samtools_depth(file_handle, num_seqs=10000)
..versionchanged:: 0.3.4
num_seqs can be None to avoid a log message

New in version 0.3.0.

Reads a samtools depth file, returning a generator that yields the array of each base coverage on a per-sequence base.

Note

The information on position is not used, to use numpy and save memory. samtools depth should be called with the -aa option:

`samtools depth -aa bamfile`

This options will output both base position with 0 coverage and sequneces with no aligned reads

Parameters:
  • file_handle (file) – file handle of the coverage file
  • num_seqs (int or None) – number of sequence that fires a log message. If None, no message is triggered
Yields:

tuple – the first element is the sequence identifier and the second one is the numpy array with the positions