mgkit.align module

Module dealing with BAM/SAM files

class mgkit.align.SamtoolsDepth(file_handle, num_seqs=10000, max_size=1000000, max_size_dict=None, calc_density=False, dtype='uint32')[source]

Bases: object

Changed in version 0.4.2: several optimisations and changes to support a scanning approach, instead of lookup table. No exception is raised when a sequence is not found in the file, instead assuming that the coverage is 0

Changed in version 0.4.0: uses pandas.SparseArray now. It should use less memory, but needs pandas version > 0.24

New in version 0.3.0.

This a class that helps use the results from read_samtools_depth().

There are 2 modes of operations:

  1. Request a region coverage via SamtoolsDepth.region_coverage()
  2. Advance the samtools depth file until a sequence coverage is read

The method 1) was the default in MGKit < 0.4.2, when the user would request a region coverage and this class would advance the reading until the requested region is found. While scanning, all the sequences encountered are kept as pandas Series with SparseArray declared, reindexed from 0 to the max_size_dict values if provided, or max_size. The advantage is that the it’s easier to use, but with bigger datasets will keep high memory usage. This is the case for case 2) which involves calling directly SamtoolsDepth.advance_file() returning the sequence ID read. Then the region coverage can be requested the same way as before, but won’t involve scanning the file.

The use of a SparseArray in a pandas Series allows the use of samtools depth files that weren’t produced with samtools depth -aa and save memory.

Note

The amount of memory used is fairly high, especially with the increasing number of sequences in a GFF/Depth file. It is recoommended to use max_size_dict to 1) only creates arrays for sequences needed and 2) use arrays of smaller size

Warning

In MGKit 0.4.2, the internal dictionary to keep the SparseArray(s) is a weakref.WeakValueDictionary, to improve the release of memory. That on Linux seems to release the array too fast. Upgrade to MGKit version 0.4.3 if this class is needed

advance_file()[source]

New in version 0.4.2.

Requests the scan of the file and returns the sequence found, the coverage can then be retrivied with SamtoolsDepth.region_coverage()

Returns:the return value is None if the file is fully read, otherwise the the sequnece ID found is returned
Return type:(str, None)
closed = False
data = None
density = None
drop_sequence(seq_id)[source]

New in version 0.4.2.

Remove the sequence passed from the internal dictionary

file_handle = None
max_size = None
max_size_dict = None
not_found = None
region_coverage(seq_id, start, end)[source]

Changed in version 0.4.2: now using SamtoolsDepth.advance_file() to scan the file

Returns the mean coverage of a region. The start and end parameters are expected to be 1-based coordinates, like the correspondent attributes in mgkit.io.gff.Annotation or mgkit.io.gff.GenomicRange.

If the sequence for which the coverage is requested is not found, the depth file is read (and cached) until it is found.

Parameters:
  • seq_id (str) – sequence for which to return mean coverage
  • start (int) – start of the region
  • end (int) – end of the region
Returns:

mean coverage of the requested region

Return type:

float

mgkit.align.add_coverage_info(annotations, bam_files, samples, attr_suff='_cov')[source]

Changed in version 0.3.4: the coverage now is returned as floats instead of int

Adds coverage information to annotations, using BAM files.

The coverage information is added for each sample as a ‘sample_cov’ and the total coverage as as ‘cov’ attribute in the annotations.

Note

The bam_files and sample variables must have the same order

Parameters:
  • annotations (iterable) – iterable of annotations
  • bam_files (iterable) – iterable of pysam.Samfile instances
  • sample (iterable) – names of the samples for the BAM files
mgkit.align.covered_annotation_bp(files, annotations, min_cov=1, progress=False)[source]

New in version 0.1.14.

Returns the number of base pairs covered of annotations over multiple samples.

Parameters:
  • files (iterable) – an iterable that returns the alignment file names
  • annotations (iterable) – an iterable that returns annotations
  • min_cov (int) – minumum coverage for a base to counted
  • progress (bool) – if True, a progress bar is used
Returns:

a dictionary whose keys are the uid and the values the number of bases that are covered by reads among all samples

Return type:

dict

mgkit.align.get_region_coverage(bam_file, seq_id, feat_from, feat_to)[source]

Return coverage for an annotation.

Note

feat_from and feat_to are 1-based indexes

Parameters:
  • bam_file (Samfile) – instance of pysam.Samfile
  • seq_id (str) – sequence id
  • feat_from (int) – start position of feature
  • feat_to (int) – end position of feature
Return int:

coverage array for the annotation

mgkit.align.read_samtools_depth(file_handle, num_seqs=10000, seq_ids=None)[source]

Changed in version 0.4.2: the function returns lists instead of numpy arrays for speed (at least in my tests it seems ~4x increase)

Changed in version 0.4.0: now returns 3 array, instead of 2. Also added seq_ids to skip lines

Changed in version 0.3.4: num_seqs can be None to avoid a log message

New in version 0.3.0.

Reads a samtools depth file, returning a generator that yields the array of each base coverage on a per-sequence base.

Note

There’s no need anymore to use samtools depth -aa, because the function returns the position array and this can be used to create a Pandas SparseArray which can be reindexed to include missing positions (with values of 0)

Valid for version < 0.4.0:

The information on position is not used, to use numpy and save memory. samtools depth should be called with the -aa option:

`samtools depth -aa bamfile`

This options will output both base position with 0 coverage and sequneces with no aligned reads

Parameters:
  • file_handle (file) – file handle of the coverage file
  • num_seqs (int or None) – number of sequence that fires a log message. If None, no message is triggered
  • seq_ids (dict, set) – a hashed container like a dictionary or set with the sequences to return
Yields:

tuple – the first element is the sequence identifier, the second one is the list with the positions, the third element is the list with the coverages