mgkit.workflow.sampling_utils module

New in version 0.3.1.

Resampling Utilities

sample command

This command samples from a Fasta or FastQ file, based on a probability defined by the user (0.001 or 1 / 1000 by default, -r parameter), for a maximum number of sequences (100,000 by default, -x parameter). By default 1 sample is extracted, but as many as desired can be taken, by using the -n parameter.

The sequence file in input can be either be passed to the standard input or as last parameter on the command line. By defult a Fasta is expected, unless the -q parameter is passed.

The -p parameter specifies the prefix to be used, and if the output files can be gzipped using the -z parameter.

sample_stream command

It works in the same way as sample, however the file is sampled only once and the output is the stdout by default. This can be convenient if streams are a preferred way to sample the file.

sync command

Used to keep in sync forward and reverse read files in paired-end FASTQ. The scenario is that the sample command was used to resample a FASTQ file, usually the forward, but we need the reverse as well. In this case, the resampled file, called master is passed to the -m option and the input file is the file that is to be synced (reverse). The input file is scanned until the same header is found in the master file and when that happens, the sequence is written. The next sequence is then read from the master file and the process is repeated until all sequence in the master file are found in the input file. This implies having the 2 files sorted in the same way, which is what the sample command does.


the old casava format is not supported by this command at the moment, as it’s unusual to find it in SRA or other repositories as well.

rand_seq command

Generate random FastA/Q sequences, allowing the specification of GC content and number of sequences being coding or random. If the output format chosen is FastQ, qualities are generated using a decreasing model with added noise. A constant model can be specified instead with a switch. Parameters such GC, length and the type of model can be infered by passing a FastA/Q file, with the quality model fit using a LOWESS (using mgkit.utils.sequence.extrapolate_model()). The noise in that case is model as the a normal distribution fitted from the qualities along the sequence deviating from the fitted LOWSS and scaled back by half to avoid too drastic changes in the qualities. Also the qualities are clipped at 40 to avoid compatibility problems with FastQ readers. If inferred, the model can be saved (as a pickle file) and loaded back for analysis


Changed in version 0.3.4: using click instead of argparse. Now *rand_seq can save and reload models

Changed in version 0.3.3: added sync, sample_stream and rand_seq commnads

mgkit.workflow.sampling_utils.compare_header(header1, header2, header_type=None)
mgkit.workflow.sampling_utils.infer_parameters(file_handle, fastq_bool)