sampling-utils - Resampling Utilities¶

Overview¶

New in version 0.3.1.

Resampling Utilities¶

sample command¶

This command samples from a Fasta or FastQ file, based on a probability defined by the user (0.001 or 1 / 1000 by default, -r parameter), for a maximum number of sequences (100,000 by default, -x parameter). By default 1 sample is extracted, but as many as desired can be taken, by using the -n parameter.

The sequence file in input can be either be passed to the standard input or as last parameter on the command line. By defult a Fasta is expected, unless the -q parameter is passed.

The -p parameter specifies the prefix to be used, and if the output files can be gzipped using the -z parameter.

sample_stream command¶

It works in the same way as sample, however the file is sampled only once and the output is the stdout by default. This can be convenient if streams are a preferred way to sample the file.

sync command¶

Used to keep in sync forward and reverse read files in paired-end FASTQ. The scenario is that the sample command was used to resample a FASTQ file, usually the forward, but we need the reverse as well. In this case, the resampled file, called master is passed to the -m option and the input file is the file that is to be synced (reverse). The input file is scanned until the same header is found in the master file and when that happens, the sequence is written. The next sequence is then read from the master file and the process is repeated until all sequence in the master file are found in the input file. This implies having the 2 files sorted in the same way, which is what the sample command does.

Note

the old casava format is not supported by this command at the moment, as it’s unusual to find it in SRA or other repositories as well.

rand_seq command¶

Generate random FastA/Q sequences, allowing the specification of GC content and number of sequences being coding or random. If the output format chosen is FastQ, qualities are generated using a decreasing model with added noise. A constant model can be specified instead with a switch. Parameters such GC, length and the type of model can be infered by passing a FastA/Q file, with the quality model fit using a LOWESS (using mgkit.utils.sequence.extrapolate_model()). The noise in that case is model as the a normal distribution fitted from the qualities along the sequence deviating from the fitted LOWSS and scaled back by half to avoid too drastic changes in the qualities. Also the qualities are clipped at 40 to avoid compatibility problems with FastQ readers. If inferred, the model can be saved (as a pickle file) and loaded back for analysis

Changes¶

Changed in version 0.3.4: using click instead of argparse. Now *rand_seq can save and reload models

Changed in version 0.3.3: added sync, sample_stream and rand_seq commnads

Options¶

sampling-utils¶

Main function

sampling-utils [OPTIONS] COMMAND [ARGS]...

Options

--version¶: Show the version and exit.

--cite¶

rand_seq¶

Generates random FastA/Q sequences

sampling-utils rand_seq [OPTIONS] [OUTPUT_FILE]

Options

-v, --verbose¶

-n, --num-seqs <num_seqs>¶

Number of sequences to generate

Default:	1000

-gc, --gc-content <gc_content>¶

GC content (defaults to .5 out of 1)

Default:	0.5

-i, --infer-params <infer_params>¶: Infer parameters GC content and Quality model from file

-r, --coding-prop <coding_prop>¶

Proportion of coding sequences

Default:	0.0

-l, --length <length>¶

Sequence length

Default:	150

-d, --const-model¶: Use a model with constant qualities + noise

-x, --dist-loc <dist_loc>¶

Use as the starting point quality

Default:	30.0

-q, --fastq¶: The output file is a FastQ file

-m, --save-model <save_model>¶: Save inferred qualities model to a pickle file

-a, --read-model <read_model>¶: Load qualities model from a pickle file

--progress¶: Shows Progress Bar

Arguments

OUTPUT_FILE¶: Optional argument

sample¶

Sample a FastA/Q multiple times

sampling-utils sample [OPTIONS] [INPUT_FILE]

Options

-v, --verbose¶

-p, --prefix <prefix>¶

Prefix for the file name(s) in output

Default:	sample

-n, --number <number>¶

Number of samples to take

Default:	1

-r, --prob <prob>¶

Probability of picking a sequence

Default:	0.001

-x, --max-seq <max_seq>¶

Maximum number of sequences

Default:	100000

-q, --fastq¶: The input file is a fastq file

-z, --gzip¶: gzip output files

Arguments

INPUT_FILE¶: Optional argument

sample_stream¶

Samples a FastA/Q one time, alternative to sample if multiple sampling is not needed

sampling-utils sample_stream [OPTIONS] [INPUT_FILE] [OUTPUT_FILE]

Options

-v, --verbose¶

-r, --prob <prob>¶: Probability of picking a sequence

-x, --max-seq <max_seq>¶: Maximum number of sequences

-q, --fastq¶: The input file is a fastq file

Arguments

INPUT_FILE¶: Optional argument

OUTPUT_FILE¶: Optional argument

sync¶

Syncs a FastQ file generated with sample with the original pair of files.

sampling-utils sync [OPTIONS] [INPUT_FILE] [OUTPUT_FILE]

Options

-v, --verbose¶

-m, --master-file <master_file>¶: Required Resampled FastQ file that is out of sync with the original pair

Arguments

INPUT_FILE¶: Optional argument

OUTPUT_FILE¶: Optional argument