mgkit.io.fastq module

Fastq utility functions

mgkit.io.fastq.check_fastq_type(qualities)

Trys to guess the type of quality string used in a Fastq file

Parameters:qualities (str) – string with the quality scores as in the Fastq file
Return str:a string with the guessed quality score

Note

Possible values are the following, classified but the values usually used in other softwares:

  • ASCII33: sanger, illumina-1.8
  • ASCII64: illumina-1.3, illumina-1.5, solexa-old
mgkit.io.fastq.choose_header_type(seq_id)

Return the guessed compiled regular expression :param str seq_id: sequence header to test

Returns:compiled regular expression object or None if no match found
mgkit.io.fastq.convert_seqid_to_new(seq_id)

Convert old seq_id format for Illumina reads to the new found in Casava 1.8+

Parameters:seq_id (str) – seq_id of the sequence (stripped of ‘@’)
Return str:the new format seq_id

Note

Example from Wikipedia:

old casava seq_id:
@HWUSI-EAS100R:6:73:941:1973#0/1
new casava seq_id:
@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCAC
mgkit.io.fastq.convert_seqid_to_old(seq_id, index_as_seq=True)

Deprecated since version 0.3.3.

Convert old seq_id format for Illumina reads to the new found in Casava until 1.8, which marks the new format.

Parameters:
  • seq_id (str) – seq_id of the sequence (stripped of ‘@’)
  • index_as_seq (bool) – if True, the index for the multiplex we’ll be the sequence found at the end of the new format seq_id. Otherwise, 0 we’ll be used
Return str:

the new format seq_id

mgkit.io.fastq.load_fastq(file_handle, num_qual=False)

New in version 0.3.1.

Loads a fastq file and returns a generator of tuples in which the first element is the name of the sequence, the second the sequence and the third the quality scores (converted in a numpy array if num_qual is True).

Note

this is a simple parser that assumes each sequence is on 4 lines, 1st and 3rd for the headers, 2nd for the sequence and 4th the quality scores

Parameters:

file_handle (str, file) – fastq file to open, can be a file name or a file handle

Yields:

tuple – first element is the sequence name/header, the second element is the sequence, the third is the quality score. The quality scores are kept as a string if num_qual is False (default) and converted to a numpy array with correct values (0-41) if num_qual is True

Raises:
  • ValueError – if the headers in both sequence and quality scores are not
  • valid. This implies that the sequence/qualities have carriage returns
  • or the file is truncated.
  • TypeError – if the qualities are in a format different than sanger
  • (min 0, max 40) or illumina-1.8 (0, 41)
mgkit.io.fastq.load_fastq_rename(file_handle, num_qual=False, name_func=None)

New in version 0.3.3.

Mirrors the same functionality in mgkit.io.fasta.load_fasta_rename(). Renames the header of the sequences using name_func, which is called on each header. By default, the behaviour is to keep the header to the left of the first space (BLAST behaviour).

mgkit.io.fastq.write_fastq_sequence(file_handle, name, seq, qual, write_mode='a')

Changed in version 0.3.3: if qual is not a string it’s converted to chars (phred33)

Write a fastq sequence to file. If the file_handle is a string, the file will be opened using write_mode.

Parameters:
  • file_handle – file handle or string.
  • name (str) – header to write for the sequence
  • seq (str) – sequence to write
  • qual (str) – quality string