Changes

0.3.3

Added

Changed

  • mgkit.io.fastq.write_fastq_sequence()
  • added seq_id as a special attribute to mgkit.io.gff.Annotation.get_attr()
  • mgkit.io.gff.from_prodigal_frag() is tested and fixed
  • added cache in mgkit.utils.dictionary.HDFDict
  • mgkit.utils.sequence.sequence_gc_content() now returns 0.5 when denominator is 0
  • add-gff-info addtaxa -a now accept seq_id as lookup, to use output from taxon-utils lca (after cutting output)

0.3.2

Removed deprecated code

0.3.1

This release adds several scripts and commands. Successive releases 0.3.x releases will be used to fix bugs and refine the APIs and CLI. Most importantly, since the publishing of the first paper using the framework, the releases will go torward the removal of as much deprecated code as possible. At the same time, a general review of the code to be able to run on Python3 (probably via the six package) will start. The general idea is to reach as a full removal of legacy code in 0.4.0, while full Python3 compatibility is the aim of 0.5.0, which also means dropping dependencies that are not compatible with Python3.

Added

Changed

Fixed

Besides smaller fixes:

Deprecated

0.3.0

A lot of bugs were fixed in this release, especially for reading NCBI taxonomy and using the msgpack format to save a UniprotTaxonomy instance. Also added a tutorial for profiling a microbial community using MGKit and BLAST (Profile a Community with BLAST)

Added

Changed

  • added no_zero parameter to mgkit.io.blast.parse_accession_taxa_table()
  • changed behaviour of mgkit.kegg.KeggModule and some of its methods.
  • added with_last parameter to mgkit.taxon.get_lineage()
  • added –split option to add-gff-info exp_syn and get-gff-info sequence scripts, to emulate BLAST behaviour in parsing sequence headers
  • added -c option to add-gff-info addtaxa

0.2.5

Changed

Added

0.2.4

Changed

  • mgkit.utils.sequence.get_contigs_info() now accepts a dictionary name->seq or a list of sequences, besides a file name (r536)
  • add-gff-info counts command now removes trailing commas from the samples list
  • the axes are turned off after the dendrogram is plo

Fixed

  • the snp_parser script requirements were set wrong in setup.py (r540)
  • uncommented lines to download sample data to build documentation (r533)
  • add-gff-info uniprot command now writes the lineage attribute correctly (r538)

0.2.3

The installation dependencies are more flexible, with only numpy as being required. To install every needed packages, you can use:

$ pip install mgkit[full]

Added

  • new option to pass the query sequences to blast2gff, this allows to add the correct frame of the annotation in the GFF
  • added the attributes evalue, subject_start and subject_end to the output of blast2gff. The subject start and end position allow to understand on which frame of the subject sequence the match was found
  • added the options to annotate the heatmap with the numbers. Also updated the relative example notebook
  • Added the option to reads the taxonomy from NCBI dump files, using mgkit.taxon.UniprotTaxonomy.read_from_ncbi_dump(). This make it faster to get the taxonomy file
  • added argument to return information from mgkit.net.embl.datawarehouse_search(), in the form of tab separated data. The argument fields can be used when display is set to report. An example on how to use it is in the function documentation
  • added a bash script download-taxonomy.sh that download the taxonomy
  • added script venv-docs.sh to build the documentation in HTML under a virtual environment. matplotlib on MacOS X raises a RuntimeError, because of a bug in virtualenv, the documentation can be first build with this, after the script create-apidoc.sh is create the API documentation. The rest of the documentation (e.g. the PDF) can be created with make as usual, afterwards
  • added mgkit.net.pfam, with only one function at the moment, that returns the descriptions of the families.
  • added pfam command to add-gff-info, using the mentioned function, it adds the description of the Pfam families in the GFF file
  • added a new exception, used internally when an additional dependency is needed

Changed

  • using the NCBI taxonomy dump has two side effects:

    • the scientific/common names are kept as is, not lower cased as was before
    • a merged file is provided for taxon_id that changed. While the old taxon_id is kept in the taxonomy, this point to the new taxon, to keep backward compatibility
  • renamed the add-gff-info gitaxa command to addtaxa. It now accepts more data sources (dictionaries) and is more general

  • changed mgkit.net.embl.datawarehouse_search() to automatically set the limit at 100,000 records

  • the taxonomy can now be saved using msgpack, making it faster to read/write it. It’s also more compact and better compression ratio

  • the mgkit.plots.heatmap.grouped_spine() now accept the rotation of the labels as option

  • added option to use another attribute for the gene_id in the get-gff-info script gtf command

  • added a function to compare the version of MGKit used, throwing a warning, when it’s different (mgkit.check_version())

  • removed test for old SNPs structures and added the same tests for the new one

  • mgkit.snps.classes.GeneSNP now caches the number of synonymous and non-synonymous SNPs for better speed

  • mgkit.io.gff.GenomicRange.__contains__() now also accepts a tuple (start, end) or another GenomicRange instance

Fixed

  • a bug in the gitaxa (now addtaxa) command: when a taxon_id was not found in the table, the wrong taxon_name and lineage was inserted
  • bug in mgkit.snps.classes.GeneSNP that prevented the correct addition of values
  • fixed bug in mgkit.snps.funcs.flat_sample_snps() with the new class
  • mgkit.io.gff.parse_gff() now correctly handles comment lines and stops parsing if the fasta file at the end of a GFF is found

0.2.2

Added

Changed

Removed

  • deprecated code from the snps package

0.2.1

Added

  • added mgkit.db.mongo
  • added mgkit.db.dbm
  • added mgkit.io.gff.Annotation.get_mappings()
  • added mgkit.io.gff.Annotation.to_json()
  • added mgkit.io.gff.Annotation.to_mongodb()
  • added mgkit.io.gff.from_json()
  • added mgkit.io.gff.from_mongodb()
  • added mgkit.taxon.get_lineage()
  • added mgkit.utils.sequence.get_contigs_info()
  • added mongodb and dbm commands to script get-gff-info
  • added kegg command to add-gff-info script, caching results and -d option to uniprot command
  • added -ft option to blast2gff script
  • added -ko option to download_profiles
  • added new HMMER tutorial
  • added another notebook to the plot examples, for misc. tips
  • added a script that downloads from figshare the tutorial data]
  • added function to get an enzyme full name (mgkit.mappings.enzyme.get_enzyme_full_name())
  • added example notebook for using GFF annotations and the mgkit.db.dbm, mgkit.db.mongo modules

Changed

Deprecated

  • mgkit.filter.taxon.filter_taxonomy_by_lineage()
  • mgkit.filter.taxon.filter_taxonomy_by_rank()

Removed

  • removed old filter_gff script

0.2.0

  • added creation of wheel distribution
  • changes to ensure compatibility with alter pandas versions
  • mgkit.io.gff.Annotation.get_ec() now returns a set, reflected changes in tests
  • added a –cite option to scripts
  • fixes to tutorial
  • updated documentation for sphinx 1.3
  • changes to diagrams
  • added decoration to raise warnings for deprecated functions
  • added possibility for mgkit.counts.func.load_sample_counts() info_dict to be a function instead of a dictionary
  • consolidation of some eggNOG structures
  • added more spine options in mgkit.plots.heatmap.grouped_spine()
  • added a length property to mgkit.io.gff.Annotation
  • changed filter-gff script to customise the filtering function, from the default one, also updated the relative documentation
  • fixed a few plot functions

0.1.16

  • changed default parameter for mgkit.plots.boxplot.add_values_to_boxplot()
  • Added include_only filter option to the default snp filters mgkit.consts.DEFAULT_SNP_FILTER
  • the default filter for SNPs now use an include only option, by default including only protozoa, archaea, fungi and bacteria in the matrix
  • added widths parameter to def mgkit.plots.boxplot.boxplot_dataframe() function, added function mgkit.plots.boxplot.add_significance_to_boxplot() and updated example boxplot notebook for new function example
  • use_dist and dist_func parameters to the mgkit.plots.heatmap.dendrogram() function
  • added a few constants and functions to calculate the distance matrices of taxa: mgkit.taxon.taxa_distance_matrix(), mgkit.taxon.distance_taxa_ancestor() and mgkit.taxon.distance_two_taxa()
  • mgkit.kegg.KeggClientRest.link_ids() now accept a dictionary as list of ids
  • if the conversion of an Annotation attribute (first 8 columns) raises a ValueError in mgkit.io.gff.from_gff(), by default the parser keeps the string version (cases for phase, where is ‘.’ instead of a number)
  • treat cases where an attribute is set with no value in mgkit.io.gff.from_gff()
  • added mgkit.plots.colors.palette_float_to_hex() to convert floating value palettes to string
  • forces vertical alignment of tick labels in heatmaps
  • added parameter to get a consensus sequence for an AA alignment, by adding the nucl parameter to mgkit.utils.sequence.Alignment.get_consensus()
  • added mgkit.utils.sequence.get_variant_sequence() to get variants of a sequence, essentially changing the sequence according to the SNPs passed
  • added method to get an aminoacid sequence from Annotation in mgkit.io.gff.Annotation.get_aa_seq() and added the possibility to pass a SNP to get the variant sequence of an Annotation in mgkit.io.gff.Annotation.get_nuc_seq().
  • added exp_syn command to add-gff-info script
  • changed GTF file conversion
  • changed behaviour of mgkit.taxon.is_ancestor(): if a taxon_id raises a KeyError, False is now returned. In other words, if the taxon_id is not found in the taxonomy, it’s not an ancestor
  • added mgkit.io.gff.GenomicRange.__contains__(). It tests if a position is inside the range
  • added mgkit.io.gff.GenomicRange.get_relative_pos(). It returns a position relative to the GenomicRange start
  • fixed documentation and bugs (Annotation.get_nuc_seq)
  • added mgkit.io.gff.Annotation.is_syn(). It returns True if a SNP is synonymous and False if non-synonymous
  • added to_nuc parameter to mgkit.io.gff.from_nuc_blast() function. It to_nuc is False, it is assumed that the hit was against an amino acidic DB, in which case the phase should always set to 0
  • reworked internal of snp_parser script. It doesn’t use SNPDat anymore
  • updated tutorial
  • added ipython notebook as an example to explore data from the tutorial
  • cleaned deprecated code, fixed imports, added tests and documentation

0.1.15

  • changed name of mgkit.taxon.lowest_common_ancestor() to mgkit.taxon.last_common_ancestor(), the old function name points to the new one
  • added mgkit.counts.func.map_counts_to_category() to remap counts from one ID to another
  • added get-gff-info script to extract information from GFF files
  • script download_data can now download only taxonomy data
  • added more script documentation
  • added examples on gene prediction
  • added function mgkit.io.gff.from_hmmer() to parse HMMER results and return mgkit.io.gff.Annotation instances
  • added mgkit.io.gff.Annotation.to_gtf() to return a GTF line, mgkit.io.gff.Annotation.add_gc_content() and mgkit.io.gff.Annotation.add_gc_ratio() to calculate GC content and ratio respectively
  • added mgkit.io.gff.parse_gff_files() to parse multiple GFF files
  • added uid_used parameter to several functions in mgkit.counts.func
  • added mgkit.plots.abund to plot abundance plots
  • added example notebooks for plots
  • HTSeq is now required only by the scripts that uses it, snp_parser and fastq_utils
  • added function to convert numbers when reading from htseq count files
  • changed behavior of -b option in add-gff-info taxonomy command
  • added mgkit.io.gff.get_annotation_map()

0.1.14

0.1.13

0.1.12

0.1.11

  • removed rst2pdf for generating a PDF for documentation. Latex is preferred
  • corrections to documentation and example script
  • removed need for joblib library in translate_seq script: used only if available (for using multiple processors)
  • deprecated mgkit.snps.funcs.combine_snps_in_dataframe() and mgkit.snps.funcs.combine_snps_in_dataframe(): mgkit.snps.funcs.combine_sample_snps() should be used
  • refactored some tests and added more
  • added docs_req.txt to help build the documentation ont readthedocs.org
  • renamed mgkit.snps.classes.GeneSyn gid and taxon attributes to gene_id and taxon_id. The old names are still available for use (via properties), but the will be taken out in later versions. Old pickle data should be loaded and saved again before in this release
  • added a few convenience functions to ease the use of combine_sample_snps()
  • added function mgkit.snps.funcs.significance_test() to test the distributions of genes share between two taxa.
  • fixed an issue with deinterleaving sequence data from khmer
  • added mgkit.snps.funcs.flat_sample_snps()
  • Added method to mgkit.kegg.KeggClientRest to get names for all ids of a certain type (more generic than the various get_*_names)
  • added first implementation of mgkit.kegg.KeggModule class to parse a Kegg module entry
  • mgkit.snps.conv_func.get_rank_dataframe(), mgkit.snps.conv_func.get_gene_map_dataframe()