Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog and this project adheres to Semantic Versioning.

Unreleased

v0.8.5 - 2021-11-21

Added

  • Minimal compatibility support for running GECCO inside of Galaxy workflows.

v0.8.4 - 2021-09-26

Fixed

  • gecco convert gbk --format bigslice failing to run because of outdated code (#5).

  • gecco convert gbk --format bigslice not creating files with names conforming to BiG-SLiCE expected input.

Changed

  • Bump minimum pyrodigal version to v0.6.2 to use platform-accelerated code if supported.

v0.8.3-post1 - 2021-08-23

Fixed

  • Wrong default value for --threshold being shown in gecco run help message.

v0.8.3 - 2021-08-23

Changed

  • Default probability threshold for segmentation to 0.3 (from 0.4).

v0.9.0 - 2021-08-10 - YANKED

Changed

  • Retrain internal model using --select=0.35 instead of --select=0.25 like before.

  • Change default p-value filter from 1e-9 to 1e-5 to detect more features.

v0.8.2 - 2021-07-31

Fixed

  • gecco run crashing on Python 3.6 because of missing contextlib.nullcontext class.

Changed

  • gecco run and gecco annotate will not try to count the number of profiles when given an external HMM file with the --hmm flag.

  • PyHMMER.run now reports the p-value of each domain in addition to the e-value as a /note qualifier.

v0.8.1 - 2021-07-29

Changed

  • gecco run now filters out unneeded features before annotating, making it easier to analyze the results of a run with a custom --model.

Fixed

  • gecco reporting about using Pfam v33.1 while actually using v34.0 because of an outdated field in gecco/hmmer/Pfam.ini.

Added

  • Missing documentation for the strand attribute of gecco.model.Gene.

v0.8.0 - 2021-07-03

Changed

  • Retrain internal model using new sequence embeddings and remove broken/duplicate BGCs from MIBiG 2.0.

  • Bump minimum pyhmmer version to v0.4.0 to improve exception handling.

  • Bump minimum pyrodigal version to v0.5.0 to fix sequence decoding on some platforms.

  • Use p-values instead of e-values to filter domains obtained with HMMER.

  • gecco cv and gecco train now seed the RNG with a user-defined seed before shuffling rows of training data.

Fixed

  • Extraction of BGC compositions for the type predictor while training.

  • ClusterCRF.trained failing to open an external model.

Added

  • Domain.pvalue attribute to access the p-value of a domain annotation.

  • Mandatory pvalue column to FeatureTable objects.

  • Support for loading several feature tables in gecco train and gecco cv.

  • Warnings to ClusterCRF.fit when selecting uninformative features.

  • --correction flag to gecco train and gecco cv, allowing to give a multiple testing correction method when computing p-values with the Fisher Exact Tests.

Removed

  • Outdated gecco embed command.

  • Unused --truncate flag from the gecco train CLI.

  • Tigrfam domains, which is not improving performance on the new training data.

v0.7.0 - 2021-05-31

Added

  • Support for writing an AntiSMASH sideload JSON file after a gecco run workflow.

  • Code for converting GenBank files in BiG-SLiCE compatible format with the gecco convert subcommand.

  • Documentation about using GECCO in combination with AntiSMASH or BiG-SLiCE.

Changed

  • Minimum Biopython version to v1.73 for compatibility with older bioinformatics tooling.

  • Internal domain composition shipped in the gecco.types with newer composition array obtained directly from MIBiG files.

Removed

  • Outdated notice about -vvv verbosity level in the help message of the main gecco command.

v0.6.3 - 2021-05-10

Fixed

  • HMMER annotation not properly handling inputs with multiple contigs.

  • Some progress bar totals displaying as floats in the CLI.

Changed

  • PyHMMER now sets the Z and domZ values from the number of proteins given to the search pipeline.

  • gecco.cli delegates imports to make CLI more responsive.

  • pkg_resources has been replaced with importlib.resources and importlib.metadata where applicable.

  • multiprocessing.cpu_count has been replaced with os.cpu_count where applicable.

v0.6.2 - 2021-05-04

Fixed

  • gecco cv loto crashing because of outdated code.

Changed

  • Logging-style prompt will only display if GECCO is running with -vv flag.

Added

  • GECCO bioRxiv paper reference to Cluster.to_seq_record output record.

v0.6.1 - 2021-03-15

Fixed

  • Progress bar not being disabled by -q flag in CLI.

  • Fallback to using HMM name if accession is not available in PyHMMER.

  • Group genes by source contig and process them separately in PyHMMER to avoid bogus E-values.

Added

  • psutil dependency to get the number of physical CPU cores on the host machine.

  • Support for using an arbitrary mapping of positives to negatives in gecco embed.

Removed

  • Unused and outdated HMMER and DomainRow classes from gecco.hmmer.

v0.6.0 - 2021-02-28

Changed

  • Updated internal model with a cleaned-up version of the MIBiG-2.0 Pfam-33.1/Tigrfam-15.0 embedding.

  • Updated internal InterPro catalog.

Fixed

  • Features not being grouped together in gecco cv and gecco train when provided with a feature table where rows were not sorted by protein IDs.

v0.5.5 - 2021-02-28

Fixed

  • gecco cv bug causing only the last fold to be written.

v0.5.4 - 2021-02-28

Changed

  • Replaced verboselogs, coloredlogs and better-exceptions with rich.

Removed

  • tqdm training dependency.

Added

  • gecco annotate command to produce a feature table from a genomic file.

  • gecco embed to embed BGCs into non-BGC regions using feature tables.

v0.5.3 - 2021-02-21

Fixed

  • Coordinates of genes in output GenBank files.

  • Potential issue with the number of CPUs in PyHMMER.run.

Changed

  • Bump required pyrodigal version to v0.4.2 to fix buffer overflow.

v0.5.2 - 2021-01-29

Added

  • Support for downloading HMM files directly from GitHub releases assets.

  • Validation of filtered HMMs with MD5 checksum.

Fixed

  • Invalid coordinates of protein domains in GenBank output files.

  • gecco.interpro module not being added to wheel distribution.

Changed

  • Bump required pyhmmer version to v0.2.1.

v0.5.1 - 2021-01-15

Fixed

  • --hmm flag being ignored in in gecco run command.

  • PyHMMER using HMM names instead of accessions, causing issues with Pfam HMMs.

v0.5.0 - 2021-01-11

Added

  • Explicit support for Python 3.9.

Changed

  • pyhmmer is used to annotate protein sequences instead of HMMER3 binary hmmsearch.

  • HMM files are stored in binary format to speedup parsing and reduce storage size.

  • tqdm is now a training-only dependency.

  • gecco cv now requires training dependencies.

v0.4.5 - 2020-11-23

Added

  • Additional fold column to cross-validation table output.

Changed

  • Use sequence ID instead of protein ID to extract type from cluster in gecco cv.

  • Install HMM data in pre-pressed format to make hmmsearch runs faster on short sequences.

  • gecco.orf was rewritten to extract genes from input sequences in parallel.

v0.4.4 - 2020-09-30

Added

  • gecco cv loto command to run LOTO cross-validation using BGC types for stratification.

  • header keyword argument to FeatureTable.dump and ClusterTable.dump to write the table without the column header allowing to append to an existing table.

  • __getitem__ implementation for FeatureTable and ClusterTable that returns a single row or a sub-table from a table.

Fixed

  • gecco cv command now writes results iteratively instead of holding the tables for every fold in memory.

Changed

  • Bumped pandas training dependency to v1.0.

v0.4.3 - 2020-09-07

Fixed

  • GenBank files being written with invalid /cds feature type.

Changed

  • Blocked installation of Biopython v1.78 or newer as it removes Bio.Alphabet and breaks the current code.

v0.4.2 - 2020-08-07

Fixed

  • TypeClassifier.predict_types using inverse type probabilities when given several clusters to process.

v0.4.1 - 2020-08-07

Fixed

  • gecco run command crashing on input sequences not containing any genes.

v0.4.0 - 2020-08-06

Added

  • gecco.model.ProductType enum to model the biosynthetic class of a BGC.

Removed

  • pandas interaction from internal data model.

  • ClusterCRF code specific to cross-validation.

Changed

  • pandas, fisher and statsmodels dependencies are now optional.

  • gecco train command expects a cluster table in addition to the feature table to know the types of the input BGCs.

v0.3.0 - 2020-08-03

Changed

  • Replaced Nearest-Neighbours classifier with Random Forest to perform type prediction for candidate BGCs.

  • gecco.knn module was renamed to implementation-agnostic name gecco.types.

Fixed

  • Extraction of domain composition taking a long time in gecco train command.

Removed

  • --metric argument to the gecco run CLI command.

v0.2.2 - 2020-07-31

Changed

  • Domain and Gene can now carry qualifiers that are used when they are translated to a sequence feature.

Added

  • InterPro names, accessions, and HMMER e-value for each annotated domain in GenBank output files.

v0.2.1 - 2020-07-23

Fixed

  • Various potential crashes in ClusterRefiner code.

Removed

  • Uneeded feature dictionary filtering in ClusterCRF for models with Fisher Exact Test feature selection.

v0.2.0 - 2020-07-23

Fixed

  • pandas warning about unsorted columns in gecco run.

Removed

  • Gene.probability property, replaced by Gene.maximum_probability and Gene.average_probability properties to be explicit.

Changed

  • Internal model now uses Pfam and Tigrfam with the top 35% features selected with Fisher’s Exact Test.

  • ClusterRefiner now removes genes on Cluster edges if they do not contain any domain annotation.

v0.1.1 - 2020-07-22

Added

  • ClusterCRF.predict_probabilities to annotate a list of Gene.

Changed

  • BGC probability is now stored at the Domain level instead of at the Gene level, independently of the feature extraction level used by the CRF.

  • ClusterKNN will use the model path provided to gecco run if any.

Docs

  • Added this changelog file to document changes in the code.

  • Added documentation to gecco submodules missing some.

  • Included the CHANGELOG.md file to the generated docs.

v0.1.0 - 2020-07-17

Initial release.

v0.0.1 - 2018-08-13

Proof-of-concept.