All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog and this project adheres to Semantic Versioning.


v0.9.2 - 2022-04-11


  • Padding of short sequences with empty genes when predicting probabilities in ClusterCRF.

v0.9.1 - 2022-04-05


  • Make the genes.tsv and features.tsv table contain all genes even when they come from a contig too short to be processed by the CRF sliding window.

  • Replaced the --force-clusters-tsv flag with a --force-tsv flag to force writing TSV tables even when no genes or clusters were found in gecco run or gecco annotate.

v0.9.1-alpha4 - 2022-03-31

Retrain internal model with:

$ python -m gecco -vv train --c1 0.4 --c2 0 --select 0.25 --window-size 20 \
         -f mibig-2.0.proG2.Pfam-v35.0.features.tsv \
         -c mibig-2.0.proG2.clusters.tsv \
         -g GECCO-data/data/embeddings/mibig-2.0.proG2.genes.tsv \
         -o models/v0.9.1-alpha4

v0.9.1-alpha3 - 2022-03-23


  • gecco.model.GeneTable class to store gene coordinates independently of protein domains.


  • Refactored implementation of load and dump methods for Table classes into a dedicated base class.

  • gecco run and gecco annotate now output a gene table in addition to the feature and cluster tables.

  • gecco train expects a gene table instead of a GFF file for the gene coordinates.

v0.9.1-alpha2 - 2022-03-23


  • TypeClassifier.trained not being able to read unknown types from type tables.

v0.9.1-alpha1 - 2022-03-20

Candidate release with support for a sliding window in the CRF prediction algorithm.

v0.8.10 - 2022-02-23


  • --antismash-sideload flag of gecco run causing command to crash.

v0.8.9 - 2022-02-22


  • Prediction and support for the Other biosynthetic type of MIBiG clusters.

v0.8.8 - 2022-02-21


  • ClusterRefiner filtering method for edge genes not working as intended.

  • gecco run and gecco annotate commands crashing on missing input files instead of nicely rendering the error.

v0.8.7 - 2022-02-18


  • interpro.json metadata file not being included in distribution files.

  • Missing docstring for Protein.with_domains method.


  • Bump minimum scikit-learn version to v1.0 for Python3.7+.

v0.8.6 - 2022-02-17 - YANKED


  • CLI flag for enabling region masking for contigs processed by Prodigal.

  • CLI flag for controlling region distance used for edge distance filtering.


  • gecco.model.Gene and gecco.model.Protein are now immutable data classes.

  • Bump minimum pyrodigal version to v0.6.4 to use region masking.

  • Implement filtering for extracted clusters based on distance to the contig edge.

  • Store InterPro metadata file uncompressed for version-control integration.


  • Mark BGC0000930 as Terpene in the type classifier data.

  • Progress bar messages are now in consistent format.

v0.8.5 - 2021-11-21


  • Minimal compatibility support for running GECCO inside of Galaxy workflows.

v0.8.4 - 2021-09-26


  • gecco convert gbk --format bigslice failing to run because of outdated code (#5).

  • gecco convert gbk --format bigslice not creating files with names conforming to BiG-SLiCE expected input.


  • Bump minimum pyrodigal version to v0.6.2 to use platform-accelerated code if supported.

v0.8.3-post1 - 2021-08-23


  • Wrong default value for --threshold being shown in gecco run help message.

v0.8.3 - 2021-08-23


  • Default probability threshold for segmentation to 0.3 (from 0.4).

v0.8.2 - 2021-07-31


  • gecco run crashing on Python 3.6 because of missing contextlib.nullcontext class.


  • gecco run and gecco annotate will not try to count the number of profiles when given an external HMM file with the --hmm flag.

  • now reports the p-value of each domain in addition to the e-value as a /note qualifier.

v0.8.1 - 2021-07-29


  • gecco run now filters out unneeded features before annotating, making it easier to analyze the results of a run with a custom --model.


  • gecco reporting about using Pfam v33.1 while actually using v34.0 because of an outdated field in gecco/hmmer/Pfam.ini.


  • Missing documentation for the strand attribute of gecco.model.Gene.

v0.8.0 - 2021-07-03


  • Retrain internal model using new sequence embeddings and remove broken/duplicate BGCs from MIBiG 2.0.

  • Bump minimum pyhmmer version to v0.4.0 to improve exception handling.

  • Bump minimum pyrodigal version to v0.5.0 to fix sequence decoding on some platforms.

  • Use p-values instead of e-values to filter domains obtained with HMMER.

  • gecco cv and gecco train now seed the RNG with a user-defined seed before shuffling rows of training data.


  • Extraction of BGC compositions for the type predictor while training.

  • ClusterCRF.trained failing to open an external model.


  • Domain.pvalue attribute to access the p-value of a domain annotation.

  • Mandatory pvalue column to FeatureTable objects.

  • Support for loading several feature tables in gecco train and gecco cv.

  • Warnings to when selecting uninformative features.

  • --correction flag to gecco train and gecco cv, allowing to give a multiple testing correction method when computing p-values with the Fisher Exact Tests.


  • Outdated gecco embed command.

  • Unused --truncate flag from the gecco train CLI.

  • Tigrfam domains, which is not improving performance on the new training data.

v0.7.0 - 2021-05-31


  • Support for writing an AntiSMASH sideload JSON file after a gecco run workflow.

  • Code for converting GenBank files in BiG-SLiCE compatible format with the gecco convert subcommand.

  • Documentation about using GECCO in combination with AntiSMASH or BiG-SLiCE.


  • Minimum Biopython version to v1.73 for compatibility with older bioinformatics tooling.

  • Internal domain composition shipped in the gecco.types with newer composition array obtained directly from MIBiG files.


  • Outdated notice about -vvv verbosity level in the help message of the main gecco command.

v0.6.3 - 2021-05-10


  • HMMER annotation not properly handling inputs with multiple contigs.

  • Some progress bar totals displaying as floats in the CLI.


  • PyHMMER now sets the Z and domZ values from the number of proteins given to the search pipeline.

  • gecco.cli delegates imports to make CLI more responsive.

  • pkg_resources has been replaced with importlib.resources and importlib.metadata where applicable.

  • multiprocessing.cpu_count has been replaced with os.cpu_count where applicable.

v0.6.2 - 2021-05-04


  • gecco cv loto crashing because of outdated code.


  • Logging-style prompt will only display if GECCO is running with -vv flag.


  • GECCO bioRxiv paper reference to Cluster.to_seq_record output record.

v0.6.1 - 2021-03-15


  • Progress bar not being disabled by -q flag in CLI.

  • Fallback to using HMM name if accession is not available in PyHMMER.

  • Group genes by source contig and process them separately in PyHMMER to avoid bogus E-values.


  • psutil dependency to get the number of physical CPU cores on the host machine.

  • Support for using an arbitrary mapping of positives to negatives in gecco embed.


  • Unused and outdated HMMER and DomainRow classes from gecco.hmmer.

v0.6.0 - 2021-02-28


  • Updated internal model with a cleaned-up version of the MIBiG-2.0 Pfam-33.1/Tigrfam-15.0 embedding.

  • Updated internal InterPro catalog.


  • Features not being grouped together in gecco cv and gecco train when provided with a feature table where rows were not sorted by protein IDs.

v0.5.5 - 2021-02-28


  • gecco cv bug causing only the last fold to be written.

v0.5.4 - 2021-02-28


  • Replaced verboselogs, coloredlogs and better-exceptions with rich.


  • tqdm training dependency.


  • gecco annotate command to produce a feature table from a genomic file.

  • gecco embed to embed BGCs into non-BGC regions using feature tables.

v0.5.3 - 2021-02-21


  • Coordinates of genes in output GenBank files.

  • Potential issue with the number of CPUs in


  • Bump required pyrodigal version to v0.4.2 to fix buffer overflow.

v0.5.2 - 2021-01-29


  • Support for downloading HMM files directly from GitHub releases assets.

  • Validation of filtered HMMs with MD5 checksum.


  • Invalid coordinates of protein domains in GenBank output files.

  • gecco.interpro module not being added to wheel distribution.


  • Bump required pyhmmer version to v0.2.1.

v0.5.1 - 2021-01-15


  • --hmm flag being ignored in in gecco run command.

  • PyHMMER using HMM names instead of accessions, causing issues with Pfam HMMs.

v0.5.0 - 2021-01-11


  • Explicit support for Python 3.9.


  • pyhmmer is used to annotate protein sequences instead of HMMER3 binary hmmsearch.

  • HMM files are stored in binary format to speedup parsing and reduce storage size.

  • tqdm is now a training-only dependency.

  • gecco cv now requires training dependencies.

v0.4.5 - 2020-11-23


  • Additional fold column to cross-validation table output.


  • Use sequence ID instead of protein ID to extract type from cluster in gecco cv.

  • Install HMM data in pre-pressed format to make hmmsearch runs faster on short sequences.

  • gecco.orf was rewritten to extract genes from input sequences in parallel.

v0.4.4 - 2020-09-30


  • gecco cv loto command to run LOTO cross-validation using BGC types for stratification.

  • header keyword argument to FeatureTable.dump and ClusterTable.dump to write the table without the column header allowing to append to an existing table.

  • __getitem__ implementation for FeatureTable and ClusterTable that returns a single row or a sub-table from a table.


  • gecco cv command now writes results iteratively instead of holding the tables for every fold in memory.


  • Bumped pandas training dependency to v1.0.

v0.4.3 - 2020-09-07


  • GenBank files being written with invalid /cds feature type.


  • Blocked installation of Biopython v1.78 or newer as it removes Bio.Alphabet and breaks the current code.

v0.4.2 - 2020-08-07


  • TypeClassifier.predict_types using inverse type probabilities when given several clusters to process.

v0.4.1 - 2020-08-07


  • gecco run command crashing on input sequences not containing any genes.

v0.4.0 - 2020-08-06


  • gecco.model.ProductType enum to model the biosynthetic class of a BGC.


  • pandas interaction from internal data model.

  • ClusterCRF code specific to cross-validation.


  • pandas, fisher and statsmodels dependencies are now optional.

  • gecco train command expects a cluster table in addition to the feature table to know the types of the input BGCs.

v0.3.0 - 2020-08-03


  • Replaced Nearest-Neighbours classifier with Random Forest to perform type prediction for candidate BGCs.

  • gecco.knn module was renamed to implementation-agnostic name gecco.types.


  • Extraction of domain composition taking a long time in gecco train command.


  • --metric argument to the gecco run CLI command.

v0.2.2 - 2020-07-31


  • Domain and Gene can now carry qualifiers that are used when they are translated to a sequence feature.


  • InterPro names, accessions, and HMMER e-value for each annotated domain in GenBank output files.

v0.2.1 - 2020-07-23


  • Various potential crashes in ClusterRefiner code.


  • Uneeded feature dictionary filtering in ClusterCRF for models with Fisher Exact Test feature selection.

v0.2.0 - 2020-07-23


  • pandas warning about unsorted columns in gecco run.


  • Gene.probability property, replaced by Gene.maximum_probability and Gene.average_probability properties to be explicit.


  • Internal model now uses Pfam and Tigrfam with the top 35% features selected with Fisher’s Exact Test.

  • ClusterRefiner now removes genes on Cluster edges if they do not contain any domain annotation.

v0.1.1 - 2020-07-22


  • ClusterCRF.predict_probabilities to annotate a list of Gene.


  • BGC probability is now stored at the Domain level instead of at the Gene level, independently of the feature extraction level used by the CRF.

  • ClusterKNN will use the model path provided to gecco run if any.


  • Added this changelog file to document changes in the code.

  • Added documentation to gecco submodules missing some.

  • Included the file to the generated docs.

v0.1.0 - 2020-07-17

Initial release.

v0.0.1 - 2018-08-13