Changelog¶

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog and this project adheres to Semantic Versioning.

Unreleased ¶

v0.9.10 - 2024-02-27¶

Fixed¶

Progress reading display when reading from compressed files.
Change labeling routine to use broad overlaps when annotating genes with cluster tables (#15).

Changed¶

Bump supported polars dependency to v0.20.
Bump supported statsmodels dependency to v0.14.
Report identifier of sequences with uni-valued labels when training.

v0.9.9 - 2023-11-23¶

Added¶

Support for gzip, bzip2, lz4 and xz-compressed input files.

Fixed¶

Outdated use of pandas API in gecco cv command.

Changed¶

Bump pyhmmer dependency to v0.10.0.
Bump pyrodigal dependency to v3.0.0.
Make gecco cv output a gene table with a ground truth column.

v0.9.8 - 2023-06-09¶

Fixed¶

ClusterTable.from_clusters extracting cluster IDs in the wrong column.
Deprecation warnings in polars.read_csv and polars.write_csv with recent polars versions.
Deprecation warnings in importlib_resources with recent Python versions.

v0.9.7 - 2023-05-26¶

Added¶

Command line option to annotate proteins using bitscore cutoffs from HMMs.
Command line option to disentangle overlapping domains after HMM annotation.

Changed¶

Bump pyhmmer dependency to v0.8.0.
Bump pyrodigal dependency to v2.1.0.
Rewrite gecco.model to use polars for managing tabular data.
Replace pandas dependencies with polars
Update gecco run to skip type classification for tasks without an assigned cluster type.

Fixed¶

Cluster.to_seq_record crashing when called on a cluster with types attribute unset.
Progress bar resetting when performing domain annotation with multiple HMMs.

Removed¶

Support for Python 3.7.

v0.9.6 - 2023-01-11¶

Added¶

Gene Ontology annotations to gecco.interpro local metadata.
Reference to Gene Ontology terms and derived functions to gecco.model.Domain objects.
Gene color based on predicted function in gecco.model.Gene.to_seq_feature.

Fixed¶

Missing gzip import in the CLI preventing usage of gzip-compressed inputs.
Invalid coordinates of domains found in reverse-strand genes.
Detection of entry points with importlib.metadata on older Python versions.

Changed¶

bgc_id columns of cluster tables are renamed cluster_id.
gecco.model.ProductType is renamed to gecco.model.ClusterType.
Bumped pyrodigal dependency to v2.0.
Bumped pyhmmer dependency to v0.7.

v0.9.5 - 2022-08-10¶

Added¶

gecco predict command to predict BGCs from an annotated genome.
Protein.with_seq function to assign a new sequence to a protein object.

Fixed¶

Issue with antiSMASH sideload JSON file generation in gecco run and gecco predict.
Make gecco.orf handle STOP codons consistently (#9).

v0.9.4 - 2022-05-31¶

Added¶

classes_ property to TypeClassifier to access the classes_ attribute of the TypeBinarizer.
Alternative ORF finder CDSFinder which simply extracts CDS features from input sequences (#8).
Support for annotating domains with “exclusive” HMMs to annotate genes with at most one HMM from the library.

Changed¶

ProductType is not restricted to MIBiG types anymore and can support any string as a base type identifier.
PyrodigalFinder now uses multiprocessing.pool.ThreadPool instead of custom thread code thanks to OrfFinder.find_genes reentrancy introduced in Pyrodigal v1.0.
PyrodigalFinder can now be used in single / non-meta mode from the API.
BUmped minimum rich version to 12.3 to use None total in progress bars when the size of an HMM library is unknown.

Fixed¶

Broken MyPy type annotations in the gecco.model and gecco.cli modules.

v0.9.3 - 2022-05-13¶

Changed¶

--format flag of gecco annotate and gecco run CLI commands is now made lowercase before giving value to Bio.SeqIO.

Fixed¶

Genes with duplicate IDs being silently ignored in HMMER.run.

v0.9.2 - 2022-04-11¶

Added¶

Padding of short sequences with empty genes when predicting probabilities in ClusterCRF.

v0.9.1 - 2022-04-05¶

Changed¶

Make the genes.tsv and features.tsv table contain all genes even when they come from a contig too short to be processed by the CRF sliding window.
Replaced the --force-clusters-tsv flag with a --force-tsv flag to force writing TSV tables even when no genes or clusters were found in gecco run or gecco annotate.

v0.9.1-alpha4 - 2022-03-31¶

Retrain internal model with:

$ python -m gecco -vv train --c1 0.4 --c2 0 --select 0.25 --window-size 20 \
         -f mibig-2.0.proG2.Pfam-v35.0.features.tsv \
         -c mibig-2.0.proG2.clusters.tsv \
         -g GECCO-data/data/embeddings/mibig-2.0.proG2.genes.tsv \
         -o models/v0.9.1-alpha4

v0.9.1-alpha3 - 2022-03-23¶

Added¶

gecco.model.GeneTable class to store gene coordinates independently of protein domains.

Changed¶

Refactored implementation of load and dump methods for Table classes into a dedicated base class.
gecco run and gecco annotate now output a gene table in addition to the feature and cluster tables.
gecco train expects a gene table instead of a GFF file for the gene coordinates.

v0.9.1-alpha2 - 2022-03-23¶

Fixed¶

TypeClassifier.trained not being able to read unknown types from type tables.

v0.9.1-alpha1 - 2022-03-20¶

Candidate release with support for a sliding window in the CRF prediction algorithm.

v0.8.10 - 2022-02-23¶

Fixed¶

--antismash-sideload flag of gecco run causing command to crash.

v0.8.9 - 2022-02-22¶

Removed¶

Prediction and support for the Other biosynthetic type of MIBiG clusters.

v0.8.8 - 2022-02-21¶

Fixed¶

ClusterRefiner filtering method for edge genes not working as intended.
gecco run and gecco annotate commands crashing on missing input files instead of nicely rendering the error.

v0.8.7 - 2022-02-18¶

Fixed¶

interpro.json metadata file not being included in distribution files.
Missing docstring for Protein.with_domains method.

Changed¶

Bump minimum scikit-learn version to v1.0 for Python3.7+.

v0.8.6 - 2022-02-17 - YANKED¶

Added¶

CLI flag for enabling region masking for contigs processed by Prodigal.
CLI flag for controlling region distance used for edge distance filtering.

Changed¶

gecco.model.Gene and gecco.model.Protein are now immutable data classes.
Bump minimum pyrodigal version to v0.6.4 to use region masking.
Implement filtering for extracted clusters based on distance to the contig edge.
Store InterPro metadata file uncompressed for version-control integration.

Fixed¶

Mark BGC0000930 as Terpene in the type classifier data.
Progress bar messages are now in consistent format.

v0.8.5 - 2021-11-21¶

Added¶

Minimal compatibility support for running GECCO inside of Galaxy workflows.

v0.8.4 - 2021-09-26¶

Fixed¶

gecco convert gbk --format bigslice failing to run because of outdated code (#5).
gecco convert gbk --format bigslice not creating files with names conforming to BiG-SLiCE expected input.

Changed¶

Bump minimum pyrodigal version to v0.6.2 to use platform-accelerated code if supported.

v0.8.3-post1 - 2021-08-23¶

Fixed¶

Wrong default value for --threshold being shown in gecco run help message.

v0.8.3 - 2021-08-23¶

Changed¶

Default probability threshold for segmentation to 0.3 (from 0.4).

v0.8.2 - 2021-07-31¶

Fixed¶

gecco run crashing on Python 3.6 because of missing contextlib.nullcontext class.

Changed¶

gecco run and gecco annotate will not try to count the number of profiles when given an external HMM file with the --hmm flag.
PyHMMER.run now reports the p-value of each domain in addition to the e-value as a /note qualifier.

v0.8.1 - 2021-07-29¶

Changed¶

gecco run now filters out unneeded features before annotating, making it easier to analyze the results of a run with a custom --model.

Fixed¶

gecco reporting about using Pfam v33.1 while actually using v34.0 because of an outdated field in gecco/hmmer/Pfam.ini.

Added¶

Missing documentation for the strand attribute of gecco.model.Gene.

v0.8.0 - 2021-07-03¶

Changed¶

Retrain internal model using new sequence embeddings and remove broken/duplicate BGCs from MIBiG 2.0.
Bump minimum pyhmmer version to v0.4.0 to improve exception handling.
Bump minimum pyrodigal version to v0.5.0 to fix sequence decoding on some platforms.
Use p-values instead of e-values to filter domains obtained with HMMER.
gecco cv and gecco train now seed the RNG with a user-defined seed before shuffling rows of training data.

Fixed¶

Extraction of BGC compositions for the type predictor while training.
ClusterCRF.trained failing to open an external model.

Added¶

Domain.pvalue attribute to access the p-value of a domain annotation.
Mandatory pvalue column to FeatureTable objects.
Support for loading several feature tables in gecco train and gecco cv.
Warnings to ClusterCRF.fit when selecting uninformative features.
--correction flag to gecco train and gecco cv, allowing to give a multiple testing correction method when computing p-values with the Fisher Exact Tests.

Removed¶

Outdated gecco embed command.
Unused --truncate flag from the gecco train CLI.
Tigrfam domains, which is not improving performance on the new training data.

v0.7.0 - 2021-05-31¶

Added¶

Support for writing an AntiSMASH sideload JSON file after a gecco run workflow.
Code for converting GenBank files in BiG-SLiCE compatible format with the gecco convert subcommand.
Documentation about using GECCO in combination with AntiSMASH or BiG-SLiCE.

Changed¶

Minimum Biopython version to v1.73 for compatibility with older bioinformatics tooling.
Internal domain composition shipped in the gecco.types with newer composition array obtained directly from MIBiG files.

Removed¶

Outdated notice about -vvv verbosity level in the help message of the main gecco command.

v0.6.3 - 2021-05-10¶

Fixed¶

HMMER annotation not properly handling inputs with multiple contigs.
Some progress bar totals displaying as floats in the CLI.

Changed¶

PyHMMER now sets the Z and domZ values from the number of proteins given to the search pipeline.
gecco.cli delegates imports to make CLI more responsive.
pkg_resources has been replaced with importlib.resources and importlib.metadata where applicable.
multiprocessing.cpu_count has been replaced with os.cpu_count where applicable.

v0.6.2 - 2021-05-04¶

Fixed¶

gecco cv loto crashing because of outdated code.

Changed¶

Logging-style prompt will only display if GECCO is running with -vv flag.

Added¶

GECCO bioRxiv paper reference to Cluster.to_seq_record output record.

v0.6.1 - 2021-03-15¶

Fixed¶

Progress bar not being disabled by -q flag in CLI.
Fallback to using HMM name if accession is not available in PyHMMER.
Group genes by source contig and process them separately in PyHMMER to avoid bogus E-values.

Added¶

psutil dependency to get the number of physical CPU cores on the host machine.
Support for using an arbitrary mapping of positives to negatives in gecco embed.

Removed¶

Unused and outdated HMMER and DomainRow classes from gecco.hmmer.

v0.6.0 - 2021-02-28¶

Changed¶

Updated internal model with a cleaned-up version of the MIBiG-2.0 Pfam-33.1/Tigrfam-15.0 embedding.
Updated internal InterPro catalog.

Fixed¶

Features not being grouped together in gecco cv and gecco train when provided with a feature table where rows were not sorted by protein IDs.

v0.5.5 - 2021-02-28¶

Fixed¶

gecco cv bug causing only the last fold to be written.

v0.5.4 - 2021-02-28¶

Changed¶

Replaced verboselogs, coloredlogs and better-exceptions with rich.

Removed¶

tqdm training dependency.

Added¶

gecco annotate command to produce a feature table from a genomic file.
gecco embed to embed BGCs into non-BGC regions using feature tables.

v0.5.3 - 2021-02-21¶

Fixed¶

Coordinates of genes in output GenBank files.
Potential issue with the number of CPUs in PyHMMER.run.

Changed¶

Bump required pyrodigal version to v0.4.2 to fix buffer overflow.

v0.5.2 - 2021-01-29¶

Added¶

Support for downloading HMM files directly from GitHub releases assets.
Validation of filtered HMMs with MD5 checksum.

Fixed¶

Invalid coordinates of protein domains in GenBank output files.
gecco.interpro module not being added to wheel distribution.

Changed¶

Bump required pyhmmer version to v0.2.1.

v0.5.1 - 2021-01-15¶

Fixed¶

--hmm flag being ignored in in gecco run command.
PyHMMER using HMM names instead of accessions, causing issues with Pfam HMMs.

v0.5.0 - 2021-01-11¶

Added¶

Explicit support for Python 3.9.

Changed¶

pyhmmer is used to annotate protein sequences instead of HMMER3 binary hmmsearch.
HMM files are stored in binary format to speedup parsing and reduce storage size.
tqdm is now a training-only dependency.
gecco cv now requires training dependencies.

v0.4.5 - 2020-11-23¶

Added¶

Additional fold column to cross-validation table output.

Changed¶

Use sequence ID instead of protein ID to extract type from cluster in gecco cv.
Install HMM data in pre-pressed format to make hmmsearch runs faster on short sequences.
gecco.orf was rewritten to extract genes from input sequences in parallel.

v0.4.4 - 2020-09-30¶

Added¶

gecco cv loto command to run LOTO cross-validation using BGC types for stratification.
header keyword argument to FeatureTable.dump and ClusterTable.dump to write the table without the column header allowing to append to an existing table.
__getitem__ implementation for FeatureTable and ClusterTable that returns a single row or a sub-table from a table.

Fixed¶

gecco cv command now writes results iteratively instead of holding the tables for every fold in memory.

Changed¶

Bumped pandas training dependency to v1.0.

v0.4.3 - 2020-09-07¶

Fixed¶

GenBank files being written with invalid /cds feature type.

Changed¶

Blocked installation of Biopython v1.78 or newer as it removes Bio.Alphabet and breaks the current code.

v0.4.2 - 2020-08-07¶

Fixed¶

TypeClassifier.predict_types using inverse type probabilities when given several clusters to process.

v0.4.1 - 2020-08-07¶

Fixed¶

gecco run command crashing on input sequences not containing any genes.

v0.4.0 - 2020-08-06¶

Added¶

gecco.model.ProductType enum to model the biosynthetic class of a BGC.

Removed¶

pandas interaction from internal data model.
ClusterCRF code specific to cross-validation.

Changed¶

pandas, fisher and statsmodels dependencies are now optional.
gecco train command expects a cluster table in addition to the feature table to know the types of the input BGCs.

v0.3.0 - 2020-08-03¶

Changed¶

Replaced Nearest-Neighbours classifier with Random Forest to perform type prediction for candidate BGCs.
gecco.knn module was renamed to implementation-agnostic name gecco.types.

Fixed¶

Extraction of domain composition taking a long time in gecco train command.

Removed¶

--metric argument to the gecco run CLI command.

v0.2.2 - 2020-07-31¶

Changed¶

Domain and Gene can now carry qualifiers that are used when they are translated to a sequence feature.

Added¶

InterPro names, accessions, and HMMER e-value for each annotated domain in GenBank output files.

v0.2.1 - 2020-07-23¶

Fixed¶

Various potential crashes in ClusterRefiner code.

Removed¶

Uneeded feature dictionary filtering in ClusterCRF for models with Fisher Exact Test feature selection.

v0.2.0 - 2020-07-23¶

Fixed¶

pandas warning about unsorted columns in gecco run.

Removed¶

Gene.probability property, replaced by Gene.maximum_probability and Gene.average_probability properties to be explicit.

Changed¶

Internal model now uses Pfam and Tigrfam with the top 35% features selected with Fisher’s Exact Test.
ClusterRefiner now removes genes on Cluster edges if they do not contain any domain annotation.

v0.1.1 - 2020-07-22¶

Added¶

ClusterCRF.predict_probabilities to annotate a list of Gene.

Changed¶

BGC probability is now stored at the Domain level instead of at the Gene level, independently of the feature extraction level used by the CRF.
ClusterKNN will use the model path provided to gecco run if any.

Docs¶

Added this changelog file to document changes in the code.
Added documentation to gecco submodules missing some.
Included the CHANGELOG.md file to the generated docs.

v0.1.0 - 2020-07-17¶

Initial release.

v0.0.1 - 2018-08-13¶

Proof-of-concept.

Changelog¶

Unreleased¶

v0.9.10 - 2024-02-27¶

Fixed¶

Changed¶

v0.9.9 - 2023-11-23¶

Added¶

Fixed¶

Changed¶

v0.9.8 - 2023-06-09¶

Fixed¶

v0.9.7 - 2023-05-26¶

Added¶

Changed¶

Fixed¶

Removed¶

v0.9.6 - 2023-01-11¶

Added¶

Fixed¶

Changed¶

v0.9.5 - 2022-08-10¶

Added¶

Fixed¶

v0.9.4 - 2022-05-31¶

Added¶

Changed¶

Fixed¶

v0.9.3 - 2022-05-13¶

Changed¶

Fixed¶

v0.9.2 - 2022-04-11¶

Added¶

v0.9.1 - 2022-04-05¶

Changed¶

v0.9.1-alpha4 - 2022-03-31¶

v0.9.1-alpha3 - 2022-03-23¶

Added¶

Changed¶

v0.9.1-alpha2 - 2022-03-23¶

Fixed¶

v0.9.1-alpha1 - 2022-03-20¶

v0.8.10 - 2022-02-23¶

Fixed¶

v0.8.9 - 2022-02-22¶

Removed¶

v0.8.8 - 2022-02-21¶

Fixed¶

v0.8.7 - 2022-02-18¶

Fixed¶

Changed¶

v0.8.6 - 2022-02-17 - YANKED¶

Added¶

Changed¶

Fixed¶

v0.8.5 - 2021-11-21¶

Added¶

v0.8.4 - 2021-09-26¶

Fixed¶

Changed¶

v0.8.3-post1 - 2021-08-23¶

Fixed¶

v0.8.3 - 2021-08-23¶

Changed¶

v0.8.2 - 2021-07-31¶

Fixed¶

Changed¶

v0.8.1 - 2021-07-29¶

Changed¶

Fixed¶

Added¶

v0.8.0 - 2021-07-03¶

Changed¶

Fixed¶

Added¶

Removed¶

v0.7.0 - 2021-05-31¶

Added¶

Changed¶

Removed¶

v0.6.3 - 2021-05-10¶

Unreleased ¶