Changelog¶
All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog and this project adheres to Semantic Versioning.
Unreleased¶
v0.9.10 - 2024-02-27¶
Fixed¶
Progress reading display when reading from compressed files.
Change labeling routine to use broad overlaps when annotating genes with cluster tables (#15).
Changed¶
Bump supported
polarsdependency tov0.20.Bump supported
statsmodelsdependency tov0.14.Report identifier of sequences with uni-valued labels when training.
v0.9.9 - 2023-11-23¶
Added¶
Support for
gzip,bzip2,lz4andxz-compressed input files.
Fixed¶
Outdated use of
pandasAPI ingecco cvcommand.
Changed¶
Bump
pyhmmerdependency tov0.10.0.Bump
pyrodigaldependency tov3.0.0.Make
gecco cvoutput a gene table with a ground truth column.
v0.9.8 - 2023-06-09¶
Fixed¶
ClusterTable.from_clustersextracting cluster IDs in the wrong column.Deprecation warnings in
polars.read_csvandpolars.write_csvwith recentpolarsversions.Deprecation warnings in
importlib_resourceswith recent Python versions.
v0.9.7 - 2023-05-26¶
Added¶
Command line option to annotate proteins using bitscore cutoffs from HMMs.
Command line option to disentangle overlapping domains after HMM annotation.
Changed¶
Bump
pyhmmerdependency tov0.8.0.Bump
pyrodigaldependency tov2.1.0.Rewrite
gecco.modelto usepolarsfor managing tabular data.Replace
pandasdependencies withpolarsUpdate
gecco runto skip type classification for tasks without an assigned cluster type.
Fixed¶
Cluster.to_seq_recordcrashing when called on a cluster withtypesattribute unset.Progress bar resetting when performing domain annotation with multiple HMMs.
Removed¶
Support for Python 3.7.
v0.9.6 - 2023-01-11¶
Added¶
Gene Ontology annotations to
gecco.interprolocal metadata.Reference to Gene Ontology terms and derived functions to
gecco.model.Domainobjects.Gene color based on predicted function in
gecco.model.Gene.to_seq_feature.
Fixed¶
Missing
gzipimport in the CLI preventing usage of gzip-compressed inputs.Invalid coordinates of domains found in reverse-strand genes.
Detection of entry points with
importlib.metadataon older Python versions.
Changed¶
bgc_idcolumns of cluster tables are renamedcluster_id.gecco.model.ProductTypeis renamed togecco.model.ClusterType.Bumped
pyrodigaldependency tov2.0.Bumped
pyhmmerdependency tov0.7.
v0.9.5 - 2022-08-10¶
Added¶
gecco predictcommand to predict BGCs from an annotated genome.Protein.with_seqfunction to assign a new sequence to a protein object.
Fixed¶
Issue with antiSMASH sideload JSON file generation in
gecco runandgecco predict.Make
gecco.orfhandle STOP codons consistently (#9).
v0.9.4 - 2022-05-31¶
Added¶
classes_property toTypeClassifierto access theclasses_attribute of theTypeBinarizer.Alternative ORF finder
CDSFinderwhich simply extracts CDS features from input sequences (#8).Support for annotating domains with “exclusive” HMMs to annotate genes with at most one HMM from the library.
Changed¶
ProductTypeis not restricted to MIBiG types anymore and can support any string as a base type identifier.PyrodigalFindernow usesmultiprocessing.pool.ThreadPoolinstead of custom thread code thanks toOrfFinder.find_genesreentrancy introduced in Pyrodigalv1.0.PyrodigalFindercan now be used in single / non-meta mode from the API.BUmped minimum
richversion to12.3to useNonetotal in progress bars when the size of an HMM library is unknown.
Fixed¶
Broken MyPy type annotations in the
gecco.modelandgecco.climodules.
v0.9.3 - 2022-05-13¶
Changed¶
--formatflag ofgecco annotateandgecco runCLI commands is now made lowercase before giving value toBio.SeqIO.
Fixed¶
Genes with duplicate IDs being silently ignored in
HMMER.run.
v0.9.2 - 2022-04-11¶
Added¶
Padding of short sequences with empty genes when predicting probabilities in
ClusterCRF.
v0.9.1 - 2022-04-05¶
Changed¶
Make the
genes.tsvandfeatures.tsvtable contain all genes even when they come from a contig too short to be processed by the CRF sliding window.Replaced the
--force-clusters-tsvflag with a--force-tsvflag to force writing TSV tables even when no genes or clusters were found ingecco runorgecco annotate.
v0.9.1-alpha4 - 2022-03-31¶
Retrain internal model with:
$ python -m gecco -vv train --c1 0.4 --c2 0 --select 0.25 --window-size 20 \
-f mibig-2.0.proG2.Pfam-v35.0.features.tsv \
-c mibig-2.0.proG2.clusters.tsv \
-g GECCO-data/data/embeddings/mibig-2.0.proG2.genes.tsv \
-o models/v0.9.1-alpha4
v0.9.1-alpha3 - 2022-03-23¶
Added¶
gecco.model.GeneTableclass to store gene coordinates independently of protein domains.
Changed¶
Refactored implementation of
loadanddumpmethods forTableclasses into a dedicated base class.gecco runandgecco annotatenow output a gene table in addition to the feature and cluster tables.gecco trainexpects a gene table instead of a GFF file for the gene coordinates.
v0.9.1-alpha2 - 2022-03-23¶
Fixed¶
TypeClassifier.trainednot being able to read unknown types from type tables.
v0.9.1-alpha1 - 2022-03-20¶
Candidate release with support for a sliding window in the CRF prediction algorithm.
v0.8.10 - 2022-02-23¶
Fixed¶
--antismash-sideloadflag ofgecco runcausing command to crash.
v0.8.9 - 2022-02-22¶
Removed¶
Prediction and support for the Other biosynthetic type of MIBiG clusters.
v0.8.8 - 2022-02-21¶
Fixed¶
ClusterRefinerfiltering method for edge genes not working as intended.gecco runandgecco annotatecommands crashing on missing input files instead of nicely rendering the error.
v0.8.7 - 2022-02-18¶
Fixed¶
interpro.jsonmetadata file not being included in distribution files.Missing docstring for
Protein.with_domainsmethod.
Changed¶
Bump minimum
scikit-learnversion tov1.0for Python3.7+.
v0.8.6 - 2022-02-17 - YANKED¶
Added¶
CLI flag for enabling region masking for contigs processed by Prodigal.
CLI flag for controlling region distance used for edge distance filtering.
Changed¶
gecco.model.Geneandgecco.model.Proteinare now immutable data classes.Bump minimum
pyrodigalversion tov0.6.4to use region masking.Implement filtering for extracted clusters based on distance to the contig edge.
Store InterPro metadata file uncompressed for version-control integration.
Fixed¶
Mark
BGC0000930asTerpenein the type classifier data.Progress bar messages are now in consistent format.
v0.8.5 - 2021-11-21¶
Added¶
Minimal compatibility support for running GECCO inside of Galaxy workflows.
v0.8.4 - 2021-09-26¶
Fixed¶
gecco convert gbk --format bigslicefailing to run because of outdated code (#5).gecco convert gbk --format bigslicenot creating files with names conforming to BiG-SLiCE expected input.
Changed¶
Bump minimum
pyrodigalversion tov0.6.2to use platform-accelerated code if supported.
v0.8.3-post1 - 2021-08-23¶
Fixed¶
Wrong default value for
--thresholdbeing shown ingecco runhelp message.
v0.8.3 - 2021-08-23¶
Changed¶
Default probability threshold for segmentation to 0.3 (from 0.4).
v0.8.2 - 2021-07-31¶
Fixed¶
gecco runcrashing on Python 3.6 because of missingcontextlib.nullcontextclass.
Changed¶
gecco runandgecco annotatewill not try to count the number of profiles when given an external HMM file with the--hmmflag.PyHMMER.runnow reports the p-value of each domain in addition to the e-value as a/notequalifier.
v0.8.1 - 2021-07-29¶
Changed¶
gecco runnow filters out unneeded features before annotating, making it easier to analyze the results of a run with a custom--model.
Fixed¶
geccoreporting about using Pfamv33.1while actually usingv34.0because of an outdated field ingecco/hmmer/Pfam.ini.
Added¶
Missing documentation for the
strandattribute ofgecco.model.Gene.
v0.8.0 - 2021-07-03¶
Changed¶
Retrain internal model using new sequence embeddings and remove broken/duplicate BGCs from MIBiG 2.0.
Bump minimum
pyhmmerversion tov0.4.0to improve exception handling.Bump minimum
pyrodigalversion tov0.5.0to fix sequence decoding on some platforms.Use p-values instead of e-values to filter domains obtained with HMMER.
gecco cvandgecco trainnow seed the RNG with a user-defined seed before shuffling rows of training data.
Fixed¶
Extraction of BGC compositions for the type predictor while training.
ClusterCRF.trainedfailing to open an external model.
Added¶
Domain.pvalueattribute to access the p-value of a domain annotation.Mandatory
pvaluecolumn toFeatureTableobjects.Support for loading several feature tables in
gecco trainandgecco cv.Warnings to
ClusterCRF.fitwhen selecting uninformative features.--correctionflag togecco trainandgecco cv, allowing to give a multiple testing correction method when computing p-values with the Fisher Exact Tests.
Removed¶
Outdated
gecco embedcommand.Unused
--truncateflag from thegecco trainCLI.Tigrfam domains, which is not improving performance on the new training data.
v0.7.0 - 2021-05-31¶
Added¶
Support for writing an AntiSMASH sideload JSON file after a
gecco runworkflow.Code for converting GenBank files in BiG-SLiCE compatible format with the
gecco convertsubcommand.Documentation about using GECCO in combination with AntiSMASH or BiG-SLiCE.
Changed¶
Minimum Biopython version to
v1.73for compatibility with older bioinformatics tooling.Internal domain composition shipped in the
gecco.typeswith newer composition array obtained directly from MIBiG files.
Removed¶
Outdated notice about
-vvvverbosity level in the help message of the maingeccocommand.
v0.6.3 - 2021-05-10¶
Fixed¶
HMMER annotation not properly handling inputs with multiple contigs.
Some progress bar totals displaying as floats in the CLI.
Changed¶
PyHMMERnow sets theZanddomZvalues from the number of proteins given to the search pipeline.gecco.clidelegates imports to make CLI more responsive.pkg_resourceshas been replaced withimportlib.resourcesandimportlib.metadatawhere applicable.multiprocessing.cpu_counthas been replaced withos.cpu_countwhere applicable.
v0.6.2 - 2021-05-04¶
Fixed¶
gecco cv lotocrashing because of outdated code.
Changed¶
Logging-style prompt will only display if GECCO is running with
-vvflag.
Added¶
GECCO bioRxiv paper reference to
Cluster.to_seq_recordoutput record.
v0.6.1 - 2021-03-15¶
Fixed¶
Progress bar not being disabled by
-qflag in CLI.Fallback to using HMM name if accession is not available in
PyHMMER.Group genes by source contig and process them separately in
PyHMMERto avoid bogus E-values.
Added¶
psutildependency to get the number of physical CPU cores on the host machine.Support for using an arbitrary mapping of positives to negatives in
gecco embed.
Removed¶
Unused and outdated
HMMERandDomainRowclasses fromgecco.hmmer.
v0.6.0 - 2021-02-28¶
Changed¶
Updated internal model with a cleaned-up version of the MIBiG-2.0 Pfam-33.1/Tigrfam-15.0 embedding.
Updated internal InterPro catalog.
Fixed¶
Features not being grouped together in
gecco cvandgecco trainwhen provided with a feature table where rows were not sorted by protein IDs.
v0.5.5 - 2021-02-28¶
Fixed¶
gecco cvbug causing only the last fold to be written.
v0.5.4 - 2021-02-28¶
Changed¶
Replaced
verboselogs,coloredlogsandbetter-exceptionswithrich.
Removed¶
tqdmtraining dependency.
Added¶
gecco annotatecommand to produce a feature table from a genomic file.gecco embedto embed BGCs into non-BGC regions using feature tables.
v0.5.3 - 2021-02-21¶
Fixed¶
Coordinates of genes in output GenBank files.
Potential issue with the number of CPUs in
PyHMMER.run.
Changed¶
Bump required
pyrodigalversion tov0.4.2to fix buffer overflow.
v0.5.2 - 2021-01-29¶
Added¶
Support for downloading HMM files directly from GitHub releases assets.
Validation of filtered HMMs with MD5 checksum.
Fixed¶
Invalid coordinates of protein domains in GenBank output files.
gecco.interpromodule not being added to wheel distribution.
Changed¶
Bump required
pyhmmerversion tov0.2.1.
v0.5.1 - 2021-01-15¶
Fixed¶
--hmmflag being ignored in ingecco runcommand.PyHMMERusing HMM names instead of accessions, causing issues with Pfam HMMs.
v0.5.0 - 2021-01-11¶
Added¶
Explicit support for Python 3.9.
Changed¶
pyhmmeris used to annotate protein sequences instead of HMMER3 binaryhmmsearch.HMM files are stored in binary format to speedup parsing and reduce storage size.
tqdmis now a training-only dependency.gecco cvnow requires training dependencies.
v0.4.5 - 2020-11-23¶
Added¶
Additional
foldcolumn to cross-validation table output.
Changed¶
Use sequence ID instead of protein ID to extract type from cluster in
gecco cv.Install HMM data in pre-pressed format to make
hmmsearchruns faster on short sequences.gecco.orfwas rewritten to extract genes from input sequences in parallel.
v0.4.4 - 2020-09-30¶
Added¶
gecco cv lotocommand to run LOTO cross-validation using BGC types for stratification.headerkeyword argument toFeatureTable.dumpandClusterTable.dumpto write the table without the column header allowing to append to an existing table.__getitem__implementation forFeatureTableandClusterTablethat returns a single row or a sub-table from a table.
Fixed¶
gecco cvcommand now writes results iteratively instead of holding the tables for every fold in memory.
Changed¶
Bumped
pandastraining dependency tov1.0.
v0.4.3 - 2020-09-07¶
Fixed¶
GenBank files being written with invalid
/cdsfeature type.
Changed¶
Blocked installation of Biopython
v1.78or newer as it removesBio.Alphabetand breaks the current code.
v0.4.2 - 2020-08-07¶
Fixed¶
TypeClassifier.predict_typesusing inverse type probabilities when given several clusters to process.
v0.4.1 - 2020-08-07¶
Fixed¶
gecco runcommand crashing on input sequences not containing any genes.
v0.4.0 - 2020-08-06¶
Added¶
gecco.model.ProductTypeenum to model the biosynthetic class of a BGC.
Removed¶
pandasinteraction from internal data model.ClusterCRFcode specific to cross-validation.
Changed¶
pandas,fisherandstatsmodelsdependencies are now optional.gecco traincommand expects a cluster table in addition to the feature table to know the types of the input BGCs.
v0.3.0 - 2020-08-03¶
Changed¶
Replaced Nearest-Neighbours classifier with Random Forest to perform type prediction for candidate BGCs.
gecco.knnmodule was renamed to implementation-agnostic namegecco.types.
Fixed¶
Extraction of domain composition taking a long time in
gecco traincommand.
Removed¶
--metricargument to thegecco runCLI command.
v0.2.2 - 2020-07-31¶
Changed¶
DomainandGenecan now carry qualifiers that are used when they are translated to a sequence feature.
Added¶
InterPro names, accessions, and HMMER e-value for each annotated domain in GenBank output files.
v0.2.1 - 2020-07-23¶
Fixed¶
Various potential crashes in
ClusterRefinercode.
Removed¶
Uneeded feature dictionary filtering in
ClusterCRFfor models with Fisher Exact Test feature selection.
v0.2.0 - 2020-07-23¶
Fixed¶
pandaswarning about unsorted columns ingecco run.
Removed¶
Gene.probabilityproperty, replaced byGene.maximum_probabilityandGene.average_probabilityproperties to be explicit.
Changed¶
Internal model now uses
PfamandTigrfamwith the top 35% features selected with Fisher’s Exact Test.ClusterRefinernow removes genes onClusteredges if they do not contain any domain annotation.
v0.1.1 - 2020-07-22¶
Added¶
ClusterCRF.predict_probabilitiesto annotate a list ofGene.
Changed¶
BGC probability is now stored at the
Domainlevel instead of at theGenelevel, independently of the feature extraction level used by the CRF.ClusterKNNwill use the model path provided togecco runif any.
Docs¶
Added this changelog file to document changes in the code.
Added documentation to
geccosubmodules missing some.Included the
CHANGELOG.mdfile to the generated docs.
v0.1.0 - 2020-07-17¶
Initial release.
v0.0.1 - 2018-08-13¶
Proof-of-concept.
