Biosynthetic Gene Cluster prediction with Conditional Random Fields.

Actions License Coverage Source Mirror Issues Preprint PyPI Bioconda Galaxy Versions Wheel


GECCO is a fast and scalable method for identifying putative novel Biosynthetic Gene Clusters (BGCs) in genomic and metagenomic data using Conditional Random Fields (CRFs).

GECCO is developed in the Zeller group and is part of the suite of computational microbiome analysis tools hosted at EMBL.



GECCO is implemented in Python, and supports all versions from Python 3.6. Install GECCO with pip:

$ pip install gecco-tool

Or with Conda, using the bioconda channel:

$ conda install -c bioconda gecco


GECCO works with DNA sequences, and loads them using Biopython, allowing it to support a large variety of formats, including the common FASTA and GenBank files.

Run a prediction on a FASTA file named sequence.fna and output the predictions to the current directory:

$ gecco -v run --genome sequence.fna


GECCO will create the following files once done (using the same prefix as the input file):

  • {sequence}.genes.tsv: The genes file, containing the genes found by Pyrodigal and per-gene BGC probabilities predicted by the CRF.

  • {sequence}.features.tsv: The features file, containing the domains identified in the predicted genes.

  • {sequence}.clusters.tsv: If any BGCs were found, a clusters file, containing the coordinates of the predicted clusters, along their putative biosynthetic type.

  • {sequence}_cluster_{N}.gbk: If any were found, a GenBank file per cluster, containing the cluster sequence annotated with its member proteins and domains. They can be opened by a standard GenBank viewer, such as Ugene.


GECCO can be cited using the following preprint:

Accurate de novo identification of biosynthetic gene clusters with GECCO. Laura M Carroll, Martin Larralde, Jonas Simon Fleck, Ruby Ponnudurai, Alessio Milanese, Elisa Cappio Barazzone, Georg Zeller. bioRxiv 2021.05.03.442509; doi:10.1101/2021.05.03.442509



If you have any question about GECCO, if you run into any issue, or if you would like to make a feature request, please create an issue in the GitHub repository. You can also directly contact Martin Larralde via email.


If you want to contribute to GECCO, please have a look at the contribution guide first, and feel free to open a pull request on the GitHub repository.





GECCO is released under the GNU General Public License v3 or later, and is fully open-source. The LICENSE file distributed with the software contains the complete license text.


GECCO is developped by the Zeller Team at the European Molecular Biology Laboratory in Heidelberg. The following individuals contributed to the development of GECCO: