# Training¶

## Software Requirements¶

In order to train GECCO, you need to have installed with the train optional dependencies. This can be done with pip:

$pip install gecco-tool[train]  This will install additional Python packages, such as pandas which is needed to process the feature tables, or fisher which is used to select the most informative domains. ## Domain database¶ GECCO needs HMM domains to use as features. Installing the gecco-tool package will also install a subset of the Pfam database that can be used for making the predictions. However, this subset should not be used for training, since a different subset of domains may be selected with different training data. You can get the latest version of Pfam (35.0 in December 2021) from the EMBL FTP server: $ wget "ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam35.0/Pfam-A.hmm.gz" -O Pfam35.hmm.gz


You are also free to get additional HMMs from other databases, such as TIGRFAM or PANTHER, or even to build your own HMMs, as long as they are in HMMER format.

## Training sequences¶

### Regions in their genomic context¶

The easiest case for training GECCO is when you have entire genomes with regions of interest. In that case, you can directly use these sequences, and you will only have to prepare a cluster table with the coordinates of each positive region.

### Regions from separate datasets¶

In the event you don’t have the genomic context available for your regions of interest, you will have to provide a “fake” context by embedding the positive regions into contigs that don’t contain any positive.

GECCO was trained to detect Biosynthetic Gene Clusters, so the MIBiG database was used to get the positives, with some additional filtering to remove redundancy and entries with invalid annotations. For the negative regions, we used representative genomes from the proGenomes database, and masked known BGCs using antiSMASH.

## Feature tables¶

GECCO does not train on sequences directly, but on feature tables. You can build the feature table yourself (see below for the expected format), but the easiest way to obtain a feature table from the sequences is the gecco annotate subcommand. To build a table from a collection of nucleotide sequences in sequences.fna and HMMs in Pfam35.hmm.gz, use:

$gecco annotate --genome sequences.fna --hmm Pfam35.hmm.gz -o features.tsv  Hint If you have more than one HMM file, you can add additional --hmm flags so that all of them are used. The feature table is a TSV file that looks like this, with one row per domain, per protein, per sequence: sequence_id protein_id start end strand domain hmm i_evalue pvalue domain_start domain_end AFPU01000001 AFPU01000001_1 3 2555 + PF01573 Pfam35 95.95209 0.00500 2 27 AFPU01000001 AFPU01000001_2 2610 4067 - PF17032 Pfam35 0.75971 3.961e-05 83 142 AFPU01000001 AFPU01000001_2 2610 4067 - PF13719 Pfam35 4.89304 0.000255 85 98 Hint If this step takes too long, you can also split the file containing your input sequences, process them independently in parallel, and combine the result. ## Cluster tables¶ The cluster table is used to additional information to GECCO: the location of each positive region in the input data, and the type of each region (if it makes sense). You need to build this table manually, but it should be quite straightforward. Hint If a region has more than one type, use ; to separate the two types in the type column. For instance, a Polyketide/NRP hybrid cluster can be marked with the type Polyketide;NRP. The cluster table is a TSV file that looks like this, with one row per region: sequence_id bgc_id start end type AFPU01000001 BGC0000001 806243 865563 Polyketide MTON01000024 BGC0001910 129748 142173 Terpene Hint If the concept of “type” makes no sense for the regions you are trying to detect, you can omit the type column entirely. This will effectively mark all the regions from the training sequences as “Unknown”. ## Fitting the model¶ Now that you have everything needed, it’s time to train GECCO! Use the following method to fit the CRF model and the type classifier: $ gecco -vv train --features features.tsv --clusters clusters.tsv -o model


GECCO will create a directory named model containing all the required files to make predictions later on.

### L1/L2 regularisation¶

Use the --c1 and --c2 flags to control the weight for the L1 and L2 regularisation, respectively. The command line defaults to 0.15 and 0.15; however, for training GECCO, we disabled L2 regularisation and selected a value of 0.4 for $$C_1$$ by optimizing on an external validation dataset.

### Feature selection¶

GECCO supports selecting the most informative features from the training dataset using a simple contingency testing for the presence/absence of each domain in the regions of interest. Reducing the number of features helps the CRF model to get better accuracy. It also greatly reduces the time needed to make predictions by skipping the HMM annotation step for useless domains.

Use the --select flag to select a fraction of most informative features before training to reduce the total feature set (for instance, use --select 0.3 to select the 30% features with the lowest Fisher p-value).

$gecco train --features features.tsv --clusters clusters.tsv -o model --select 0.3  Hint You will get a warning in case you select a p-value threshold that is still too high, resulting in non-informative domains to be included in the selected features. ## Predicting with the new model¶ To make predictions with a model different from the one embedded in GECCO, you will need the folder from a previous gecco train run, as well as the HMMs used to build the feature tables in the first place. $ gecco run --model model --hmm Pfam35.hmm.gz --genome genome.fa -o ./predictions/


Congratulations, you trained GECCO with your own dataset, and successfully used it to make predictions!