BGC Extraction

Algorithm to smooth contiguous BGC predictions into single regions.

class gecco.refine.ClusterRefiner(object)[source]

A post-processor to extract contiguous BGCs from CRF predictions.

__init__(threshold: float = 0.3, criterion: str = 'gecco', n_cds: int = 5, n_biopfams: int = 5, average_threshold: float = 0.6) → None[source]

Create a new ClusterRefiner instance.

Parameters
  • threshold (float) – The probability threshold to use to consider a protein to be part of a BGC region.

  • criterion (str) – The criterion to use when checking for BGC validity. See gecco.bgc.BGC.is_valid documentation for allowed values and expected behaviours.

  • n_cds (int) – The minimum number of CDS a gene cluster must contain to be considered valid. If criterion is gecco, then this is the minimum number of annotated CDS.

  • n_biopfams (int) – The minimum number of biosynthetic Pfam domains a gene cluster must contain to be considered valid (only when the criterion is antismash).

  • average_threshold (int) – The average probability threshold to use to consider a BGC valid (only when the criterion is antismash).

iter_clusters()[source]

Find all clusters in a table of CRF predictions.

Parameters

genes (list of Gene) – A list of genes with probability annotations estimated by ClusterCRF.

Yields

gecco.model.Cluster – Valid clusters found in the input with respect to the postprocessing criterion given at initialisation.