BGC Extraction¶
Algorithm to smooth contiguous gene cluster predictions into single regions.
- class gecco.refine.ClusterRefiner(object)[source]¶
A post-processor to extract contiguous clusters from CRF predictions.
- __init__(threshold: float = 0.8, criterion: str = 'gecco', n_cds: int = 5, n_biopfams: int = 5, average_threshold: float = 0.6, edge_distance: int = 0) None [source]¶
Create a new
ClusterRefiner
instance.- Parameters:
threshold (
float
) – The probability threshold to use to consider a protein to be part of a gene cluster.criterion (
str
) – The criterion to use when checking for cluster validity.n_cds (
int
) – The minimum number of genes a gene cluster must contain to be considered valid. Ifcriterion
isgecco
, then this is the minimum number of annotated CDS.n_biopfams (
int
) – The minimum number of biosynthetic Pfam domains a gene cluster must contain to be considered valid (only when the criterion isantismash
).average_threshold (
int
) – The average probability threshold to use to consider a gene cluster valid (only when the criterion isantismash
).edge_distance (
int
) – The minimum distance from the edge the gene cluster must be located (it may start at an edge, but must span for longer thanedge_distance
), in number of annotated genes (only when the criterion isgecco
).
- iter_clusters(genes: List[Gene]) Iterator[Cluster] [source]¶
Find all clusters in a table of CRF predictions.
- Parameters:
genes (
list
ofGene
) – A list of genes with probability annotations estimated byClusterCRF
.- Yields:
gecco.model.Cluster
– Valid clusters found in the input with respect to the postprocessing criterion given at initialisation.