BGC Detection¶
Gene cluster prediction using a conditional random field.
- class gecco.crf.ClusterCRF(object)[source]¶
A wrapper for
sklearn_crfsuite.CRFto work with the GECCO data model.- classmethod trained(model_path: Optional[str] = None) ClusterCRF[source]¶
Create a new pre-trained
ClusterCRFinstance from a model path.- Parameters:
model_path (
str, optional) – The path to the model directory obtained with thegecco traincommand. IfNonegiven, use the embedded model.- Returns:
ClusterCRF– A CRF model that can be used to perform predictions without training first.- Raises:
ValueError – If the model data does not match its hash.
- __init__(feature_type: str = 'protein', algorithm: str = 'lbfgs', window_size: int = 5, window_step: int = 1, **kwargs: Any) None[source]¶
Create a new
ClusterCRFinstance.- Parameters:
feature_type (
str) – Defines how features should be extracted. Should be either domain or protein.algorithm (
str) – The optimization algorithm for the model. See https://sklearn-crfsuite.readthedocs.io/en/latest/api.html for available values.window_size (
int) – The size of the sliding window to use when training and predicting probabilities on sequences of genes.window_step (
int) – The step between consecutive sliding windows to use when training and predicting probabilities on sequences of genes.
Any additional keyword argument is passed as-is to the internal
CRFconstructor.- Raises:
ValueError – if
feature_typehas an invalid value.TypeError – if one of the
*_columnsargument is not iterable.
- predict_probabilities(genes: Iterable[Gene], *, pad: bool = True, progress: Optional[Callable[[int, int], None]] = None) List[Gene][source]¶
Predict how likely each given gene is part of a gene cluster.
- Parameters:
genes (iterable of
Gene) – The genes to compute probabilities for.- Keyword Arguments:
batch_size (
int) – The number of samples to load per batch. Ignored, always 1 with the CRF.pad (
bool) – Whether to pad sequences too small for a single window. Setting this toFalsewill skip probability prediction entirely for sequences smaller than the window size.progress (callable) – A callable that accepts two
int, the current batch index and the total number of batches.
- Returns:
listofGene– A list of newGeneobjects with their probability set.- Raises:
NotFittedError – When calling this method on an object that has not been fitted yet.
- fit(genes: Iterable[Gene], *, select: Optional[float] = None, shuffle: bool = True, cpus: Optional[int] = None, correction_method: Optional[str] = None) None[source]¶
Fit the CRF model to the given training data.
- Parameters:
genes (iterable of
Gene) – The genes to extract domains from for training the CRF.select (
float, optional) – The fraction of features to select based on Fisher-tested significance. Leave asNoneto skip feature selection.shuffle (
bool) – Whether or not to shuffle the contigs after having grouped the genes together.correction_method (
str, optional) – The correction method to use for correcting p-values used for feature selection. Ignored ifselectisFalse.
- save(model_path: PathLike[str]) None[source]¶
Save the
ClusterCRFto an on-disk location.Models serialized at a given location can be later loaded from that same location using the
ClusterCRF.trainedclass method.- Parameters:
model_path (
str) – The path to the directory where to write the model files.
