BGC Detection¶

Gene cluster prediction using a conditional random field.

class gecco.crf.ClusterCRF(object)[source]¶

A wrapper for sklearn_crfsuite.CRF to work with the GECCO data model.

classmethod trained(model_path: Optional[str] = None) → ClusterCRF[source]¶

Create a new pre-trained ClusterCRF instance from a model path.

Parameters:: model_path (str, optional) – The path to the model directory obtained with the gecco train command. If None given, use the embedded model.
Returns:: ClusterCRF – A CRF model that can be used to perform predictions without training first.
Raises:: ValueError – If the model data does not match its hash.

__init__(feature_type: str = 'protein', algorithm: str = 'lbfgs', window_size: int = 5, window_step: int = 1, **kwargs: Any) → None[source]¶

Create a new ClusterCRF instance.

Parameters:

feature_type (str) – Defines how features should be extracted. Should be either domain or protein.
algorithm (str) – The optimization algorithm for the model. See https://sklearn-crfsuite.readthedocs.io/en/latest/api.html for available values.
window_size (int) – The size of the sliding window to use when training and predicting probabilities on sequences of genes.
window_step (int) – The step between consecutive sliding windows to use when training and predicting probabilities on sequences of genes.

Any additional keyword argument is passed as-is to the internal CRF constructor.

Raises:

predict_probabilities(genes: Iterable[Gene], *, pad: bool = True, progress: Optional[Callable[[int, int], None]] = None) → List[Gene][source]¶

Predict how likely each given gene is part of a gene cluster.

Parameters:

genes (iterable of Gene) – The genes to compute probabilities for.

Keyword Arguments:

batch_size (int) – The number of samples to load per batch. Ignored, always 1 with the CRF.
pad (bool) – Whether to pad sequences too small for a single window. Setting this to False will skip probability prediction entirely for sequences smaller than the window size.
progress (callable) – A callable that accepts two int, the current batch index and the total number of batches.

Returns:

list of Gene – A list of new Gene objects with their probability set.

Raises:

NotFittedError – When calling this method on an object that has not been fitted yet.

fit(genes: Iterable[Gene], *, select: Optional[float] = None, shuffle: bool = True, cpus: Optional[int] = None, correction_method: Optional[str] = None) → None[source]¶

Fit the CRF model to the given training data.

Parameters:

genes (iterable of Gene) – The genes to extract domains from for training the CRF.
select (float, optional) – The fraction of features to select based on Fisher-tested significance. Leave as None to skip feature selection.
shuffle (bool) – Whether or not to shuffle the contigs after having grouped the genes together.
correction_method (str, optional) – The correction method to use for correcting p-values used for feature selection. Ignored if select is False.

save(model_path: PathLike[str]) → None[source]¶

Save the ClusterCRF to an on-disk location.

Models serialized at a given location can be later loaded from that same location using the ClusterCRF.trained class method.

Parameters:: model_path (str) – The path to the directory where to write the model files.