BGC Detection

Gene cluster prediction using a conditional random field.

class gecco.crf.ClusterCRF(object)[source]

A wrapper for sklearn_crfsuite.CRF to work with the GECCO data model.

classmethod trained(model_path: Optional[str] = None) ClusterCRF[source]

Create a new pre-trained ClusterCRF instance from a model path.

Parameters:

model_path (str, optional) – The path to the model directory obtained with the gecco train command. If None given, use the embedded model.

Returns:

ClusterCRF – A CRF model that can be used to perform predictions without training first.

Raises:

ValueError – If the model data does not match its hash.

__init__(feature_type: str = 'protein', algorithm: str = 'lbfgs', window_size: int = 5, window_step: int = 1, **kwargs: Any) None[source]

Create a new ClusterCRF instance.

Parameters:
  • feature_type (str) – Defines how features should be extracted. Should be either domain or protein.

  • algorithm (str) – The optimization algorithm for the model. See https://sklearn-crfsuite.readthedocs.io/en/latest/api.html for available values.

  • window_size (int) – The size of the sliding window to use when training and predicting probabilities on sequences of genes.

  • window_step (int) – The step between consecutive sliding windows to use when training and predicting probabilities on sequences of genes.

Any additional keyword argument is passed as-is to the internal CRF constructor.

Raises:
  • ValueError – if feature_type has an invalid value.

  • TypeError – if one of the *_columns argument is not iterable.

predict_probabilities(genes: Iterable[Gene], *, pad: bool = True, progress: Optional[Callable[[int, int], None]] = None) List[Gene][source]

Predict how likely each given gene is part of a gene cluster.

Parameters:

genes (iterable of Gene) – The genes to compute probabilities for.

Keyword Arguments:
  • batch_size (int) – The number of samples to load per batch. Ignored, always 1 with the CRF.

  • pad (bool) – Whether to pad sequences too small for a single window. Setting this to False will skip probability prediction entirely for sequences smaller than the window size.

  • progress (callable) – A callable that accepts two int, the current batch index and the total number of batches.

Returns:

list of Gene – A list of new Gene objects with their probability set.

Raises:

NotFittedError – When calling this method on an object that has not been fitted yet.

fit(genes: Iterable[Gene], *, select: Optional[float] = None, shuffle: bool = True, cpus: Optional[int] = None, correction_method: Optional[str] = None) None[source]

Fit the CRF model to the given training data.

Parameters:
  • genes (iterable of Gene) – The genes to extract domains from for training the CRF.

  • select (float, optional) – The fraction of features to select based on Fisher-tested significance. Leave as None to skip feature selection.

  • shuffle (bool) – Whether or not to shuffle the contigs after having grouped the genes together.

  • correction_method (str, optional) – The correction method to use for correcting p-values used for feature selection. Ignored if select is False.

save(model_path: PathLike[str]) None[source]

Save the ClusterCRF to an on-disk location.

Models serialized at a given location can be later loaded from that same location using the ClusterCRF.trained class method.

Parameters:

model_path (str) – The path to the directory where to write the model files.