BGC Detection

BGC prediction using a conditional random field.

class gecco.crf.ClusterCRF(object)[source]

A wrapper for sklearn_crfsuite.CRF to work with the GECCO data model.

classmethod trained(model_path: Optional[str] = None) → gecco.crf.ClusterCRF[source]

Create a new pre-trained ClusterCRF instance from a model path.

Parameters

model_path (str, optional) – The path to the model directory obtained with the gecco train command. If None given, use the embedded model.

Returns

ClusterCRF – A CRF model that can be used to perform predictions without training first.

Raises

ValueError – If the model data does not match its hash.

__init__()[source]

Create a new ClusterCRF instance.

Parameters
  • feature_type (str) – Defines how features should be extracted. Should be either domain, protein, or overlap.

  • algorithm (str) – The optimization algorithm for the model. See https://sklearn-crfsuite.readthedocs.io/en/latest/api.html for available values.

  • overlap (int) – In case of feature_type = "overlap", defines the sliding window size to use. The resulting window width is 2*overlap+1.

  • pool_factory (multiprocessing.pool.Pool subclass, or callable) – The factory to use to create a new pool instance for methods that can perform operations in parallel. It is called with a single argument which is the number of workers to create, or None to create a much workers as there are CPUs.

Any additional keyword argument is passed as-is to the internal CRF constructor.

Raises
  • ValueError – if feature_type has an invalid value.

  • TypeError – if one of the *_columns argument is not iterable.

predict_probabilities()[source]

Predict how likely each given gene is part of a gene cluster.