BGC Detection¶
Gene cluster prediction using a conditional random field.
- class gecco.crf.ClusterCRF(object)[source]¶
A wrapper for
sklearn_crfsuite.CRF
to work with the GECCO data model.- classmethod trained(model_path: Optional[str] = None) ClusterCRF [source]¶
Create a new pre-trained
ClusterCRF
instance from a model path.- Parameters:
model_path (
str
, optional) – The path to the model directory obtained with thegecco train
command. IfNone
given, use the embedded model.- Returns:
ClusterCRF
– A CRF model that can be used to perform predictions without training first.- Raises:
ValueError – If the model data does not match its hash.
- __init__(feature_type: str = 'protein', algorithm: str = 'lbfgs', window_size: int = 5, window_step: int = 1, **kwargs: Any) None [source]¶
Create a new
ClusterCRF
instance.- Parameters:
feature_type (
str
) – Defines how features should be extracted. Should be either domain or protein.algorithm (
str
) – The optimization algorithm for the model. See https://sklearn-crfsuite.readthedocs.io/en/latest/api.html for available values.window_size (
int
) – The size of the sliding window to use when training and predicting probabilities on sequences of genes.window_step (
int
) – The step between consecutive sliding windows to use when training and predicting probabilities on sequences of genes.
Any additional keyword argument is passed as-is to the internal
CRF
constructor.- Raises:
ValueError – if
feature_type
has an invalid value.TypeError – if one of the
*_columns
argument is not iterable.
- predict_probabilities(genes: Iterable[Gene], *, pad: bool = True, progress: Optional[Callable[[int, int], None]] = None) List[Gene] [source]¶
Predict how likely each given gene is part of a gene cluster.
- Parameters:
genes (iterable of
Gene
) – The genes to compute probabilities for.- Keyword Arguments:
batch_size (
int
) – The number of samples to load per batch. Ignored, always 1 with the CRF.pad (
bool
) – Whether to pad sequences too small for a single window. Setting this toFalse
will skip probability prediction entirely for sequences smaller than the window size.progress (callable) – A callable that accepts two
int
, the current batch index and the total number of batches.
- Returns:
list
ofGene
– A list of newGene
objects with their probability set.- Raises:
NotFittedError – When calling this method on an object that has not been fitted yet.
- fit(genes: Iterable[Gene], *, select: Optional[float] = None, shuffle: bool = True, cpus: Optional[int] = None, correction_method: Optional[str] = None) None [source]¶
Fit the CRF model to the given training data.
- Parameters:
genes (iterable of
Gene
) – The genes to extract domains from for training the CRF.select (
float
, optional) – The fraction of features to select based on Fisher-tested significance. Leave asNone
to skip feature selection.shuffle (
bool
) – Whether or not to shuffle the contigs after having grouped the genes together.correction_method (
str
, optional) – The correction method to use for correcting p-values used for feature selection. Ignored ifselect
isFalse
.
- save(model_path: PathLike[str]) None [source]¶
Save the
ClusterCRF
to an on-disk location.Models serialized at a given location can be later loaded from that same location using the
ClusterCRF.trained
class method.- Parameters:
model_path (
str
) – The path to the directory where to write the model files.