Gene cluster prediction using a conditional random field.
- class gecco.crf.ClusterCRF(object)[source]¶
A wrapper for
sklearn_crfsuite.CRFto work with the GECCO data model.
- classmethod trained(model_path: Optional[str] = None) gecco.crf.ClusterCRF [source]¶
Create a new pre-trained
ClusterCRFinstance from a model path.
str, optional) – The path to the model directory obtained with the
gecco traincommand. If
Nonegiven, use the embedded model.
ClusterCRF– A CRF model that can be used to perform predictions without training first.
ValueError – If the model data does not match its hash.
- __init__(feature_type: str = 'protein', algorithm: str = 'lbfgs', window_size: int = 5, window_step: int = 1, **kwargs: Any) None [source]¶
Create a new
str) – Defines how features should be extracted. Should be either domain or protein.
str) – The optimization algorithm for the model. See https://sklearn-crfsuite.readthedocs.io/en/latest/api.html for available values.
int) – The size of the sliding window to use when training and predicting probabilities on sequences of genes.
int) – The step between consecutive sliding windows to use when training and predicting probabilities on sequences of genes.
Any additional keyword argument is passed as-is to the internal
ValueError – if
feature_typehas an invalid value.
TypeError – if one of the
*_columnsargument is not iterable.
- predict_probabilities(genes: Iterable[gecco.model.Gene], *, pad: bool = True, progress: Optional[Callable[[int, int], None]] = None) List[gecco.model.Gene] [source]¶
Predict how likely each given gene is part of a gene cluster.
genes (iterable of
Gene) – The genes to compute probabilities for.
- Keyword Arguments
int) – The number of samples to load per batch. Ignored, always 1 with the CRF.
bool) – Whether to pad sequences too small for a single window. Setting this to
Falsewill skip probability prediction entirely for sequences smaller than the window size.
progress (callable) – A callable that accepts two
int, the current batch index and the total number of batches.
Gene– A list of new
Geneobjects with their probability set.
NotFittedError – When calling this method on an object that has not been fitted yet.
- fit(genes: Iterable[gecco.model.Gene], *, select: Optional[float] = None, shuffle: bool = True, cpus: Optional[int] = None, correction_method: Optional[str] = None) None [source]¶
Fit the CRF model to the given training data.
genes (iterable of
Gene) – The genes to extract domains from for training the CRF.
float, optional) – The fraction of features to select based on Fisher-tested significance. Leave as
Noneto skip feature selection.
bool) – Whether or not to shuffle the contigs after having grouped the genes together.
str, optional) – The correction method to use for correcting p-values used for feature selection. Ignored if
- save(model_path: os.PathLike[str]) None [source]¶
ClusterCRFto an on-disk location.
Models serialized at a given location can be later loaded from that same location using the
str) – The path to the directory where to write the model files.