Data Model¶
Data layer classes storing information needed for gene cluster detection.
Python Layer¶
- class gecco.model.ClusterType(object)[source]¶
An immutable storage for the type of a gene cluster.
- __init__(*names: str) None [source]¶
Create a new product type from one or more base types.
Example
>>> t1 = ClusterType() # unknown type >>> t2 = ClusterType("Polyketide") # single type >>> t3 = ClusterType("Polyketide", "NRP") # multiple types
- unpack() List[ClusterType] [source]¶
Unpack a composite
ClusterType
into a list of individual types.Example
>>> ty = ClusterType("Polyketide", "Saccharide") >>> ty.unpack() [ClusterType('Polyketide'), ClusterType('Saccharide')]
- class gecco.model.Strand(enum.IntEnum)[source]¶
A flag to declare on which DNA strand a gene is located.
- class gecco.model.Domain(object)[source]¶
A conserved region within a protein.
- start¶
The start coordinate of the domain within the protein sequence (first amino-acid at 1).
- Type:
- i_evalue¶
The independent e-value reported by
hmmsearch
that measures how reliable the domain annotation is.- Type:
- probability¶
The probability that this domain is part of a gene cluster, or
None
if no prediction has been made yet.- Type:
float
, optional
- cluster_weight¶
The weight for this domain, measuring its importance as infered from the training clusters by the CRF model.
- Type:
float
, optional
- go_functions¶
The Gene Ontology term families for this particular domain. Term families are extracted by taking the highest superclasses (excluding the root) of each Gene Ontology term in the
molecular_function
namespace associated with this domain.- Type:
list
ofGOTerm
- qualifiers¶
A dictionary of feature qualifiers that is added to the
SeqFeature
built from thisDomain
.- Type:
- with_probability(probability: Optional[float]) Domain [source]¶
Copy the current domain and assign it a cluster probability.
- class gecco.model.Gene(object)[source]¶
A nucleotide sequence coding a protein.
- start¶
The index of the leftmost nucleotide of the gene within the source sequence, independent of the strandedness.
- Type:
- qualifiers¶
A dictionary of feature qualifiers that is added to the
SeqFeature
built from thisGene
.- Type:
dict
, optional
- property average_probability: Optional[float]¶
The average of domain probabilities of being in a cluster.
- Type:
- property maximum_probability: Optional[float]¶
The highest of domain probabilities of being in a cluster.
- Type:
- to_seq_feature(color: bool = True) SeqFeature [source]¶
Convert the gene to a single feature.
- with_protein(protein: Protein) Gene [source]¶
Copy the current gene and assign it a different protein.
- with_source(source: SeqRecord) Gene [source]¶
Copy the current gene and assign it a different source.
- class gecco.model.Cluster(object)[source]¶
A sequence of contiguous genes.
- types¶
The putative types of this gene cluster, according to similarity in domain composition with curated clusters.
- Type:
- types_probabilities¶
The probability with which each cluster type was identified (same dimension as the
types
attribute).
- property average_probability: Optional[float]¶
The average of proteins probability of being biosynthetic.
- Type:
- property maximum_probability: Optional[float]¶
The highest of proteins probability of being biosynthetic.
- Type:
- domain_composition(all_possible: Optional[Sequence[str]] = None, normalize: bool = True, minlog_weights: bool = False, pvalue: bool = True) NDArray[numpy.double] [source]¶
Compute weighted domain composition with respect to
all_possible
.- Parameters:
all_possible (sequence of
str
, optional) – A sequence containing all domain names to consider when computing domain composition for the cluster. IfNone
given, then only domains within the cluster are taken into account.normalize (
bool
) – Normalize the composition vector so that it sums to 1.minlog_weights (
bool
) – Compute weight for each domain as \(-log_10(v)\) (where \(v\) is either thepvalue
or thei_evalue
, depending on the value ofnormalize
). Use \(1 - v\) otherwise.pvalue (
bool
) – Compute composition weights using thepvalue
of each domain, instead of thei_evalue
.
- Returns:
ndarray
– A numerical array containing the relative domain composition of the gene cluster.
Report Tables¶
- class gecco.model.ClusterTable(collections.Sized)[source]¶
A table storing condensed information from several clusters.
- classmethod from_clusters(clusters: Iterable[Cluster]) ClusterTable [source]¶
Create a new cluster table from an iterable of clusters.