Data Model

Data layer classes storing information needed for gene cluster detection.

Python Layer

class gecco.model.ClusterType(object)[source]

An immutable storage for the type of a gene cluster.

__init__(*names: str) None[source]

Create a new product type from one or more base types.

Example

>>> t1 = ClusterType()                    # unknown type
>>> t2 = ClusterType("Polyketide")        # single type
>>> t3 = ClusterType("Polyketide", "NRP") # multiple types
unpack() List[ClusterType][source]

Unpack a composite ClusterType into a list of individual types.

Example

>>> ty = ClusterType("Polyketide", "Saccharide")
>>> ty.unpack()
[ClusterType('Polyketide'), ClusterType('Saccharide')]
class gecco.model.Strand(enum.IntEnum)[source]

A flag to declare on which DNA strand a gene is located.

property sign: str

The strand as a single sign (+ or -).

Type:

str

class gecco.model.Domain(object)[source]

A conserved region within a protein.

name

The accession of the protein domain in the source HMM.

Type:

str

start

The start coordinate of the domain within the protein sequence (first amino-acid at 1).

Type:

int

end

The end coordinate of the domain within the protein sequence (inclusive).

Type:

int

hmm

The name of the HMM library this domain belongs to (e.g. Pfam, Panther).

Type:

str

i_evalue

The independent e-value reported by hmmsearch that measures how reliable the domain annotation is.

Type:

float

pvalue

The p-value reported by hmmsearch that measure how likely the domain score is.

Type:

float

probability

The probability that this domain is part of a gene cluster, or None if no prediction has been made yet.

Type:

float, optional

cluster_weight

The weight for this domain, measuring its importance as infered from the training clusters by the CRF model.

Type:

float, optional

go_terms

The Gene Ontology terms for this particular domain.

Type:

list of GOTerm

go_functions

The Gene Ontology term families for this particular domain. Term families are extracted by taking the highest superclasses (excluding the root) of each Gene Ontology term in the molecular_function namespace associated with this domain.

Type:

list of GOTerm

qualifiers

A dictionary of feature qualifiers that is added to the SeqFeature built from this Domain.

Type:

dict

with_probability(probability: Optional[float]) Domain[source]

Copy the current domain and assign it a cluster probability.

with_cluster_weight(cluster_weight: Optional[float]) Domain[source]

Copy the current domain and assign it a cluster weight.

to_seq_feature(protein_coordinates: bool = False) SeqFeature[source]

Convert the domain to a single feature.

Parameters:

protein_coordinates (bool) – Set to True for the feature coordinates to be given in amino-acids, or to False in nucleotides.

class gecco.model.Protein(object)[source]

A sequence of amino-acids translated from a gene.

id

The identifier of the protein.

Type:

str

seq

The sequence of amino-acids of this protein.

Type:

Seq

domains

A list of domains found in the protein sequence.

Type:

list of Domain

to_seq_record() SeqRecord[source]

Convert the protein to a single record.

with_seq(seq: Seq) Protein[source]

Copy the current protein and assign it a new sequence.

with_domains(domains: Iterable[Domain]) Protein[source]

Copy the current protein and assign it new domains.

class gecco.model.Gene(object)[source]

A nucleotide sequence coding a protein.

source

The DNA sequence this gene was found in, as a Biopython record.

Type:

SeqRecord

start

The index of the leftmost nucleotide of the gene within the source sequence, independent of the strandedness.

Type:

int

end

The index of the rightmost nucleotide of the gene within the source sequence.

Type:

int

strand

The strand where the gene is located.

Type:

Strand

protein

The protein translated from this gene.

Type:

Protein

qualifiers

A dictionary of feature qualifiers that is added to the SeqFeature built from this Gene.

Type:

dict, optional

property id: str

The identifier of the gene (same as the protein identifier).

Type:

str

property average_probability: Optional[float]

The average of domain probabilities of being in a cluster.

Type:

float

property maximum_probability: Optional[float]

The highest of domain probabilities of being in a cluster.

Type:

float

to_seq_feature(color: bool = True) SeqFeature[source]

Convert the gene to a single feature.

with_protein(protein: Protein) Gene[source]

Copy the current gene and assign it a different protein.

with_source(source: SeqRecord) Gene[source]

Copy the current gene and assign it a different source.

with_probability(probability: float) Gene[source]

Copy the current gene and assign it a different probability.

functions() Set[str][source]

Predict the function(s) of the gene from its domain annotations.

class gecco.model.Cluster(object)[source]

A sequence of contiguous genes.

id

The identifier of the gene cluster.

Type:

str

genes

A list of the genes belonging to this gene cluster.

Type:

list of Gene

types

The putative types of this gene cluster, according to similarity in domain composition with curated clusters.

Type:

gecco.model.ClusterType

types_probabilities

The probability with which each cluster type was identified (same dimension as the types attribute).

Type:

list of float

property source: SeqRecord

The sequence this cluster was found in.

Type:

SeqRecord

property start: int

The start of this cluster in the source sequence.

Type:

int

property end: int

The end of this cluster in the source sequence.

Type:

int

property average_probability: Optional[float]

The average of proteins probability of being biosynthetic.

Type:

float

property maximum_probability: Optional[float]

The highest of proteins probability of being biosynthetic.

Type:

float

domain_composition(all_possible: Optional[Sequence[str]] = None, normalize: bool = True, minlog_weights: bool = False, pvalue: bool = True) NDArray[numpy.double][source]

Compute weighted domain composition with respect to all_possible.

Parameters:
  • all_possible (sequence of str, optional) – A sequence containing all domain names to consider when computing domain composition for the cluster. If None given, then only domains within the cluster are taken into account.

  • normalize (bool) – Normalize the composition vector so that it sums to 1.

  • minlog_weights (bool) – Compute weight for each domain as \(-log_10(v)\) (where \(v\) is either the pvalue or the i_evalue, depending on the value of normalize). Use \(1 - v\) otherwise.

  • pvalue (bool) – Compute composition weights using the pvalue of each domain, instead of the i_evalue.

Returns:

ndarray – A numerical array containing the relative domain composition of the gene cluster.

to_seq_record() SeqRecord[source]

Convert the cluster to a single record.

Annotations of the source sequence are kept intact if they don’t overlap with the cluster boundaries. Component genes are added on the record as CDS features. Annotated protein domains are added as misc_feature.

Report Tables

class gecco.model.ClusterTable(collections.Sized)[source]

A table storing condensed information from several clusters.

classmethod from_clusters(clusters: Iterable[Cluster]) ClusterTable[source]

Create a new cluster table from an iterable of clusters.

class gecco.model.FeatureTable(collections.Sized)[source]

A table storing condensed domain annotations from different genes.

classmethod from_genes(genes: Iterable[Gene]) FeatureTable[source]

Create a new feature table from an iterable of genes.

to_genes() Iterable[Gene][source]

Convert a feature table to actual genes.

Since the source sequence cannot be known, a dummy sequence is built for each gene of size gene.end, so that each gene can still be converted to a SeqRecord if needed.