# Data Model¶

Data layer classes storing information needed for gene cluster detection.

## Python Layer¶

class gecco.model.ClusterType(object)[source]

An immutable storage for the type of a gene cluster.

__init__(*names: str) None[source]

Create a new product type from one or more base types.

Example

>>> t1 = ClusterType()                    # unknown type
>>> t2 = ClusterType("Polyketide")        # single type
>>> t3 = ClusterType("Polyketide", "NRP") # multiple types

unpack() [source]

Unpack a composite ClusterType into a list of individual types.

Example

>>> ty = ClusterType("Polyketide", "Saccharide")
>>> ty.unpack()
[ClusterType('Polyketide'), ClusterType('Saccharide')]

class gecco.model.Strand(enum.IntEnum)[source]

A flag to declare on which DNA strand a gene is located.

property sign: str

The strand as a single sign (+ or -).

Type

str

class gecco.model.Domain(object)[source]

A conserved region within a protein.

name

The accession of the protein domain in the source HMM.

Type

str

start

The start coordinate of the domain within the protein sequence (first amino-acid at 1).

Type

int

end

The end coordinate of the domain within the protein sequence (inclusive).

Type

int

hmm

The name of the HMM library this domain belongs to (e.g. Pfam, Panther).

Type

str

i_evalue

The independent e-value reported by hmmsearch that measures how reliable the domain annotation is.

Type

float

pvalue

The p-value reported by hmmsearch that measure how likely the domain score is.

Type

float

probability

The probability that this domain is part of a gene cluster, or None if no prediction has been made yet.

Type

float, optional

cluster_weight

The weight for this domain, measuring its importance as infered from the training clusters by the CRF model.

Type

float, optional

go_terms

The Gene Ontology terms for this particular domain.

Type

list of GOTerm

go_functions

The Gene Ontology term families for this particular domain. Term families are extracted by taking the highest superclasses (excluding the root) of each Gene Ontology term in the molecular_function namespace associated with this domain.

Type

list of GOTerm

qualifiers

A dictionary of feature qualifiers that is added to the SeqFeature built from this Domain.

Type

dict

with_probability(probability: ) [source]

Copy the current domain and assign it a cluster probability.

with_cluster_weight(cluster_weight: ) [source]

Copy the current domain and assign it a cluster weight.

to_seq_feature(protein_coordinates: bool = False) [source]

Convert the domain to a single feature.

Parameters

protein_coordinates (bool) – Set to True for the feature coordinates to be given in amino-acids, or to False in nucleotides.

class gecco.model.Protein(object)[source]

A sequence of amino-acids translated from a gene.

id

The identifier of the protein.

Type

str

seq

The sequence of amino-acids of this protein.

Type

Seq

domains

A list of domains found in the protein sequence.

Type
to_seq_record() [source]

Convert the protein to a single record.

with_seq(seq: Bio.Seq.Seq) [source]

Copy the current protein and assign it a new sequence.

with_domains(domains: ) [source]

Copy the current protein and assign it new domains.

class gecco.model.Gene(object)[source]

A nucleotide sequence coding a protein.

source

The DNA sequence this gene was found in, as a Biopython record.

Type

SeqRecord

start

The index of the leftmost nucleotide of the gene within the source sequence, independent of the strandedness.

Type

int

end

The index of the rightmost nucleotide of the gene within the source sequence.

Type

int

strand

The strand where the gene is located.

Type

Strand

protein

The protein translated from this gene.

Type

Protein

qualifiers

A dictionary of feature qualifiers that is added to the SeqFeature built from this Gene.

Type

dict, optional

property id: str

The identifier of the gene (same as the protein identifier).

Type

str

property average_probability: Optional[float]

The average of domain probabilities of being in a cluster.

Type

float

property maximum_probability: Optional[float]

The highest of domain probabilities of being in a cluster.

Type

float

to_seq_feature(color: bool = True) [source]

Convert the gene to a single feature.

with_protein(protein: gecco.model.Protein) [source]

Copy the current gene and assign it a different protein.

with_source(source: Bio.SeqRecord.SeqRecord) [source]

Copy the current gene and assign it a different source.

with_probability(probability: float) [source]

Copy the current gene and assign it a different probability.

functions() Set[str][source]

Predict the function(s) of the gene from its domain annotations.

class gecco.model.Cluster(object)[source]

A sequence of contiguous genes.

id

The identifier of the gene cluster.

Type

str

genes

A list of the genes belonging to this gene cluster.

Type
types

The putative types of this gene cluster, according to similarity in domain composition with curated clusters.

Type

gecco.model.ClusterType

types_probabilities

The probability with which each cluster type was identified (same dimension as the types attribute).

Type
property source: Bio.SeqRecord.SeqRecord

The sequence this cluster was found in.

Type

SeqRecord

property start: int

The start of this cluster in the source sequence.

Type

int

property end: int

The end of this cluster in the source sequence.

Type

int

property average_probability: Optional[float]

The average of proteins probability of being biosynthetic.

Type

float

property maximum_probability: Optional[float]

The highest of proteins probability of being biosynthetic.

Type

float

domain_composition(all_possible: = None, normalize: bool = True, minlog_weights: bool = False, pvalue: bool = True) NDArray[numpy.double][source]

Compute weighted domain composition with respect to all_possible.

Parameters
• all_possible (sequence of str, optional) – A sequence containing all domain names to consider when computing domain composition for the cluster. If None given, then only domains within the cluster are taken into account.

• normalize (bool) – Normalize the composition vector so that it sums to 1.

• minlog_weights (bool) – Compute weight for each domain as $$-log_10(v)$$ (where $$v$$ is either the pvalue or the i_evalue, depending on the value of normalize). Use $$1 - v$$ otherwise.

• pvalue (bool) – Compute composition weights using the pvalue of each domain, instead of the i_evalue.

Returns

ndarray – A numerical array containing the relative domain composition of the gene cluster.

to_seq_record() [source]

Convert the cluster to a single record.

Annotations of the source sequence are kept intact if they don’t overlap with the cluster boundaries. Component genes are added on the record as CDS features. Annotated protein domains are added as misc_feature.

## Report Tables¶

class gecco.model.ClusterTable(collections.Sized)[source]

A table storing condensed information from several clusters.

class Row(sequence_id: str, cluster_id: str, start: int, end: int, average_p: , max_p: , type: gecco.model.ClusterType, type_p: , proteins: Optional[List[str]], domains: Optional[List[str]])[source]

A single row in a cluster table.

sequence_id: str

Alias for field number 0

cluster_id: str

Alias for field number 1

start: int

Alias for field number 2

end: int

Alias for field number 3

average_p: Optional[float]

Alias for field number 4

max_p: Optional[float]

Alias for field number 5

type: gecco.model.ClusterType

Alias for field number 6

type_p: Dict[gecco.model.ClusterType, float]

Alias for field number 7

proteins: Optional[List[str]]

Alias for field number 8

domains: Optional[List[str]]

Alias for field number 9

classmethod from_clusters(clusters: ) [source]

Create a new cluster table from an iterable of clusters.

dump(fh: TextIO, dialect: str = 'excel-tab', header: bool = True) None[source]

Write the table in CSV format to the given file.

Parameters
classmethod load(fh: TextIO, dialect: str = 'excel-tab') [source]

Load a table in CSV format from a file handle in text mode.

class gecco.model.FeatureTable(collections.Sized)[source]

A table storing condensed domain annotations from different genes.

class Row(sequence_id: str, protein_id: str, start: int, end: int, strand: str, domain: str, hmm: str, i_evalue: float, pvalue: float, domain_start: int, domain_end: int, cluster_probability: )[source]

A single row in a feature table.

sequence_id: str

Alias for field number 0

protein_id: str

Alias for field number 1

start: int

Alias for field number 2

end: int

Alias for field number 3

strand: str

Alias for field number 4

domain: str

Alias for field number 5

hmm: str

Alias for field number 6

i_evalue: float

Alias for field number 7

pvalue: float

Alias for field number 8

domain_start: int

Alias for field number 9

domain_end: int

Alias for field number 10

cluster_probability: Optional[float]

Alias for field number 11

classmethod from_genes(genes: ) [source]

Create a new feature table from an iterable of genes.

to_genes() [source]

Convert a feature table to actual genes.

Since the source sequence cannot be known, a dummy sequence is built for each gene of size gene.end, so that each gene can still be converted to a SeqRecord if needed.