Data Model¶
Data layer classes storing information needed for gene cluster detection.
Python Layer¶
- class gecco.model.ClusterType(object)[source]¶
An immutable storage for the type of a gene cluster.
- __init__(*names: str) None [source]¶
Create a new product type from one or more base types.
Example
>>> t1 = ClusterType() # unknown type >>> t2 = ClusterType("Polyketide") # single type >>> t3 = ClusterType("Polyketide", "NRP") # multiple types
- unpack() List[gecco.model.ClusterType] [source]¶
Unpack a composite
ClusterType
into a list of individual types.Example
>>> ty = ClusterType("Polyketide", "Saccharide") >>> ty.unpack() [ClusterType('Polyketide'), ClusterType('Saccharide')]
- class gecco.model.Strand(enum.IntEnum)[source]¶
A flag to declare on which DNA strand a gene is located.
- class gecco.model.Domain(object)[source]¶
A conserved region within a protein.
- start¶
The start coordinate of the domain within the protein sequence (first amino-acid at 1).
- Type
- i_evalue¶
The independent e-value reported by
hmmsearch
that measures how reliable the domain annotation is.- Type
- probability¶
The probability that this domain is part of a gene cluster, or
None
if no prediction has been made yet.- Type
float
, optional
- cluster_weight¶
The weight for this domain, measuring its importance as infered from the training clusters by the CRF model.
- Type
float
, optional
- go_functions¶
The Gene Ontology term families for this particular domain. Term families are extracted by taking the highest superclasses (excluding the root) of each Gene Ontology term in the
molecular_function
namespace associated with this domain.- Type
list
ofGOTerm
- qualifiers¶
A dictionary of feature qualifiers that is added to the
SeqFeature
built from thisDomain
.- Type
- with_probability(probability: Optional[float]) gecco.model.Domain [source]¶
Copy the current domain and assign it a cluster probability.
- with_cluster_weight(cluster_weight: Optional[float]) gecco.model.Domain [source]¶
Copy the current domain and assign it a cluster weight.
- class gecco.model.Protein(object)[source]¶
A sequence of amino-acids translated from a gene.
- to_seq_record() Bio.SeqRecord.SeqRecord [source]¶
Convert the protein to a single record.
- with_seq(seq: Bio.Seq.Seq) gecco.model.Protein [source]¶
Copy the current protein and assign it a new sequence.
- with_domains(domains: Iterable[gecco.model.Domain]) gecco.model.Protein [source]¶
Copy the current protein and assign it new domains.
- class gecco.model.Gene(object)[source]¶
A nucleotide sequence coding a protein.
- start¶
The index of the leftmost nucleotide of the gene within the source sequence, independent of the strandedness.
- Type
- qualifiers¶
A dictionary of feature qualifiers that is added to the
SeqFeature
built from thisGene
.- Type
dict
, optional
- property average_probability: Optional[float]¶
The average of domain probabilities of being in a cluster.
- Type
- property maximum_probability: Optional[float]¶
The highest of domain probabilities of being in a cluster.
- Type
- to_seq_feature(color: bool = True) Bio.SeqFeature.SeqFeature [source]¶
Convert the gene to a single feature.
- with_protein(protein: gecco.model.Protein) gecco.model.Gene [source]¶
Copy the current gene and assign it a different protein.
- with_source(source: Bio.SeqRecord.SeqRecord) gecco.model.Gene [source]¶
Copy the current gene and assign it a different source.
- with_probability(probability: float) gecco.model.Gene [source]¶
Copy the current gene and assign it a different probability.
- class gecco.model.Cluster(object)[source]¶
A sequence of contiguous genes.
- types¶
The putative types of this gene cluster, according to similarity in domain composition with curated clusters.
- types_probabilities¶
The probability with which each cluster type was identified (same dimension as the
types
attribute).
- property source: Bio.SeqRecord.SeqRecord¶
The sequence this cluster was found in.
- Type
- property average_probability: Optional[float]¶
The average of proteins probability of being biosynthetic.
- Type
- property maximum_probability: Optional[float]¶
The highest of proteins probability of being biosynthetic.
- Type
- domain_composition(all_possible: Optional[Sequence[str]] = None, normalize: bool = True, minlog_weights: bool = False, pvalue: bool = True) NDArray[numpy.double] [source]¶
Compute weighted domain composition with respect to
all_possible
.- Parameters
all_possible (sequence of
str
, optional) – A sequence containing all domain names to consider when computing domain composition for the cluster. IfNone
given, then only domains within the cluster are taken into account.normalize (
bool
) – Normalize the composition vector so that it sums to 1.minlog_weights (
bool
) – Compute weight for each domain as \(-log_10(v)\) (where \(v\) is either thepvalue
or thei_evalue
, depending on the value ofnormalize
). Use \(1 - v\) otherwise.pvalue (
bool
) – Compute composition weights using thepvalue
of each domain, instead of thei_evalue
.
- Returns
ndarray
– A numerical array containing the relative domain composition of the gene cluster.
- to_seq_record() Bio.SeqRecord.SeqRecord [source]¶
Convert the cluster to a single record.
Annotations of the source sequence are kept intact if they don’t overlap with the cluster boundaries. Component genes are added on the record as CDS features. Annotated protein domains are added as misc_feature.
Report Tables¶
- class gecco.model.ClusterTable(collections.Sized)[source]¶
A table storing condensed information from several clusters.
- class Row(sequence_id: str, cluster_id: str, start: int, end: int, average_p: Optional[float], max_p: Optional[float], type: gecco.model.ClusterType, type_p: Dict[gecco.model.ClusterType, float], proteins: Optional[List[str]], domains: Optional[List[str]])[source]¶
A single row in a cluster table.
- type: gecco.model.ClusterType¶
Alias for field number 6
- type_p: Dict[gecco.model.ClusterType, float]¶
Alias for field number 7
- classmethod from_clusters(clusters: Iterable[gecco.model.Cluster]) gecco.model.ClusterTable [source]¶
Create a new cluster table from an iterable of clusters.
- dump(fh: TextIO, dialect: str = 'excel-tab', header: bool = True) None [source]¶
Write the table in CSV format to the given file.
- Parameters
fh (file-like
object
) – A writable file-handle opened in text mode to write the feature table to.dialect (
str
) – The CSV dialect to use. Seecsv.list_dialects
for allowed values.header (
bool
) – Whether or not to include the column header when writing the table (useful for appending to an existing table). Defaults toTrue
.
- classmethod load(fh: TextIO, dialect: str = 'excel-tab') gecco.model.ClusterTable [source]¶
Load a table in CSV format from a file handle in text mode.
- class gecco.model.FeatureTable(collections.Sized)[source]¶
A table storing condensed domain annotations from different genes.
- class Row(sequence_id: str, protein_id: str, start: int, end: int, strand: str, domain: str, hmm: str, i_evalue: float, pvalue: float, domain_start: int, domain_end: int, cluster_probability: Optional[float])[source]¶
A single row in a feature table.
- classmethod from_genes(genes: Iterable[gecco.model.Gene]) gecco.model.FeatureTable [source]¶
Create a new feature table from an iterable of genes.
- to_genes() Iterable[gecco.model.Gene] [source]¶
Convert a feature table to actual genes.
Since the source sequence cannot be known, a dummy sequence is built for each gene of size
gene.end
, so that each gene can still be converted to aSeqRecord
if needed.