# Data Model¶

Data layer classes storing information needed for BGC detection.

## Python Layer¶

class gecco.model.ProductType(enum.IntFlag)[source]

A flag to declare the type of product synthesized by a gene cluster.

unpack()[source]

Unpack a composite ProductType into a list of individual types.

Example

>>> ty = ProductType.Polyketide | ProductType.Saccharide
>>> ty.unpack()
[<ProductType.Polyketide: 4>, <ProductType.Saccharide: 16>]

class gecco.model.Strand(enum.IntEnum)[source]

A flag to declare on which DNA strand a gene is located.

property sign

The strand as a single sign (+ or -).

Type

str

class gecco.model.Domain(object)[source]

A conserved region within a protein.

name

The accession of the protein domain in the source HMM.

Type

str

start

The start coordinate of the domain within the protein sequence (first amino-acid at 1).

Type

int

end

The end coordinate of the domain within the protein sequence (inclusive).

Type

int

hmm

The name of the HMM library this domain belongs to (e.g. Pfam, Panther).

Type

str

i_evalue

The independent e-value reported by hmmsearch that measures how reliable the domain annotation is.

Type

float

pvalue

The p-value reported by hmmsearch that measure how likely the domain score is.

Type

float

probability

The probability that this domain is part of a BGC, or None if no prediction has been made yet.

Type

float, optional

qualifiers

A dictionary of feature qualifiers that is added to the SeqFeature built from this Domain.

Type

dict, optional

with_probability(probability: Optional[float]) → gecco.model.Domain[source]

Copy the current domain and assign it a BGC probability.

to_seq_feature(protein_coordinates: bool = False) → Bio.SeqFeature.SeqFeature[source]

Convert the domain to a single feature.

Parameters

protein_coordinates (bool) – Set to True for the feature coordinates to be given in amino-acids, or to False in nucleotides.

class gecco.model.Protein(object)[source]

A sequence of amino-acids translated from a gene.

id

The identifier of the protein.

Type

str

seq

The sequence of amino-acids of this protein.

Type

Seq

domains

A list of domains found in the protein sequence.

Type
to_seq_record() → Bio.SeqRecord.SeqRecord[source]

Convert the protein to a single record.

class gecco.model.Gene(object)[source]

A nucleotide sequence coding a protein.

source

The DNA sequence this gene was found in, as a Biopython record.

Type

SeqRecord

start

The index of the leftmost nucleotide of the gene within the source sequence, independent of the strandedness.

Type

int

end

The index of the rightmost nucleotide of the gene within the source sequence.

Type

int

strand

The strand where the gene is located.

Type

Strand

protein

The protein translated from this gene.

Type

Protein

qualifiers

A dictionary of feature qualifiers that is added to the SeqFeature built from this Gene.

Type

dict, optional

property id

The identifier of the gene (same as the protein identifier).

Type

str

property average_probability

The average of domain probabilities of being biosynthetic.

Type

float

property maximum_probability

The highest of domain probabilities of being biosynthetic.

Type

float

to_seq_feature() → Bio.SeqFeature.SeqFeature[source]

Convert the gene to a single feature.

class gecco.model.Cluster(object)[source]

A sequence of contiguous genes with biosynthetic activity.

id

The identifier of the gene cluster.

Type

str

genes

A list of the genes belonging to this gene cluster.

Type
types

The putative types of product synthesized by this gene cluster, according to similarity in domain composition with curated clusters.

Type

gecco.model.ProductType

types_probabilities

The probability with which each BGC type was identified (same dimension as the types attribute).

Type
property source

The sequence this cluster was found in.

Type

SeqRecord

property start

The start of this cluster in the source sequence.

Type

int

property end

The end of this cluster in the source sequence.

Type

int

property average_probability

The average of proteins probability of being biosynthetic.

Type

float

property maximum_probability

The highest of proteins probability of being biosynthetic.

Type

float

domain_composition()[source]

Compute weighted domain composition with respect to all_possible.

Parameters
• all_possible (sequence of str, optional) – A sequence containing all domain names to consider when computing domain composition for the BGC. If None given, then only domains within the cluster are taken into account.

• normalize (bool) – Normalize the composition vector so that it sums to 1.

• minlog_weights (bool) – Compute weight for each domain as $$-log_10(v)$$ (where $$v$$ is either the pvalue or the i_evalue, depending on the value of normalize). Use $$1 - v$$ otherwise.

• pvalue (bool) – Compute composition weights using the pvalue of each domain, instead of the i_evalue.

Returns

ndarray – A numerical array containing the relative domain composition of the BGC.

to_seq_record() → Bio.SeqRecord.SeqRecord[source]

Convert the cluster to a single record.

Annotations of the source sequence are kept intact if they don’t overlap with the cluster boundaries. Component genes are added on the record as CDS features. Annotated protein domains are added as misc_feature.

## Report Tables¶

class gecco.model.ClusterTable(collections.Sized)[source]

A table storing condensed information from several clusters.

class Row[source]

A single row in a cluster table.

sequence_id

Alias for field number 0

bgc_id

Alias for field number 1

start

Alias for field number 2

end

Alias for field number 3

average_p

Alias for field number 4

max_p

Alias for field number 5

type

Alias for field number 6

alkaloid_probability

Alias for field number 7

polyketide_probability

Alias for field number 8

ripp_probability

Alias for field number 9

saccharide_probability

Alias for field number 10

terpene_probability

Alias for field number 11

nrp_probability

Alias for field number 12

other_probability

Alias for field number 13

proteins

Alias for field number 14

domains

Alias for field number 15

classmethod from_clusters()[source]

Create a new cluster table from an iterable of clusters.

dump(fh: TextIO, dialect: str = 'excel-tab', header: bool = True) → None[source]

Write the cluster table in CSV format to the given file.

Parameters
classmethod load(fh: TextIO, dialect: str = 'excel-tab') → gecco.model.ClusterTable[source]

Load a cluster table in CSV format from a file handle in text mode.

class gecco.model.FeatureTable(collections.Sized)[source]

A table storing condensed domain annotations from different genes.

class Row[source]

A single row in a feature table.

sequence_id

Alias for field number 0

protein_id

Alias for field number 1

start

Alias for field number 2

end

Alias for field number 3

strand

Alias for field number 4

domain

Alias for field number 5

hmm

Alias for field number 6

i_evalue

Alias for field number 7

pvalue

Alias for field number 8

domain_start

Alias for field number 9

domain_end

Alias for field number 10

bgc_probability

Alias for field number 11

classmethod from_genes()[source]

Create a new feature table from an iterable of genes.

to_genes()[source]

Convert a feature table to actual genes.

Since the source sequence cannot be known, a dummy sequence is built for each gene of size gene.end, so that each gene can still be converted to a SeqRecord if needed.

to_dataframe() → pandas.core.frame.DataFrame[source]

Convert the feature table to a DataFrame.

Raises

ImportError – if the pandas module could not be imported.

dump(fh: TextIO, dialect: str = 'excel-tab', header: bool = True) → None[source]

Write the feature table in CSV format to the given file.

Parameters
classmethod load(fh: TextIO, dialect: str = 'excel-tab') → gecco.model.FeatureTable[source]

Load a feature table in CSV format from a file handle in text mode.