Integrations¶
The files written by GECCO are standard TSV and GenBank files, so they should be easy to use in downstream analyses. However, some common use-cases are already covered to reduce the need for custom scripts.
Genome Feature¶
GECCO outputs tables containing the location of BGCs in TSV format to retain
as much metadata as possible for each predicted BGC. These tables are easy to
manipulate with a library such as pandas
or polars
. However, most biology visualization
tools can load arbitrary features in GFF format,
which may be used to visualize predicted BGCs across an entire genome.
Since v0.9.7
GECCO offers the option to generate a GFF file from its output:
$ gecco run -g KY646191.1.gbk -o output_dir
$ gecco convert clusters -i output_dir --format gff
The output folder will contain an additional GFF file with each cluster as a single feature:
$ tree output_dir
output_dir
├── KC188778.1_cluster_1.gbk
├── KC188778.1.clusters.gff
├── KC188778.1.clusters.tsv
└── KC188778.1.features.tsv
Feature Coloring¶
Starting from v0.9.6
, GECCO will attempt to color the gene features in the
output GenBank files based on their molecular function. GenBank has no
standard way of doing so, but many software offer their own way of coloring
features. At the moment, GECCO outputs color qualifiers that should be
supported by APE,
EasyFig,
Benchling or
SnapGene as shown below:
The color code is the same as MIBiG, green for regulatory proteins, blue for transporters, red for biosynthetic proteins, and grey for unknown function protein.
AntiSMASH¶
Since v0.7.0
, GECCO can natively output JSON files that can be loaded into
the AntiSMASH viewer as external annotations. To do so, simply run
your analysis with the --antismash-sideload
option to generate an
additional file:
$ gecco run -g KC188778.1.gbk -o output_dir --antismash-sideload
The output folder will contain an additional JSON file compared to usual runs:
$ tree output_dir
output_dir
├── KC188778.1_cluster_1.gbk
├── KC188778.1.clusters.tsv
├── KC188778.1.features.tsv
└── KC188778.1.sideload.json
0 directories, 4 files
That JSON file can be loaded into the AntiSMASH result viewer. Check
Upload extra annotations, and upload the *.sideload.json
file:
When AntiSMASH is done processing your sequences, the Web viewer will display BGCs found by GECCO as subregions next to the AntiSMASH clusters.
GECCO-specific metadata (such as the probability of the predicted type) and
configuration (recording the --threshold
and --cds
values passed to
the gecco run
command) can be seen in the dedicated GECCO tab.
BiG-SLiCE¶
GECCO outputs GenBank files that only contain standard features, but BiG-SLiCE requires additional metadata to load BGCs for analysis.
Since v0.7.0
, the gecco convert
subcommand can convert GenBank files
obtained with a typical GECCO run into files than can be loaded by BiG-SLiCE.
Just run the command after gecco run
using the same folder as the input:
$ gecco run -g KY646191.1.gbk -o bigslice_dir/dataset_1/KY646191.1/
$ gecco convert gbk -i bigslice_dir/dataset_1/KY646191.1/ --format bigslice
This will create a new region file for each GenBank file, which will be detected by BiG-SLiCE. Provided you organised the folders in the appropriate structure, it should look like this:
$ tree bigslice_dir
bigslice_dir
├── dataset_1
│ └── KC188778.1
│ ├── KC188778.1_cluster_1.gbk
│ ├── KC188778.1.clusters.tsv
│ ├── KC188778.1.features.tsv
│ └── KC188778.1.region1.gbk
├── datasets.tsv
└── taxonomy
└── dataset_1_taxonomy.tsv
3 directories, 6 files
BiG-SLiCE will be able to load and render the BGCs found by GECCO:
Warning
Because of the way BiG-SLiCE loads BGCs coming from GECCO, they are always marked as being fragmented.