skip to content

Engineering Biology in Cambridge

 

The idea is to develop a computational system and corresponding web site to couple information on metabolic and biochemical networks within bacteria (focussing initially on cyanobacteria) with networks of protein evolutionary history and homology. The system is designed to be easily used by biologists with minimal computing experience. The main principle is that the variability in the enzymatic components of pathways across a large sample of different genomes provides a valuable resource for the understanding and manipulation of biosynthesis.

The Idea:

The idea is to develop a computational system and corresponding web site to couple information on metabolic and biochemical networks within bacteria (focussing initially on cyanobacteria) with networks of protein evolutionary history and homology. The system is designed to be easily used by biologists with minimal computing experience. The main principle is that the variability in the enzymatic components of pathways across a large sample of different genomes provides a valuable resource for the understanding and manipulation of biosynthesis.

We aim to create a web­based tool to easily analyse the gene and protein families involved in the complete multi­organism network of biosynthetic pathways across hundreds of genomes. This will allow points of interest such as conserved, specialist and missing biosynthetic steps to be quickly and easily identified from within a clade of organisms. Thus the synthetic pathways within individual species can be better understood, and underpinned with concrete data relating to genes and homologues etc.

Importantly, the genome specific differences will enable identification of useful pathway components that can be recombined in novel ways to introduce foreign biosynthetic pathways into an organism of interest.

This project involves the interconnection of two networks of information 1) the traditional biochemical network of enzyme­driven metabolic pathways and 2) the evolutionary connection between homologous genes and proteins across a wide range of organisms. These networks can readily be described by a computational system that models each enzyme as a "node" with connections to other nodes. In the first case the connections will correspond to metabolite (product­substrate) links between genes, and in the second case the connections represent significant sequence similarity, from which homology can be inferred. These networks will be computed and/or curated and stored in a database that will then underpin the web­based tool. Gene families not involved in metabolic pathways will also be collated and stored in the same system, and though not presented in exactly the same way, they will be equally accessible.

The web site will display the information in a manner that aims to illustrate how different biosynthetic pathways differ between genomes. The presentation of data will be primarily visual, anchoring it to a metabolic pathway map; either a complete multi­genome "superset" map or just a subset, focussing on a particular pathway. The display of gene presence/absence and conservation across the genomes will also be graphical, including for genes not involved in a metabolic pathway. The user will then be able to investigate the full depth of sequence based information, connect to external bioinformatics databases (GenBank, UniProt etc.) and extract any family trees and alignments for any and all points of interest. All data will be downloadable as spreadsheets and in a variety of popular bioinformatics formats. The metabolic map may also be superimposed with other, orthogonal data (e.g. transcriptomics, ChIP­seq, metabolomics) that can be anchored to the genes, so the information can be displayed and analysed in a pathway­wide context, rather than as the more common linear genome or array­based representations. Free, open­source software allowing mapping of genomics data to a complete metabolic gene map (and also to genes not involved in metabolism) is currently not available and would make many biological analyses simpler.

David Lea­Smith will be primarily responsible for the creation of a holistic biosynthetic pathway map from a wide­ranging review of databases and published literature relating to metabolism and biosynthetic pathways. This will be completed for cyanobacteria in the initial instance. Eschirichia coli, the best annotated bacterium and the model cyanobacterium, Synechocystis sp. PCC 6803 and Nostoc sp. PCC 7120, will be used to create the anchor points for a consistent annotation of homologous gene clusters. David will also be involved in the preliminary testing of the web site and provide feedback to ensure that the eventual outcome suits the needs of the biological community. The clustering of genes into families naturally aims to make it clear what the homology relationships are between different proteins, e.g. to identify orthologues. Where the traditional naming of genes and proteins are either inconsistent or missing, combining knowledge about orthology and where a protein is likely to act in a particular pathway make the identity of any component unambiguous. Thus as a necessary side­effect, the system will be to generate a single consistent nomenclature for all of the enzymes in the network of pathways, across all the organisms of interest. In the future this could provide a basis for automatically annotating new genome sequences within the same scheme. 

Tim Stevens will be responsible for the bioinformatics analyses pathway maps as a website. This work involves several sequential steps: comparison of all protein sequences from all genomes under study to generate a matrix of detectable similarities; the hierarchical clustering sequences into family groups; the detection of remote homology and identification of missing genes; the connection of clusters, and thus also individual genes, to anchor points on the biosynthetic pathway map. All of this information will be stored in an SQL database (as is standard) and be presented as an interactive, searchable, graphically­oriented web page. A hierarchical approach will be used for the clustering of protein sequences into family groups because the amount of conservation within a family can vary substantially from case to case. Also, by moving up and down a familial hierarchy a user of the website will be able to see how specialisation of function arises as species and sequences diverge. This will allow sequence variation (or absence) to be related to metabolic capabilities. Initial work will focus on cyanobacteria because it is a mainstay of the Howe lab and because a large amount of analysis on cyanobacterial synthetic pathways has already been performed, with a practical application towards generating biofuels. This group of bacteria will also serve as a test bed for the system, fixing any problems and refining the web site before the project is opened­up to bacteria at large. This project must be of limited scope to be achievable within a limited time period, and so will focus on a subset of Bacteria, but the technology would naturally be expandable to further clades, including those from Archaea and Eukarya.

System with bacteria metabolic and biochemical network information full application.

.pdf