skip to content

Engineering Biology in Cambridge

 

To develop a hybrid approach to solve multi-parametric experimental design problems via a simple-to-use and freely available graphical user interface (GUI) to empower a wider audience of experimental biologists to employ genetic algorithms in solving their optimisation problems.

The Idea

Biological problems are usually complex due to their multi-parametric nature and to the fact that these parameters are often interdependent. A commonly employed approach in attacking such problem relies on the use of background knowledge, or informed guesswork, to prioritise these parameters. For novel systems there may be insufficient background knowledge to enable successful prioritisation. Moreover, identifying and testing the effect of individual parameters is often an ineffective strategy because it ignores the interactive effects of mutually dependent parameters.

Design of experiment (DoE) is currently the most commonly employed methodology to handle the optimisation of multi-parametric design problems. The preliminary assumption of the statistical DoE approach is that the factors affecting the optimisation of the solution to the problem under investigation are known ab initio. This means that the factors need to be prioritised and limited to a manageable number to be tested at a limited number of levels in advance. Fractional factorial design is the first step of this approach. The factors are tested at all possible combinations of their assigned levels to define the boundaries of the solution space. Then, additional values of the variables are included outside the solution space at the same distance as from the centre point to the corners (central composite circumscribed design; CCC), or on the surface of the solution space between the initial values used in fractional factorial design (composite face centred design; CCF). In the next stage, response surface methodology (RSM) is employed to fit polynomial models to the data obtained from the experiments carried out over the possible solution space using either multiple linear regression or partial least squares regression (Mandenius and Anders 2008). The resulting models are then used to optimise the output parameter or to evaluate the sensitivity of the output to changes imposed on individual input parameters. A number of software packages which utilise DoE are commercially available, making it an attractive approach that simplifies this complex problem for experimental scientists in both academia and industry.

The major problem with this approach is that there are practical limitations to the number of experiments that need to be conducted. In DoE, this number can explode quite readily since the number of experiments is given by Ln, where n is the number of the factors and L is the number of levels defined for each factor under investigation (Rao et al. 2008). Limiting the number of parameters (factors) creates the risk of overlooking some critical components, while limiting the number of levels introduces the risk of achieving an outcome that occupies one of the many local optima rather than finding the global optimum. Both these limitations mean that the full potential of the system will not be exploited.

These issues concerning the DoE have led to the exploration of non-statistical approaches, such as artificial intelligence, to be employed as tools for solving such multi-parametric optimisation problems. Genetic algorithms (GA) have frequently been employed as a search heuristic to explore the solution space in order to find the global optimum (Camacho-Rodríguez et al. 2015; Malhotra, Singh, and Singh 2011; Sarma, Sahai, and Bisaria 2009; Weuster-Botz 2000). GA has the advantage of exploring a large variable space without exponentially increasing the number of experiments that need to be conducted. GA only requires the maximum and minimum values for any number of factors to be known in order to define the boundaries of the possible solutions space. Unlike fractional factorial design, the number of factors and levels do not need to be limited to “manageable” values. The possible solution space is spanned through a heuristic search that begins from a random set of “populations” that describe the randomly assigned levels for each of the factors under investigation. These populations “evolve” over generations of mating and random mutations towards a solution sub-space, where the global optimum resides. Despite its potential, the lack of commercial software employing GA as a search algorithm to span the complete solution space for a multi-parametric optimisation problem limits its practice to just specialist users experienced in handling algorithms.

We propose to develop a hybrid approach to solve this multi-parametric experimental design problem and to develop a simple-to-use and freely available graphical user interface (GUI) to empower a wider audience of experimental biologists to employ GA in solving their optimisation problems. . The approach will adopt the best of both worlds from DoE and GA methodologies. The employment of GA in spanning the possible solution space will ensure that the search leads to the discovery of the subspace where the global optimum resides, without imposing artificial restrictions on the number of factors (or their levels) that need to be investigated. We will then adopt the regression models approach employed in DoE to investigate the optimal solution sub-space. The experimental data will also be used to fit polynomial models by regression. This hybrid approach will allow experimental biologists to: (i) collect extensive data over the complete solution space, (ii) describe the solution space and how the output is affected by the dependent parameters under investigation, and (iii) identify the exact optimal solution in the global optimum sub-space refined over the generations. We believe this tool to be an attractive alternative to the commercially available DoE software for both academic users and for experimental users in SMEs whose limited funds do not allow them to employ dedicated statisticians to carry out such tasks or to purchase commercial software.

Who we are

 

Implementation

The work we propose involves the realisation of two tasks, which will be conducted in parallel and will continuously feed back information to one another. The first task will involve demonstrating the applicability of the proposed approach and the second task will involve developing the software tool. We will experimentally validate the applicability of the hybrid approach that we propose within the scope of the first task. For this purpose, we have selected the optimization of cultivation conditions during inducible recombinant protein production by a production strain of the yeast Pichia (Komagataella) pastoris. The accumulated knowledge in Prof Oliver’s lab on this host strain indicated pH as well as several macronutrients (carbon, nitrogen, sulphur sources) and several micronutrients (potassium, magnesium, iron) along with the main chemicals required during the induction of protein production (methanol and sorbitol) to be the major effectors of cultivation.

We have selected this 9-parameter system, each defined at 32 different levels, as a test case to demonstrate the efficacy of GA over factorial design. The population size in each generation is proposed as 16 in contrast to 329 experiments that would have been required in the case of a standard DoE investigation. We propose to employ a combined objective to tune growth in a phase-dependent manner (to be maximised pre-induction and to be minimised post-induction of recombinant protein production) and to maximise recombinant protein production. We will employ GA as the search heuristic to determine the solution sub-space fulfilling these combined objectives. We will then proceed to determine the optimal solution in that sub-space and construct a regression-based model to describe the interdependencies among the input parameters, which affect the output parameters.

The performance of this optimised set of environmental conditions will be experimentally verified by benchmarking its performance against other conditions reported in the literature. This task will provide an optimised set of cultivation conditions and a predictive model for the selected case study. This work will be conducted by DD and ACC. The initial structure of the GUI for the tool, which forms part of the second task, will be developed based on the experience gained during the application. The GUI, which will facilitate the use of the tool by a wider user group, will be developed within the scope of the second task. The GA algorithm, as well as the regression analysis protocols, is currently available in MATLAB environment, which itself is commercial software.

The aim of the proposed project is to remove these commercial dependencies allowing free-to-all access. MATLAB Runtime is a standalone set of shared libraries that enables running compiled MATLAB applications or components without installing MATLAB, and is free-of-charge. Therefore it is considered suitable for the purpose of this task. The GUI will have a modular structure to accommodate: (i) selection of the input parameters and the range that will determine the largest possible solution space, (ii) determination of the objective function describing the output parameter, (iii) implementation of the GA algorithm to narrow down the solution space towards global optimum, and (iv) determination of the global optimum and construction of the regression model. This task will be mainly carried out by JMLD. Bi-weekly meetings will be held between JMLD, DD and ACC to monitor and discuss the progress and to provide feedback based on the progress achieved in the first task.

The GUI development process described in this task will be carried out in close communication with ZuvaSyntha, which employs commercial DoE software in their multi-parametric optimisation problems. Dr John Kendall is an experienced user of commercially available DoE software and will therefore provide valuable feedback during every stage of the development process from an industrial perspective. Proposed timeline The experimental task of the project will be completed within the first 3 months and the optimised set of conditions will be benchmarked against previously reported conditions during the 4th month. The first draft of the GUI for the tool will also be ready by the end of the first half of the project period. Academic partners, other academic institutions (through the 3rd Meeting of Applied Synthetic Biology in Europe) and ZuvaSyntha will test this pilot version and provide feedback on its development. The second version of the software will be ready by the end of the 4th month of the project. The version will go through a second round of internal evaluations and, by the end of the 5th month, the software will be ready to launch. The last month of the project will be focused on the development of the documentation and the website.

 

Benefits and outcomes

The proposed work lies within the scope of synthetic biology since many questions addressed by experts in the field are of multi-parametric nature and require the optimisation of a set of interdependent parameters. The proposed project aims to target this problem. As a solution, it will provide a tool that will have the potential to replace commercial products available in the field, which are often beyond the means of both academic and industrial (SME) users.

This work falls under the “software development and documentation” subtitle of the Call. The software we propose to develop for the hybrid approach for multi-parametric optimisation will be made accessible by the end of the project period. Therefore, our Proposal will lead to a tangible, publicly documented, and open outcome. We propose to make the tool and its user documentation accessible to all potential users under free licensing and have any associated publications in open-access journals. The data generated in the experiments will also be made publicly available through the CamData Repository. The timeline of the proposed work falls within the duration of the funding and the detailed costing we provide below for the Proposal remains within the allowed budget limits.

The work involves collaboration between the Departments of Biochemistry and Haematology as well as the Sanger Institute. Furthermore, an industrial partner will provide the view of the preferences on the DoE protocols and requests of non-academic research environments, specifically that of SMEs with limited funding for maintaining dedicated staff and purchasing commercial software. All main applicants undertaking the proposed work are postdoctoral workers at the University of Cambridge. They have the agreement of the University cost-code holder (Prof Stephen G Oliver) that the proposed project and management of the allocated funding will fit with his existing work.

Our team is comprised of dedicated and enthusiastic scientists with a variety of different and complementary skills, which are all indispensable for the success of this project. ACC is an experienced fermentation scientist with extensive knowledge on the multi-parametric nature of fermentation problems and different types of cultivation environments and modes. DD is an experienced data analyst with a specific focus on the employment of model-based techniques. JMLD is experienced in the development of GUI applications for model based tools. Both JMLD and DD are experienced in collaborating with non-academic partners and feel confident in their ability to carry the communication with ZuvaSyntha forward. Furthermore, we believe that the contribution of a non-academic partner’s views and ideas during the development process will provide an invaluable contribution in ensuring the tool meet the needs of not only academia but also those of experimental biologists in SMEs in the biotechnology and pharmacology sectors.

Budget

The salaries of the project members are secured via other funds throughout the duration of the proposed project. The requested funding and the justification for these are detailed below. Costs that will incur during the course of the project:

  1. Laboratory consumables required for the evolutionary experiments for cultivation parameter optimisation in the domain of recombinant protein production using microbial hosts – growth medium chemicals, consumable plasticware (aerated tubes, micropipette tips, microcentrifuge tubes (Eppendorf), Stericup Filter Units (Millipore), PVC spectrophotometer cuvettes, 96-well plates (Nunc) for enzymatic assays to determine protein activity), 2 x EnzChek (ThermoFisher Scientific) enzymatic assay kit for the determination of lysozyme activity (400 assays per kit): £1500 NB: Chemicals for the growth medium will be used from Prof Oliver’s laboratory in the first instance. The requested budget will be used to replace any chemical that will be fully consumed throughout the course of the project.
  2. Travel between Sanger Institute (JMLD) and the Department of Biochemistry (DD and ACC) and meeting expenses for regular scheduled meetings among team members: £150
  3. 1 x Portable computer £1000
  4. 1 x Attendance to 3rd Meeting of Applied Synthetic Biology in Europe (http://www.efb-central.org/index.php/3rdSyntheticBiology/) – to receive interim feedback from a mainly academic audience on the developed approach and the initial version of the tool. Any suggestions will be used to modify the course of the development process to better suit the needs of end-users. The target topic area at the meeting will be: Technology & tool development £1000
  5. Travel and meeting expenses for industrial partner feedback on the development of the GUI of the software (meetings held at ZuvaSyntha Ltd., Welwyn Garden City) £100
  6. Costs for building and maintaining a website for the tool – The only cost will be associated with the building of the website. The webpage for the tool will be embedded under the Cambridge Systems Biology Website and will be regularly maintained along with other website components in the future. This will facilitate continued access to the tool. £100
  7. Costs that will incur during the follow-on and outreach:
  8. Half-day full training and Q&A session for the use of the developed tool aimed for all potential users in ZuvaSyntha £100
  9. Securing the uninterrupted accessibility of the tool and the relevant data via CamData Repository £5/GB
  10. 1 x Attendance for a dissemination event to improve the visibility of the software – 17th European Congress on Biotechnology (http://ecb2016.com/) £900

Project application slides

Design files and documentation used during this application can be found here.

 

References

  • Camacho-Rodríguez, J., M. C. Cerón-García, J. M. Fernández-Sevilla, and E. Molina-Grima. 2015. “Genetic Algorithm for the Medium Optimization of the Microalga Nannochloropsis Gaditana Cultured to Aquaculture.” Bioresource Technology 177:102–9. Retrieved (http://linkinghub.elsevier.com/retrieve/pii/S0960852414016599).
  • Malhotra, Rahul, Narinder Singh, and Yaduvir Singh. 2011. “Genetic Algorithms : Concepts , Design for Optimization of Process Controllers.” Computer and Information Science 4(2):39–54.
  • Mandenius, Carl-Fredrik, and Brundin Anders. 2008. “Bioprocess Optimization Using Design-of-Experiments Methodology.” Biotechnol. Prog. 24(6):1191–1203.
  • Rao, Ch Subba et al. 2008. “Modelling and Optimization of Fermentation Factors for Enhancement of Alkaline Protease Production by Isolated Bacillus Circulans Using Feed-Forward Neural Network and Genetic Algorithm.” Journal of applied microbiology 104(3):889–98. Retrieved (http://www.ncbi.nlm.nih.gov/pubmed/17953681).
  • Sarma, M. V. R. K., Vikram Sahai, and V. S. Bisaria. 2009. “Genetic Algorithm-Based Medium Optimization for Enhanced Production of Fluorescent Pseudomonad R81 and Siderophore.” Biochemical Engineering Journal 47(1-3):100–108.
  • Weuster-Botz, D. 2000. “Experimental Design for Fermentation Media Development: Statistical Design or Global Random Search?” Journal of bioscience and bioengineering 90(5):473–83.