Introduction

The phenotypic behavior of living organisms is determined by the underlying and highly complex interactions of molecules, for example proteins, DNA, RNA or other biochemical substances (Kitano 2002). These interactions can occur at an extremely fast rate and therefore the overall dynamics of the cell or higher organism is highly nonlinear. One of the challenges of systems biology is to utilize proven techniques that have been developed in other areas, such as control engineering, and apply these to biological systems in order to try to gain a better understanding of the function and behaviour of the underlying molecular processes (Wolkenhauer et al. 2005; Wellstead 2007).

This paper investigates two such processes that have been widely studied in the literature: the mitogen-activated protein kinase (MAPK) cascade (Gormley et al. 2007; Sasagawa et al. 2005; Huang and Ferrell 1996; Kholodenko 2000) and a biological oscillator known as the Brusselator (Karafyllis et al. 1997; Peng and Wang 2005; Wang et al. 2002; Zimmerman 2006). The MAPK cascade can be found in all eukaryotic cells and is an important signal transduction pathway that helps to activate several transcription factors involved in the regulation of cell cycle activity (Widmann et al. 1999). The Brusselator is a simplified model of biochemical oscillations; a behaviour that is the basis for much of the dynamic behaviour found in many cellular systems. For example, the regulation of enzyme activity produces metabolic oscillations, circadian rhythms originate from the regulation of gene expression, and oscillations in intracellular calcium levels are responsible for the control of cell receptor activity which in turn is responsible for intercellular signalling (Goldbeter 2002). Therefore, identifying the key features and dynamics in these types of molecular processes is important for understanding system behaviour and also for possible regulatory control of biological systems.

Throughout the systems biology literature, the most common approach to representing these molecular interactions and signalling pathways is by ordinary/partial differential equations (Levchenko et al. 2000; Chen et al 2004; Markevich et al. 2007). Such equations describe concentration levels of the individual molecular species in the pathway over time. In control engineering, this is commonly known as white-box modelling as the models have been derived from chemical rate equations of the underlying biological process to provide a complete picture of the system at any time. Such models are perfectly feasible when the number of molecular species in the pathway is relatively small (such as in the cases investigated here). However, in other biological systems the number of species interactions can become incredibly large, resulting in the model becoming too complex to analyse and even impossible to solve. The work described here therefore takes a different approach by adopting simplified black-box identification of these biological systems using a linear-in-the-parameters model. This class of nonlinear model comprises of a linear combination of some model terms or basis functions, that are a function of past system states of interest, and has been used to model a wide range of nonlinear dynamic systems in the literature. Some examples include the polynomial nonlinear AutoRegressive model with eXogenous inputs (polynomial NARX), neurofuzzy networks, and radial basis function (RBF) networks (Chen et al. 1989; Haber and Unbehauen 1990; Sjberg et al. 1995; Li et al. 2005, 2006; Peng et al. 2006). It has been shown that linear-in-the-parameters models have broad approximation capabilities and have been widely used in modelling and control of complex nonlinear engineering systems (Chen et al. 1989; Harris et al. 2002; Zhu and Billings 1996; Li et al. 2004; Huang et al. 2005; Hunt et al. 1992).

When building a linear-in-the-parameters model, a major problem is that a very large pool of candidate model terms has to be considered initially (Mao and Billings 1997; Li et al. 2005; Haber and Unbehauen 1990), from which a useful and simplified model is then generated based on the parsimonious principle (Ljung 1987; Söderström and Stoica 1989), of selecting the smallest possible model, in terms of size, which explains the data. In the linear regression field, this problem is referred to as the subset selection (Draper and Smith 1981; Hastie et al. 2001; Lawson and Hanson 1974; Miller 1990; Li et al. 2006). However, in modelling nonlinear dynamic systems, the size of the term pool can be so huge (Mao and Billings 1997) that to select an optimal subset is computationally too expensive. For example, (Mao and Billings 1997) pointed out that exhaustive search of the optimal model with 20 possible model terms involves 2.43 × 1018 search paths—the so-called curse of dimensionality.

Among various subset approaches, the forward methods are among the most effective for model building where a very large term pool has to be considered. In particular, the orthogonal least squares (OLS) method (Chen et al. 1989; Chen and Wigger 1995; Zhu and Billings 1996), which performs the forward stepwise model selection using modified Gram–Schmidt (MGS) orthogonalization, is the most popular one. In forward model selection, significant terms are selected one-by-one, and the net decrease in the cost function due to each newly selected term can be computed without explicitly solving the least-squares. Thus the computational complexity is significantly reduced and the dimensionality problem can be effectively relieved. To further improve the computational efficiency and numerical stability, other fast algorithms have been proposed (Li et al. 2005; Chen and Wigger 1995; Korenberg 1988).

Despite the great efficiency of forward stepwise methods in model selection, the major disadvantage is that the model obtained is not optimal (Sherstinsky and Picard 1996). To overcome this problem, the orthogonal estimation algorithm has been augmented with genetic search procedures to search the optimal model (Mao and Billings 1997). However, it is well known that genetic algorithms suffer from slow and premature convergence (Andre et al. 2001; Peng et al. 2004). Given the fact that the search for the optimal model is a mixed integer problem and that numerous local minima exist, there is no guarantee that the global optimum can be produced in practice through a genetic search. Moreover, the computational complexity is usually extremely high, and it is also impossible to analyse this due to the stochastic sampling nature of genetic search.

In this paper, an iterative subset selection approach is used for identification of the nonlinear dynamics of molecular interactions that underly many biological systems. The model terms are selected and refined within one analytic framework, leading to improved model compactness over forward subset selection methods. It will be shown that the proposed method can capture the inherent dynamics of these systems using only sparse input–output data of system states, where the sets are of varying size. It will be demonstrated that the method is of sufficient accuracy, even considering system noise, to offer a simple alternative to the more computationally expensive white-box approach.

This paper is organised as follows. The next section describes the main method used to select the optimal model structure. Following that, the two biological systems to be investigated are introduced. The iterative subset selection method is then applied to modelling simulations of these molecular processes using a polynomial NARX structure. Finally some conclusions are drawn.

The modelling method

The method applied to modelling the biological systems in this paper is a polynomial NARX model. This type of model uses regression of system input–output data to create a model structure and has been applied to modelling many types of conventional nonlinear systems throughout the control engineering literature. The ability of these models to approximate any nonlinear function to arbitrary accuracy is well known (Ljung 1987). They provide a method of mapping input states to system output, where the internal structure of the target system is usually not considered. These are relatively simple linear-in-the-parameters models, where the output at any time is a linear combination of previous input/output states of the system. For readers less familiar with this type of approach, the following subsection provides a brief introduction to the technique.

Introduction to polynomial NARX models

A general nonlinear dynamic system can be represented as:

$$ \begin{aligned} y(t)&=f(y(t-1), \ldots,y(t-n_y),u(t-1), \ldots,u(t-n_u)) + \epsilon(t)\\ &=f({\mathbf{x}}(t))+\epsilon (t) \end{aligned} $$
(1)

where the output of the system y(t) at any time is a function of previous output and input states u(t) plus some unknown noise variation \(\epsilon(t),\) where n u and n y are the maximal input/output lags, x(t) is the model ‘input’ vector, and f(·) is some unknown (usually nonlinear) function.

Now suppose the systems to be investigated are represented by a polynomial NARX model, which is a linear-in-the-parameters model of the form:

$$ y(t)=\sum\limits_{i=1}^M \theta_i\varphi_i({\mathbf{x}}(t))+\epsilon(t) $$
(2)

where φ is the regression matrix which contains M candidate model terms and θ is the corresponding vector of model parameters to be estimated.

The regression matrix φ is constructed from a polynomial expansion of previous input and output states of the target system. The main steps taken to construct it are as follows:

  1. 1.

    First perturb the target system to obtain a set of input–output data evenly sampled over a period of time.

  2. 2.

    Now taking the input u(t) and output y(t) vectors of N samples each, create new data vectors by delaying u(t) and y(t) by a number of time points to create the model input vector x(t). So for example a system lag of 3 would create a model input vector of:

    $$ {\mathbf{x}}(t)=\{y(t-1),y(t-2),y(t-3),u(t-1),u(t-2),u(t-3)\} $$
    (3)
  3. 3.

    Next perform a polynomial expansion of the model input vector x(t) to create the full regression matrix φ. So for a polynomial expansion of 3, φ would be a N × M matrix containing M = 14 column vectors of linear and nonlinear candidate terms of up to 2nd order.

Now the problem is to select the best n regressor terms p 1,…,p n ∈ [φ1,…,φ M ] so that the sum squared error (SSE) between the target system and model output is minimised:

$$ {\mathop{\min}\limits_{\theta_i,{\bf p}_i}} \sum\limits_{t=1}^N \left(y(t)-\sum\limits_{i=1}^n {\bf p}_i \theta_i \right)^2 $$
(4)

Through minimising the cost function, the model parameters are also estimated and the significance of each term in the regression matrix towards the true system can be established. Terms that are unrelated to the true system will be found to have an insignificant contribution to minimising the cost function and hence, the most important regressor terms can be selected to be included in the model.

Obviously, when building a model, both the order of expansion and number of delays selected for the input vector will affect the performance. Increasing these parameters means that the subset selection algorithm will be more likely to converge upon the optimal model, however, this will also increase the solution space as M tends towards infinity and therefore the computational complexity of finding the solution becomes too high.

Implementation example

To illustrate the basic concept proposed in the paper, consider the following true system which is unknown to the modeler:

$$ \begin{aligned} y(t)&=-1.7y(t-1)-0.8y(t-2)\\ &+u(t-1)+0.8u(t-2)+\epsilon(t) \end{aligned} $$
(5)

Now, if a NARX model is created with five delays on the model input vector with a polynomial expansion of order 2, the full model can be constructed as:

$$ y(t)=\{y(t-1),\ldots,y(t-5),u(t-1),\ldots,u(t-5),y^{2}(t-1),\\ \ldots,y(t-1)*u(t-5)\}\theta + \epsilon (t) $$
(6)

Now comparing this to the true system shows that only linear terms are required in this case, so ideally the model subset selection algorithm will only select these terms when performing the regression, while ignoring the insignificant nonlinear terms.

However, suppose a set of observations (samples) has been obtained from the true system, based on which a run of the forward selection algorithm might have selected the following four terms:

$$ \begin{aligned} y(t)=&-1.655y(t-1)-0.225y(t-3)\\ &+0.95u(t-1) + 0.68u(t-2)*y(t-1) \end{aligned} $$
(7)

Comparing this with the true system shows that only two of the most significant terms have been selected, even though the model may still be able to give a reasonable approximation of the system.

Now if instead we perform the forward and backward subset selection algorithm proposed in this paper, the terms selected are:

$$ \begin{aligned} y(t)=&-1.689y(t-1)-0.775y(t-2)\\ &+0.998u(t-1)+0.830u(t-2) \end{aligned} $$
(8)

This algorithm has selected the most significant model terms and has therefore converged upon the optimal model structure resulting in greater transparency in the model and an improved modelling performance.

The 2-stage algorithm

The two-stage identification algorithm used to perform the subset selection is only briefly described in the following subsections. A more detailed algorithm can be found in the Appendix section.

Forward subset selection

This section briefly outlines the first stage of the identification method where the algorithm uses forward selection to generate an initial model. The model terms are chosen one-by-one from a pool of candidates so that each time the cost function is reduced by the maximum amount. This procedure is repeated until k model terms have been selected, where k is determined by the model structure selection criterion.

To begin with, consider a general nonlinear dynamic system (Chen et al. 1989; Li et al. 2005, 2006)

$$ \begin{aligned} y(t)&=f(y(t-1),\ldots,y(t-n_y),u(t-1),\ldots,u(t-n_u))\\ &=f({\mathbf{x}}(t)) \end{aligned} $$
(9)

where u(t) and y(t) are the system input and output at sample time instant t, n u and n y are the corresponding maximal lags, x(t) represents the model ‘input’ vector, and f(·) is some unknown nonlinear function.

Now suppose in this case a polynomial NARX model is used to represent system (9), then

$$ y(t)=\sum\limits_{i=1}^M {\theta _i \varphi _i ({\mathbf{x}}(t))}+ \varepsilon (t) $$
(10)

where φ i (·), i = 1,…, M are candidate basis functions and ɛ(t) is the model residual. If a sequence of N data samples {x(t), y(t)}, t = 1,…, N is to be used for model identification, Eq. (10) can be rewritten as:

$$ {\mathbf{y}}=\varvec{\Upphi\Uptheta+\Upxi} $$
(11)

where \(\varvec{\Upphi}=[\varphi_1,\ldots,\varphi_M]\in\Re^{N\times M}\) with \(\varphi_i=[\varphi_i({\mathbf{x}}(1)),\ldots, \varphi_i({\mathbf{x}}(N))]^{\rm T}\in\Re^N\) for i = 1,…, M, \({\mathbf{y}}^T=[y(1),\ldots,y(N)]\in\Re^N, \varvec{\Uptheta} =[\theta_1,\ldots,\theta_M]^T\in\Re^M\), and \(\varvec{\Upxi}^T=[\varepsilon(t_1),\ldots,\varepsilon(t_N)]\in\Re^N\) .

The model selection aims to select, say k, regressor terms, denoted as p 1,…,p k , from all the candidates, \({\varphi}_i(\cdot), i=1\ldots,M\) (M is usually ≫ k), resulting in a linear-in-the-parameters model

$$ {\mathbf{y}}={\mathbf{P}}_k\varvec{\Uptheta}_k+{\mathbf{e}} $$
(12)

which best fits the data samples such that the sum squared-error (SSE) is minimised where

$$ \begin{aligned} J({\mathbf{P}}_k)&=\mathop{min}\limits_{{\varvec{\Upphi}_k\in\varvec{\Upphi}, \varvec{\Uptheta}_k}\in\Re^k}\{{\mathbf{e}}^{\rm T}{\mathbf{e}}\}\\ &=\mathop{min}\limits_{\varvec{\Upphi}_k\in\varvec{\Upphi},\varvec{\Uptheta}_k \in \Re^k}\{({\mathbf{y}}-\varvec{\Upphi}_k \varvec{\widehat{\Uptheta}}_{\mathbf k})^{\rm T}({\mathbf{y}}-\varvec{\Upphi}_k\varvec{\widehat{\Uptheta}}_{\mathbf k})\} \end{aligned} $$
(13)

Here \(\varvec{\Upphi}_k\) is an N × k matrix composing of k columns from \(\varvec{\Upphi}, \varvec{\widehat{\Uptheta}}_k\) denotes the corresponding regression coefficient vector, and the selected regression matrix

$$ {\mathbf{P}}_k=[{\mathbf{p}}_1,\ldots,{\mathbf{p}}_k] $$
(14)

If P k is of full column-rank, the least-squares estimate of the regression coefficients in (12) is given by

$$ \varvec{\widehat{\Uptheta}}_k=({\mathbf{P}}_k^T{\mathbf{P}}_k)^{-1}{\mathbf{P}}_k^T {\mathbf{y}} $$
(15)

Having selected k model terms, suppose that one more is added into the model with the corresponding regressor term p k+1. The net reduction in the cost function due to adding this term is now given by

$$ \Updelta J_{k+1}({\mathbf{p}}_{k+1})=J({\mathbf{P}}_k)-J({\mathbf{P}}_{k +1}) $$
(16)

Evaluating the contribution of all remaining terms requires some redefinitions:

$$ \begin{array}{l} \Upphi=[{\mathbf{P}}_k, {\mathbf{C}}_{M-k}]\\ {\mathbf{C}}_{M-k}=[\phi_{k+1},\cdots, \phi_M]\\ \end{array} $$
(17)

Now clearly the first k regressors in Φ (i.e. P k ) correspond to the selected k terms, while the remaining Mk terms C Mk  = [ϕ k+1,⋯, ϕ M ] make up the candidate pool C Mk .

Using (16) the contribution of all remaining candidate terms in Φ = {ϕ1,…,ϕ M } can now be calculated and the term from C Mk which gives the maximum contribution is then selected as the (k + 1)th model term. For example, if the index j of the next most significant term is given by

$$ j=\hbox{arg}\mathop{\hbox{max}}\limits_{k < i\le M}\{\Updelta J_{k + 1} (\phi_i)\} $$
(18)

then ϕ j is selected as the (k + 1)th model term and re-labelled as p k+1 = ϕ j . The regression matrix of the selected model is then P k+1 = [P k  p k+1], while the candidate pool is reduced in size and becomes C Mk−1. The remaining candidates in C Mk−1 are re-indexed as ϕ k+2,⋯,ϕ M . Finally, the full regression matrix Φ changes to Φ = [P k+1 C Mk−1].

This forward selection is repeated until the desired number of model terms (k) has been reached, or the cost function is reduced to a given level, or a certain stop criterion has been reached, such as Akaike’s information criterion (AIC) (Akaike 1974) or the minimum description length (MDL) (Gustafsson and Hjalmarsson 1995). Once the initial model has been constructed, the model can be refined using a backward selection approach to replace insignificant model terms in the original structure.

Backward model refinement

Each iteration of the forward selection algorithm described above selects one new term and adds this to the model. The term is chosen as the one that produces the maximum reduction in the cost function. However, there is usually some correlation between the regressor terms. Therefore terms that are selected subsequently may affect the contribution of previously selected ones. In other words, while a previously selected model term may once have provided a large contribution to reducing the cost, due to a newly introduced term, its contribution can suddenly become insignificant. This inefficiency in forward subset selection methods has been explored in (Sherstinsky and Picard 1996). To overcome this a second stage is introduced whereby all the previously selected model terms are reviewed and the model is refined. Any insignificant terms are removed and/or replaced until an optimal model is achieved for a given selection criterion.

Assume an initial model with n regressor terms has been generated using forward selection. Then suppose a term, say p i , 1 ≤ in, is to be reviewed. Its contribution to the cost (SSE) reduction ΔJ n (p i ) needs to be compared to the individual in the pool of candidate terms offering the largest contribution to cost reduction. Denoting the maximum candidate contribution as \(\Updelta J_n ({\phi}_j)\), then the significance of a model term p i can be checked by identifying the maximum of the contribution of all the other candidates from

$$ \Updelta J_n (\phi_j)=\hbox{max}\{\Updelta J_n (\phi_s),s=n+1,\ldots,M\} $$
(19)

If \(\Updelta J_n ({\phi}_j) > \Updelta J_n ({\mathbf{p}}_i), {\mathbf{p}}_i\) is said to be insignificant, and will be replaced with \({\phi}_j\) as the new regressor term, while p i is returned to the candidate pool, taking the position of \({\phi}_j\) . Such an exchange of model terms will further reduce the SSE by \( \Updelta J_n ({\phi}_j)-\Updelta J_n ({\mathbf{p}}_i)\), which means that the model compactness is further improved and an optimal model structure can be obtained.

The experimental results

The following sections now provide a description of the steps taken to perform the identification of two simulated biological systems using the proposed method from the previous section. The two systems investigated here are the MAPK signalling pathway and the Brusselator. In each case a brief introduction to the system is given, along with a description of the modelling process. Finally, the modelling results obtained using the two-stage algorithm are compared with the conventional forward selection approach.

The MAPK cascade

The MAPK cascade is an important intracellular signalling pathway that is involved in producing many different cellular responses, including cell growth and proliferation (Kholodenko 2000). As such, it is an important pathway that can even be implicated in cancer development when its normal signalling process malfunctions. The pathway describes the response of a cell when it detects the binding of extracellular signalling molecules to receptor proteins at the surface of the cell membrane. The binding process results in conformational changes on the part of the receptor that is below the membrane surface, which in turn triggers the activation of a cascade of intracellular signalling proteins. This is a three-tiered cascade where the kinase at each level is activated through dual phosphorylation at two amino acid sites by the activated kinase of the previous level (see Fig. 1). At the end of the cascade, the terminal signalling protein activates target proteins which alter the behaviour of the cell, for example, by regulating the expression of certain genes, by altering cell shape (by cytoskeletal proteins) or by changing cell metabolism (Alberts et al. 2002).

Fig. 1
figure 1

Kinetic pathway diagram of the MAPK cascade. The single and dual phosphorylation of each molecule is represented by the addition of a ‘-P’ and ‘-PP’ respectively to the name of the kinase, where MAPK-PP represents the output activated form of the kinase. Ras (or MKKKK) is the input protein that triggers the activation of the kinase at the top level of the cascade

Simulation of the MAPK cascade

To create a black-box model of the MAPK cascade, a set of input–output data is required to perform model estimation and validation. A simulation of the signalling pathway was performed to generate a sufficiently large data set. The mathematical model used for the simulation is based on one derived in (Kholodenko 2000) which includes the addition of negative feedback. This is an 8th order state model with a single-input and single-output (SISO). The model uses Michaelis–Menten enzyme kinetics to derive chemical rate equations for each of the pathway connections in the cascade. The rate equations are given in Tables 1 and 2. After setting the initial concentrations of each species and rate constants, the physical equations can be solved for a particular time series.

Table 1 Kinetic rate equations for the concentrations of each of the eight types of molecule found in the MAPK cascade (Kholodenko 2000)
Table 2 Rate equations and parameters for each of the 10 reactions in the MAPK pathway diagram (Fig. 1)

Identification of the MAPK cascade

A data set of 800 samples was generated from the simulation of the MAPK signalling cascade. In order to simulate the effects of measurement noise, a signal of uniformly distributed random noise was generated for each time point and added to the data. The noise was at a level of 30 dB of the signal power of the original data. Finally, the data was normalised to within the range 0–1 and the corresponding statistical measures for this set can be seen in Table 3.

Table 3 Statistics of the input–output data sets used for training and validation

Ideally when performing this type of regression modelling, a large data set (typically 1,000–2,000 samples) is used to make certain that the model will capture the entire range of possible dynamics of the system. However, when dealing with biological systems the amount of data available using current experimental techniques is much smaller than this. For example a typical differential equation model in the Systems Biology literature is fitted to a set of around 30–50 data points. This could be a potential stumbling block for applying the proposed two-stage algorithm to model biological systems. However, provided the derived model is able to perform well when validated on previously unseen data, then the model can be said to be sufficiently accurate. To investigate the effect that data size has on performance, models were derived using subsets of the original 800 samples, beginning with 30 samples and gradually increasing this up to 400 samples.

In each case a nonlinear polynomial AutoRegressive model with eXogenous inputs (NARX), with polynomial order up to 3, was used to construct the regression model. The model input variables Ras (u t ) and MAPK-PP (y t ), with delays of up to 3 time steps each, were used to construct the full model set, resulting in a candidate pool of 285 terms. First the forward selection procedure was performed (using the MDL as the stop criterion) to select a subset of terms from the pool and estimate the corresponding model parameters. Then the obtained model structure was validated on a new set of 400 data points not provided to the algorithm during estimation. The process was then repeated for each set, this time using the proposed two-stage identification algorithm, to perform both forward and backward subset selection in each case. As mentioned in the previous section, the forward approach is not optimal therefore the two-stage method should obtain a more accurate model. To compare the performances, the results of training and validation for both methods on each data set are listed in Table 4.

Table 4 MAPK training and validation results with mean squared error (MSE) between the model and target output given for different sized data sets

From Table 4, it is clear that the proposed two-stage method outperformed the conventional forward selection method in terms of modelling accuracy. As expected the performance also increases, particularly under validation, as the amount of data available to the algorithm increases.

To get an indication of the ability of this method to approximate the MAPK system, Figs. 2 to 11 display the model output superimposed over the target output during the estimation and validation stages. As can be seen in Fig. 2 the polynomial NARX model can be easily fitted to the data when only 30 samples are available from the set. Unfortunately this model is then quite poor when it attempts to be validated on new unseen data in Fig. 7. As the number of samples used at the estimation stage is increased (Figs. 26), the performance of the models under validation also improves (Figs. 711). In fact even using only 100 samples for estimation (Fig. 4) the validation performance has reached an acceptable level (Fig. 9) and the NARX model can approximate the MAPK pathway to sufficient degree of accuracy.

Fig. 2
figure 2

MAPK model estimation using only 30 data points

Fig. 3
figure 3

MAPK model estimation using only 50 data points

Fig. 4
figure 4

MAPK model estimation using 100 data points

Fig. 5
figure 5

MAPK model estimation using 200 data points

Fig. 6
figure 6

MAPK model estimation using 400 data points

Fig. 7
figure 7

MAPK model validation using only 30 data points

Fig. 8
figure 8

MAPK model validation using only 50 data points

Fig. 9
figure 9

MAPK model validation using 100 data points

Fig. 10
figure 10

MAPK model validation using 200 data points

Fig. 11
figure 11

MAPK model validation using 400 data points

Taking the case of the models generated using only 100 samples as an examples, the model structure and parameters derived from both methods are given in Tables 5 and 6. Using the MDL as the stop criterion, the forward subset selection procedure produced a model structure containing only eight terms out of the entire pool of 285 candidates. When the proposed two-stage forward and backward selection method was used, a new optimal subset of eight terms was selected instead. The different subsets of terms and parameters obtained by the two approaches can be compared in Tables 5 and 6. It is obvious from looking at the tables that these model structures are very simple, consisting only of a linear combination of eight (linear/nonlinear) terms and associated parameters. Therefore as already stated in previous sections, these types of models are much simpler than their differential equation counterparts and offer a potential solution to the problem of solving complex high-dimensional systems containing a large number of variables.

Table 5 MAPK model structure obtained from forward selection
Table 6 MAPK model structure obtained from two-stage, forward and backward subset selection

The Brusselator

The second example describes the black-box identification of a biochemical oscillator model known as the Brusselator (Karafyllis et al. 1997). Biochemical oscillations are the underlying basis for much of the dynamic behaviour found in many cellular systems. Many biological processes that exhibit oscillatory behaviour are fundamental to life itself. A typical example of this is the cell cycle, where cell growth and division are controlled by oscillations in the levels of certain proteins and therefore by mitotic oscillations (Tyson 1991; Novak and Tyson 1997; Chen et al. 2004). Therefore, identifying the key features and dynamics in these biochemical oscillations is important for understanding the underlying dynamical behaviour and for possible regulatory control of these biological systems.

Simulation of the Brusselator

As with the previous example, a simulation of the Brusselator was performed to generate a set of input–output data for model estimation and validation. The model used for the simulation is based on the four biochemical reaction equations given below:

$$ \begin{aligned} \hbox{A} \longrightarrow \hbox{X} \\ 2\hbox{X}+\hbox{Y}\longrightarrow 3\hbox{X} \\ \hbox{B }+\hbox{X}\longrightarrow\hbox{Y}+\hbox{C}\\ \hbox{X}\longrightarrow\hbox{ D}\\ \end{aligned} $$

This is a 6th order state model with a 2 inputs and 2 outputs. The inputs are the concentrations of molecular species A and B, and the outputs are the oscillatory species of interest X and Y. The model uses simple mass–action kinetics to derive the chemical rate equations for each of the reactions taking place in the model. From this, the rate equations for the oscillatory species of interest are derived for the Brusselator model as:

$$ \frac{\hbox{d}X}{\hbox{d}t}=k_{1}A-k_{3}BX+k_{2}X^{2}Y-k_{4}X $$
(20)
$$ \frac{\hbox{d}Y}{\hbox{d}t}=k_{3}BX-k_{2}X^{2}Y $$
(21)

where X and Y are the outputs, A and B are input species variables and k 1, k 2, k 3 and k 4 are the rate constants. After setting the initial concentrations of A = 0.5, B = X = Y = 3.0 and C = D = 0.0 and rate constants of k1 = k2 = k3 = k4 = 1, the differential equations can be solved to generate a particular time series.

Identification of the Brusselator

From the above simulation, a data set of 800 samples was again generated to be used for model estimation and validation. As before, a uniformly distributed random noise signal was added to the data and then the sample values were normalised to within the range 0–1. Statistical measures from this new data are given in Table 7.

Table 7 Statistics of the input–output data sets used for training and validation

This time a polynomial NARX model of order 3, and inputs X(t−1), Y(t−1), A(t−1), B(t−1), was used to construct the full model set, resulting in a candidate pool of 454 terms. The forward subset selection procedure was performed first, and this time AIC was used as the stop criterion. For the case of modelling X(t) as the system output, 12 terms were selected from the entire candidate pool. The process was then repeated using the iterative forward and backward subset method. The different subsets of terms and parameters obtained by the two methods can be compared in Tables 8 and 9.

Table 8 Brusselator model structure for concentration of X obtained from forward selection
Table 9 Brusselator model structure for concentration of X obtained from two-stage, forward and backward subset selection

The modelling result produced by the two methods for training and validation (on different sized data sets of 30–400 samples) are listed in Table 10. Figures 1216 show the variation in X(t) over time during the estimation stage, whereas Figs. 1721 show this variation while attempting to validate the model over previously unseen data. These results again illustrate that the two-stage method outperforms the conventional forward approach in terms of modelling accuracy as was predicted. The figures also show that the the model begins to show a sufficient level of accuracy under validation when training has taken place on a data set of at least 100 samples.

Table 10 Brusselator training and validation results with mean squared error (MSE) between the model and target output given for different sized data sets
Fig. 12
figure 12

Brusselator model estimation using only 30 data points

Fig. 13
figure 13

Brusselator model estimation using only 50 data points

Fig. 14
figure 14

Brusselator model estimation using 100 data points

Fig. 15
figure 15

Brusselator model estimation using 200 data points

Fig. 16
figure 16

Brusselator model estimation using 400 data points

Fig. 17
figure 17

Brusselator model validation using only 30 data points

Fig. 18
figure 18

Brusselator model validation using only 50 data points

Fig. 19
figure 19

Brusselator model validation using 100 data points

Fig. 20
figure 20

Brusselator model validation using 200 data points

Fig. 21
figure 21

Brusselator model validation using 400 data points

Discussion

The work described in this paper has investigated the black-box identification of two well known nonlinear molecular interaction pathways that have traditionally been modelled using white-box methods. A two-stage approach has been used to obtain an optimal nonlinear model effectively and efficiently, where the model terms are selected and refined using a forward and backward subset selection algorithm. The simulation experiments carried out to model the Brusselator and the MAPK signalling pathway have confirmed the efficacy of the proposed algorithm. One of the main contributions of this paper has been to show that, instead of white-box modelling approaches which have been widely used in systems biology research, black box methods offer an alternative for capturing the essential behavior and dynamics of the biological processes using a simplified model structure. This enables the identification and analysis of large-scale biological systems using a relatively small set of simple models, based on which the design of control strategies may become possible. Future work will include using physically related basis functions to build up nonlinear models from the underlying biological system, improving the model transparency and interpretability.