Biopharmaceutical manufacturing involves multiple process steps that can be challenging to model using first principles. Oftentimes, operating conditions are studied in bench-scale experiments and then fixed to specific values during full-scale operations. This procedure limits the opportunity to tune process variables to correct for the effects of disturbances. Utilizing process models has the potential to increase the flexibility and controllability of the biomanufacturing processes. This work proposes a statistical modeling methodology to predict the outputs of biopharmaceutical operations. This methodology addresses two important challenging characteristics typical of data collected in the biopharmaceutical industry: limited data availability and data heterogeneity. Motivated by the final aim of control, regularization methods, specifically the elastic net, are combined with sampling techniques similar to the bootstrap to develop mathematical models that use only a small number of input variables. These techniques are of particular interest because of their ability to perform model selection and estimation simultaneously.
Process modeling techniques can be grouped into two broad categories: first-principles and data-based. This work focuses on data-based modeling, which is more often applied in biopharmaceutical manufacturing facilities. Data-based models have been applied to cell culture characterization [1] [2] [3], quality control [4] [5], process monitoring [3] [6] [7] [8], and downstream operations [3]. A drawback of current data-based methods applied in the biopharmaceutical industry is that the models that are produced are not easily interpretable because they rely on subspaces that do not have direct physical meaning.
A successful biopharmaceutical model was defined as achieving three goals: (1) model accuracy, (2) model simplicity, and (3) model interpretability. These aims have the caveat of using only a small amount of heterogeneous data, as data for biopharmaceutical manufacturing are typically both heterogeneous and relatively limited.
One way to achieve these goals is through the identification of the input variables in the process that exhibit the largest effects on the output variables. Regularization methods have been identified as possible approaches for such problems because of their ability to simultaneously handle input selection and model estimation [9]. A particular regularization method, the elastic net [10], was identified because of its ability to handle data with more measurements than observations. Elastic net is an optimization formulation for parameter estimation that is formulated as:
where
N is the number of experiments, y_{i} is the ith scalar response, x_{i} is the p-dimensional data vector at observation i, λ is a nonnegative regularization parameter, β_{0} is a scalar parameter, β is a p-dimensional vector of model parameters, and α is on the interval (0,1]. Using this basis, a five-step methodology, referred to as the elastic net with Monte Carlo sampling (ENwMC) is proposed.
The first step in ENwMC is an application of the elastic net, using leave-one-out cross validation to choose the value of α. In leave-one-out cross validation, all but one of the experimental observations are used to fit the model, then the remaining experiment is used to calculate the error. This step is repeated for each possible set and then averaged. The procedure is performed for many possible combinations of the regularization parameters α and λ, where α and λ captures the convex behavior of the error. Because α is the weighting between the ℓ_{2}- and ℓ_{1}-norm penalties and the goal is a sparse model, a value of α close to 1 is preferable. Therefore α is chosen based on a tradeoff between model dimensionality and prediction error. In some cases, this choice is trivial, as a higher value leads to a more accurate model.
Once the value of α is fixed, a test for over-fitting is performed using k-fold cross validation. Using Monte Carlo samples [11], the data are portioned into a validation set containing 1/k proportion of the data and a calibration set containing the rest. The elastic net, with a fixed α, is then performed and the input variables corresponding to the minimum error are recorded. This step is repeated many times to converge to the distribution of models over the possible calibration and validation sets. The frequency with which each variable is selected is then calculated. In further analysis, only the variables that were selected above a threshold frequency are considered.
The subset of selected variables is considered for inclusion in a model using best subset selection. The error of all possible ordinary least squares models of size m , where p is now the dimensions that were chosen based on the threshold, is calculated. A model from this set is then selected based the tradeoff between increasing dimensionality and decreasing error. This tradeoff is easily visualized by plotting the prediction error against the model dimensions to create a Pareto curve. Plots of this type will often exhibit an “elbow.” The elbow corresponds to the model dimensionality that compromises between model size and prediction error. The result of this step is the final model.
Figure 1: Simplified flowsheet of the antibody production process. The bioreactor volume was 2000L and was operated in the fed-batch mode. The column loadings were typical of an antibody purification process. [12].
The developed methodology is evaluated on an antibody manufacturing dataset (see Figure 1) and compared to well-known multivariate analysis techniques for the (bio)pharmaceutical field: principal component regression (PCR) and partial least squares (PLS). In a majority of cases, the elastic net technique out-performed PCR and PLS in terms of error and variance (see Table 1). Averaged over all of the output variables that were considered, the sum-of-squared errors decreased 27% and the variance decreased 48% using a regularized model as compared to the latent variable models. The regularized models have the added benefit of being easily interpreted in terms of the process variables.
Table 1: Comparisons of scaled error and variance for PCR, PLS, and ENwMC modeling techniques. The bold number marks the model with the best performance for each variable.
Unit Operation | Output Variable | Error using … |
| Variance of the prediction using… | ||||
PCR | PLS | ENwMC |
| PCR | PLS | ENwMC | ||
Bioreactor | G0 Product Quality | 3.79 | 4.25 | 3.41 |
| 0.146 | 0.148 | 0.087 |
Final Titer | 5.38 | 3.35 | 5.40 |
| 0.281 | 0.287 | 0.178 | |
DNA | 7.58 | 6.77 | 5.20 |
| 0.209 | 0.201 | 0.223 | |
HCP | 4.30 | 2.85 | 1.67 |
| 0.258 | 0.210 | 0.150 | |
Protein A Column | DNA | 4.26 | 4.33 | 2.71 |
| 0.151 | 0.143 | 0.095 |
HCP | 4.71 | 1.60 | 1.92 |
| 0.268 | 0.202 | 0.080 | |
Total Impurity | 9.22 | 7.98 | 2.40 |
| 0.286 | 0.256 | 0.164 | |
HMW | 2.08 | 2.54 | 1.11 |
| 0.117 | 0.092 | 0.045 | |
Cation Exchange Column | HCP | 1.57 | 1.99 | 1.96 |
| 0.226 | 0.132 | 0.083 |
Total Impurity | 7.78 | 5.73 | 7.18 |
| 0.323 | 0.348 | 0.226 | |
HMW | 1.73 | 1.45 | 0.32 |
| 0.058 | 0.063 | 0.010 | |
Anion Exchange Column | HCP | 2.63 | 2.59 | 1.20 |
| 0.189 | 0.140 | 0.048 |
Total Impurity | 4.65 | 1.56 | 2.48 |
| 0.228 | 0.227 | 0.115 | |
HMW | 0.54 | 0.24 | 0.23 |
| 0.067 | 0.050 | 0.007 |
References
[1] | S. M. Mercier, B. Diepenbroek, M. C. F. Dalm and R. H. Wijffels, "Mutlivariate data analysis as a PAT tool for early bioprocess development data," Journal of Biotechnology, vol. 167, pp. 262-270, 2013. |
[2] | A. Kirdar, J. Conner, J. Baclaski and A. S. Rathore, "Application of multivariate analysis toward biotech processes: Case study of a cell-culture unit operation," Biotechnology Progress, vol. 23, no. 1, pp. 61-67, 2007. |
[3] | A. S. Rathore, N. Bhushan and S. Hadpe, "Chemometrics applications in biotech processes: A review," Biotechnology Progress, vol. 27, no. 2, pp. 307-315, 2011. |
[4] | Y. Roggo, P. Chalus, L. Maurer, C. Lema-Martinez, A. Edmond and N. Jent, "A review of near infrared spectroscopy and chemometrics in pharmaceutical technologies," Journal of Pharmaceutical and Biomedical Analysis, vol. 44, no. 3, pp. 683-700, 2007. |
[5] | Z. Chen, D. Lovett and J. Morris, "Process analytical technologies and real time process control a review of some spectroscopic issues and challenges," Journal of Process Control, vol. 21, no. 10, pp. 1467-1482, 2011. |
[6] | E. Read, J. Park, R. Shah, B. S. Riley, K. A. Brorson and A. S. Rathore, "Process analytical technology (PAT) for biopharmaceutical products: Part I concepts and applications," Biotechnology and Bioengineering, vol. 104, no. 2, pp. 276-284, 2010. |
[7] | E. Read, R. Shah, B. S. Riley, J. T. Park, K. A. Brorson and A. S. Rathore, "Process analytical technology (PAT) for biopharmaceutical products: Part II concepts and applications," Biotechnology and Bioengineering, vol. 105, no. 2, pp. 285-295, 2010. |
[8] | D. Bonné, M. A. Alvarez and S. B. Jorgensen, "Data driven modeling for monitoring and control of industrial fed-batch cultivations," Industrial & Engineering Chemistry Research, vol. 53, pp. 7365-7381, 2013. |
[9] | S. Pampuri, A. Schirru, G. Fazio and G. De Nicolao, "Multilevel lasso applied to virtual metrology in semiconductor manufacturing," in 2011 IEEE International Conference on Automation Science and Engineering, Trieste, 2001. |
[10] | H. Zou and T. Hastie, "Regularization and variable selection via the Elastic Net," Journal of the Royal Statistical Society, Series B (Statistical Methodology), vol. 67, no. 2, pp. 301-320, 2005. |
[11] | N. Metropolis and S. Ulam, "The Monte Carlo method," Journal of the American Statistical Association, vol. 44, no. 247, pp. 335-341, 1949. |
[12] | A. Shukla and J. Thommes, "Recent advances in large-scale production of monoclonal antibodies and related proteins," Trends in Biotechnology, vol. 28, no. 5, pp. 253-261, 2010. |
See more of this Group/Topical: Computing and Systems Technology Division