Variable Selection and Grouping for Large-scale Data-driven Modelling


  • Esko K. Juuso



variable selection and grouping, data analysis, intelligent methods, data-driven modelling


For large-scale systems, the number of possible variable combinations becomes very large. Variable grouping means finding feasible groups of variables for modelling. Systems can be divided into subsystems but even then the number of available variables is often impractically high to be used with the data-based methods. Interactive variable selection and grouping by comparing the performance of the model alternatives is a good solution if there are not too many variables. This paper describes possibilities of variable selection in large-scale industrial systems. It classifies the variable selection and grouping into four categories: knowledge-based grouping, grouping with data analysis, decomposition, and model-based grouping and selection. The data analysis part consists of correlation analysis and handling of high dimension data with principal components. These originally linear methodologies were extended to nonlinear systems by using the nonlinear scaling approach. Decomposition can be realised with various clustering methods or learning with case-based reasoning. The multimodel systems are handled with fuzzy set systems. Numerous studies based on linear multivariate statistical modelling have been reported in literature. The methodologies approaches have been tested in several applications: bioprocesses, continuous brewing, condition monitoring, web break sensitivity analysis and wastewater treatment. Industrial process data, a pilot system and a test rig were used in the analysis. Uncertainty handling is a part of the analysis method: uncertainty is represented with the degrees of membership.


A. Aamodt and E. Plaza. Case-based reasoning: Foundational issues, methodological variations and system approaches. AICom - Artifical Intelligence Communications, 7(1):39–59, 1994.

T. Ahola, H. Kumpula, and E. Juuso. Case based prediction of paper web break sensitivity. In Proceedings of Eunite 2003 - European Symposium on Intelligent Technologies, Hybrid Systems and their implementation on Smart Adaptive Systems, July 10-11, 2003, Oulu, Finland, pages 161–167. Wissenschaftsverlag Mainz, Aachen, 2003.

Mina Aminghafari and Nathalie Cheze. Multivariate denoising using wavelets and principal component analysis. Computational Statistics & Data Analysis, 50:2381–2398, 2006.

S. Äyrämö and T. Kärkkäinen. Introduction to partitioning-based clustering methods with a robust example. Reports of the Department of Mathematical Information Technology Series C. University of Jyväskylä, Software and Computational Engineering No. C. 1/2006. Jyväskylä, 2006.

R. Babuška. Fuzzy Modeling and Identification. Kluwer Academic Publisher, Boston, 1998.

B. Bakshi. Multiscale PCA with application to multivariate statistical process monitoring. AIChE Journal, 44:1596–1610, 1998.

J. C. Bezdek. Pattern Recognition with Fuzzy Objective Function. Plenum Press, New York, 1981.

J. Cadima, J. Orestes Cerdeira, and M. Minhoto. Computational aspects of algorithms for variable selection in the context of principal components. Computational Statistics & Data Analysis, 47(2):225–236, 2004.

S. Chen, C.F.N. Cowan, and P. M. Grant. Orthogonal least squares learning algorithm for radial basis function networks. IEEE Transactions on Neural Networks, 2(2):302–309, 1991.

H. Cheng. Inference and decision making methods in fault diagnosis. A literature review. Helsinki University of Technology, Laboratory of Process Control and Automation. no. 9. Espoo, 2006.

D. Driankov, H. Hellendoorn, and M. Reinfrank. An Introduction to Fuzzy Control. Springer, Berlin, Germany, 1993.

P. Garcia-Martinez, M. Tejera, C. Ferreira, D. Lefebve, and H. H. Arsenault. Optical implementation of the weighted sliced orthogonal nonlinear generalized correlation for nonuniform illumination conditions. Applied Optics, 41(32):6867–6873, 2002.

R. W. Gerlach, B. R. Kowalski, and H. O. A. Wold. Partial least squares modelling with latent variables. Anal. Chim. Acta, 112(4):417–421, 1979.

D. E. Gustafson and W. C. Kessel. Fuzzy clustering with a fuzzy covariance matrix. In Proceedings of IEEE CDC, San Diego, CA, USA, pages 761–766. IEEE Press, 1979.

I.A.T. Hashem, I. Yaqoob, N.B. Anuar, S. Mokhtar, A. Gani, and S. Ullah Khan. The rise of "big data" on cloud computing: Review and open research issues. Information Systems, 47:98–115, 2015. doi:10.1016/

J.-S. R. Jang. ANFIS: Adaptive-Network-based Fuzzy Inference Systems. IEEE Transactions on Systems, Man, and Cybernetics, 23(3):665–685, 1993.

E. Juuso and S. Lahdelma. Intelligent scaling of features in fault diagnosis. In 7th International Conference on Condition Monitoring and Machinery Failure Prevention Technologies, CM 2010 - MFPT 2010, 22-24 June 2010, Stratford-upon-Avon, UK, volume 2, pages 1358–1372, 2010. URL

E. Juuso, T. Ahola, K. Oinonen, and K. Leiviskä. Web break sensitivity indicator for a paper machine. In H.-J. Zimmermann, editor, Proceedings of the 6th European Congress on Intelligent Techniques & Soft Computing -EUFIT’98, Aachen, September 7 - 10, 1998, volume 3, pages 1549–1553, Aachen, 1998. Mainz.

E. Juuso, T. Ahola, and K. Leiviskä. Variable selction and grouping. Report A 36, August 2008. Control Engineering Laboratory, University of Oulu, Oulu, 2008.

E. K. Juuso. Linguistic equations for data analysis: FuzzEqu toolbox. In L. Yliniemi and E. Juuso, editors, Proceedings of TOOLMET 2000 Symposium - Tool Environments and Development Methods for Intelligent Systems, Oulu, April 13-14, 2000, pages 212–226, Oulu, 2000. Oulun yliopistopaino.

E. K. Juuso. Integration of intelligent systems in development of smart adaptive systems. International Journal of Approximate Reasoning, 35(3):307–337, 2004. doi:10.1016/j.ijar.2003.08.008.

E. K. Juuso. Intelligent dynamic simulation of a fedbatch enzyme fermentation process. In Tenth International Conference on Computer Modelling and Simulation, EUROSIM/UKSim, Cambridge, UK, April 13, 2008., pages 301–306. The Institute of Electrical and Electronics Engineers IEEE, 2008. doi:10.1109/UKSIM.2008.133.

E. K. Juuso. Hybrid models in dynamic simulation of a biological water treatment process. In J. Kunovský, P. Hanácek, F. Zboril, Al-Dabass, and A. Abraham, editors, Proceedings First International Conference on Computational Intelligence, Modelling and Simulation, 7- 9 September 2009, Brno, Czech Republik, pages 30–35. IEEE Computer Society, 2009. doi:10.1109/CSSim.2009.52.

E. K. Juuso. Intelligent performance analysis with a natural language interface. Management Systems in Production Engineering, 25(3):168–175, 2017. doi:10.1515/mspe-2017-0025.

E. K. Juuso. Intelligent dynamic simulation of fed-batch fermentation processes. In E. Dahlquist, E. Juuso, B. Lie, and L. Eriksson, editors, Proceedings of The 60th Conference on Simulation and Modelling (SIMS 60), 13-16, 2019, Västerås, Sweden, number 170 in Linköping Electronic Conference Proceedings, pages 132–138. Linköping University Electronic Press, Linköpings universitet, 2019. doi:10.3384/ecp20170132.

E. K. Juuso. Expertise and uncertainty processing with nonlinear scaling and fuzzy systems for automation. Open Engineering, 10(1):712–720, 2020a. doi:10.1515/eng-2020-0080.

E. K. Juuso. Intelligent methodologies in recursive data-based modelling. In E. Juuso, B. Lie, E. Dahlquist, and J. Ruuska, editors, Proceedings of The 61st Conference on Simulation and Modelling (SIMS 61), 22-24, 2020, Virtual Conference, Finland, number 176 in Linköping Elecronic Conference Proceedings, pages 466–474. Linköping University Electronic Press, Linköpings universitet, 2020b. doi:10.3384/ecp20176466.

E. K. Juuso and T. Ahola. Case-based detection of operating conditions in complex nonlinear systems. IFAC Proceedings Volumes, 41(2):11142–11147, 2008. doi:10.3182/20080706-5-KR-1001.01888.

E. K. Juuso and J. Kronlöf. Model-based monitoring of immobilized yeast fermentation using fuzzy logic and linguistic equations. IFAC Proceedings Volumes, 38(1):97–102, 2005. doi:10.3182/20050703-6-CZ-1902.02220.

M. Kano, S. Hasebe, I. Hashimoto, and H. Ohno. A new multivariate statistical process monitoring method using principal

component analysis. Computers and Chemical Engineering, 20:1103–1113, 2001.

H. Karttunen. Datan käsittely. CSC-Tieteellinen laskenta, Yliopistopaino, Helsinki, 1994.

T. Kohonen. Self-Organizing Maps. Springer, Berlin, 1995.

W. Ku, R. Storer, and C. Georgakis. Disturbance detection and isolation by dynamic principal components. Chemometrics and Intelligent Laboratory Systems, 30:179–196, 1995.

W. Li and S. Qin. Consistent dynamic PCA based on errors-in-variables subspace identification. Journal of Process Control, 11:661–678, 2001.

P. Nomikos and J. MacGregor. Monitoring batch processes using multiway principal component analysis. AIChE Journal, 40(8):1361–1375, 1994.

J. Oton, P. Garcia-Martinez, I. Moreno, and J. Garcia. Phase joint transform sequential correlator for nonlinear binary correlations. Optical Communications, 245:113–124, 2005.

W. Pedrycz. An identification algorithm in fuzzy relational systems. Fuzzy Sets and Systems, 13(2):153–167, 1984.

D. Pyle. Data preparation for data mining. Morgan Kaufmann Publishers, San Francisco, 1999.

E. Ranta, H. Rita, and J. Kouki. Biometria – Tilastotiedettä ekologeille. Yliopistopaino, Helsinki, 1999.

J. Schmidhuber. Deep learning in neural networks: An overview. Neural Networks, 61(Supplement C):85 – 117, 2015. ISSN 0893-6080. doi: URL

T. Takagi and M. Sugeno. Fuzzy identification of systems and its applications to modeling and control. IEEE Transactions on Systems, Man, and Cybernetics, 15(1):116–132, 1985.

Erik Vanhatalo, Murat Kulahci, and Bjarne Bergquist. On the structure of dynamic principal component analysis used in statistical process monitoring. Chemometrics and Intelligent Laboratory Systems, 167:1–11, 2017. doi:10.1016/j.chemolab.2017.05.016.

V. Venkatasubramanian, R. Rengaswamy, K. Yin, and S. N. Kavuri. A review of process fault detection and diagnosis part i: Quantitative model-based methods. Computers and Chemical Engineering, 27:293–311, 2003.

M. Vermasvuori. Data-based methods and prior knowledge in process monitoring. A literature review. Helsinki University of Technology, Laboratory of Process Control and Automation. Report series no. 10. Espoo, 2006.

F. Westad, M. Hersleth, P. Lea, and H. Martens. Variable selection in PCA in sensory descriptive and consumer data. Food Quality and Preference, 14(5-6):463–472, 2003.

M. Zarzo and A Ferrer. Batch process diagnosis: PLS with variable selection versus block-wise PCR. Chemometrics and Intelligent Laboratory Systems, 73(1):15–27, 2004.