Multi-Agent Automated Feature Engineering for High-Dimensional Big Data

Himant Goyal, Prabhav Rathi, Sheetal Tatiya

Authors

Himant Goyal, Prabhav Rathi, Sheetal Tatiya

Keywords:

Multi-agent systems, automated feature engineering, high-dimensional data, distributed computing, AutoML, reinforcement learning

Abstract

Feature engineering remains one of the most critical yet time-consuming bottlenecks in building effective machine learning pipelines, especially in high-dimensional big data environments where the feature space is vast, noisy, and often poorly understood. Manual feature engineering demands significant domain expertise, is difficult to scale, and frequently fails to uncover complex, non-linear relationships hidden within the data. This paper proposes a Multi-Agent Framework for Automated Feature Engineering (MAFE) designed to address these challenges through intelligent automation, specialization, and inter-agent coordination.

Functionally, the framework operates by deploying a population of autonomous agents, each assigned a specialized role in the feature transformation pipeline. These roles include feature generators, feature selectors, redundancy eliminators, and performance evaluators. Agents interact through a competitive-collaborative mechanism — competing to propose the most predictive feature subsets while collaborating by sharing high-value transformations via a shared knowledge pool. A master orchestrator agent governs agent interactions, resolves conflicts, and enforces computational constraints, ensuring the system remains efficient and scalable across large datasets.

On the technical side, each agent is powered by reinforcement learning policies that iteratively refine transformation strategies based on reward signals derived from downstream model performance metrics such as AUC, F1-score, and cross-validation accuracy. The framework integrates graph-based feature dependency modeling to detect and eliminate multicollinearity, while a meta-learning module accelerates convergence by transferring knowledge from previously solved feature engineering tasks. Distributed computing support via Apache Spark enables the framework to handle datasets exceeding millions of rows and thousands of features without significant performance degradation.

Empirical evaluations conducted across diverse benchmark datasets — including financial, genomic, and IoT domains — demonstrate that MAFE consistently outperforms both manual feature engineering approaches and existing AutoML baselines in predictive accuracy, feature interpretability, and computational efficiency. This work makes a significant contribution to the AutoML landscape by presenting a robust, adaptive, and production-ready solution to one of data science's most persistent challenges."

References

Brown, G., Pocock, A., Zhao, M. J., & Luján, M. (2012). Conditional likelihood maximisation: A unifying framework for information theoretic feature selection. Journal of Machine Learning Research, 13, 27–66.

Busoniu, L., Babuska, R., & De Schutter, B. (2008). A comprehensive survey of multi-agent reinforcement learning. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 38(2), 156–172.

Chandrashekar, G., & Sahin, F. (2014). A survey on feature selection methods. Computers & Electrical Engineering, 40(1), 16–28.

Chen, T., Guestrin, C., et al. (2019). XGBoost: A scalable tree boosting system. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–794.

Domingos, P. (2012). A few useful things to know about machine learning. Communications of the ACM, 55(10), 78–87.

Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 1157–1182.

Hutter, F., Kotthoff, L., & Vanschoren, J. (2019). Automated machine learning: Methods, systems, challenges. Springer.

Jolliffe, I. T. (2016). Principal component analysis (2nd ed.). Springer.

Kaelbling, L. P., Littman, M. L., & Moore, A. W. (1996). Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4, 237–285.

Kanter, J. M., & Veeramachaneni, K. (2015). Deep feature synthesis: Towards automating data science endeavors. Proceedings of the IEEE International Conference on Data Science and Advanced Analytics, 1–10.

Khurana, U., Samulowitz, H., Turaga, D., & Parthasarathy, S. (2018). Cognito: Automated feature engineering for supervised learning. Proceedings of the IEEE International Conference on Data Mining, 1304–1309.

Olson, R. S., Bartley, N., Urbanowicz, R. J., & Moore, J. H. (2016). Evaluation of a tree-based pipeline optimisation tool for automating data science. Proceedings of the Genetic and Evolutionary Computation Conference, 485–492.

Stone, P., & Veloso, M. (2000). Multiagent systems: A survey from a machine learning perspective. Autonomous Robots, 8(3), 345–383.

Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction (2nd ed.). MIT Press.

Vanschoren, J. (2018). Meta-learning: A survey. arXiv preprint arXiv:1810.03548.

Wooldridge, M. (2009). An introduction to multiagent systems (2nd ed.). Wiley.

Yu, L., & Liu, H. (2004). Efficient feature selection via analysis of relevance and redundancy. Journal of Machine Learning Research, 5, 1205–1224.

Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., & Stoica, I. (2016). Apache Spark: A unified engine for big data processing. Communications of the ACM, 59(11), 56–65.

Zheng, A., & Casari, A. (2018). Feature engineering for machine learning: Principles and techniques for data scientists. O’Reilly Media.

Multi-Agent Automated Feature Engineering for High-Dimensional Big Data

Authors

Keywords:

Abstract

References

Downloads

How to Cite

Issue

Section

License

Similar Articles

Make a Submission

Keywords

Abstracting & Indexing

Flag Counter