An Explainable Ensemble Learning Framework for Accurate Fake News Detection Using TF-IDF Features
Keywords:
fake news detection; gradient boosting; XGBoost; LightGBM; CatBoost; TF-IDF; natural language processing; LIME; misinformationAbstract
The rapid proliferation of misinformation on digital platforms has become a critical societal challenge. This paper presents a comprehensive comparative study of four gradient boosting ensemble classifiers — Gradient Boosting Classifier (GBC), XGBoost, LightGBM, and CatBoost — for automated fake news detection. The publicly available Kaggle Fake and Real News Dataset is used as the benchmark. A systematic text preprocessing pipeline is applied, followed by Term Frequency-Inverse Document Frequency (TF-IDF) feature extraction using unigrams and bigrams. All four models are trained on an 80/20 stratified split and evaluated using Accuracy, Precision, Recall, F1-Score, ROC-AUC, and RMSE. LightGBM achieves the highest F1-Score of 0.9959 and ROC-AUC of 0.9996, followed closely by Gradient Boosting (F1=0.9958), XGBoost (F1=0.9956), and CatBoost (F1=0.9955). LIME (Local Interpretable Model-Agnostic Explanations) analysis is applied to the best model to identify the most influential textual features driving predictions. Error analysis through misclassification counts further validates the robustness of all classifiers. The results confirm that TF-IDF-based gradient boosting ensembles provide highly accurate, efficient, and interpretable solutions for real-world fake news detection.
References
K. Shu, A. Sliva, S. Wang, J. Tang, and H. Liu, "Fake news detection on social media: A data mining perspective," ACM SIGKDD Explorations Newsletter, vol. 19, no. 1, pp. 22–36, Jul. 2017.
H. Ahmed, I. Traore, and S. Saad, "Detecting opinion spam and fake news using n-gram analysis and semantic similarity," in Proc. Int. Conf. Information Systems Security and Privacy (ICISSP), 2017, pp. 253–265.
M. Granik and V. Mesyura, "Fake news detection using naive Bayes classifier," in Proc. IEEE First Ukraine Conference on Electrical and Computer Engineering (UKRCON), Kyiv, 2017, pp. 900–903.
T. Chen and C. Guestrin, "XGBoost: A scalable tree boosting system," in Proc. 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, pp. 785–794.
G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.-Y. Liu, "LightGBM: A highly efficient gradient boosting decision tree," in Advances in Neural Information Processing Systems (NIPS), vol. 30, 2017.
L. Prokhorenkova, G. Gusev, A. Vorobev, A. V. Dorogush, and A. Gulin, "CatBoost: Unbiased boosting with categorical features," in Advances in Neural Information Processing Systems (NeurIPS), vol. 31, 2018.
M. T. Ribeiro, S. Singh, and C. Guestrin, "Why should I trust you?: Explaining the predictions of any classifier," in Proc. 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, pp. 1135–1144.
C. Bisaillon, "Fake and Real News Dataset," Kaggle, 2020. [Online]. Available: https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset
Downloads
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.




