Applied Sciences
| Open Access | Data-Driven Retention Targeting: A Holistic Analytics Framework Spanning Prediction, Causality, and Fairness
Nidhi Singh , Senior Data Analyst, State of Alabama, AL USAAbstract
Attrition entails significant costs of hiring, lost productivity, and lost know-how, which drive ML research on employee attrition prediction. Nevertheless, most existing work offers just one discriminatory statistic on one IBM HR Analytics attrition prediction synthetic split, without providing much guidance, and hardly addresses cost, interpretability, dynamics over time, robustness to new data, and fairness altogether. In contrast, this paper proposes a holistic decision-oriented framework and a new targeting policy contributing to attrition analysis in the following six dimensions: (i) cost-sensitive stacked ensemble (LightGBM, CatBoost, logistic regression) with repeated cross-validation, confidence intervals, and expected net savings metric rooted in retention economics; calibration proves indispensable for any cost-sensitive applications; (ii) post hoc explainability based on SHAP(SHapley Additive exPlanations) explanations together with DiCE (Diverse Counterfactual Explanations)counterfactual recourse; (iii) survival analysis (Kaplan-Meier estimator, Cox proportional hazards model) applied to a time-to-event dataset of turnover as another base-classification target; (iv) uplift modeling using three kinds of learners (S-, T-, and X-); (v) Fairness-Aware Cost-Sensitive Retention Targeting policy, FACS-RT, integrating uplift, cost, and fairness optimization in one algorithm and constructing value-fairness Pareto frontier; and (vi) leave-one-department-out resampling and auditing with respect to group fairness criterion. For the IBM dataset (n = 1,470), our approach yields an average AUC of 0.83 with 95% confidence interval (0.79–0.89) with cross-validation, statistically equivalent to a strong logistic regression baseline (paired-bootstrap test p = 0.37), and isotonic calibration brings ECE down to 0.04. For the turnover dataset (n = 1,129), our method achieves AUC 0.72 and Cox concordance 0.66. There is a partial agreement between risk- and uplift-based ranking orders of 27%. FACS-RT retains 83% of expected maximum value while decreasing gender disparity of selection rates by 92% (0.053 → 0.004).
Keywords
Explainable AI, SHAP, Counterfactual explanation, Survival data analysis, Uplift modeling, Cost-sensitive learning, Algorithmic fairness
References
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. DOI: 10.1023/A:1010933404324
Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 785–794). DOI: 10.1145/2939672.2939785 · arXiv: 1603.02754
Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., & Liu, T.-Y. (2017). LightGBM: A highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems (Vol. 30, pp. 3146–3154). URL: proceedings.neurips.cc/paper/2017/hash/6449f44a102fde848669bdd9eb6b76fa-Abstract.html
Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V., & Gulin, A. (2018). CatBoost: Unbiased boosting with categorical features. In Advances in Neural Information Processing Systems (Vol. 31, pp. 6638–6648). URL: proceedings.neurips.cc/paper/2018/hash/14491b756b3a51daac41c24863285549-Abstract.html · arXiv: 1706.09516
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357. DOI: 10.1613/jair.953
Lundberg, S. M., & Lee, S.-I. (2017). A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems (Vol. 30, pp. 4765–4774). URL: proceedings.neurips.cc/paper/2017/hash/8a20a8621978632d76c43dfd28b67767-Abstract.html · arXiv: 1705.07874
Mothilal, R. K., Sharma, A., & Tan, C. (2020). Explaining machine learning classifiers through diverse counterfactual explanations. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency (pp. 607–617). DOI: 10.1145/3351095.3372850
Kaplan, E. L., & Meier, P. (1958). Nonparametric estimation from incomplete observations. Journal of the American Statistical Association, 53(282), 457–481. DOI: 10.1080/01621459.1958.10501452
Cox, D. R. (1972). Regression models and life-tables. Journal of the Royal Statistical Society: Series B, 34(2), 187–202. DOI: 10.1111/j.2517-6161.1972.tb00899.x
Künzel, S. R., Sekhon, J. S., Bickel, P. J., & Yu, B. (2019). Metalearners for estimating heterogeneous treatment effects using machine learning. Proceedings of the National Academy of Sciences, 116(10), 4156–4165. DOI: 10.1073/pnas.1804597116
Barocas, S., & Selbst, A. D. (2016). Big data's disparate impact. California Law Review, 104(3), 671–732. DOI: 10.15779/Z38BG31 · SSRN: ssrn.com/abstract=2477899
Hardt, M., Price, E., & Srebro, N. (2016). Equality of opportunity in supervised learning. In Advances in Neural Information Processing Systems (Vol. 29, pp. 3315–3323). URL: proceedings.neurips.cc/paper/2016/hash/9d2682367c3935defcb1f9e247a97c0d-Abstract.html · arXiv: 1610.02413
Bird, S., Dudík, M., Edgar, R., Horn, B., Lutz, R., Milan, V., Sameki, M., Wallach, H., & Walker, K. (2020). Fairlearn: A toolkit for assessing and improving fairness in AI (Microsoft Technical Report MSR-TR-2020-32). Microsoft Research. URL: microsoft.com/en-us/research/publication/fairlearn-a-toolkit-for-assessing-and-improving-fairness-in-ai/
Pedregosa, F., Varoquaux, G., Gramfort, A., et al. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830. URL: jmlr.org/papers/v12/pedregosa11a.html
Davidson-Pilon, C. (2019). lifelines: Survival analysis in Python. Journal of Open Source Software, 4(40), 1317. DOI: 10.21105/joss.01317
Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). "Why should I trust you?" Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1135–1144). DOI: 10.1145/2939672.2939778 · arXiv: 1602.04938
Wolpert, D. H. (1992). Stacked generalization. Neural Networks, 5(2), 241–259. DOI: 10.1016/S0893-6080(05)80023-1
Elkan, C. (2001). The foundations of cost-sensitive learning. In Proceedings of the 17th International Joint Conference on Artificial Intelligence (pp. 973–978). URL: dl.acm.org/doi/10.5555/1642194.1642224 · PDF: cseweb.ucsd.edu/~elkan/rescale.pdf
Allen, D. G., Bryant, P. C., & Vardaman, J. M. (2010). Retaining talent: Replacing misconceptions with evidence-based strategies. Academy of Management Perspectives, 24(2), 48–64. DOI: 10.5465/amp.24.2.48 · JSTOR: jstor.org/stable/25682398 Note: corrected from your draft — no "Gusterson & Allen (2017)" paper with this title exists; the canonical reference is the 2010 Allen/Bryant/Vardaman paper above.
Athey, S., & Imbens, G. W. (2016). Recursive partitioning for heterogeneous causal effects. Proceedings of the National Academy of Sciences, 113(27), 7353–7360. DOI: 10.1073/pnas.1510489113 · arXiv: 1504.01132
Harrell, F. E., Califf, R. M., Pryor, D. B., Lee, K. L., & Rosati, R. A. (1982). Evaluating the yield of medical tests. JAMA, 247(18), 2543–2546. DOI: 10.1001/jama.1982.03320430047030
IBM. (2017). IBM HR Analytics Employee Attrition & Performance dataset. Kaggle. URL: kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset
Download and View Statistics
Copyright License
Copyright (c) 2024 Nidhi Singh

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors retain the copyright of their manuscripts, and all Open Access articles are disseminated under the terms of the Creative Commons Attribution License 4.0 (CC-BY), which licenses unrestricted use, distribution, and reproduction in any medium, provided that the original work is appropriately cited. The use of general descriptive names, trade names, trademarks, and so forth in this publication, even if not specifically identified, does not imply that these names are not protected by the relevant laws and regulations.
