Building Scalable ETL Pipelines for HR Data
Chetan Urkudkar , Senior Staff Software Development Engineer, Liveramp Inc San Ramon, California, USAAbstract
The article is devoted to the development and experimental validation of scalable ETL pipelines for HR data, aimed at bridging the gap between the volume of heterogeneous workforce events and the capabilities of traditional nightly processes. The relevance of the study is determined by the exponential growth of the HR technology market to USD 40.45 billion in 2024 and its forecasted doubling by 2032 at a 9.2% CAGR, as well as by the fragmentation of corporate systems, which leads to data incompleteness, inconsistency, and latency in turnover metrics and talent-development program effectiveness analysis. The work is aimed at formalizing requirements for Extraction, Transformation, Loading, Scalability, and Observability; at designing a containerized architecture based on Kubernetes, Apache Airflow, Spark, and Flink-CDC; and to ensure low latency, exactly-once semantics as well as linear scaling up to 32 worker pods with an efficiency η of 0.78 or greater. The novelty of the work lies in the first formal model that integrates adaptive API-request throttling with idempotent SCD-attribute transformations for a hybrid Iceberg/Snowflake storage layer and a complete observability system using Prometheus and OpenTelemetry with real-time alerts. An experimental evaluation on a private Kubernetes cluster under load up to 10⁸ records per day demonstrated end-to-end latency ≤ 15 min in batch mode and p95 latency reduction to 48s in near-real-time mode, throughput up to 18.7k records/min with linear worker scaling (η = 0.82), and full lineage-graph traceability in compliance with GDPR. The main conclusions confirm that the proposed architecture provides reliable and reproducible HR-data integration with minimal latency and predictable cost, paving the way for practical deployment in large enterprises. This article will be helpful to data engineers, cloud-architecture designers, and project managers in HR analytics automation.
Keywords
ETL pipeline, HR data, scalability, Observability, Iceberg, Snowflake, Kubernetes, Airflow
References
“HR Statistics You Need to Know,” Paycor, Oct. 11, 2024. https://www.paycor.com/resource-center/articles/hr-statistics-you-need-to-know/ (accessed Apr. 06, 2025).
“Annual Hr Systems Survey Report Sapient Insights Group Hr Systems Adoption Blueprint,” Sapient Insights, 2024. Accessed: Apr. 06, 2025. [Online]. Available: https://sapientinsights.com/wp-content/uploads/2024/11/SIG_2024SEGMENTREPORT_HRBLUEPRINT_FINAL_11112024.pdf
E.-L. Jones, “Survey reveals the HR metrics that matter most,” Ciphr Ltd, May 24, 2024. https://www.ciphr.com/press-releases/survey-reveals-the-hr-metrics-that-matter-most (accessed Apr. 07, 2025).
G. Feretzakis, E. Vagena, K. Kalodanis, P. Peristera, D. Kalles, and A. Anastasiou, “GDPR and Large Language Models: Technical and Legal Obstacles,” Future Internet, vol. 17, no. 4, p. 151, Mar. 2025, doi: https://doi.org/10.3390/fi17040151.
“Overview,” Prometheus. https://prometheus.io/docs/introduction/overview/ (accessed Apr. 16, 2025).
“m7g.large prices and specs,” Instances, 2025. https://instances.vantage.sh/aws/ec2/m7g.large (accessed Apr. 20, 2025).
Article Statistics
Copyright License
Copyright (c) 2025 Chetan Urkudkar

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors retain the copyright of their manuscripts, and all Open Access articles are disseminated under the terms of the Creative Commons Attribution License 4.0 (CC-BY), which licenses unrestricted use, distribution, and reproduction in any medium, provided that the original work is appropriately cited. The use of general descriptive names, trade names, trademarks, and so forth in this publication, even if not specifically identified, does not imply that these names are not protected by the relevant laws and regulations.