Articles | Open Access | DOI: https://doi.org/10.37547/tajet/Volume07Issue06-09

Building Scalable ETL Pipelines for HR Data

Chetan Urkudkar , Senior Staff Software Development Engineer, Liveramp Inc San Ramon, California, USA

Abstract

The article is devoted to the development and experimental validation of scalable ETL pipelines for HR data, aimed at bridging the gap between the volume of heterogeneous workforce events and the capabilities of traditional nightly processes. The relevance of the study is determined by the exponential growth of the HR technology market to USD 40.45 billion in 2024 and its forecasted doubling by 2032 at a 9.2% CAGR, as well as by the fragmentation of corporate systems, which leads to data incompleteness, inconsistency, and latency in turnover metrics and talent-development program effectiveness analysis. The work is aimed at formalizing requirements for Extraction, Transformation, Loading, Scalability, and Observability; at designing a containerized architecture based on Kubernetes, Apache Airflow, Spark, and Flink-CDC; and to ensure low latency, exactly-once semantics as well as linear scaling up to 32 worker pods with an efficiency η of 0.78 or greater. The novelty of the work lies in the first formal model that integrates adaptive API-request throttling with idempotent SCD-attribute transformations for a hybrid Iceberg/Snowflake storage layer and a complete observability system using Prometheus and OpenTelemetry with real-time alerts. An experimental evaluation on a private Kubernetes cluster under load up to 10⁸ records per day demonstrated end-to-end latency ≤ 15 min in batch mode and p95 latency reduction to 48s in near-real-time mode, throughput up to 18.7k records/min with linear worker scaling (η = 0.82), and full lineage-graph traceability in compliance with GDPR. The main conclusions confirm that the proposed architecture provides reliable and reproducible HR-data integration with minimal latency and predictable cost, paving the way for practical deployment in large enterprises. This article will be helpful to data engineers, cloud-architecture designers, and project managers in HR analytics automation.

Keywords

ETL pipeline, HR data, scalability, Observability, Iceberg, Snowflake, Kubernetes, Airflow

References

“HR Statistics You Need to Know,” Paycor, Oct. 11, 2024. https://www.paycor.com/resource-center/articles/hr-statistics-you-need-to-know/ (accessed Apr. 06, 2025).

“Annual Hr Systems Survey Report Sapient Insights Group Hr Systems Adoption Blueprint,” Sapient Insights, 2024. Accessed: Apr. 06, 2025. [Online]. Available: https://sapientinsights.com/wp-content/uploads/2024/11/SIG_2024SEGMENTREPORT_HRBLUEPRINT_FINAL_11112024.pdf

E.-L. Jones, “Survey reveals the HR metrics that matter most,” Ciphr Ltd, May 24, 2024. https://www.ciphr.com/press-releases/survey-reveals-the-hr-metrics-that-matter-most (accessed Apr. 07, 2025).

G. Feretzakis, E. Vagena, K. Kalodanis, P. Peristera, D. Kalles, and A. Anastasiou, “GDPR and Large Language Models: Technical and Legal Obstacles,” Future Internet, vol. 17, no. 4, p. 151, Mar. 2025, doi: https://doi.org/10.3390/fi17040151.

“Overview,” Prometheus. https://prometheus.io/docs/introduction/overview/ (accessed Apr. 16, 2025).

“m7g.large prices and specs,” Instances, 2025. https://instances.vantage.sh/aws/ec2/m7g.large (accessed Apr. 20, 2025).

Article Statistics

Copyright License

Download Citations

How to Cite

Chetan Urkudkar. (2025). Building Scalable ETL Pipelines for HR Data. The American Journal of Engineering and Technology, 7(06), 88–95. https://doi.org/10.37547/tajet/Volume07Issue06-09