From Data Entry to ML Inference: End-to-End Pipelines for Duplicate Detection
Srinivasarao Daruna , Senior Software Dev Engineer McLean, Virginia, USAAbstract
The study systematically presents the fundamental principles for constructing end-to-end pipelines for duplicate detection using machine learning methods. The objective is to analyze and subsequently formalize an architectural schema that unifies stream processing, adaptive ML mechanisms, and scalable cloud components. The methodological foundation is based on a review of existing entity resolution approaches and the design of an integrated architectural solution derived with consideration of the key concepts embedded in patent US11995054B2. The technological core comprises the following components: Apache Kafka for stream orchestration, Apache Spark for distributed processing, Amazon SageMaker for model development and management, and NoSQL stores for flexible and scalable persistence of intermediate and final data. As a result, a fault-tolerant, horizontally scalable architecture is proposed, intended for operation in near real-time conditions. The central mechanism is a machine learning system with a continuous feedback loop, in which user verdicts on ambiguous duplicate cases are employed for dynamic retraining and improvement of detection quality. The findings of the study offer practical value for data architects, machine learning engineers, and researchers focused on data quality management in the design of high-throughput analytical systems.
Keywords
duplicate detection, machine learning, end-to-end pipeline, Apache Spark, Apache Kafka, Amazon SageMaker, data quality, entity resolution, stream processing, MLOps
References
Coughlin, T.. 175 zettabytes by 2025. Forbes. Retrieved from: https://www.forbes.com/sites/tomcoughlin/2018/11/27/175-zettabytes-by-2025/ (date of access: 10.06.2025).
Adapa, C. S. R. (2025). Building a Standout Portfolio in Master Data Management (MDM) and Data Engineering. International Research Journal of Modernization in Engineering Technology and Science, 7 (3), 8082-8099.
Peng, J., et al. (2024). RLclean: An unsupervised integrated data cleaning framework based on deep reinforcement learning. Information Sciences, 682. https://doi.org/10.1016/j.ins.2024.121281.
Jehangir, B., Radhakrishnan, S., & Agarwal, R. (2023). A survey on Named Entity Recognition — datasets, tools, and methodologies. Natural Language Processing Journal, 3, 1-12. https://doi.org/10.1016/j.nlp.2023.100017.
Li, X., et al. (2025). Contextual semantics graph attention network model for entity resolution. Scientific Reports, 15, 1-16. https://doi.org/10.1038/s41598-025-11932-9
Espinoza, J. L., & Dupont, C. L. (2022). VEBA: a modular end-to-end suite for in silico recovery, clustering, and analysis of prokaryotic, microeukaryotic, and viral genomes from metagenomes. BMC bioinformatics, 23.
Lopez-Lopez, E., Pardo, X. M., & Regueiro, C. V. (2022). Incremental Learning from Low-labelled Stream Data in Open-Set Video Face Recognition. Pattern Recognition, 131, 1-12. https://doi.org/10.1016/j.patcog.2022.108885 .
Zhu, J., Huang, C., & De Meo, P. (2023). DFMKE: A dual fusion multi-modal knowledge graph embedding framework for entity alignment. Information Fusion, 90, 111-119. https://doi.org/10.1016/j.inffus.2022.09.012.
Liu, B., et al. (2024). PRTA: Joint extraction of medical nested entities and overlapping relation via parameter sharing progressive recognition and targeted assignment decoding scheme. Computers in Biology and Medicine, 176. https://doi.org/10.1016/j.compbiomed.2024.108539.
Daruna, S., Bantanur, V. S., Lee, M. Machine-learning based data entry duplication detection and mitigation and methods thereof. Retrieved from: https://patents.google.com/patent/US11995054B2/en (date of access: 20.06.2025)
Article Statistics
Downloads
Copyright License
Copyright (c) 2025 Srinivasarao Daruna

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors retain the copyright of their manuscripts, and all Open Access articles are disseminated under the terms of the Creative Commons Attribution License 4.0 (CC-BY), which licenses unrestricted use, distribution, and reproduction in any medium, provided that the original work is appropriately cited. The use of general descriptive names, trade names, trademarks, and so forth in this publication, even if not specifically identified, does not imply that these names are not protected by the relevant laws and regulations.