Applied Sciences | Open Access | DOI: https://doi.org/10.37547/tajas/Volume07Issue10-05

From Data Entry to ML Inference: End-to-End Pipelines for Duplicate Detection

Srinivasarao Daruna , Senior Software Dev Engineer McLean, Virginia, USA

Abstract

The study systematically presents the fundamental principles for constructing end-to-end pipelines for duplicate detection using machine learning methods. The objective is to analyze and subsequently formalize an architectural schema that unifies stream processing, adaptive ML mechanisms, and scalable cloud components. The methodological foundation is based on a review of existing entity resolution approaches and the design of an integrated architectural solution derived with consideration of the key concepts embedded in patent US11995054B2. The technological core comprises the following components: Apache Kafka for stream orchestration, Apache Spark for distributed processing, Amazon SageMaker for model development and management, and NoSQL stores for flexible and scalable persistence of intermediate and final data. As a result, a fault-tolerant, horizontally scalable architecture is proposed, intended for operation in near real-time conditions. The central mechanism is a machine learning system with a continuous feedback loop, in which user verdicts on ambiguous duplicate cases are employed for dynamic retraining and improvement of detection quality. The findings of the study offer practical value for data architects, machine learning engineers, and researchers focused on data quality management in the design of high-throughput analytical systems.

Keywords

duplicate detection, machine learning, end-to-end pipeline, Apache Spark, Apache Kafka, Amazon SageMaker, data quality, entity resolution, stream processing, MLOps

References

Coughlin, T.. 175 zettabytes by 2025. Forbes. Retrieved from: https://www.forbes.com/sites/tomcoughlin/2018/11/27/175-zettabytes-by-2025/ (date of access: 10.06.2025).

Adapa, C. S. R. (2025). Building a Standout Portfolio in Master Data Management (MDM) and Data Engineering. International Research Journal of Modernization in Engineering Technology and Science, 7 (3), 8082-8099.

Peng, J., et al. (2024). RLclean: An unsupervised integrated data cleaning framework based on deep reinforcement learning. Information Sciences, 682. https://doi.org/10.1016/j.ins.2024.121281.

Jehangir, B., Radhakrishnan, S., & Agarwal, R. (2023). A survey on Named Entity Recognition — datasets, tools, and methodologies. Natural Language Processing Journal, 3, 1-12. https://doi.org/10.1016/j.nlp.2023.100017.

Li, X., et al. (2025). Contextual semantics graph attention network model for entity resolution. Scientific Reports, 15, 1-16. https://doi.org/10.1038/s41598-025-11932-9

Espinoza, J. L., & Dupont, C. L. (2022). VEBA: a modular end-to-end suite for in silico recovery, clustering, and analysis of prokaryotic, microeukaryotic, and viral genomes from metagenomes. BMC bioinformatics, 23.

Lopez-Lopez, E., Pardo, X. M., & Regueiro, C. V. (2022). Incremental Learning from Low-labelled Stream Data in Open-Set Video Face Recognition. Pattern Recognition, 131, 1-12. https://doi.org/10.1016/j.patcog.2022.108885 .

Zhu, J., Huang, C., & De Meo, P. (2023). DFMKE: A dual fusion multi-modal knowledge graph embedding framework for entity alignment. Information Fusion, 90, 111-119. https://doi.org/10.1016/j.inffus.2022.09.012.

Liu, B., et al. (2024). PRTA: Joint extraction of medical nested entities and overlapping relation via parameter sharing progressive recognition and targeted assignment decoding scheme. Computers in Biology and Medicine, 176. https://doi.org/10.1016/j.compbiomed.2024.108539.

Daruna, S., Bantanur, V. S., Lee, M. Machine-learning based data entry duplication detection and mitigation and methods thereof. Retrieved from: https://patents.google.com/patent/US11995054B2/en (date of access: 20.06.2025)

Article Statistics

Downloads

Download data is not yet available.

Copyright License

Download Citations

How to Cite

Srinivasarao Daruna. (2025). From Data Entry to ML Inference: End-to-End Pipelines for Duplicate Detection. The American Journal of Applied Sciences, 7(10), 52–59. https://doi.org/10.37547/tajas/Volume07Issue10-05