AUTOMATION OF TEXT DATA PROCESSING USING NLP
Yaroslav Starukhin , Senior Data scientist, McKinsey & Company, Boston, USA Vladimir Diukarev , Head of Data Analytics, Anti-Fraud Department, Sberbank Moscow, Russian FederationAbstract
This study aims to develop an automated system for processing scientific texts using advanced NLP techniques. The methodology integrates classical NLP methods with deep learning approaches, employing SciBERT for text classification, LDA for topic modeling, and a modified TextRank algorithm for keyword extraction. Results demonstrate high accuracy in document classification (F1-score of 0.92), effective topic identification, and precise keyword extraction. The developed web interface showcases the system's practical applicability. This research contributes to the field by presenting a comprehensive solution for scientific text analysis, combining state-of-the-art language models with established NLP techniques. The study's novelty lies in its tailored approach to scientific literature, addressing the unique challenges of domain-specific language and complex content structure in academic texts.
Keywords
Natural language processing, scientific text analysis, topic modeling
References
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
Methods of natural language processing. [Electronic resource] Access mode: https://developers.sber.ru/help/ml/natural-language-processing-techniques (accessed 8.05.2024).
Natural language processing. [Electronic resource] Access mode: https://habr.com/ru/companies/otus/articles/705482 / (accessed 8.05.2024).
An overview of natural language processing methods for automatic generation of test tasks. [Electronic resource] Access mode: https://na-journal.ru/8-2023-informacionnye-tekhnologii/6251-obzor-metodov-obrabotki-estestvennogo-yazyka-dlya-avtomaticheskoi-generacii-testovyh-zadanii (accessed 8.05.2024).
Processing of text data using NLP methods. [Electronic resource] Access mode: https://vc.ru/newtechaudit/109667-obrabotka-tekstovyh-dannyh-metodami-nlp (accessed 8.05.2024).
Semantic analysis for automatic natural language processing. [Electronic resource] Access mode: https://rdc.grfc.ru/2021/09/semantic_analysis / (accessed 8.05.2024).
Natural language processing: NLP (natural language processing) methods, tools and tasks. [Electronic resource] Access mode: https://www.cleverence.ru/articles/auto-busines/obrabotka-estestvennogo-yazyka-metody-instrumenty-i-zadachi-nlp-natural-language-processing / (accessed 8.05.2024).
Article Statistics
Copyright License
Copyright (c) 2024 Yaroslav Starukhin, Vladimir Diukarev

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors retain the copyright of their manuscripts, and all Open Access articles are disseminated under the terms of the Creative Commons Attribution License 4.0 (CC-BY), which licenses unrestricted use, distribution, and reproduction in any medium, provided that the original work is appropriately cited. The use of general descriptive names, trade names, trademarks, and so forth in this publication, even if not specifically identified, does not imply that these names are not protected by the relevant laws and regulations.