Articles | Open Access | DOI: https://doi.org/10.37547/tajet/Volume06Issue07-04

AUTOMATION OF TEXT DATA PROCESSING USING NLP

Yaroslav Starukhin , Senior Data scientist, McKinsey & Company, Boston, USA
Vladimir Diukarev , Head of Data Analytics, Anti-Fraud Department, Sberbank Moscow, Russian Federation

Abstract

This study aims to develop an automated system for processing scientific texts using advanced NLP techniques. The methodology integrates classical NLP methods with deep learning approaches, employing SciBERT for text classification, LDA for topic modeling, and a modified TextRank algorithm for keyword extraction. Results demonstrate high accuracy in document classification (F1-score of 0.92), effective topic identification, and precise keyword extraction. The developed web interface showcases the system's practical applicability. This research contributes to the field by presenting a comprehensive solution for scientific text analysis, combining state-of-the-art language models with established NLP techniques. The study's novelty lies in its tailored approach to scientific literature, addressing the unique challenges of domain-specific language and complex content structure in academic texts.

Keywords

Natural language processing, scientific text analysis, topic modeling

References

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.

Methods of natural language processing. [Electronic resource] Access mode: https://developers.sber.ru/help/ml/natural-language-processing-techniques (accessed 8.05.2024).

Natural language processing. [Electronic resource] Access mode: https://habr.com/ru/companies/otus/articles/705482 / (accessed 8.05.2024).

An overview of natural language processing methods for automatic generation of test tasks. [Electronic resource] Access mode: https://na-journal.ru/8-2023-informacionnye-tekhnologii/6251-obzor-metodov-obrabotki-estestvennogo-yazyka-dlya-avtomaticheskoi-generacii-testovyh-zadanii (accessed 8.05.2024).

Processing of text data using NLP methods. [Electronic resource] Access mode: https://vc.ru/newtechaudit/109667-obrabotka-tekstovyh-dannyh-metodami-nlp (accessed 8.05.2024).

Semantic analysis for automatic natural language processing. [Electronic resource] Access mode: https://rdc.grfc.ru/2021/09/semantic_analysis / (accessed 8.05.2024).

Natural language processing: NLP (natural language processing) methods, tools and tasks. [Electronic resource] Access mode: https://www.cleverence.ru/articles/auto-busines/obrabotka-estestvennogo-yazyka-metody-instrumenty-i-zadachi-nlp-natural-language-processing / (accessed 8.05.2024).

Article Statistics

Copyright License

Download Citations

How to Cite

Yaroslav Starukhin, & Vladimir Diukarev. (2024). AUTOMATION OF TEXT DATA PROCESSING USING NLP. The American Journal of Engineering and Technology, 6(07), 24–39. https://doi.org/10.37547/tajet/Volume06Issue07-04