STAGES OF FORMING AND DIGITIZING THE AUTHOR CORPUS
Keywords:
corpus linguistics, author corpus, digitization, metadata, tokenization, normalization, concordance, linguostatistics, idiolect, stylometry, Uzbek language, digital humanities, lexical analysis, indexing, Nusratulla Jumaxo‘jaAbstract
This study examines the stages of forming and digitizing the Nusratulla Jumaxo‘ja author corpus within the framework of modern corpus linguistics and digital humanities. The research focuses on the scientific and methodological principles of corpus creation, including source collection, metadata development, text normalization, tokenization, indexing, concordance generation, and statistical analysis. Particular attention is paid to the challenges of digitizing Uzbek-language texts, especially issues related to OCR accuracy, Unicode standardization, and apostrophe encoding in Uzbek Latin script. The study demonstrates that the author corpus is not merely an electronic archive, but a multilayered linguistic platform designed for linguostatistical, stylistic, and semantic analysis. Through concordance and frequency-based analysis, the corpus enables the identification of the author’s idiolect, dominant lexical units, and discursive strategies. The integration of metadata and etymological modules further enhances the analytical capabilities of the system. The research concludes that the Nusratulla Jumaxo‘ja author corpus serves as an important digital resource for Uzbek linguistics, stylometry, lexicography, and corpus-based literary studies, while also offering a methodological model for the development of future author corpora in Uzbek corpus linguistics.
References
Biber, D., Conrad, S., & Reppen, R. Corpus Linguistics: Investigating Language Structure and Use. Cambridge: Cambridge University Press, 1998.
Bowker, L., & Pearson, J. Working with Specialized Language: A Practical Guide to Using Corpora. London: Routledge, 2002.
McEnery, T., & Hardie, A. Corpus Linguistics: Method, Theory and
Akramova, Sh. I. “Nusratilla Jumaxo‘ja mualliflik korpusi konkordansi va chastotali lug‘ati (statistik tahlil asosida).” The Lingua Spectrum, Vol. 4, 2025, pp. 354–360. (The Lingua Spectrum)
Akramova, Sh. I. “Nusratullo Jumaxo‘ja mualliflik korpusi lingvistik ta’minoti.” Kompyuter lingvistikasi: muammolar, yechim, istiqbollar V xalqaro ilmiy-amaliy konferensiya materiallari, Toshkent, 2025, pp. 229–234. (compling.navoiy-uni.uz)
Akramova, Sh. I. “Mualliflik korpuslarida konkordans va chastotali tahlil metodlari.” O‘zbekiston: Language and Culture, 2025. (linguistics.tsuull.uz)






Azerbaijan
Türkiye
Uzbekistan
Kazakhstan
Turkmenistan
Kyrgyzstan
Republic of Korea
Japan
India
United States of America
Kosovo