Orthography Standardization in Arabic Dialects
Normalizace pravopisu v arabských dialektech
diploma thesis (DEFENDED)

View/ Open
Permanent link
http://hdl.handle.net/20.500.11956/147949Identifiers
Study Information System: 235817
Collections
- Kvalifikační práce [11322]
Author
Advisor
Referee
Straňák, Pavel
Faculty / Institute
Faculty of Mathematics and Physics
Discipline
Computational Linguistics
Department
Institute of Formal and Applied Linguistics
Date of defense
8. 9. 2021
Publisher
Univerzita Karlova, Matematicko-fyzikální fakultaLanguage
English
Grade
Excellent
Keywords (Czech)
kontrola pravopisu|automatické opravy|arabština|dialektKeywords (English)
spell checking|automatic corrections|Arabic|dialectOrthography Standardization in Arabic Dialects Abstract Christian Cayralat1 1 Charles University Spontaneous orthography in Arabic dialects poses one of the biggest ob- stacles in the way of Dialectal Arabic NLP applications. As the Arab world enjoys a wide array of these widely spoken and recently written, non-standard, low-resource varieties, this thesis presents a detailed account of this relatively overlooked phenomenon. It sets out to show that continuously creating addi- tional noise-free, manually standardized corpora of Dialectal Arabic does not free us from the shackles of non-standard (spontaneous) orthography. Because real-world data will most often come in a noisy format, it also investigates ways to ease the amount of noise in textual data. As a proof of concept, we restrict ourselves to one of the dialectal varieties, namely, Lebanese Arabic. It also strives to gain a better understanding of the nature of the noise and its distri- bution. All of this is done by leveraging various spelling correction and morpho- logical tagging neural architectures in a multi-task setting, and by annotating a Lebanese Arabic corpus for spontaneous orthography standardization, and morphological segmentation and tagging, among other features. Additionally, a detailed taxonomy of spelling inconsistencies for...