Orthography Standardization in Arabic Dialects
Normalizace pravopisu v arabských dialektech
diplomová práce (OBHÁJENO)
Zobrazit/ otevřít
Trvalý odkaz
http://hdl.handle.net/20.500.11956/147949Identifikátory
SIS: 235817
Kolekce
- Kvalifikační práce [10678]
Autor
Vedoucí práce
Oponent práce
Straňák, Pavel
Fakulta / součást
Matematicko-fyzikální fakulta
Obor
Matematická lingvistika
Katedra / ústav / klinika
Ústav formální a aplikované lingvistiky
Datum obhajoby
8. 9. 2021
Nakladatel
Univerzita Karlova, Matematicko-fyzikální fakultaJazyk
Angličtina
Známka
Výborně
Klíčová slova (česky)
kontrola pravopisu|automatické opravy|arabština|dialektKlíčová slova (anglicky)
spell checking|automatic corrections|Arabic|dialectOrthography Standardization in Arabic Dialects Abstract Christian Cayralat1 1 Charles University Spontaneous orthography in Arabic dialects poses one of the biggest ob- stacles in the way of Dialectal Arabic NLP applications. As the Arab world enjoys a wide array of these widely spoken and recently written, non-standard, low-resource varieties, this thesis presents a detailed account of this relatively overlooked phenomenon. It sets out to show that continuously creating addi- tional noise-free, manually standardized corpora of Dialectal Arabic does not free us from the shackles of non-standard (spontaneous) orthography. Because real-world data will most often come in a noisy format, it also investigates ways to ease the amount of noise in textual data. As a proof of concept, we restrict ourselves to one of the dialectal varieties, namely, Lebanese Arabic. It also strives to gain a better understanding of the nature of the noise and its distri- bution. All of this is done by leveraging various spelling correction and morpho- logical tagging neural architectures in a multi-task setting, and by annotating a Lebanese Arabic corpus for spontaneous orthography standardization, and morphological segmentation and tagging, among other features. Additionally, a detailed taxonomy of spelling inconsistencies for...