Orthography Standardization in Arabic Dialects

Cayralat, Christian

Normalizace pravopisu v arabských dialektech

dc.contributor.advisor	Zeman, Daniel
dc.creator	Cayralat, Christian
dc.date.accessioned	2022-04-06T11:06:29Z
dc.date.available	2022-04-06T11:06:29Z
dc.date.issued	2021
dc.identifier.uri	http://hdl.handle.net/20.500.11956/147949
dc.description.abstract	Orthography Standardization in Arabic Dialects Abstract Christian Cayralat1 1 Charles University Spontaneous orthography in Arabic dialects poses one of the biggest ob- stacles in the way of Dialectal Arabic NLP applications. As the Arab world enjoys a wide array of these widely spoken and recently written, non-standard, low-resource varieties, this thesis presents a detailed account of this relatively overlooked phenomenon. It sets out to show that continuously creating addi- tional noise-free, manually standardized corpora of Dialectal Arabic does not free us from the shackles of non-standard (spontaneous) orthography. Because real-world data will most often come in a noisy format, it also investigates ways to ease the amount of noise in textual data. As a proof of concept, we restrict ourselves to one of the dialectal varieties, namely, Lebanese Arabic. It also strives to gain a better understanding of the nature of the noise and its distri- bution. All of this is done by leveraging various spelling correction and morpho- logical tagging neural architectures in a multi-task setting, and by annotating a Lebanese Arabic corpus for spontaneous orthography standardization, and morphological segmentation and tagging, among other features. Additionally, a detailed taxonomy of spelling inconsistencies for...	en_US
dc.language	English	cs_CZ
dc.language.iso	en_US
dc.publisher	Univerzita Karlova, Matematicko-fyzikální fakulta	cs_CZ
dc.subject	spell checking\|automatic corrections\|Arabic\|dialect	en_US
dc.subject	kontrola pravopisu\|automatické opravy\|arabština\|dialekt	cs_CZ
dc.title	Orthography Standardization in Arabic Dialects	en_US
dc.type	diplomová práce	cs_CZ
dcterms.created	2021
dcterms.dateAccepted	2021-09-08
dc.description.department	Institute of Formal and Applied Linguistics	en_US
dc.description.department	Ústav formální a aplikované lingvistiky	cs_CZ
dc.description.faculty	Matematicko-fyzikální fakulta	cs_CZ
dc.description.faculty	Faculty of Mathematics and Physics	en_US
dc.identifier.repId	235817
dc.title.translated	Normalizace pravopisu v arabských dialektech	cs_CZ
dc.contributor.referee	Straňák, Pavel
thesis.degree.name	Mgr.
thesis.degree.level	navazující magisterské	cs_CZ
thesis.degree.discipline	Matematická lingvistika	cs_CZ
thesis.degree.discipline	Computational Linguistics	en_US
thesis.degree.program	Computer Science	en_US
thesis.degree.program	Informatika	cs_CZ
uk.thesis.type	diplomová práce	cs_CZ
uk.taxonomy.organization-cs	Matematicko-fyzikální fakulta::Ústav formální a aplikované lingvistiky	cs_CZ
uk.taxonomy.organization-en	Faculty of Mathematics and Physics::Institute of Formal and Applied Linguistics	en_US
uk.faculty-name.cs	Matematicko-fyzikální fakulta	cs_CZ
uk.faculty-name.en	Faculty of Mathematics and Physics	en_US
uk.faculty-abbr.cs	MFF	cs_CZ
uk.degree-discipline.cs	Matematická lingvistika	cs_CZ
uk.degree-discipline.en	Computational Linguistics	en_US
uk.degree-program.cs	Informatika	cs_CZ
uk.degree-program.en	Computer Science	en_US
thesis.grade.cs	Výborně	cs_CZ
thesis.grade.en	Excellent	en_US
uk.abstract.en	Orthography Standardization in Arabic Dialects Abstract Christian Cayralat1 1 Charles University Spontaneous orthography in Arabic dialects poses one of the biggest ob- stacles in the way of Dialectal Arabic NLP applications. As the Arab world enjoys a wide array of these widely spoken and recently written, non-standard, low-resource varieties, this thesis presents a detailed account of this relatively overlooked phenomenon. It sets out to show that continuously creating addi- tional noise-free, manually standardized corpora of Dialectal Arabic does not free us from the shackles of non-standard (spontaneous) orthography. Because real-world data will most often come in a noisy format, it also investigates ways to ease the amount of noise in textual data. As a proof of concept, we restrict ourselves to one of the dialectal varieties, namely, Lebanese Arabic. It also strives to gain a better understanding of the nature of the noise and its distri- bution. All of this is done by leveraging various spelling correction and morpho- logical tagging neural architectures in a multi-task setting, and by annotating a Lebanese Arabic corpus for spontaneous orthography standardization, and morphological segmentation and tagging, among other features. Additionally, a detailed taxonomy of spelling inconsistencies for...	en_US
uk.file-availability	V
uk.grantor	Univerzita Karlova, Matematicko-fyzikální fakulta, Ústav formální a aplikované lingvistiky	cs_CZ
thesis.grade.code	1
uk.publication-place	Praha	cs_CZ
uk.thesis.defenceStatus	O

Soubory tohoto záznamu

Název:: 120400525.pdf
Velikost:: 4.661Mb
Formát:: application/pdf
Popis:: Text práce

Zobrazit/otevřít

Název:: 120400189.pdf
Velikost:: 62.29Kb
Formát:: application/pdf
Popis:: Abstrakt (anglicky)

Zobrazit/otevřít

Název:: 120403107.pdf
Velikost:: 47.20Kb
Formát:: application/pdf
Popis:: Posudek vedoucího

Zobrazit/otevřít

Název:: 120402974.pdf
Velikost:: 97.14Kb
Formát:: application/pdf
Popis:: Posudek oponenta

Zobrazit/otevřít

Název:: 120404166.pdf
Velikost:: 347.7Kb
Formát:: application/pdf
Popis:: Záznam o průběhu obhajoby

Zobrazit/otevřít

Tento záznam se objevuje v následujících sbírkách

Kvalifikační práce [10691]
Theses

Zobrazit minimální záznam