Harmonisation of Language Resources for Word-Formation of Multiple Languages

Kyjánek, Lukáš

Harmonizace jazykových zdrojů zachycujících slovotvorbu různých jazyků

dc.contributor.advisor	Ševčíková, Magda
dc.creator	Kyjánek, Lukáš
dc.date.accessioned	2020-07-14T09:57:28Z
dc.date.available	2020-07-14T09:57:28Z
dc.date.issued	2020
dc.identifier.uri	http://hdl.handle.net/20.500.11956/118513
dc.description.abstract	In the field of Natural Language Processing, word-formation is under-resourced comparing to inflectional morphology. Moreover, the existing resources capturing word-formation differ in many aspects. This thesis aims to review existing language resources for word-formation across languages and to unify them to a common data structure and file format. Basic notions of word-formation are followed by a review of existing language resources and their comparison in both quantitative and qualitative aspects. In the core part of the thesis, the harmonisation process is presented. Design decisions on the unification procedure are presented, and the selection of the resources to unify is described. The resources are unified to the rooted tree data structure and stored in a lexeme-based file format, which is already used in DeriNet 2.0. The procedure applies supervised machine learning model and the Maximum Spanning Tree algorithm. While the model scores word-formation relations, the MST algorithm uses the scores for identifying the rooted tree structure in each word-formation family. The resulting collection of harmonised resources covering 20 European languages was published under the title 'Universal Derivations' (UDer).	en_US
dc.description.abstract	V oblasti počítačového zpracování přirozené jazyka není slovotvorba v porovnání s (flektivní) morfologií dostatečně pokryta jazykovými zdroji. Již existující zdroje zachycující slovotvorbu se navíc liší v mnoha aspektech. V rámci této diplomové práce jsou popsány jak existující jazykové zdroje zachycující slovotvorbu napříč jazyky, tak sjednocení (harmonizace) jejich datových struktur a souborových formátů. První dvě kapitoly uvádí základní pojmy z oblasti slovotvorby a zároveň detailní přehled a kvantitativní i kvalitativní srovnání existujících jazykových zdrojů slovotvorby. Jádro diplomové práce tvoří popis harmonizačního procesu a jeho aplikace na vybrané zdroje. Jsou představena nejen kritéria výběru, ale také základní rozhodnutí týkající se harmonizačního procesu. Výsledné harmonizované zdroje reprezentují příbuzná slova jako zakořeněné stromy uložené ve sloupcovém souborovém formátu. Tato datová struktura a souborový formát aktuálně používá DeriNet 2.0. Navržená harmonizační procedura využívá řízené strojové učení a algoritmus hledající kostru v orientovaném grafu. Natrénovaný strojový model přiřazuje skóre každému slovotvornému vztahu a zmíněný algoritmus následně na jejich základě nalezne v každé slovotvorné rodině kostru orientovaného grafu, tj. strukturu zakořeněného stromu. Výsledná kolekce...	cs_CZ
dc.language	English	cs_CZ
dc.language.iso	en_US
dc.publisher	Univerzita Karlova, Matematicko-fyzikální fakulta	cs_CZ
dc.subject	language resource	en_US
dc.subject	lexical resource	en_US
dc.subject	word-formation	en_US
dc.subject	derivation	en_US
dc.subject	harmonisation	en_US
dc.subject	natural languages	en_US
dc.subject	natural language processing	en_US
dc.subject	jazykový zdroj	cs_CZ
dc.subject	lexikální zdroj	cs_CZ
dc.subject	slovotvorba	cs_CZ
dc.subject	derivace	cs_CZ
dc.subject	harmonizace	cs_CZ
dc.subject	přirozené jazyky	cs_CZ
dc.subject	počítačové zpracování jazyka	cs_CZ
dc.title	Harmonisation of Language Resources for Word-Formation of Multiple Languages	en_US
dc.type	diplomová práce	cs_CZ
dcterms.created	2020
dcterms.dateAccepted	2020-06-23
dc.description.department	Institute of Formal and Applied Linguistics	en_US
dc.description.department	Ústav formální a aplikované lingvistiky	cs_CZ
dc.description.faculty	Matematicko-fyzikální fakulta	cs_CZ
dc.description.faculty	Faculty of Mathematics and Physics	en_US
dc.identifier.repId	211324
dc.title.translated	Harmonizace jazykových zdrojů zachycujících slovotvorbu různých jazyků	cs_CZ
dc.contributor.referee	Zeman, Daniel
thesis.degree.name	Mgr.
thesis.degree.level	navazující magisterské	cs_CZ
thesis.degree.discipline	Matematická lingvistika	cs_CZ
thesis.degree.discipline	Computational Linguistics	en_US
thesis.degree.program	Computer Science	en_US
thesis.degree.program	Informatika	cs_CZ
uk.thesis.type	diplomová práce	cs_CZ
uk.taxonomy.organization-cs	Matematicko-fyzikální fakulta::Ústav formální a aplikované lingvistiky	cs_CZ
uk.taxonomy.organization-en	Faculty of Mathematics and Physics::Institute of Formal and Applied Linguistics	en_US
uk.faculty-name.cs	Matematicko-fyzikální fakulta	cs_CZ
uk.faculty-name.en	Faculty of Mathematics and Physics	en_US
uk.faculty-abbr.cs	MFF	cs_CZ
uk.degree-discipline.cs	Matematická lingvistika	cs_CZ
uk.degree-discipline.en	Computational Linguistics	en_US
uk.degree-program.cs	Informatika	cs_CZ
uk.degree-program.en	Computer Science	en_US
thesis.grade.cs	Výborně	cs_CZ
thesis.grade.en	Excellent	en_US
uk.abstract.cs	V oblasti počítačového zpracování přirozené jazyka není slovotvorba v porovnání s (flektivní) morfologií dostatečně pokryta jazykovými zdroji. Již existující zdroje zachycující slovotvorbu se navíc liší v mnoha aspektech. V rámci této diplomové práce jsou popsány jak existující jazykové zdroje zachycující slovotvorbu napříč jazyky, tak sjednocení (harmonizace) jejich datových struktur a souborových formátů. První dvě kapitoly uvádí základní pojmy z oblasti slovotvorby a zároveň detailní přehled a kvantitativní i kvalitativní srovnání existujících jazykových zdrojů slovotvorby. Jádro diplomové práce tvoří popis harmonizačního procesu a jeho aplikace na vybrané zdroje. Jsou představena nejen kritéria výběru, ale také základní rozhodnutí týkající se harmonizačního procesu. Výsledné harmonizované zdroje reprezentují příbuzná slova jako zakořeněné stromy uložené ve sloupcovém souborovém formátu. Tato datová struktura a souborový formát aktuálně používá DeriNet 2.0. Navržená harmonizační procedura využívá řízené strojové učení a algoritmus hledající kostru v orientovaném grafu. Natrénovaný strojový model přiřazuje skóre každému slovotvornému vztahu a zmíněný algoritmus následně na jejich základě nalezne v každé slovotvorné rodině kostru orientovaného grafu, tj. strukturu zakořeněného stromu. Výsledná kolekce...	cs_CZ
uk.abstract.en	In the field of Natural Language Processing, word-formation is under-resourced comparing to inflectional morphology. Moreover, the existing resources capturing word-formation differ in many aspects. This thesis aims to review existing language resources for word-formation across languages and to unify them to a common data structure and file format. Basic notions of word-formation are followed by a review of existing language resources and their comparison in both quantitative and qualitative aspects. In the core part of the thesis, the harmonisation process is presented. Design decisions on the unification procedure are presented, and the selection of the resources to unify is described. The resources are unified to the rooted tree data structure and stored in a lexeme-based file format, which is already used in DeriNet 2.0. The procedure applies supervised machine learning model and the Maximum Spanning Tree algorithm. While the model scores word-formation relations, the MST algorithm uses the scores for identifying the rooted tree structure in each word-formation family. The resulting collection of harmonised resources covering 20 European languages was published under the title 'Universal Derivations' (UDer).	en_US
uk.file-availability	V
uk.grantor	Univerzita Karlova, Matematicko-fyzikální fakulta, Ústav formální a aplikované lingvistiky	cs_CZ
thesis.grade.code	1
uk.publication-place	Praha	cs_CZ