Zpracování češtiny s využitím kontextualizované reprezentace

Vysušilová, Petra

Czech NLP with Contextualized Embeddings

diploma thesis (DEFENDED)

View/Open

Záznam o průběhu obhajoby (347.4Kb)

Permanent link

http://hdl.handle.net/20.500.11956/147648

Identifiers

Study Information System: 223946

Referee

Hajič, Jan

Faculty / Institute

Faculty of Mathematics and Physics

Discipline

Artificial Intelligence

Department

Institute of Formal and Applied Linguistics

Date of defense

2. 9. 2021

Publisher

Univerzita Karlova, Matematicko-fyzikální fakulta

Language

Czech

Grade

Excellent

Keywords (Czech)

čeština|zpracování přirozeného jazyka|kontextualizované slovní reprezentace|BERT

Keywords (English)

Czech|natural language processing|contextualized word embeddings|BERT

S rostoucím objemem dat, zejména nestrukturovaného textu, roste důleži- tost zpracování přirozeného jazyka. Nejmodernějšími technologiemi posledních let jsou neuronové sítě. Tato práce aplikuje nejúspěšnější metody, jmenovitě Bi- directional Encoders Representations from Transformers (BERT), na tři české úlohy ve zpracování přirozeného jazyka - lematizaci, morfologické značkování a analýzu sentimentu. Použili jsme BERTa s jednoduchou klasifikační hlavou na tři české dataset pro analýzu sentimentu: mall, facebook a csfd a dosáhli jsme state-of-the-art výsledků. Také jsme prozkoumaly několik možných postupů tré- nování pro úlohy značkování a lematizace a obdrželi jsme nové state-of-the-art výsledky pro Pražský závislostní korpus v obou úlohách pomocí fine-tunningu. Konkrétně jsme dosáhli přesnosti 98.57% pro značkování, 99.00% pro lemati- zaci a 98.19% pro společné ohodnocení. Nejlepší modely pro všechny úlohy jsou veřejně dostupné. 1

Abstract (English)

With the increasing amount of digital data in the form of unstructured text, the importance of natural language processing (NLP) increases. The most suc- cessful technologies of recent years are deep neural networks. This work applies the state-of-the-art methods, namely transfer learning of Bidirectional Encoders Representations from Transformers (BERT), on three Czech NLP tasks: part- of-speech tagging, lemmatization and sentiment analysis. We applied BERT model with a simple classification head on three Czech sentiment datasets: mall, facebook, and csfd, and we achieved state-of-the-art results. We also explored several possible architectures for tagging and lemmatization and obtained new state-of-the-art results in both tagging and lemmatization with fine-tunning ap- proach on data from Prague Dependency Treebank. Specifically, we achieved accuracy 98.57% for tagging, 99.00% for lemmatization, and 98.19% for joint accuracy of both tasks. Best models for all tasks are publicly available. 1

Citace dokumentu

Metadata

Show full item record