Automatická detekcia fake-news v slovenských textoch

Romanský, Patrik

Automatic detection of fake-news on Slovak texts
Automatická detekce fake-news na slovenských textech

dc.contributor.advisor	Mareček, David
dc.creator	Romanský, Patrik
dc.date.accessioned	2023-11-06T16:31:57Z
dc.date.available	2023-11-06T16:31:57Z
dc.date.issued	2023
dc.identifier.uri	http://hdl.handle.net/20.500.11956/184051
dc.description.abstract	Fake news is a problem in recent years. This study focuses on detecting fake news written in the Slovak language using text classification methods. It is unique because it is the first to conduct such a comprehensive set of experiments on Slovak. During the study, a balanced dataset was created, and over 80 experiments were conducted to find the optimal classifier for the problem. Pre-trained transformer-based language models, including BERT, mBERT, RoBERTA, XLM-RoBERTa, and SlovakBERT, were used in the initial step of the study, and their performance was compared against other machine learning methods using standard metrics. The models were fine-tuned with LIAR and COVID19 FN, English-language datasets, to test the impact of fake news topics and language transfer properties. SlovakBERT combined with training exclusively on Slovak datasets achieved the best results with an (acc = 0.9610). This study can contribute to the development of tools to automatically detect fake news in Slovak, aiding in the fight against the spread of false information. 1	en_US
dc.description.abstract	Šírenie fake-news je dlhodobým problémom, ale v posledných rokoch sa stáva ešte výraznejším. Preto sme v tejto práci analyzovali problém ich automatickej detekcie ako úlohu klasifikácie textu. Práca sa od iných, jej podobných štúdií, odlišuje primárne v tom, že sa zameriava na slovenčinu, kde doposiaľ nebola vykonaná takáto rozsiahla sada experi- mentov. Počas testov sme vytvorili vybalansovaný dataset. Vykonali sme taktiež viac ako 80 experimentov s cieľom nájsť optimálny klasifikátor pre riešenie tohto problému. Ako prvý sme použili predtrénované jazykové modely typu Transformer (BERT, mBERT, Ro- BERTA, XLM-RoBERTa a SlovakBERT) a pomocou štandardných metrík sme porovnali ich výkonnosť s inými metódami strojového učenia. Pre fine-tuning sme použili aj ang- lické datasety LIAR a COVID19 FN, na ktorých sme otestovali vplyv témy fake-news a prenos vlastnosti medzi jazykmi. Najlepšie výsledky dosiahol SlovakBERT v kombiná- cii s tréningom na výlučne slovenskom datasete (acc = 0, 9610). 1	cs_CZ
dc.language	Slovenčina	cs_CZ
dc.language.iso	sk_SK
dc.publisher	Univerzita Karlova, Matematicko-fyzikální fakulta	cs_CZ
dc.subject	fake-news\|hoax	cs_CZ
dc.subject	fake-news\|hoax	en_US
dc.title	Automatická detekcia fake-news v slovenských textoch	sk_SK
dc.type	diplomová práce	cs_CZ
dcterms.created	2023
dcterms.dateAccepted	2023-09-05
dc.description.department	Ústav formální a aplikované lingvistiky	cs_CZ
dc.description.department	Institute of Formal and Applied Linguistics	en_US
dc.description.faculty	Matematicko-fyzikální fakulta	cs_CZ
dc.description.faculty	Faculty of Mathematics and Physics	en_US
dc.identifier.repId	260320
dc.title.translated	Automatic detection of fake-news on Slovak texts	en_US
dc.title.translated	Automatická detekce fake-news na slovenských textech	cs_CZ
dc.contributor.referee	Novák, Michal
thesis.degree.name	Mgr.
thesis.degree.level	navazující magisterské	cs_CZ
thesis.degree.discipline	Informatika - Umělá inteligence	cs_CZ
thesis.degree.discipline	Computer Science - Artificial Intelligence	en_US
thesis.degree.program	Informatika - Umělá inteligence	cs_CZ
thesis.degree.program	Computer Science - Artificial Intelligence	en_US
uk.thesis.type	diplomová práce	cs_CZ
uk.taxonomy.organization-cs	Matematicko-fyzikální fakulta::Ústav formální a aplikované lingvistiky	cs_CZ
uk.taxonomy.organization-en	Faculty of Mathematics and Physics::Institute of Formal and Applied Linguistics	en_US
uk.faculty-name.cs	Matematicko-fyzikální fakulta	cs_CZ
uk.faculty-name.en	Faculty of Mathematics and Physics	en_US
uk.faculty-abbr.cs	MFF	cs_CZ
uk.degree-discipline.cs	Informatika - Umělá inteligence	cs_CZ
uk.degree-discipline.en	Computer Science - Artificial Intelligence	en_US
uk.degree-program.cs	Informatika - Umělá inteligence	cs_CZ
uk.degree-program.en	Computer Science - Artificial Intelligence	en_US
thesis.grade.cs	Dobře	cs_CZ
thesis.grade.en	Good	en_US
uk.abstract.cs	Šírenie fake-news je dlhodobým problémom, ale v posledných rokoch sa stáva ešte výraznejším. Preto sme v tejto práci analyzovali problém ich automatickej detekcie ako úlohu klasifikácie textu. Práca sa od iných, jej podobných štúdií, odlišuje primárne v tom, že sa zameriava na slovenčinu, kde doposiaľ nebola vykonaná takáto rozsiahla sada experi- mentov. Počas testov sme vytvorili vybalansovaný dataset. Vykonali sme taktiež viac ako 80 experimentov s cieľom nájsť optimálny klasifikátor pre riešenie tohto problému. Ako prvý sme použili predtrénované jazykové modely typu Transformer (BERT, mBERT, Ro- BERTA, XLM-RoBERTa a SlovakBERT) a pomocou štandardných metrík sme porovnali ich výkonnosť s inými metódami strojového učenia. Pre fine-tuning sme použili aj ang- lické datasety LIAR a COVID19 FN, na ktorých sme otestovali vplyv témy fake-news a prenos vlastnosti medzi jazykmi. Najlepšie výsledky dosiahol SlovakBERT v kombiná- cii s tréningom na výlučne slovenskom datasete (acc = 0, 9610). 1	cs_CZ
uk.abstract.en	Fake news is a problem in recent years. This study focuses on detecting fake news written in the Slovak language using text classification methods. It is unique because it is the first to conduct such a comprehensive set of experiments on Slovak. During the study, a balanced dataset was created, and over 80 experiments were conducted to find the optimal classifier for the problem. Pre-trained transformer-based language models, including BERT, mBERT, RoBERTA, XLM-RoBERTa, and SlovakBERT, were used in the initial step of the study, and their performance was compared against other machine learning methods using standard metrics. The models were fine-tuned with LIAR and COVID19 FN, English-language datasets, to test the impact of fake news topics and language transfer properties. SlovakBERT combined with training exclusively on Slovak datasets achieved the best results with an (acc = 0.9610). This study can contribute to the development of tools to automatically detect fake news in Slovak, aiding in the fight against the spread of false information. 1	en_US
uk.file-availability	V
uk.grantor	Univerzita Karlova, Matematicko-fyzikální fakulta, Ústav formální a aplikované lingvistiky	cs_CZ
thesis.grade.code	3
uk.publication-place	Praha	cs_CZ
uk.thesis.defenceStatus	O