Genres classification by means of machine learning

Bílek, Jan

Klasifikace žánrů pomocí strojového učení

dc.contributor.advisor	Neruda, Roman
dc.creator	Bílek, Jan
dc.date.accessioned	2018-10-04T12:43:29Z
dc.date.available	2018-10-04T12:43:29Z
dc.date.issued	2018
dc.identifier.uri	http://hdl.handle.net/20.500.11956/101890
dc.description.abstract	In this thesis, we compare the bag of words approach with doc2vec doc- ument embeddings on the task of classification of book genres. We cre- ate 3 datasets with different text lengths by extracting short snippets from books in Project Gutenberg repository. Each dataset comprises of more than 200000 documents and 14 different genres. For 3200-character documents, we achieve F1-score of 0.862 when stacking models trained on both bag of words and doc2vec representations. We also explore the relationships be- tween documents, genres and words using similarity metrics on their vector representations and report typical words for each genre. As part of the thesis, we also present an online webapp for book genre classification. 1	en_US
dc.description.abstract	V této práci porovnáváme bag of words a doc2vec přístup k problému klasifikace literárních žánrů. Na základě textů knih z repozitáře Projektu Gutenberg vytváříme tři datatsety různých délek. Každý z nich obsahuje přes 200000 dokumentů a 14 různých žánrů. Na souboru dokumentů s délkou 3200 znaků dosahujeme kombinací modelů bag of words a doc2vec reprezentace F1-skóre 0.862. V práci dále zkoumáme vztahy mezi knihami, žánry a slovy na základě podobnostní jejich vektorové reprezentace a uvádíme typická slova pro každý žánr. Součástí práce je webová aplikace na klasifikaci žánrů. 1	cs_CZ
dc.language	English	cs_CZ
dc.language.iso	en_US
dc.publisher	Univerzita Karlova, Matematicko-fyzikální fakulta	cs_CZ
dc.subject	Machine learning	en_US
dc.subject	natural language processing	en_US
dc.subject	genre classification	en_US
dc.subject	word embeddings	en_US
dc.subject	paragraph vector	en_US
dc.subject	Strojové učení	cs_CZ
dc.subject	zpracování přirozeného jazyka	cs_CZ
dc.subject	klasifikace žánrů	cs_CZ
dc.subject	vnoření slov	cs_CZ
dc.subject	paragraph vector	cs_CZ
dc.title	Genres classification by means of machine learning	en_US
dc.type	diplomová práce	cs_CZ
dcterms.created	2018
dcterms.dateAccepted	2018-09-13
dc.description.department	Katedra teoretické informatiky a matematické logiky	cs_CZ
dc.description.department	Department of Theoretical Computer Science and Mathematical Logic	en_US
dc.description.faculty	Faculty of Mathematics and Physics	en_US
dc.description.faculty	Matematicko-fyzikální fakulta	cs_CZ
dc.identifier.repId	202143
dc.title.translated	Klasifikace žánrů pomocí strojového učení	cs_CZ
dc.contributor.referee	Vomlelová, Marta
thesis.degree.name	Mgr.
thesis.degree.level	navazující magisterské	cs_CZ
thesis.degree.discipline	Artificial Intelligence	en_US
thesis.degree.discipline	Umělá inteligence	cs_CZ
thesis.degree.program	Informatika	cs_CZ
thesis.degree.program	Computer Science	en_US
uk.thesis.type	diplomová práce	cs_CZ
uk.taxonomy.organization-cs	Matematicko-fyzikální fakulta::Katedra teoretické informatiky a matematické logiky	cs_CZ
uk.taxonomy.organization-en	Faculty of Mathematics and Physics::Department of Theoretical Computer Science and Mathematical Logic	en_US
uk.faculty-name.cs	Matematicko-fyzikální fakulta	cs_CZ
uk.faculty-name.en	Faculty of Mathematics and Physics	en_US
uk.faculty-abbr.cs	MFF	cs_CZ
uk.degree-discipline.cs	Umělá inteligence	cs_CZ
uk.degree-discipline.en	Artificial Intelligence	en_US
uk.degree-program.cs	Informatika	cs_CZ
uk.degree-program.en	Computer Science	en_US
thesis.grade.cs	Výborně	cs_CZ
thesis.grade.en	Excellent	en_US
uk.abstract.cs	V této práci porovnáváme bag of words a doc2vec přístup k problému klasifikace literárních žánrů. Na základě textů knih z repozitáře Projektu Gutenberg vytváříme tři datatsety různých délek. Každý z nich obsahuje přes 200000 dokumentů a 14 různých žánrů. Na souboru dokumentů s délkou 3200 znaků dosahujeme kombinací modelů bag of words a doc2vec reprezentace F1-skóre 0.862. V práci dále zkoumáme vztahy mezi knihami, žánry a slovy na základě podobnostní jejich vektorové reprezentace a uvádíme typická slova pro každý žánr. Součástí práce je webová aplikace na klasifikaci žánrů. 1	cs_CZ
uk.abstract.en	In this thesis, we compare the bag of words approach with doc2vec doc- ument embeddings on the task of classification of book genres. We cre- ate 3 datasets with different text lengths by extracting short snippets from books in Project Gutenberg repository. Each dataset comprises of more than 200000 documents and 14 different genres. For 3200-character documents, we achieve F1-score of 0.862 when stacking models trained on both bag of words and doc2vec representations. We also explore the relationships be- tween documents, genres and words using similarity metrics on their vector representations and report typical words for each genre. As part of the thesis, we also present an online webapp for book genre classification. 1	en_US
uk.file-availability	V
uk.publication.place	Praha	cs_CZ
uk.grantor	Univerzita Karlova, Matematicko-fyzikální fakulta, Katedra teoretické informatiky a matematické logiky	cs_CZ
thesis.grade.code	1