Semi-supervised Learning from Unfavorably Distributed Data

Sochor, Matěj

Semi-supervised učení z nepříznivě distribuovaných dat

dc.contributor.advisor	Pilát, Martin
dc.creator	Sochor, Matěj
dc.date.accessioned	2020-07-29T10:01:48Z
dc.date.available	2020-07-29T10:01:48Z
dc.date.issued	2020
dc.identifier.uri	http://hdl.handle.net/20.500.11956/119538
dc.description.abstract	Semi-supervised learning (SSL) is a branch of machine learning focusing on using not only labeled data samples, but also unlabeled ones, in an effort to decrease the need for labeled data and thus allow using machine learning even when labeling large amounts of data would be too costly. Despite its quick development in the recent years, there are still issues left to be solved before it can be broadly deployed in practice. One of those issues is class distribution mismatch. It arises when the unlabeled data contains samples not belonging to the classes present in the labeled data. This confuses the training and can even lead to getting a classifier performing worse than a classifier trained on the available data in purely supervised fashion. We designed a filtration method called Unfavorable Data Filtering (UDF) which extracts important features from the data and then uses a similarity-based filter to filter the irrelevant data out according to those features. The filtering happens before any of the SSL training takes places, making UDF usable with any SSL algorithm. To judge its effectiveness, we performed many experiments, mainly on the CIFAR-10 dataset. We found out that UDF is capable of significantly improving the resulting accuracy when compared to not filtering the data, identified basic guidelines...	en_US
dc.description.abstract	Semi-supervised učení je technika strojového učení snažící se využít nejen označko- vaná data (data pro která známe požadované výstupy), ale i neoznačkovaná data (data pro která požadované výstupy neznáme) s cílem snížit požadavky na množství označko- vaných dat a tím umožnit použití strojového učení i v případech kdy je označkování velkého množství dat příliš náročné. I přes svůj rychlý vývoj v posledních letech stále trpí problémy které brání jeho širokému využití v praxi. Jedním z těchto problémů je nesoulad distribucí tříd. Ten vzniká, když neoznačkovaná data obsahují vzorky které nepatří do žádné ze tříd označkovaných dat. To může zmást učení klasifikátoru do takové míry, že je ve výsledku horší než kdyby neoznačkovaná data vůbec nebyla využita. Tato diplomová práce navrhuje metodu nazvanou Unfavorable Data Filtering (UDF), která nejprve z dat extrahuje důležité příznaky a pak se na jejich základě pomocí filtru založeného na podobnosti datových vzorků snažít vyřadit nerelevantní data z trénovacích dat. Díky tomu, že je UDF použita před semi-supervised učením je možné ji použít s libovolnou učící metodou. Pro zjištění jak efektivní UDF je jsme provedli mnoho ex- perimentů, převážně na datasetu zvaném CIFAR-10. Pomocí těchto experimentů jsme zjistili, že filtrování pomocí UDF je opravdu schopno výrazně...	cs_CZ
dc.language	English	cs_CZ
dc.language.iso	en_US
dc.publisher	Univerzita Karlova, Matematicko-fyzikální fakulta	cs_CZ
dc.subject	Semi-supervised Learning	en_US
dc.subject	Deep Learning	en_US
dc.subject	Unbalanced distribution	en_US
dc.subject	Semi-supervised učení	cs_CZ
dc.subject	Hluboké učení	cs_CZ
dc.subject	Nevyvážená distribuce	cs_CZ
dc.title	Semi-supervised Learning from Unfavorably Distributed Data	en_US
dc.type	diplomová práce	cs_CZ
dcterms.created	2020
dcterms.dateAccepted	2020-07-08
dc.description.department	Department of Theoretical Computer Science and Mathematical Logic	en_US
dc.description.department	Katedra teoretické informatiky a matematické logiky	cs_CZ
dc.description.faculty	Matematicko-fyzikální fakulta	cs_CZ
dc.description.faculty	Faculty of Mathematics and Physics	en_US
dc.identifier.repId	222808
dc.title.translated	Semi-supervised učení z nepříznivě distribuovaných dat	cs_CZ
dc.contributor.referee	Mrázová, Iveta
thesis.degree.name	Mgr.
thesis.degree.level	navazující magisterské	cs_CZ
thesis.degree.discipline	Umělá inteligence	cs_CZ
thesis.degree.discipline	Artificial Intelligence	en_US
thesis.degree.program	Computer Science	en_US
thesis.degree.program	Informatika	cs_CZ
uk.thesis.type	diplomová práce	cs_CZ
uk.taxonomy.organization-cs	Matematicko-fyzikální fakulta::Katedra teoretické informatiky a matematické logiky	cs_CZ
uk.taxonomy.organization-en	Faculty of Mathematics and Physics::Department of Theoretical Computer Science and Mathematical Logic	en_US
uk.faculty-name.cs	Matematicko-fyzikální fakulta	cs_CZ
uk.faculty-name.en	Faculty of Mathematics and Physics	en_US
uk.faculty-abbr.cs	MFF	cs_CZ
uk.degree-discipline.cs	Umělá inteligence	cs_CZ
uk.degree-discipline.en	Artificial Intelligence	en_US
uk.degree-program.cs	Informatika	cs_CZ
uk.degree-program.en	Computer Science	en_US
thesis.grade.cs	Výborně	cs_CZ
thesis.grade.en	Excellent	en_US
uk.abstract.cs	Semi-supervised učení je technika strojového učení snažící se využít nejen označko- vaná data (data pro která známe požadované výstupy), ale i neoznačkovaná data (data pro která požadované výstupy neznáme) s cílem snížit požadavky na množství označko- vaných dat a tím umožnit použití strojového učení i v případech kdy je označkování velkého množství dat příliš náročné. I přes svůj rychlý vývoj v posledních letech stále trpí problémy které brání jeho širokému využití v praxi. Jedním z těchto problémů je nesoulad distribucí tříd. Ten vzniká, když neoznačkovaná data obsahují vzorky které nepatří do žádné ze tříd označkovaných dat. To může zmást učení klasifikátoru do takové míry, že je ve výsledku horší než kdyby neoznačkovaná data vůbec nebyla využita. Tato diplomová práce navrhuje metodu nazvanou Unfavorable Data Filtering (UDF), která nejprve z dat extrahuje důležité příznaky a pak se na jejich základě pomocí filtru založeného na podobnosti datových vzorků snažít vyřadit nerelevantní data z trénovacích dat. Díky tomu, že je UDF použita před semi-supervised učením je možné ji použít s libovolnou učící metodou. Pro zjištění jak efektivní UDF je jsme provedli mnoho ex- perimentů, převážně na datasetu zvaném CIFAR-10. Pomocí těchto experimentů jsme zjistili, že filtrování pomocí UDF je opravdu schopno výrazně...	cs_CZ
uk.abstract.en	Semi-supervised learning (SSL) is a branch of machine learning focusing on using not only labeled data samples, but also unlabeled ones, in an effort to decrease the need for labeled data and thus allow using machine learning even when labeling large amounts of data would be too costly. Despite its quick development in the recent years, there are still issues left to be solved before it can be broadly deployed in practice. One of those issues is class distribution mismatch. It arises when the unlabeled data contains samples not belonging to the classes present in the labeled data. This confuses the training and can even lead to getting a classifier performing worse than a classifier trained on the available data in purely supervised fashion. We designed a filtration method called Unfavorable Data Filtering (UDF) which extracts important features from the data and then uses a similarity-based filter to filter the irrelevant data out according to those features. The filtering happens before any of the SSL training takes places, making UDF usable with any SSL algorithm. To judge its effectiveness, we performed many experiments, mainly on the CIFAR-10 dataset. We found out that UDF is capable of significantly improving the resulting accuracy when compared to not filtering the data, identified basic guidelines...	en_US
uk.file-availability	V
uk.grantor	Univerzita Karlova, Matematicko-fyzikální fakulta, Katedra teoretické informatiky a matematické logiky	cs_CZ
thesis.grade.code	1
uk.publication-place	Praha	cs_CZ