Document embedding using Transformers

Burian, David

Embedování dokumentů pomocí Transformerů

dc.contributor.advisor	Libovický, Jindřich
dc.creator	Burian, David
dc.date.accessioned	2024-07-08T09:15:57Z
dc.date.available	2024-07-08T09:15:57Z
dc.date.issued	2024
dc.identifier.uri	http://hdl.handle.net/20.500.11956/190630
dc.description.abstract	We develop a method to train a document embedding model with an unlabeled dataset and low computational resources. Using teacher-student training, we distill SBERT's capacity to capture text structure and Paragraph Vector's ability to encode extended context into the resulting embedding model. We test our method on Longformer, a Transformer model with sparse attention that can process up to 4096 tokens. We explore several loss functions for the distillation of knowledge from the two teachers (SBERT and Paragraph Vector) to our student model (Longformer). Throughout experimentation, we show that despite SBERT's short maximum context, its distillation is more critical to the student's performance. However, the student model can benefit from both teachers. Our method improves Longformer's performance on eight downstream tasks, including citation prediction, plagiarism detection, and similarity search. Our method shows excep- tional performance with few finetuning data available, where the trained student model outperforms both teacher models. By showing consistent performance of differently con- figured student models, we demonstrate our method's robustness to various changes and suggest areas for future work. 1	en_US
dc.description.abstract	V této práci představujeme metodu strojového učení modelů emedující dokumenty, která není náročná na výpočetní zdroje ani nevyžaduje anotovaná trénovací data. S přís- tupem učitele a studenta, distilujeme kapacitu SBERTa zaznamenat strukturu textu a schopnost Paragraph Vektoru zpracovat dlouhé dokumenty do našeho výsledného em- bedovacího modelu. Naší metodu testujeme na Longformeru, Transformeru s řídkou attention vrstvou, který je schopný zpracovat dokumenty dlouhé až 4096 tokenů. Prozk- oumáme několik ztrátových funkcí, které nutí studenta (Longformera) napodobovat výs- tupy obou učitelů (SBERTa a Paragraph Vektoru). V experimentech ukazujeme, že i přes omezený kontext SBERTa, je distilace jeho výstupů pro výkon studenta zásad- nější. Nicméně student dokáže získat prospěch z obou učitelů. Naše metoda vylepšuje výsledek Longformera na osmi úlohách, které zahrnují predikci citace, detekci plagiarismu i vyhledávání na základě podobnosti dokumentů. Naše metoda se navíc ukazuje jako obzvláště účinná v situacích s málo dotrénovávacími daty, kde námi natrénovaný student překoná i oba učitele. Podobným výkonem odlišně natrénovaných studentů ukazujeme, že naše metoda je robustní vůči různým změnám, a navrhujeme možné oblasti budoucího výzkumu. 1	cs_CZ
dc.language	English	cs_CZ
dc.language.iso	en_US
dc.publisher	Univerzita Karlova, Matematicko-fyzikální fakulta	cs_CZ
dc.subject	document embedding\|knowledge distillation\|SBERT\|Paragraph Vector\|Longformer	en_US
dc.subject	embedding dokumentů\|destilování znalostí\|SBERT\|Paragraph Vector\|Longformer	cs_CZ
dc.title	Document embedding using Transformers	en_US
dc.type	diplomová práce	cs_CZ
dcterms.created	2024
dcterms.dateAccepted	2024-06-10
dc.description.department	Institute of Formal and Applied Linguistics	en_US
dc.description.department	Ústav formální a aplikované lingvistiky	cs_CZ
dc.description.faculty	Matematicko-fyzikální fakulta	cs_CZ
dc.description.faculty	Faculty of Mathematics and Physics	en_US
dc.identifier.repId	250786
dc.title.translated	Embedování dokumentů pomocí Transformerů	cs_CZ
dc.contributor.referee	Variš, Dušan
thesis.degree.name	Mgr.
thesis.degree.level	navazující magisterské	cs_CZ
thesis.degree.discipline	Informatika - Umělá inteligence	cs_CZ
thesis.degree.discipline	Computer Science - Artificial Intelligence	en_US
thesis.degree.program	Computer Science - Artificial Intelligence	en_US
thesis.degree.program	Informatika - Umělá inteligence	cs_CZ
uk.thesis.type	diplomová práce	cs_CZ
uk.taxonomy.organization-cs	Matematicko-fyzikální fakulta::Ústav formální a aplikované lingvistiky	cs_CZ
uk.taxonomy.organization-en	Faculty of Mathematics and Physics::Institute of Formal and Applied Linguistics	en_US
uk.faculty-name.cs	Matematicko-fyzikální fakulta	cs_CZ
uk.faculty-name.en	Faculty of Mathematics and Physics	en_US
uk.faculty-abbr.cs	MFF	cs_CZ
uk.degree-discipline.cs	Informatika - Umělá inteligence	cs_CZ
uk.degree-discipline.en	Computer Science - Artificial Intelligence	en_US
uk.degree-program.cs	Informatika - Umělá inteligence	cs_CZ
uk.degree-program.en	Computer Science - Artificial Intelligence	en_US
thesis.grade.cs	Výborně	cs_CZ
thesis.grade.en	Excellent	en_US
uk.abstract.cs	V této práci představujeme metodu strojového učení modelů emedující dokumenty, která není náročná na výpočetní zdroje ani nevyžaduje anotovaná trénovací data. S přís- tupem učitele a studenta, distilujeme kapacitu SBERTa zaznamenat strukturu textu a schopnost Paragraph Vektoru zpracovat dlouhé dokumenty do našeho výsledného em- bedovacího modelu. Naší metodu testujeme na Longformeru, Transformeru s řídkou attention vrstvou, který je schopný zpracovat dokumenty dlouhé až 4096 tokenů. Prozk- oumáme několik ztrátových funkcí, které nutí studenta (Longformera) napodobovat výs- tupy obou učitelů (SBERTa a Paragraph Vektoru). V experimentech ukazujeme, že i přes omezený kontext SBERTa, je distilace jeho výstupů pro výkon studenta zásad- nější. Nicméně student dokáže získat prospěch z obou učitelů. Naše metoda vylepšuje výsledek Longformera na osmi úlohách, které zahrnují predikci citace, detekci plagiarismu i vyhledávání na základě podobnosti dokumentů. Naše metoda se navíc ukazuje jako obzvláště účinná v situacích s málo dotrénovávacími daty, kde námi natrénovaný student překoná i oba učitele. Podobným výkonem odlišně natrénovaných studentů ukazujeme, že naše metoda je robustní vůči různým změnám, a navrhujeme možné oblasti budoucího výzkumu. 1	cs_CZ
uk.abstract.en	We develop a method to train a document embedding model with an unlabeled dataset and low computational resources. Using teacher-student training, we distill SBERT's capacity to capture text structure and Paragraph Vector's ability to encode extended context into the resulting embedding model. We test our method on Longformer, a Transformer model with sparse attention that can process up to 4096 tokens. We explore several loss functions for the distillation of knowledge from the two teachers (SBERT and Paragraph Vector) to our student model (Longformer). Throughout experimentation, we show that despite SBERT's short maximum context, its distillation is more critical to the student's performance. However, the student model can benefit from both teachers. Our method improves Longformer's performance on eight downstream tasks, including citation prediction, plagiarism detection, and similarity search. Our method shows excep- tional performance with few finetuning data available, where the trained student model outperforms both teacher models. By showing consistent performance of differently con- figured student models, we demonstrate our method's robustness to various changes and suggest areas for future work. 1	en_US
uk.file-availability	V
uk.grantor	Univerzita Karlova, Matematicko-fyzikální fakulta, Ústav formální a aplikované lingvistiky	cs_CZ
thesis.grade.code	1
uk.publication-place	Praha	cs_CZ
uk.thesis.defenceStatus	O