Vyhledávání v nesegmentované mluvené řeči

Češka, Pavel

Unsegmented speech retrieval

dc.contributor.advisor	Pecina, Pavel
dc.creator	Češka, Pavel
dc.date.accessioned	2017-04-12T09:53:39Z
dc.date.available	2017-04-12T09:53:39Z
dc.date.issued	2008
dc.identifier.uri	http://hdl.handle.net/20.500.11956/17229
dc.description.abstract	V této práci vyhledávám relevantní pasáže v nahrávkách českých svědků holocaustu z projektu MALACH. Zvukové záznamy těchto nahrávek jsou zpracovány systémem pro automatické rozpoznání řeči a přepisy z těchto systémů jsou lemmatizovány a opatřeny morfologickými tagy. V práci představuji skript, který z těchto dat generuje parametrizovatelné kolekce dokumentů. Problém vyhledávání informací v nesegmentované mluvené řeči poté přeformuluji na problém vyhledávání v těchto kolekcích dokumentů. V práci popisuji několik desítek experimentů zkoumajících vliv různých vyhledávacích technik na výsledky vyhledávání na těchto datech. Jedná se zejména o vliv normalizace slovních forem (lemmatizace), volby vyhledávacího modelu (TFIDF modelu, Okapi modelu a Indri modelu), obohacení dotazu o slepou zpětnou vazbu, odstranění nevýznamových slov podle frekvence či podle slovního druhu. Důraz je kladen také na různé hodnoty parametrů délky a přesahu generovaných dokumentů. Zjišťěné poznatky jsou v závěru práce ověřeny na testovacích datech. Přepisy výpovědí ani témata pro vyhledávání nejsou z právních důvodů součástí této práce.	cs_CZ
dc.description.abstract	In this work I search through interviews of Czech witnesses of the holocaust from the MALACH project to find relevant parts of these testimonies. Audio records of these interviews are automatically recognized by a system for an automatic speech recognition. Automatically recognized texts are then lemmatized and tagged. In this work I present a script which generates parametrizable collections of documents from these preprocessed texts. The task of unsegmented speech retrieval is then reformulated to a task of information retrieval in this collections of documents. In this work, I describe many experiments which examine the influence of different retrieval techniques on retrieval results on this data collection. Mainly, I study an influence of a morphological normalization (lemmatization), different types of IR systems (TF-IDF model, Okapi model and Indri model), blind relevance feedback, stopword list based on frequencies of terms and part-of-speech categories. I also place emphasis on various values of length and overlap parameters of generated documents. The results of these experiments are verified on test data. Audio records, outputs from automatic speech recognition system and topics for information retrieval are not part of this work due to legal grounds.	en_US
dc.language	Čeština	cs_CZ
dc.language.iso	cs_CZ
dc.publisher	Univerzita Karlova, Matematicko-fyzikální fakulta	cs_CZ
dc.title	Vyhledávání v nesegmentované mluvené řeči	cs_CZ
dc.type	diplomová práce	cs_CZ
dcterms.created	2008
dcterms.dateAccepted	2008-09-08
dc.description.department	Ústav formální a aplikované lingvistiky	cs_CZ
dc.description.department	Institute of Formal and Applied Linguistics	en_US
dc.description.faculty	Faculty of Mathematics and Physics	en_US
dc.description.faculty	Matematicko-fyzikální fakulta	cs_CZ
dc.identifier.repId	48476
dc.title.translated	Unsegmented speech retrieval	en_US
dc.contributor.referee	Peterek, Nino
dc.identifier.aleph	001099840
thesis.degree.name	Mgr.
thesis.degree.level	navazující magisterské	cs_CZ
thesis.degree.discipline	Matematická lingvistika	cs_CZ
thesis.degree.discipline	Computational Linguistics	en_US
thesis.degree.program	Informatika	cs_CZ
thesis.degree.program	Computer Science	en_US
uk.thesis.type	diplomová práce	cs_CZ
uk.taxonomy.organization-cs	Matematicko-fyzikální fakulta::Ústav formální a aplikované lingvistiky	cs_CZ
uk.taxonomy.organization-en	Faculty of Mathematics and Physics::Institute of Formal and Applied Linguistics	en_US
uk.faculty-name.cs	Matematicko-fyzikální fakulta	cs_CZ
uk.faculty-name.en	Faculty of Mathematics and Physics	en_US
uk.faculty-abbr.cs	MFF	cs_CZ
uk.degree-discipline.cs	Matematická lingvistika	cs_CZ
uk.degree-discipline.en	Computational Linguistics	en_US
uk.degree-program.cs	Informatika	cs_CZ
uk.degree-program.en	Computer Science	en_US
thesis.grade.cs	Výborně	cs_CZ
thesis.grade.en	Excellent	en_US
uk.abstract.cs	V této práci vyhledávám relevantní pasáže v nahrávkách českých svědků holocaustu z projektu MALACH. Zvukové záznamy těchto nahrávek jsou zpracovány systémem pro automatické rozpoznání řeči a přepisy z těchto systémů jsou lemmatizovány a opatřeny morfologickými tagy. V práci představuji skript, který z těchto dat generuje parametrizovatelné kolekce dokumentů. Problém vyhledávání informací v nesegmentované mluvené řeči poté přeformuluji na problém vyhledávání v těchto kolekcích dokumentů. V práci popisuji několik desítek experimentů zkoumajících vliv různých vyhledávacích technik na výsledky vyhledávání na těchto datech. Jedná se zejména o vliv normalizace slovních forem (lemmatizace), volby vyhledávacího modelu (TFIDF modelu, Okapi modelu a Indri modelu), obohacení dotazu o slepou zpětnou vazbu, odstranění nevýznamových slov podle frekvence či podle slovního druhu. Důraz je kladen také na různé hodnoty parametrů délky a přesahu generovaných dokumentů. Zjišťěné poznatky jsou v závěru práce ověřeny na testovacích datech. Přepisy výpovědí ani témata pro vyhledávání nejsou z právních důvodů součástí této práce.	cs_CZ
uk.abstract.en	In this work I search through interviews of Czech witnesses of the holocaust from the MALACH project to find relevant parts of these testimonies. Audio records of these interviews are automatically recognized by a system for an automatic speech recognition. Automatically recognized texts are then lemmatized and tagged. In this work I present a script which generates parametrizable collections of documents from these preprocessed texts. The task of unsegmented speech retrieval is then reformulated to a task of information retrieval in this collections of documents. In this work, I describe many experiments which examine the influence of different retrieval techniques on retrieval results on this data collection. Mainly, I study an influence of a morphological normalization (lemmatization), different types of IR systems (TF-IDF model, Okapi model and Indri model), blind relevance feedback, stopword list based on frequencies of terms and part-of-speech categories. I also place emphasis on various values of length and overlap parameters of generated documents. The results of these experiments are verified on test data. Audio records, outputs from automatic speech recognition system and topics for information retrieval are not part of this work due to legal grounds.	en_US
uk.publication.place	Praha	cs_CZ
uk.grantor	Univerzita Karlova, Matematicko-fyzikální fakulta, Ústav formální a aplikované lingvistiky	cs_CZ
dc.identifier.lisID	990010998400106986