Velký mnohojazyčný korpus

Majliš, Martin

Velký mnohojazyčný korpus

dc.contributor.advisor	Žabokrtský, Zdeněk
dc.creator	Majliš, Martin
dc.date.accessioned	2017-05-08T13:57:49Z
dc.date.available	2017-05-08T13:57:49Z
dc.date.issued	2011
dc.identifier.uri	http://hdl.handle.net/20.500.11956/49625
dc.description.abstract	V této diplomové práci je popsán webový korpus W2C. Tento korpus obsahuje 97 jazyku a pro každý z nich alespoň 10 milionů slov. Celková velikost je 10,5 miliardy slov. Aby bylo možné takovýto korpus vytvořit, bylo nutné vyřešit ce- lou řadu dílčích problémů. Na začátku musel být sestaven korpus z Wikipedie se 122 jazyky, na kterém byl natrénován rozpoznávač jazyků. Pro stahování webových stránek byl implementován distribuovaný systém, který využíval 35 počítačů. Ze stažených dat byly odstraněny duplicity. Vytvořené korpusy byly vzájemně porovnány pomocí různých statistik, jako jsou průměrná délky slov a vět, podmíněná entropie a podmíněná perplexita. 1	cs_CZ
dc.description.abstract	This thesis introduces the W2C Corpus which contains 97 languages with more than 10 million words for each of these languages, with the total size 10.5 billion words. The corpus was built by crawling the Internet. This work describes the methods and tools used for its construction. The complete process consisted of building an initial corpus from Wikipedia, developing a language recognizer for 122 languages, implementing a distributed system for crawling and parsing webpages and finally, the reduction of duplicities. A comparative analysis of the texts of Wikipedia and the Internet is provided at the end of this thesis. The analysis is based on basic statistics such as average word and sentence length, conditional entropy and perplexity. 1	en_US
dc.language	English	cs_CZ
dc.language.iso	en_US
dc.publisher	Univerzita Karlova, Matematicko-fyzikální fakulta	cs_CZ
dc.subject	jazykový korpus	cs_CZ
dc.subject	distribuované zpracování	cs_CZ
dc.subject	language corpus	en_US
dc.subject	distributed processing	en_US
dc.title	Velký mnohojazyčný korpus	en_US
dc.type	diplomová práce	cs_CZ
dcterms.created	2011
dcterms.dateAccepted	2011-09-06
dc.description.department	Institute of Formal and Applied Linguistics	en_US
dc.description.department	Ústav formální a aplikované lingvistiky	cs_CZ
dc.description.faculty	Faculty of Mathematics and Physics	en_US
dc.description.faculty	Matematicko-fyzikální fakulta	cs_CZ
dc.identifier.repId	106396
dc.title.translated	Velký mnohojazyčný korpus	cs_CZ
dc.contributor.referee	Spousta, Miroslav
dc.identifier.aleph	001384473
thesis.degree.name	Mgr.
thesis.degree.level	navazující magisterské	cs_CZ
thesis.degree.discipline	Computational Linguistics	en_US
thesis.degree.discipline	Matematická lingvistika	cs_CZ
thesis.degree.program	Computer Science	en_US
thesis.degree.program	Informatika	cs_CZ
uk.thesis.type	diplomová práce	cs_CZ
uk.taxonomy.organization-cs	Matematicko-fyzikální fakulta::Ústav formální a aplikované lingvistiky	cs_CZ
uk.taxonomy.organization-en	Faculty of Mathematics and Physics::Institute of Formal and Applied Linguistics	en_US
uk.faculty-name.cs	Matematicko-fyzikální fakulta	cs_CZ
uk.faculty-name.en	Faculty of Mathematics and Physics	en_US
uk.faculty-abbr.cs	MFF	cs_CZ
uk.degree-discipline.cs	Matematická lingvistika	cs_CZ
uk.degree-discipline.en	Computational Linguistics	en_US
uk.degree-program.cs	Informatika	cs_CZ
uk.degree-program.en	Computer Science	en_US
thesis.grade.cs	Velmi dobře	cs_CZ
thesis.grade.en	Very good	en_US
uk.abstract.cs	V této diplomové práci je popsán webový korpus W2C. Tento korpus obsahuje 97 jazyku a pro každý z nich alespoň 10 milionů slov. Celková velikost je 10,5 miliardy slov. Aby bylo možné takovýto korpus vytvořit, bylo nutné vyřešit ce- lou řadu dílčích problémů. Na začátku musel být sestaven korpus z Wikipedie se 122 jazyky, na kterém byl natrénován rozpoznávač jazyků. Pro stahování webových stránek byl implementován distribuovaný systém, který využíval 35 počítačů. Ze stažených dat byly odstraněny duplicity. Vytvořené korpusy byly vzájemně porovnány pomocí různých statistik, jako jsou průměrná délky slov a vět, podmíněná entropie a podmíněná perplexita. 1	cs_CZ
uk.abstract.en	This thesis introduces the W2C Corpus which contains 97 languages with more than 10 million words for each of these languages, with the total size 10.5 billion words. The corpus was built by crawling the Internet. This work describes the methods and tools used for its construction. The complete process consisted of building an initial corpus from Wikipedia, developing a language recognizer for 122 languages, implementing a distributed system for crawling and parsing webpages and finally, the reduction of duplicities. A comparative analysis of the texts of Wikipedia and the Internet is provided at the end of this thesis. The analysis is based on basic statistics such as average word and sentence length, conditional entropy and perplexity. 1	en_US
uk.publication.place	Praha	cs_CZ
uk.grantor	Univerzita Karlova, Matematicko-fyzikální fakulta, Ústav formální a aplikované lingvistiky	cs_CZ
dc.identifier.lisID	990013844730106986