Oborová klasifikace textu

Čech, Josef

Branch text classification

dc.contributor.advisor	Raab, Jan
dc.creator	Čech, Josef
dc.date.accessioned	2017-04-20T21:16:24Z
dc.date.available	2017-04-20T21:16:24Z
dc.date.issued	2010
dc.identifier.uri	http://hdl.handle.net/20.500.11956/28777
dc.description.abstract	Práce se zabývá porovnáváním textu a jeho kategorizaci. Kategorie, které je program schopen určit, získává v módu učení. Porovnává několik možných algoritmů, které lze využít ke kategorizaci textu. Jde především o Bayesovský model, klasifikaci pomocí neuronových sítí a vektorový model. V praktické části je implementován vektorový model, který využívá kosinovu míru podobnosti. Extrakce termínu vychází z Luhnovy myšlenky o významovosti slov. Jako hlavní zdroj vah pro kosinovu míru podobnosti je využívána hlavně metoda tfxidf s penalizacemi.	cs_CZ
dc.description.abstract	This thesis follows up text categorization. In the first part are described several chosen algorithms for a categorization of documents - the Bayesian model, a categorization with a neural networks and a vector model. Practice part is focused on a algorithm vector model. The vector model is based on idea of two vectors. One vector represents a pattern and second a query. In our case first vector corresponds with a category and the second one with the document. Coordinates of the vector are weights of single words in the text or in the branch depends on, which vector we think about. For comparing are possible to use several procedures like Dice coefficient similarity, Jaccard coefficient or cosine similarity. In my thesis is used cosine similarity. Computing weights is based on frequency of the term in the document and on frequency of documents, which contain the term. Relevant terms are selected on Luhn simple ideas of significance words.	en_US
dc.language	Čeština	cs_CZ
dc.language.iso	cs_CZ
dc.publisher	Univerzita Karlova, Matematicko-fyzikální fakulta	cs_CZ
dc.title	Oborová klasifikace textu	cs_CZ
dc.type	bakalářská práce	cs_CZ
dcterms.created	2010
dcterms.dateAccepted	2010-06-21
dc.description.department	Institute of Formal and Applied Linguistics	en_US
dc.description.department	Ústav formální a aplikované lingvistiky	cs_CZ
dc.description.faculty	Faculty of Mathematics and Physics	en_US
dc.description.faculty	Matematicko-fyzikální fakulta	cs_CZ
dc.identifier.repId	65697
dc.title.translated	Branch text classification	en_US
dc.contributor.referee	Spousta, Miroslav
dc.identifier.aleph	001381687
thesis.degree.name	Bc.
thesis.degree.level	bakalářské	cs_CZ
thesis.degree.discipline	Programování	cs_CZ
thesis.degree.discipline	Programming	en_US
thesis.degree.program	Informatika	cs_CZ
thesis.degree.program	Computer Science	en_US
uk.thesis.type	bakalářská práce	cs_CZ
uk.taxonomy.organization-cs	Matematicko-fyzikální fakulta::Ústav formální a aplikované lingvistiky	cs_CZ
uk.taxonomy.organization-en	Faculty of Mathematics and Physics::Institute of Formal and Applied Linguistics	en_US
uk.faculty-name.cs	Matematicko-fyzikální fakulta	cs_CZ
uk.faculty-name.en	Faculty of Mathematics and Physics	en_US
uk.faculty-abbr.cs	MFF	cs_CZ
uk.degree-discipline.cs	Programování	cs_CZ
uk.degree-discipline.en	Programming	en_US
uk.degree-program.cs	Informatika	cs_CZ
uk.degree-program.en	Computer Science	en_US
thesis.grade.cs	Dobře	cs_CZ
thesis.grade.en	Good	en_US
uk.abstract.cs	Práce se zabývá porovnáváním textu a jeho kategorizaci. Kategorie, které je program schopen určit, získává v módu učení. Porovnává několik možných algoritmů, které lze využít ke kategorizaci textu. Jde především o Bayesovský model, klasifikaci pomocí neuronových sítí a vektorový model. V praktické části je implementován vektorový model, který využívá kosinovu míru podobnosti. Extrakce termínu vychází z Luhnovy myšlenky o významovosti slov. Jako hlavní zdroj vah pro kosinovu míru podobnosti je využívána hlavně metoda tfxidf s penalizacemi.	cs_CZ
uk.abstract.en	This thesis follows up text categorization. In the first part are described several chosen algorithms for a categorization of documents - the Bayesian model, a categorization with a neural networks and a vector model. Practice part is focused on a algorithm vector model. The vector model is based on idea of two vectors. One vector represents a pattern and second a query. In our case first vector corresponds with a category and the second one with the document. Coordinates of the vector are weights of single words in the text or in the branch depends on, which vector we think about. For comparing are possible to use several procedures like Dice coefficient similarity, Jaccard coefficient or cosine similarity. In my thesis is used cosine similarity. Computing weights is based on frequency of the term in the document and on frequency of documents, which contain the term. Relevant terms are selected on Luhn simple ideas of significance words.	en_US
uk.publication.place	Praha	cs_CZ
uk.grantor	Univerzita Karlova, Matematicko-fyzikální fakulta, Ústav formální a aplikované lingvistiky	cs_CZ
dc.identifier.lisID	990013816870106986