Na cestě k lemmatizaci staročeských textů: data, software, aplikace

Synková, Pavlína; Lehečka, Boris; Svoboda, Ondřej

Towards the lemmatization of Old Czech texts: data, software, applications

dc.contributor.author	Synková, Pavlína
dc.contributor.author	Lehečka, Boris
dc.contributor.author	Svoboda, Ondřej
dc.date.accessioned	2018-11-28T15:02:26Z
dc.date.available	2018-11-28T15:02:26Z
dc.date.issued	2018
dc.identifier.issn	2336-6702
dc.identifier.uri	http://hdl.handle.net/20.500.11956/103953
dc.publisher	Univerzita Karlova, Filozofická fakulta	cs_CZ
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/2.0/
dc.source	Studie z aplikované lingvistiky - Studies in Applied Linguistics, 2018, 9, Special Issue, 66-84	cs_CZ
dc.source.uri	https://studiezaplikovanelingvistiky.ff.cuni.cz
dc.subject	common nouns	cs_CZ
dc.subject	NLP software and applications	cs_CZ
dc.subject	Old Czech	cs_CZ
dc.subject	tagging	cs_CZ
dc.subject	XML	cs_CZ
dc.subject	apelativa	cs_CZ
dc.subject	lemmatizace	cs_CZ
dc.subject	NLP software a aplikace	cs_CZ
dc.subject	stará čeština	cs_CZ
dc.subject	tagování	cs_CZ
dc.subject	XML	cs_CZ
dc.title	Na cestě k lemmatizaci staročeských textů: data, software, aplikace	cs_CZ
dc.title.alternative	Towards the lemmatization of Old Czech texts: data, software, applications	cs_CZ
dc.type	Vědecký článek	cs_CZ
uk.abstract.en	This paper introduces the description of Old Czech common nouns developed and used in a tool for tagging and lemmatizing common nouns occurring in transcribed digital editions of Old Czech texts. This description consists of four parts: the first features an overview of all declension type endings (approx. 100 declension patterns), the second part analyses alternations in the morphological basis accompanying declension (approx. 120 types of alternations), the third part deals with formal changes connected mainly with the language’s historical development (approx. 100 formal changes) and, finally, the fourth part contains a list of lemmas extracted from modern dictionaries of Old Czech (approx. 29 000 lemmas). Furthermore, the paper introduces the software developed and used for this purpose, namely i) the tool which makes it possible a) to generate word forms and subsequently search for multiple word forms in the texts at once, b) to create lists of word forms filtered by sequences of characters occurring at the end of the word forms, ii) the tool for assigning a declension pattern to a lemma, and iii) the tool enabling work with large databases. Finally, the paper describes two applications developed on the basis of Old Czech common noun description, i.e. i) a database of Old Czech common noun declension patterns connected with Old Czech dictionaries and the Old Czech text bank, ii) a tool for generating word forms, which is used for the lemmatization and tagging of Old Czech texts.	cs_CZ
uk.internal-type	uk_publication
dc.description.startPage	66
dc.description.endPage	84
dcterms.isPartOf.name	Studie z aplikované lingvistiky - Studies in Applied Linguistics	cs_CZ
dcterms.isPartOf.journalYear	2018
dcterms.isPartOf.journalVolume	9
dcterms.isPartOf.journalIssue	Special Issue

Soubory tohoto záznamu

Název:: Pavlina_Synkova_—_Boris_Leheck ...
Velikost:: 1.043Mb
Formát:: application/pdf

Zobrazit/otevřít

Tento záznam se objevuje v následujících sbírkách

Zvláštní číslo 2018 [9]
Special Issue 2018

Zobrazit minimální záznam

Kromě případů, kde je uvedeno jinak, licence tohoto záznamu je http://creativecommons.org/licenses/by-nc-nd/2.0/