Show simple item record

Quantitative delimitation of the core of a language
dc.contributor.authorCvrček, Václav
dc.date.accessioned2018-05-28T11:04:45Z
dc.date.available2018-05-28T11:04:45Z
dc.date.issued2014
dc.identifier.issn2336-6591
dc.identifier.urihttp://hdl.handle.net/20.500.11956/96790
dc.description.abstractThe exploitation of hapax legomena, i.e. word or lemma types which occur in a corpus only once, is usually overlooked in language description. These types cannot be systematically used for a vast majority of analyses as they do not provide a basis for any type of generalization. On the other hand, the overall number of hapaxes can be used as an indicator of the lexical periphery of the language system. This paper suggests that the ratio between the number of hapaxes and the number of all types in relation to the growing corpus size (hapax-type ratio, HTR) can be used for delimitation of the lexical core of a language. It has been shown by previous research (Fengxiang 2010) that HTR in English has the shape of a pipe or chibouque, which means that the rates of the emergence of new hapaxes and new types in the process of building a corpus differ before and after reaching a certain size. In a hypothetical small corpus (a few sentences) the hapax-type ratio will be equal to one (each wordtype is also a hapax). As texts are added to the corpus (up to a few million words), the hapax-type ratio decreases (the number of new words including hapaxes is continuously increasing but the majority of added tokens are new instances of words already present in the corpus) from its maximal value (=1) to a local minimum. After reaching this turning point, extending the corpus increases the ratio because the number of hapaxes grows at a faster pace than the number of non-hapaxes (i.e. types with a frequency higher than one). This empirical finding tested on corpora of Czech and English brings us closer to the exact determination of the range of the core lexicon. Subsequently, we can deduce the approximate size of a corpus sufficient for compiling a dictionary that covers the core lexicon.en_US
dc.formatpdf
dc.publisherUniverzita Karlova, Filozofická fakultacs_CZ
dc.rights.urihttps://creativecommons.org/licenses/by-nc-nd/2.0/
dc.sourceČasopis pro moderní filologii (Journal for Modern Philology), 2014, 96, 1, 9-26
dc.source.urihttps://casopispromodernifilologii.ff.cuni.cz
dc.subjectkorpuscs_CZ
dc.subjectkvantitativní lingvistikacs_CZ
dc.subjecthapax legomenoncs_CZ
dc.subjectlexikoncs_CZ
dc.subjecttoken-type poměrcs_CZ
dc.subjectcorpusen_US
dc.subjectquantitative linguisticsen_US
dc.subjecthapax legomenonen_US
dc.subjectlexiconen_US
dc.subjecttoken-type ratioen_US
dc.titleKvantitativní určení lexikálního jádra jazykacs_CZ
dc.typeVědecký článekcs_CZ
dc.typeResearch Articleen_US
dcterms.accessRightsopenAccess
dc.title.translatedQuantitative delimitation of the core of a languageen_US
uk.internal-typeuk_publication
dc.description.startPage9
dc.description.endPage26
dcterms.isPartOf.nameČasopis pro moderní filologii (Journal for Modern Philology)cs_CZ
dcterms.isPartOf.journalYear2014
dcterms.isPartOf.journalVolume2014
dcterms.isPartOf.journalIssue1


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record

https://creativecommons.org/licenses/by-nc-nd/2.0/
Except where otherwise noted, this item's license is described as https://creativecommons.org/licenses/by-nc-nd/2.0/

© 2017 Univerzita Karlova, Ústřední knihovna, Ovocný trh 560/5, 116 36 Praha 1; email: admin-repozitar [at] cuni.cz

Za dodržení všech ustanovení autorského zákona jsou zodpovědné jednotlivé složky Univerzity Karlovy. / Each constituent part of Charles University is responsible for adherence to all provisions of the copyright law.

Upozornění / Notice: Získané informace nemohou být použity k výdělečným účelům nebo vydávány za studijní, vědeckou nebo jinou tvůrčí činnost jiné osoby než autora. / Any retrieved information shall not be used for any commercial purposes or claimed as results of studying, scientific or any other creative activities of any person other than the author.

DSpace software copyright © 2002-2015  DuraSpace
Theme by 
@mire NV