利用者:Dragoniez/sandbox5
n-gram
n-グラム(英: n-gram)とは、特定の順序で並べられた、隣接するn個の記号の集まりをいう。ここでいう記号とは、字母(句読点および空白を含む)、音節、(まれに)言語データセット内に見られる単語全体、あるいは音声記録データセットから抽出された隣接する音素、ゲノムから抽出された隣接する塩基対などである。これらは通常、テキストコーパスまたは会話コーパスから収集される。
An n-gram is a sequence of n adjacent symbols in particular order. The symbols may be n adjacent letters (including punctuation marks and blanks), syllables, or rarely whole words found in a language dataset; or adjacent phonemes extracted from a speech-recording dataset, or adjacent base pairs extracted from a genome. They are collected from a text corpus or speech corpus. If Latin numerical prefixes are used, then n-gram of size 1 is called a "unigram", size 2 a "bigram" (or, less commonly, a "digram") etc. If, instead of the Latin ones, the English cardinal numbers are furtherly used, then they are called "four-gram", "five-gram", etc. Similarly, using Greek numerical prefixes such as "monomer", "dimer", "trimer", "tetramer", "pentamer", etc., or English cardinal numbers, "one-mer", "two-mer", "three-mer", etc. are used in computational biology, for polymers or oligomers of a known size, called k-mers. When the items are words, n-grams may also be called shingles.[1]
In the context of Natural language processing (NLP), the use of n-grams allows bag-of-words models to capture information such as word order, which would not be possible in the traditional bag of words setting.
Examples
(Shannon 1951)[2] discussed n-gram models of English. For example:
- 3-gram character model (random draw based on the probabilities of each trigram): in no ist lat whey cratict froure birs grocid pondenome of demonstures of the retagin is regiactiona of cre
- 2-gram word model (random draw of words taking into account their transition probabilities): the head and in frontal attack on an english writer that the character of this point is therefore another method for the letters that the time of who ever told the problem for an unexpected
Field | Unit | Sample sequence | 1-gram sequence | 2-gram sequence | 3-gram sequence |
---|---|---|---|---|---|
Vernacular name | unigram | bigram | trigram | ||
Order of resulting Markov model | 0 | 1 | 2 | ||
Protein sequencing | amino acid | ... Cys-Gly-Leu-Ser-Trp ... | ..., Cys, Gly, Leu, Ser, Trp, ... | ..., Cys-Gly, Gly-Leu, Leu-Ser, Ser-Trp, ... | ..., Cys-Gly-Leu, Gly-Leu-Ser, Leu-Ser-Trp, ... |
DNA sequencing | base pair | ...AGCTTCGA... | ..., A, G, C, T, T, C, G, A, ... | ..., AG, GC, CT, TT, TC, CG, GA, ... | ..., AGC, GCT, CTT, TTC, TCG, CGA, ... |
Language model | character | ...to_be_or_not_to_be... | ..., t, o, _, b, e, _, o, r, _, n, o, t, _, t, o, _, b, e, ... | ..., to, o_, _b, be, e_, _o, or, r_, _n, no, ot, t_, _t, to, o_, _b, be, ... | ..., to_, o_b, _be, be_, e_o, _or, or_, r_n, _no, not, ot_, t_t, _to, to_, o_b, _be, ... |
Word n-gram language model | word | ... to be or not to be ... | ..., to, be, or, not, to, be, ... | ..., to be, be or, or not, not to, to be, ... | ..., to be or, be or not, or not to, not to be, ... |
Figure 1 shows several example sequences and the corresponding 1-gram, 2-gram and 3-gram sequences.
Here are further examples; these are word-level 3-grams and 4-grams (and counts of the number of times they appeared) from the Google n-gram corpus.[3]
3-grams
- ceramics collectables collectibles (55)
- ceramics collectables fine (130)
- ceramics collected by (52)
- ceramics collectible pottery (50)
- ceramics collectibles cooking (45)
4-grams
- serve as the incoming (92)
- serve as the incubator (99)
- serve as the independent (794)
- serve as the index (223)
- serve as the indication (72)
- serve as the indicator (120)
References
- ^ Broder, Andrei Z.; Glassman, Steven C.; Manasse, Mark S.; Zweig, Geoffrey (1997). “Syntactic clustering of the web”. Computer Networks and ISDN Systems 29 (8): 1157–1166. doi:10.1016/s0169-7552(97)00031-7 .
- ^ Shannon, Claude E. "The redundancy of English." Cybernetics; Transactions of the 7th Conference, New York: Josiah Macy, Jr. Foundation. 1951.
- ^ “All Our N-gram are Belong to You”. Google Research Blog (2006年). 17 October 2006時点のオリジナルよりアーカイブ。2011年12月16日閲覧。
Further reading
- Manning, Christopher D.; Schütze, Hinrich; Foundations of Statistical Natural Language Processing, MIT Press: 1999, ISBN 0-262-13360-1
- White, Owen; Dunning, Ted; Sutton, Granger; Adams, Mark; Venter, J. Craig; Fields, Chris (1993). “A quality control algorithm for dna sequencing projects”. Nucleic Acids Research 21 (16): 3829–3838. doi:10.1093/nar/21.16.3829. PMC 309901. PMID 8367301 .
- Damerau, Frederick J.; Markov Models and Linguistic Theory, Mouton, The Hague, 1971
- Figueroa, Alejandro; Atkinson, John (2012). “Contextual Language Models For Ranking Answers To Natural Language Definition Questions”. Computational Intelligence 28 (4): 528–548. doi:10.1111/j.1467-8640.2012.00426.x .
- Brocardo, Marcelo Luiz; Traore, Issa; Saad, Sherif; Woungang, Isaac (2013). Authorship Verification for Short Messages Using Stylometry. IEEE International Conference on Computer, Information and Telecommunication Systems (CITS).
関連項目
External links
- Ngram Extractor: Gives weight of n-gram based on their frequency.
- Google's Google Books n-gram viewer and Web n-grams database (September 2006)
- STATOPERATOR N-grams Project Weighted n-gram viewer for every domain in Alexa Top 1M
- 1,000,000 most frequent 2,3,4,5-grams from the 425 million word Corpus of Contemporary American English
- Peachnote's music ngram viewer
- Stochastic Language Models (n-Gram) Specification (W3C)
- Michael Collins's notes on n-Gram Language Models
- OpenRefine: Clustering In Depth