Package org.apache.lucene.analysis.cjk
Analyzer for Chinese, Japanese, and Korean, which indexes bigrams. This analyzer generates bigram
terms, which are overlapping groups of two adjacent Han, Hiragana, Katakana, or Hangul
characters.
Three analyzers are provided for Chinese, each of which treats Chinese text in a different way.
- ChineseAnalyzer (in the analyzers/cn package): Index unigrams (individual Chinese characters) as a token.
- CJKAnalyzer (in this package): Index bigrams (overlapping groups of two adjacent Chinese characters) as tokens.
- SmartChineseAnalyzer (in the analyzers/smartcn package): Index words (attempt to segment Chinese text into words) as tokens.
- ChineseAnalyzer: 我-是-中-国-人
- CJKAnalyzer: 我是-是中-中国-国人
- SmartChineseAnalyzer: 我-是-中国-人
-
Class Summary Class Description CJKAnalyzer AnAnalyzerthat tokenizes text withStandardTokenizer, normalizes content withCJKWidthFilter, folds case withLowerCaseFilter, forms bigrams of CJK withCJKBigramFilter, and filters stopwords withStopFilterCJKBigramFilter Forms bigrams of CJK terms that are generated from StandardTokenizer or ICUTokenizer.CJKBigramFilterFactory Factory forCJKBigramFilter.CJKWidthCharFilter ACharFilterthat normalizes CJK width differences: Folds fullwidth ASCII variants into the equivalent basic latin Folds halfwidth Katakana variants into the equivalent kanaCJKWidthCharFilterFactory Factory forCJKWidthCharFilter.CJKWidthFilter ATokenFilterthat normalizes CJK width differences: Folds fullwidth ASCII variants into the equivalent basic latin Folds halfwidth Katakana variants into the equivalent kanaCJKWidthFilterFactory Factory forCJKWidthFilter.