NAME Lingua::JA::NormalizeText - Text Normalizer SYNOPSIS use Lingua::JA::NormalizeText; use utf8; my @options = ( qw/nfkc decode_entities/, \&dearinsu_to_desu ); my $normalizer = Lingua::JA::NormalizeText->new(@options); print $normalizer->normalize('鳥が㌧㌦でありんす♥'); # -> 鳥がトンドルです♥ sub dearinsu_to_desu { my $text = shift; $text =~ s/гЃ§гЃ‚г‚Љг‚“гЃ™/гЃ§гЃ™/g; return $text; } # or use Lingua::JA::NormalizeText qw/old2new_kanji/; use utf8; print old2new_kanji('жѓЎгЃ®иЏЇ'); # -> ж‚ЄгЃ®иЏЇ DESCRIPTION Lingua::JA::NormalizeText normalizes text. METHODS new(@options) Creates a new Lingua::JA::NormalizeText instance. The following options are available: OPTION SAMPLE INPUT OUTPUT FOR SAMPLE INPUT --------------------- --------------------- ----------------------- lc DdD ddd uc DdD DDD nfkc гЊ¦ гѓ‰гѓ« (length: 2) nfkd гЊ¦ гѓ€г‚™гѓ« (length: 3) nfc nfd decode_entities ♥ ♥ strip_html <em>гЃ‚</em> гЃ‚ alnum_z2h пјЎпјўпјЈпј‘пј’пј“ ABC123 alnum_h2z ABC123 пјЎпјўпјЈпј‘пј’пј“ space_z2h space_h2z katakana_z2h гѓЏг‚ЎгѓЏг‚Ў пѕЉпЅ§пѕЉпЅ§ katakana_h2z пЅЅпЅ°пѕЉпЅ°пЅЅпЅ°пѕЉпЅ° г‚№гѓјгѓЏгѓјг‚№гѓјгѓЏгѓј katakana2hiragana гѓ‘гѓігѓ„ гЃ±г‚“гЃ¤ hiragana2katakana гЃ±г‚“гЃ¤ гѓ‘гѓігѓ„ wave2tilde гЂњ, гЂ° пЅћ tilde2wave пЅћ гЂњ wavetilde2long гЂњ, гЂ°, пЅћ гѓј wave2long гЂњ, гЂ° гѓј tilde2long пЅћ гѓј fullminus2long пјЌ гѓј dashes2long — гѓј drawing_lines2long в”Ђ гѓј unify_long_repeats гѓґг‚Ўгѓјгѓјгѓј гѓґг‚Ўгѓј nl2space (LF)(CR)(CRLF} (space)(space)(space) unify_nl (LF)(CR)(CRLF) \n\n\n unify_long_spaces гЃ‚(space)(space)гЃ‚ гЃ‚(space)гЃ‚ unify_whitespaces \x{00A0} (space) trim (space)гЃ‚(space)гЃ‚(space) гЃ‚(space)гЃ‚ ltrim (space)гЃ‚(space) гЃ‚(space) rtrim гЃ‚гЃ‚(space)(space) гЃ‚гЃ‚ old2new_kana г‚ђгѓ°г‚‘гѓ±гѓёгѓ№ いイえエイ゙エ゙ old2new_kanji дєћп©§й¬ дєњйЂёй— tab2space (tab)(tab) (space)(space) remove_controls гЃ‚\x{0000}гЃ‚ гЃ‚гЃ‚ remove_spaces (space)гЃ‚(space)гЃ‚(space) гЃ‚гЃ‚ dakuon_normalize гЃ•\x{3099} гЃ– handakuon_normalize гЃЇ\x{309A} гЃ± all_dakuon_normalize гЃ•\x{3099}гЃЇ\x{309A} гЃ–гЃ± The order in which these options are applied is according to the order of the elements of @options. (i.e., The first element is applied first, and the last element is applied last.) External functions are also addable. (See dearinsu_to_desu function of the SYNOPSIS section.) normalize($text) normalizes $text. OPTIONS dashes2long Note that this option does not convert hyphens into long. drawing_line2long This option converts drawing lines which are similar to long(U+30FC) in appearance. unify_long_spaces Note that this option unifies only SPACE(U+0020) and IDEOGRAPHIC SPACE(U+3000). remove_controls Note that this option does not remove the following characters: CHARACTER TABULATION LINE FEED CARRIAGE RETURN remove_spaces Note that this option removes only SPACE(U+0020) and IDEOGRAPHIC SPACE(U+3000). unify_whitespaces This option converts the following characters into SPACE(U+0020). LINE TABULATION FORM FEED NEXT LINE NO-BREAK SPACE OGHAM SPACE MARK MONGOLIAN VOWEL SEPARATOR EN QUAD EM QUAD EN SPACE EM SPACE THREE-PER-EM SPACE FOUR-PER-EM SPACE SIX-PER-EM SPACE FIGURE SPACE PUNCTUATION SPACE THIN SPACE HAIR SPACE LINE SEPARATOR PARAGRAPH SEPARATOR NARROW NO-BREAK SPACE MEDIUM MATHEMATICAL SPACE Note that this does not convert the following characters: CHARACTER TABULATION LINE FEED CARRIAGE RETURN IDEOGRAPHIC SPACE AUTHOR pawa <pawapawa@cpan.org> SEE ALSO ж–°ж—§е—дЅ“иЎЁ: <http://www.asahi-net.or.jp/~ax2s-kmtn/ref/old_chara.html> Lingua::JA::Regular::Unicode Lingua::JA::Dakuon Lingua::JA::Moji Unicode::Normalize HTML::Entities HTML::Scrubber LICENSE This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.