NAME
    Lingua::JA::NormalizeText - Text Normalizer

SYNOPSIS
      use Lingua::JA::NormalizeText;
      use utf8;

      my @options = ( qw/nfkc decode_entities/, \&dearinsu_to_desu );
      my $normalizer = Lingua::JA::NormalizeText->new(@options);

      print $normalizer->normalize('鳥が㌧㌦でありんす♥');
      # -> 鳥がトンドルです♥

      sub dearinsu_to_desu
      {
          my $text = shift;
          $text =~ s/гЃ§гЃ‚г‚Љг‚“гЃ™/гЃ§гЃ™/g;

          return $text;
      }

    # or

      use Lingua::JA::NormalizeText qw/old2new_kanji/;
      use utf8;

      print old2new_kanji('жѓЎгЃ®иЏЇ');
      # -> ж‚ЄгЃ®иЏЇ

DESCRIPTION
    Lingua::JA::NormalizeText normalizes text.

METHODS
  new(@options)
    Creates a new Lingua::JA::NormalizeText instance.

    The following options are available:

      OPTION                 SAMPLE INPUT           OUTPUT FOR SAMPLE INPUT
      ---------------------  ---------------------  -----------------------
      lc                     DdD                    ddd
      uc                     DdD                    DDD
      nfkc                   гЊ¦                     гѓ‰гѓ« (length: 2)
      nfkd                   гЊ¦                     гѓ€г‚™гѓ« (length: 3)
      nfc
      nfd
      decode_entities        ♥               ♥
      strip_html             <em>гЃ‚</em>                гЃ‚    
      alnum_z2h              пјЎпјўпјЈпј‘пј’пј“           ABC123
      alnum_h2z              ABC123                 пјЎпјўпјЈпј‘пј’пј“
      space_z2h
      space_h2z
      katakana_z2h           гѓЏг‚ЎгѓЏг‚Ў               пѕЉпЅ§пѕЉпЅ§
      katakana_h2z           пЅЅпЅ°пѕЉпЅ°пЅЅпЅ°пѕЉпЅ°               г‚№гѓјгѓЏгѓјг‚№гѓјгѓЏгѓј
      katakana2hiragana      гѓ‘гѓігѓ„                 гЃ±г‚“гЃ¤
      hiragana2katakana      гЃ±г‚“гЃ¤                 гѓ‘гѓігѓ„
      wave2tilde             гЂњ, гЂ°                 пЅћ
      tilde2wave             пЅћ                     гЂњ
      wavetilde2long         гЂњ, гЂ°, пЅћ             гѓј
      wave2long              гЂњ, гЂ°                 гѓј
      tilde2long             пЅћ                     гѓј
      fullminus2long         пјЌ                     гѓј
      dashes2long            —                      ー
      drawing_lines2long     в”Ђ                      гѓј
      unify_long_repeats     гѓґг‚Ўгѓјгѓјгѓј             гѓґг‚Ўгѓј
      nl2space               (LF)(CR)(CRLF}         (space)(space)(space)
      unify_nl               (LF)(CR)(CRLF)         \n\n\n
      unify_long_spaces      гЃ‚(space)(space)гЃ‚     гЃ‚(space)гЃ‚
      unify_whitespaces      \x{00A0}               (space)
      trim                   (space)гЃ‚(space)гЃ‚(space)  гЃ‚(space)гЃ‚
      ltrim                  (space)гЃ‚(space)       гЃ‚(space)
      rtrim                  гЃ‚гЃ‚(space)(space)     гЃ‚гЃ‚
      old2new_kana           ゐヰゑヱヸヹ           いイえエイ゙エ゙
      old2new_kanji          亞逸鬭                 亜逸闘
      tab2space              (tab)(tab)             (space)(space)
      remove_controls        гЃ‚\x{0000}гЃ‚           гЃ‚гЃ‚
      remove_spaces          (space)гЃ‚(space)гЃ‚(space)  гЃ‚гЃ‚
      dakuon_normalize       гЃ•\x{3099}             гЃ–
      handakuon_normalize    гЃЇ\x{309A}             гЃ±
      all_dakuon_normalize   гЃ•\x{3099}гЃЇ\x{309A}   гЃ–гЃ±

    The order in which these options are applied is according to the order
    of the elements of @options. (i.e., The first element is applied first,
    and the last element is applied last.)

    External functions are also addable. (See dearinsu_to_desu function of
    the SYNOPSIS section.)

  normalize($text)
    normalizes $text.

OPTIONS
  dashes2long
    Note that this option does not convert hyphens into long.

  drawing_line2long
    This option converts drawing lines which are similar to long(U+30FC) in
    appearance.

  unify_long_spaces
    Note that this option unifies only SPACE(U+0020) and IDEOGRAPHIC
    SPACE(U+3000).

  remove_controls
    Note that this option does not remove the following characters:

      CHARACTER TABULATION
      LINE FEED
      CARRIAGE RETURN

  remove_spaces
      Note that this option removes only SPACE(U+0020) and IDEOGRAPHIC SPACE(U+3000).

  unify_whitespaces
    This option converts the following characters into SPACE(U+0020).

      LINE TABULATION
      FORM FEED
      NEXT LINE
      NO-BREAK SPACE
      OGHAM SPACE MARK
      MONGOLIAN VOWEL SEPARATOR
      EN QUAD
      EM QUAD
      EN SPACE
      EM SPACE
      THREE-PER-EM SPACE
      FOUR-PER-EM SPACE
      SIX-PER-EM SPACE
      FIGURE SPACE
      PUNCTUATION SPACE
      THIN SPACE
      HAIR SPACE
      LINE SEPARATOR
      PARAGRAPH SEPARATOR
      NARROW NO-BREAK SPACE
      MEDIUM MATHEMATICAL SPACE

    Note that this does not convert the following characters:

      CHARACTER TABULATION
      LINE FEED
      CARRIAGE RETURN
      IDEOGRAPHIC SPACE

AUTHOR
    pawa <pawapawa@cpan.org>

SEE ALSO
    ж–°ж—§е­—дЅ“иЎЁ: <http://www.asahi-net.or.jp/~ax2s-kmtn/ref/old_chara.html>

    Lingua::JA::Regular::Unicode

    Lingua::JA::Dakuon

    Lingua::JA::Moji

    Unicode::Normalize

    HTML::Entities

    HTML::Scrubber

LICENSE
    This library is free software; you can redistribute it and/or modify it
    under the same terms as Perl itself.