NAME Lingua::JA::NormalizeText - All-in-One Japanese text normalizer SYNOPSIS use Lingua::JA::NormalizeText; use utf8; my @options = ( qw/nfkc decode_entities/, \&dearinsu_to_desu ); my $normalizer = Lingua::JA::NormalizeText->new(@options); my $text = $normalizer->normalize('鳥が㌧㌦でありんす♥'); # => '鳥がトンドルです♥' sub dearinsu_to_desu { my $text = shift; $text =~ s/гЃ§гЃ‚г‚Љг‚“гЃ™/гЃ§гЃ™/g; return $text; } # or use Lingua::JA::NormalizeText qw/old2new_kanji/; use utf8; my $text = old2new_kanji('жѓЎгЃ®иЏЇ'); # => 'ж‚ЄгЃ®иЏЇ' DESCRIPTION All-in-One Japanese text normalizer. METHODS new(@options) Creates a new Lingua::JA::NormalizeText instance. The following options are available: OPTION SAMPLE INPUT OUTPUT FOR SAMPLE INPUT --------------------- --------------------- ----------------------- lc DdD ddd uc DdD DDD nfkc пЅ¶пѕћ ガ (U+30AC) nfkd пЅ¶пѕћ г‚«г‚™ (U+30AB. U+3099) nfc гѓ‰ гѓ‰ (U+30C9) nfd гѓ‰ гѓ€г‚™ (U+30C8, U+3099) decode_entities ♥ ♥ strip_html <em>гЃ‚</em> гЃ‚ alnum_z2h пјЎпјўпјЈпј‘пј’пј“ ABC123 alnum_h2z ABC123 пјЎпјўпјЈпј‘пј’пј“ space_z2h \x{3000} \x{0020} space_h2z \x{0020} \x{3000} katakana_z2h гѓЏг‚ЎгѓЏг‚Ў пѕЉпЅ§пѕЉпЅ§ katakana_h2z пЅЅпЅ°пѕЉпЅ°пЅЅпЅ°пѕЉпЅ° г‚№гѓјгѓЏгѓјг‚№гѓјгѓЏгѓј katakana2hiragana гѓ‘гѓігѓ„ гЃ±г‚“гЃ¤ hiragana2katakana гЃ±г‚“гЃ¤ гѓ‘гѓігѓ„ wave2tilde гЂњ, гЂ° пЅћ tilde2wave пЅћ гЂњ wavetilde2long гЂњ, гЂ°, пЅћ гѓј wave2long гЂњ, гЂ° гѓј tilde2long пЅћ гѓј fullminus2long пјЌ гѓј dashes2long — гѓј drawing_lines2long в”Ђ гѓј unify_long_repeats гѓґг‚Ўгѓјгѓјгѓј гѓґг‚Ўгѓј nl2space (LF)(CR)(CRLF} (space)(space)(space) unify_nl (LF)(CR)(CRLF) \n\n\n unify_long_spaces гЃ‚(space)(space)гЃ‚ гЃ‚(space)гЃ‚ unify_whitespaces \x{00A0} (space) trim (space)гЃ‚(space)гЃ‚(space) гЃ‚(space)гЃ‚ ltrim (space)гЃ‚(space) гЃ‚(space) rtrim гЃ‚гЃ‚(space)(space) гЃ‚гЃ‚ old2new_kana г‚ђгѓ°г‚‘гѓ±гѓёгѓ№ いイえエイ゙エ゙ old2new_kanji дєћп©§й¬ дєњйЂёй— tab2space (tab)(tab) (space)(space) remove_controls гЃ‚\x{0000}гЃ‚ гЃ‚гЃ‚ remove_spaces \x{0020}гЃ‚\x{3000}гЃ‚\x{0020} гЃ‚гЃ‚ dakuon_normalize гЃ•\x{3099} гЃ– (U+3056) handakuon_normalize гЃЇ\x{309A} гЃ± (U+3071) all_dakuon_normalize гЃ•\x{3099}гЃЇ\x{309A} гЃ–гЃ± (U+3056, U+3071) square2katakana гЊў г‚»гѓігѓЃ circled2kana г‹™г‹›г‹‘г‹џг‹‘ コシイタイ circled2kanji гЉ©гЉ«гЉљгЉ’гЉ– еЊ»е¦з”·жњ‰иІЎ The order in which these options are applied is according to the order of the elements of @options. (i.e., The first element is applied first, and the last element is applied last.) External functions are also addable. (See dearinsu_to_desu function of the SYNOPSIS section.) normalize($text) normalizes $text. OPTIONS lc, uc These options are the same as CORE::lc and CORE::uc. nfkc, nfkd, nfc, nfd See Unicode::Normalize. decode_entities See HTML::Entities. strip_html Strips the HTML tags from the given text. alnum_z2h, alnum_h2z Converts English alphabet, numbers and symbols ZENKAKU <-> HANKAKU. ZENKAKU: пј‡пј»пЅ–пЅЏпјЊпЅћпј”пЅѓпј™пј¦пЅ•пјїпјпј§пјґпј·пј°пЅ‘пїЈпЅ пј¶пЅ‰пј©пЅ’пјљпјєпјёпјЅпЅЊпјћ пЅќпї¦пјЃпЅњпЅпј–пј…пЅ”пјѕпјпЅ…пј¤пј«пј•пЅЉпјЌпї пЅ€пј‘пЅ›пјµпј’пј®пјЁпј†пјђпјѓпјЇпЅЋпїў пј пЅџпЅ†пј“пј±пЅЃпЅђпјЄпїҐпјџпјЎпЅ—пјјпј„пј‚пјўпЅЌпјЈпј—пј›пї¤пјќпЅ™пј‹пЅ‡пј№пјІпЅ‚пј¬пЅ‹ )S`E(£*.zs/<d HANKAKU: '[vo,~4c9Fu_MGTWPq¯⦆ViIr:ZX]l> }в‚©!|x6%t^8eDK5j-Вўh1{U2NH&0#OnВ¬ @в¦…f3QapJВҐ?Aw\$"BmC7;В¦=y+gYRbLk )S`E(ВЈ*.zs/<d space_z2h, space_h2z SPACE (U+0020) <-> IDEOGRAPHIC SPACE (U+3000) katakana_z2h, katakana_h2z Converts katakanas ZENKAKU <-> HANKAKU. See Lingua::JA::Regular::Unicode. hiragana2katakana INPUT: гЃ·г‚”гЃ«г‚ЂгЃ¦гЃ„гЃ§гЃ№г‚ћг‚ђгЃµгЃЁгЃЉг‚ЉгЃ’гЃќгЃҐг‚€гЃЇгЃ¤гЃ–гЃ—г‚ѓгЃ®гЃЈгЃгЃІгЃѓгЃџг‚‡ гЃ‘гЃѕг‚ЊгЃіг‚„гЃЊгЃЅгЃ¬гЃєгЃЏгЃћгЃ±гЃ”г‚’гЃёгЃљгЃ‹гЃґг‚…г‚ЋгЃ‚гЃЌг‚–гЃ‡гЃ©гЃ г‚Ќг‚‚гЃ€г‚Џ んぶぜめなちばぢるすぁゕぼらぉゝぐほさゑぎみせгЃгЃ“ぅゆう OUTPUT FOR INPUT: プヴニムテイデベヾヰフトオリゲソヅヨハツザシャノッネヒィタョ ケマレビヤガポヌペクゾパゴヲгѓг‚єг‚«гѓ”ュヮアг‚гѓ¶г‚§гѓ‰гѓЂгѓгѓўг‚ЁгѓЇ ンブゼメナチバヂルスァヵボラォヽグホサヱギミセジコゥユウ katakana2hiragana INPUT: пѕгѓњг‚єг‚·пЅ·пЅпѕ™пѕ€г‚°гѓЌг‚ェヱテクニトロドェコヽチガヘトゥダヤレ пѕ†гѓЃг‚ЅгѓЋпЅїпЅ»гѓ‘гѓЁпЅ§пѕ‰пѕЉг‚ґг‚ІпЅ«гѓ®гѓўгѓ°гѓ«гѓІгѓ пЅ±пѕѓг‚јгѓќгѓ•гѓЏгѓЈг‚µгѓѓгѓ© マアィョウオオクメユゥヂギメウナススラセザブフгѓпЅєпЅ¶гѓљг‚«пЅІгѓѕ エワヴンタャホョヨツゾバプモセムケリデミミホケイヒッユツマヵ タレピジシヌビヅヌィンエァォヶナヲュヤгѓпѕ‹гѓ™гѓЇ OUTPUT FOR INPUT: г‚ЉгЃјгЃљгЃ—гЃЌг‚…г‚‹гЃгЃђгЃгЃЌгЃ‡г‚‘гЃ¦гЃЏгЃ«гЃЁг‚ЌгЃ©гЃ‡гЃ“г‚ќгЃЎгЃЊгЃёгЃЁгЃ…гЃ г‚„г‚Њ にちそのそさぱよぁのはごげぉゎもゐるをむあてぜぽふはゃさっら まあぃょうおおくめゆぅぢぎめうなすすらせざぶふへこかぺかいゞ えわゔんたゃほょよつぞばぷもせむけりでみみほけいひっゆつまゕ гЃџг‚ЊгЃґгЃгЃ—гЃ¬гЃігЃҐгЃ¬гЃѓг‚“гЃ€гЃЃгЃ‰г‚–гЃЄг‚’г‚…г‚„г‚ЌгЃІгЃ№г‚Џ wave2tilde Converts WAVE DASH (U+301C) and WAVY DASH (U+3030) into tilde (U+FF5E). tilde2wave Converts tilde (U+FF5E) into wave (U+301C). wavetilde2long Converts WAVE DASH (U+301C), WAVY DASH (U+3030) and tilde (U+FF5E) into long (U+30FC). wave2long Converts WAVE DASH (U+301C) and WAVY DASH (U+3030) into long (U+30FC). tilde2long Converts tilde (U+FF5E) into long (U+30FC). fullminus2long Converts FULLWIDTH HYPHEN-MINUS (U+FF0D) into long (U+30FC). dashes2long Converts the following characters into long (U+30FC). U+2012 FIGURE DASH U+2013 EN DASH U+2014 EM DASH U+2015 HORIZONTAL BAR Note that this option does not convert hyphens into long. drawing_line2long Converts the following characters into long (U+30FC). U+2500 BOX DRAWINGS LIGHT HORIZONTAL U+2501 BOX DRAWINGS HEAVY HORIZONTAL U+254C BOX DRAWINGS LIGHT DOUBLE DASH HORIZONTAL U+254D BOX DRAWINGS HEAVY DOUBLE DASH HORIZONTAL U+2574 BOX DRAWINGS LIGHT LEFT U+2576 BOX DRAWINGS LIGHT RIGHT U+2578 BOX DRAWINGS HEAVY LEFT U+257A BOX DRAWINGS HEAVY RIGHT unify_long_repeats Unifies long (U+30FC) repeats. nl2space Converts new lines (LF, CR, CRLF) into SPACE (U+0020). unify_nl Unifies new lines. unify_long_spaces Unifies long spaces (U+0020 and U+3000). unify_whitespaces Converts the following characters into SPACE (U+0020). U+000B LINE TABULATION U+000C FORM FEED U+0085 NEXT LINE U+00A0 NO-BREAK SPACE U+1680 OGHAM SPACE MARK U+180E MONGOLIAN VOWEL SEPARATOR U+2000 EN QUAD U+2001 EM QUAD U+2002 EN SPACE U+2003 EM SPACE U+2004 THREE-PER-EM SPACE U+2005 FOUR-PER-EM SPACE U+2006 SIX-PER-EM SPACE U+2007 FIGURE SPACE U+2008 PUNCTUATION SPACE U+2009 THIN SPACE U+200A HAIR SPACE U+2028 LINE SEPARATOR U+2029 PARAGRAPH SEPARATOR U+202F NARROW NO-BREAK SPACE U+205F MEDIUM MATHEMATICAL SPACE Note that this option does not convert the following characters: U+0009 CHARACTER TABULATION U+000A LINE FEED U+000D CARRIAGE RETURN U+3000 IDEOGRAPHIC SPACE trim Removes leading and trailing whitespace. ltrim Removes only leading whitespace. rtrim Removes only trailing whitespace. old2new_kana INPUT OUTPUT FOR INPUT ----- -------------------- г‚ђ гЃ„ гѓ° イ г‚‘ гЃ€ гѓ± г‚Ё гѓё イ゙ (U+30A4, U+3099) гѓ№ г‚Ёг‚™ (U+30A8, U+3099) old2new_kanji INPUT: 亞惡壓圍爲醫壹逸稻飮隱營榮衞驛謁圓緣艷鹽奧應橫жђжЇ†й»ѓжє«з©©еЃ‡еѓ№ п©’з•«жњѓеЈћпЁЅж‡·п©…з№ЄпЁѕж§Єж“ґж®ји¦єеёе¶ЅжЁ‚пЁ¶жёґп© е‹ёеЌ·еЇ¬жЎп©‡зЅђи§Ђй—њй™·йЎЏпЁё п©‚жёж°Јп©Ћйѕњеѓћж€ІзЉ§и€Љж“љж“§и™›еіЅжЊѕз‹№й„•п©©ж›‰пЁґп©ЈеЌЂй©…е‹іи–°еѕ‘жѓ жЏжєЄз¶“з№ј 莖螢輕鷄藝擊缺儉劍圈檢權獻硏縣險顯驗嚴效廣恆鑛號國穀黑濟碎齋 劑櫻册殺雜參ж…жЈ§и ¶иґЉж®п©ЌзµІп©ЎйЅ’е…’иѕжї•實舍寫煮社者釋壽收臭從澁 獸縱祝肅處暑緖署諸敍奬將涉燒祥稱и‰д№е‰©еЈ¤еѓжўќж·Ёз‹Ђз–Љи®“й‡Ђе›‘и§ёеЇў 愼眞神盡圖粹醉隨髓數樞瀨聲靜齊攝竊節專戰淺潛纖踐錢禪曾祖僧雙 еЈЇпЁ»жђњжЏ’е·ўз€з¦зёЅиЋЉиЈќйЁ·еўћпЁїи‡џи—Џп©ҐеЌЅе±¬зєЊеў®й«”對帶滯臺瀧擇澤單嘆 ж“”и†ЅењеЅ€ж–·з™ЎйЃІж™ќиџІй‘„п©џе»іеѕµп©ЂиЃЅж••йЋпЁђйЃћйђµиЅ‰й»ће‚іпЁ¦й»Ёз›њз‡€з•¶й¬еѕ· зЌЁи®Ђп©•е±†з№©п©ЁиІіжѓ±и…¦йњёе»ўж‹њп©„иіЈйєҐз™јй«®ж‹”п©™ж™љи »пЁµп©‹зҐ•жї±п©¤п©Єп©Ѓз”ЃпЁ° пЁ›ж‹‚дЅ›еЂ‚пЁ№з«ќи®Љй‚ЉпЁіиѕЁз“ЈиѕЇи€–жҐз©—寶襃豐墨沒飜每萬滿免麵й»й¤ ж€ѕеЅЊ и—ҐиЇи±«й¤и€‡иЅжђ–жЁЈи¬ дѕ†иіґдє‚п¤ќи¦Ѕп§њйѕЌп¤¶е…©зЌµз¶ еЈж·љп§ђе‹µз¦®йљёйќ€йЅЎж›†ж· ж€Ђп©—йЌЉз€ђе‹ћп¤Ёп¤©жЁ“йѓћйЊ„зЃЈе Їе·–ж™‰ж§‡п©†пЁ–п©Љз‘¤п©ЏзҐїп©“з©°иЃ°йЃ™ OUTPUT FOR INPUT: дєњж‚Єењ§е›Із‚єеЊ»еЈ±йЂёзЁІйЈІйљ е–¶ж „иЎ›й§…и¬Ѓе††зёЃи‰¶еЎ©еҐҐеїњжЁЄж¬§ж®ґй»„жё©з©Џд»®дѕЎ 禍画会壊悔懐海絵慨概拡殻覚е¦еІіжҐЅе–ќжё‡и¤ђе‹§е·»еЇ›ж“漢缶観関陥顔器 ж—ўеё°ж°—зҐ€дєЂеЃЅж€ЇзЉ ж—§ж‹ жЊ™и™љеіЎжЊџз‹йѓ·йџїжљЃе‹¤и¬№еЊєй§†е‹Іи–«еѕ„жЃµжЋІжё“зµЊз¶™ иЊЋи›Ќи»Ѕй¶ЏиЉёж’ѓж¬ еЂ№е‰ЈењЏж¤њжЁ©зЊ®з ”зњЊй™єйЎ•йЁ“еЋіеЉ№еєѓжЃ’й‰±еЏ·е›Ѕз©Ђй»’жё€з •ж–Ћ 剤桜冊殺雑参惨桟蚕賛残祉糸視жЇе…ђиѕћж№їе®џи€Ће†™з…®з¤ѕиЂ…釈寿収и‡еѕ“жё‹ 獣縦祝粛処暑緒署諸叙奨将渉焼祥称証乗剰壌嬢条浄状畳иІй†ёе±и§¦еЇќ 慎真神尽図粋酔随髄数枢瀬声静斉摂窃節専戦浅潜繊践йЉз¦…曽祖僧双 壮層捜挿巣争痩総иЌиЈ…йЁ’еў—ж†Ћи‡“и”µиґ€еЌіе±ћз¶ље •дЅ“еЇѕеёЇж»ћеЏ°ж»ќжЉћжІўеЌе† ж‹…иѓ†е›Јејѕж–з—ґйЃ…жји™«й‹іи‘—庁徴懲聴勅鎮塚逓鉄転点伝都党盗灯当й—еѕі 独иЄзЄЃе±Љзё„難弐悩脳覇廃拝梅売麦発髪抜繁晩蛮卑碑з§жµњиі“й »ж•Џз“¶дѕ® 福払仏併塀並変辺勉弁弁弁舗ж©з©‚宝褒豊墨没翻毎万満免麺黙餅戻弥 и–¬иЁідє€дЅ™дёЋиЄ‰жЏєж§и¬ЎжќҐй јд№±ж¬„и¦§йљ†з«њи™њдёЎзЊџз·‘еЎЃж¶™йЎћеЉ±з¤јйљ·йњЉйЅўжљ¦жґ жЃ‹з·ґйЊ¬з‚‰еЉґе»Љжњ—жҐјйѓЋйЊІж№ѕе°е·Њж™‹ж§™жёљзЊЄзђўз‘¶зҐђз¦„з¦Ћз©ЈиЃЎйЃҐ tab2space Converts CHARACTER TABULATION (U+0009) into SPACE (U+0020). remove_controls Removes the following characters: U+0000 - U+0008 U+000B U+000C U+000E - U+001F U+007E - U+009F Note that this option does not remove the following characters: U+0009 CHARACTER TABULATION U+000A LINE FEED U+000D CARRIAGE RETURN remove_spaces Removes SPACE (U+0020) and IDEOGRAPHIC SPACE (U+3000). dakuon_normalize, handakuon_normalize, all_dakuon_normalize See Lingua::JA::Dakuon. square2katakana, circled2kana, circled2kanji See Lingua::JA::Moji. AUTHOR pawa <pawapawa@cpan.org> SEE ALSO ж–°ж—§е—дЅ“иЎЁ: <http://www.asahi-net.or.jp/~ax2s-kmtn/ref/old_chara.html> Lingua::JA::Regular::Unicode Lingua::JA::Dakuon Lingua::JA::Moji Unicode::Normalize Unicode::Number HTML::Entities HTML::Scrubber LICENSE This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.