Article 11449 of comp.lang.perl: Path: feenix.metronet.com!news.ecn.bgu.edu!usenet.ins.cwru.edu!howland.reston.ans.net!pipex!sunic!trane.uninett.no!nntp.uio.no!hbf From: h.b.furuseth@usit.uio.no Newsgroups: comp.lang.perl Subject: Re: Redefining \w and \b possible? Date: 11 Mar 1994 20:24:18 GMT Organization: University of Oslo, Norway Lines: 86 Message-ID: References: <1994Mar9.125522.20435@nntp.nta.no> <1994Mar10.014444.4803@netlabs.com> NNTP-Posting-Host: durin.uio.no In-reply-to: lwall@netlabs.com's message of Thu, 10 Mar 1994 01:44:44 GMT In article <1994Mar10.014444.4803@netlabs.com> lwall@netlabs.com (Larry Wall) writes: > In Perl 5 you'll just have to do &POSIX::setlocale. In Perl 4 you'd > have to sneak a setlocale into main() somewhere. But \b and \w are > defined in terms of isalpha and isdigit, so it oughta work. Setlocale looks nice for some applications. Problem is, the foreigner who wrote our locales didn't agree that the Norwegian characters can be represented as 7-bit "[\]{|}". I can't find any documentation on how the user can define such a locale, and I suspect there is no reasonably portable way. The solution seems to be to add user-defined character classes and translation tables to the Perl 6 wish list... In article <1994Mar9.125522.20435@nntp.nta.no> stein@hal.nta.no (Stein Kulseth) writes: > If not, how can I write a search pattern that will match Norwegian > word boundaries at either end and anywhere within a string? Sorry. Rewrite your code so you don't need the delimiters. Prepend and append a blank to your strings before maching, or split out the words and call functions on them. This is close to what I'm going to use. Translates both iso8859-1 and 7-bit Norwegian chars. Since this will run inside a 2-level loop over a 5000-line inputfile, I'd be very grateful for any hints about how to speed up the thing and still keep the usage simple enough to enable perl novices to modify it. # These can be used both in a tr/.../ (inside an eval) and in s/.../. $upChars = 'A-\135\300-\326\330-\336'; # upcase chars $toDownChars = 'a-\175\340-\366\370-\376'; # ..tr'ed to downcase $downChars = 'a-\175\340-\366\370-\376\337\377'; # downcase chars $toUpChars = 'A-\135\300-\326\330-\336\337\377'; # ..tr'ed to upcase $norwChars = $upChars . $downChars; # letters $wordChars = $norwChars . "0-9"; # alphanumerics # Using \135 instead of ] so s/$upChars/../ won't be confused. $arg1 = '$_[$[]'; # The argument eval " # Convert (and modify) the args to Norwegian upper/lowercase sub upCase { $arg1 =~ tr/$downChars/$toUpChars/; $arg1; } sub downCase { $arg1 =~ tr/$upChars/$toDownChars/; $arg1; } # 1. alphanum in string -> uppercase, rest -> lowercase sub Capitalize { $arg1 =~ tr/$upChars/$toDownChars/; $arg1 =~ s/[$wordChars]/&upCase(\$&)/eo); $arg1; } # 1. alphanum in each word -> uppercase, rest -> lowercase sub Casify { $arg1 =~ s/([$wordChars])([$wordChars]*)/ &upCase(\$1) . &downCase(\$2)/geo; $arg1; } "; # Example usage -- convert names to correct case # In names, these words should be in lowercase %nameTrans = ('Af', 'af', 'Av', 'av', 'De', 'de', 'Jr', 'jr', 'Den', 'den', 'Der', 'der', 'Van', 'van', 'Von', 'von'); sub convNamePart { local($_) = &downCase(shift); s/[$wordChars]/&upCase($&)/eo; $nameTrans{$_} || do { s/^Mc(.)/'Mc' . &upCase($1)/eo; # Mcneill -> McNeill $_; } } sub convName { $_[$[] =~ s/[$wordChars]+/&convNamePart($&)/geo; $_[$[]; } while (<>) { print &convName($_); } -- Hallvard