Article 8979 of comp.lang.perl: Xref: feenix.metronet.com sci.lang:5114 comp.lang.perl:8979 comp.programming:4409 Newsgroups: sci.lang,comp.lang.perl,comp.programming Path: feenix.metronet.com!news.ecn.bgu.edu!mp.cs.niu.edu!vixen.cso.uiuc.edu!uwm.edu!math.ohio-state.edu!darwin.sura.net!europa.eng.gtefsd.com!library.ucla.edu!agate!boulder!wraeththu.cs.colorado.edu!tchrist From: tchrist@wraeththu.cs.colorado.edu (Tom Christiansen) Subject: soundex questions Message-ID: Sender: news@Colorado.EDU (USENET News System) Organization: University of Colorado, Boulder Date: Fri, 17 Dec 1993 17:22:49 GMT Lines: 166 I have been thinking of soundex matching, except I know nearly nothing about it. I do have a function someone posted once that attempts to deal it. I'm not sure whether the algorithm here is "right" or not. The algorithm is: # return the Soundex value of a string using the following rules: # # 1) remove W and H # 2) remove all vowels except in the first position (A E I O U Y) # 3) recode characters per table: # A E I O U Y 0 # B F P V 1 # C G J K Q S X Z 2 # D T 3 # L 4 # M N 5 # R 6 # # 4) if two adjacent digits are now identical, remove one # 5) truncate to six digits or pad out the result with zeroes to # make six digits # 6) replace the first digit with the first character from the # original word Here's what it shows for some sample misspellings S52350 sunstem S52350 sunstom Z52350 zonstem Z52350 zonstum S52365 sonsterm S52365 sonstrom S52365 sonstromb S52365 sonstromm S52365 sunstorm S52365 sunstromb S52365 sunstrum Z52365 zonstorm S53236 sondstrom S53236 sondstrum S53236 soundstorm S53236 soundstromboner S53236 sundstrom Have you ever played with soundex? What might one do with these? Well, you should be able to look up hits that are close to you numerically and suggest them as possible alternatives. It would take a different database format of course, but that's ok. The problem is that it's not too smart. Some questions/issues: 1. What does it only produce with a six-characters return key? 2. Why doesn't it collapse the initial character as well (S and Z, P and B, etc). 3. Some of consonant clusters could stand being munged up a bit, like -mb -nd, etc. 5. Maybe vowels should have their own series? 1. Y -> I W -> U 2. Collapse duplicates 3. Score remaining vowel clusters into two or three sets, based on open/closed: O U OU EU UE AU EAU I E EI IE AE EA AI The problem with A is "father", "cat", "cake". I'd say more often it's with the latter set than the former. I don't think "coil" and "cowl" should be so close, either. 6. The liquids (L's and R's) in seem too significant, R's perhaps more than L's. "order" and "odor" are closer than it things. 7. What about all the digraphs? Do you dare think about them or not? TH SH CH GN KN PH GH all come to mind. The problem is that they all can give false readings in medial positions, as in "cathair" versus "catheter". Perhaps only in initial and final positions? Some should know about leading silent letter and throw them out (GN KN), others maps into single letter (PH => F), whereas others just go with whatever series they would normally go in, e.g. TH would be in the "D T" series, SH would be in with the S's, etc. Hm... I guess that's why they throw the H's out? But I don't like this: C30000 cot C23000 caught That might not be able to be done right, since then you have to discern "draught" is closer to "raft" than it is to "route", which is itself closer to "drought". Ug. Code follows for people wanting to sample it. #!/usr/bin/perl while (<>) { chop; print &soundex($_), "\t", $_, "\n"; } # soundex.pl # by George Armhold 3/22/92 # improvements by Marc Arnold # return the Soundex value of a string using the following rules: # # 1) remove W and H # 2) remove all vowels except in the first position (A E I O U Y) # 3) recode characters per table: # A E I O U Y 0 # B F P V 1 # C G J K Q S X Z 2 # D T 3 # L 4 # M N 5 # R 6 # # 4) if two adjacent digits are now identical, remove one # 5) truncate to six digits or pad out the result with zeroes to # make six digits # 6) replace the first digit with the first character from the # original word sub soundex { # takes a string as an argument, and returns its soundex value local($pattern) = @_; # upper-case the pattern to normalize matches $pattern =~ tr/a-z/A-Z/; # remove all but alphanumerics, and H,W $pattern =~ tr/A-GI-VX-Z0-9//cd; # remove all vowels after 1st letter ## substr($pattern, 1, length($pattern)) =~ tr/AEIOUY//d; # save first char local($first) = substr($pattern, 0, 1); # replaces letters with numbers and squish identical numbers $pattern =~ tr/BFPVCGJKQSXZDTLMNR0-9/1111222222223345560-9/ds; # remove all vowels after 1st letter substr($pattern, 1, length($pattern)) =~ tr/AEIOUY//d; # replace first letter substr($pattern, 0, 1) = $first; # pad on zeroes if necessary and truncate substr($pattern."000000", 0, 6); } 1; # because this is a require'd file -- Tom Christiansen tchrist@cs.colorado.edu "Will Hack Perl for Fine Food and Fun" Boulder Colorado 303-444-3212 Article 9018 of comp.lang.perl: Xref: feenix.metronet.com sci.lang:5128 comp.lang.perl:9018 comp.programming:4420 Newsgroups: sci.lang,comp.lang.perl,comp.programming Path: feenix.metronet.com!news.utdallas.edu!corpgate!bnrgate!bnr.co.uk!pipex!howland.reston.ans.net!spool.mu.edu!sgiblab!sgigate.sgi.com!olivea!pagesat!news.cerf.net!netlabs!lwall From: lwall@netlabs.com (Larry Wall) Subject: Re: soundex questions Message-ID: <1993Dec19.041550.4229@netlabs.com> Sender: news@netlabs.com Nntp-Posting-Host: scalpel.netlabs.com Organization: NetLabs, Inc. References: Date: Sun, 19 Dec 1993 04:15:50 GMT Lines: 69 In article jfh@netcom.com (Jack Hamilton) writes: : tchrist@wraeththu.cs.colorado.edu (Tom Christiansen) wrote: : : Well, here we were talking about you on the train just the other day, and : Bang, you post about a subject I'm interested in. (I decided you ought to : look like Larry Wall and Larry Wall ought to look like you, by the way.) Not unless you think a cute bald viking looks like a Honda mechanic. :-) : >I have been thinking of soundex matching, except I know nearly nothing : >about it. I do have a function someone posted once that attempts to deal : >it. I'm not sure whether the algorithm here is "right" or not. : : I don't think there is a "right" algorithm, although the one in Knuth is : probably the "standard" algorithm. It's hard to claim that any algorithm is "right" for a problem in fuzzy logic. The basic problems with soundex is that it's trying to solve a number of problems at once, and getting about half of the way there. There are several sources of error in the process. Misperception of spoken sounds. Mistranscription of perceived sounds to writing. Inadequacy of writing to convey spoken distinctions. Quantization boundary effects of the algorithm. Ideally, the computer should be taking the actual spoken sounds and computing the distance in "speech" space to all potential matches (I'll let the linguists argue about whether it should be etic or emic (not to be confused with emetic :-), and if emic, how you handle dialectic differences while doing phoneme recognition), then displaying the list in increasing order of linguistic distance. The soundex algorithm has a rather crude notion of distance: it only distinguishes "short" from "long", just like area codes in the phone system (no pun intended) back in the days when you could get charged long distance for calling someone across the street. Even if you limit yourself to processing written text (this is, after all, cross-posted to comp.lang.perl), you could probably do much better with an approximate matching algorithm that tried not to throw so much information away at the outset, but kept a better notion of linguistic distance. One thing the soundex system does do pretty good at is regularizing the dimensionality of the linguistic space. Perhaps if each "chunk" of soundex data that currently turns into a byte could instead be turned into a location in a small space of its own, then a larger space could be constructed of all the smaller spaces. The question then becomes how many different kinds of small spaces you need. Minimally, a vowel cluster space and a consonent cluster space, but you could differentiate word initial and word final, or use alternate spaces depending on surrounding choices. The phonologist in me is starting to go nuts. How many megabytes am I allowed to use? : Soundex attempts to map the sound of a name to the spelling of a name, and : how words are pronounced depends on a lot of different things. It : certainly depends on the language (the standard algorithm wouldn't work very : well for French, for example) and on the regional and personal speech : patterns of the speaker. Proper names tend to preserve complicated : spellings with simplified pronunciations - think of Chomondeley-Magdalen : (which I've probably misspelled) or Leichester Square. The constuction of the overall space from the small spaces could probably make some guesses about this sort of thing. The prototypical pronunciation of a given name could be stored in a dictionary, and distances compared with that. At some point it becomes more efficient to simply ask, "How do you spell that?" [lERiy ual] lwall@netlabs.com Article 9021 of comp.lang.perl: Xref: feenix.metronet.com sci.lang:5129 comp.lang.perl:9021 comp.programming:4422 Path: feenix.metronet.com!news.ecn.bgu.edu!usenet.ins.cwru.edu!howland.reston.ans.net!vixen.cso.uiuc.edu!moe.ksu.ksu.edu!cbs.ksu.ksu.edu!news From: Steve Davis Newsgroups: sci.lang,comp.lang.perl,comp.programming Subject: Re: soundex questions Date: 19 Dec 1993 06:13:30 -0600 Organization: Kansas State University Lines: 72 Sender: strat@cbs.ksu.ksu.edu (Steve Davis) Message-ID: <2f1ghaINN4dd@cbs.ksu.ksu.edu> References: Reply-To: strat@cis.ksu.edu (Steve Davis) NNTP-Posting-Host: cbs.ksu.ksu.edu tchrist@wraeththu.cs.colorado.edu (Tom Christiansen) writes: :Have you ever played with soundex? What might one do with these? I've played with a couple variations in writing a spelling checker. The algorithm I wound up using is this: (in perl even!) # "ABCDEFGHIJKLMNOPQRSTUVWXYZ" # ".123.12..22455.12623.1.2.2" for (0..255) { push(@map, "."); } $map[ord('B')] = 1; $map[ord('C')] = 2; $map[ord('D')] = 3; $map[ord('F')] = 1; $map[ord('G')] = 2; $map[ord('J')] = 2; $map[ord('K')] = 2; $map[ord('L')] = 4; $map[ord('M')] = 5; $map[ord('N')] = 5; $map[ord('P')] = 1; $map[ord('Q')] = 2; $map[ord('R')] = 6; $map[ord('S')] = 2; $map[ord('T')] = 3; $map[ord('V')] = 1; $map[ord('X')] = 2; $map[ord('Z')] = 2; sub soundex { y/a-z/A-Z/; # Map to upper case. @chars = unpack("C*",$_); # Split up by characters. $first = pack("C", shift(@chars)); # Save first letter. $_ = join("",@map[@chars]); # Map to [.0-9] (see above) tr/0-9//s; # Remove adjacent duplicates. s/\.//go; # Remove the placeholder. $_ = $_ ? $_ : "0"; # "0" for no pattern. $soundex = "$first$_"; return $soundex; } :The problem is that it's not too smart. Well, it was never intended to be. For the purposes of a spelling checker, you might want to remove the last consonant sound ('s, 'ing, and so on) and try looking for root words. Generally this means lopping off the last digit of the return value. :1. What does it only produce with a six-characters return key? They don't all return six characters. The one I found on the net returns either four or 'L0' where L is th efirst letter. :2. Why doesn't it collapse the initial character as well (S and Z, P : and B, etc). The obvious example of soundex usage is looking up a persons name out of an enormous database. Is that name "smith" or "smythe"? After running it through soundex, it generally doesn't matter. But in this case, there would be an incredible number of hits just on the common last name. Keeping the first letter intact helps keep the match hits low and throw out unreasonable matches. :3. Some of consonant clusters could stand being munged up a bit, like : -mb -nd, etc. Yes, sadly. [ Some more questions and points deleted. ] I agree with most of them. Please, write your own! :-) -- Steve Davis (strat@cis.ksu.edu) Kansas State University A billing computer that cheats is not a great public relations ploy. Article 11895 of comp.lang.perl: Path: feenix.metronet.com!news.utdallas.edu!convex!cs.utexas.edu!howland.reston.ans.net!gatech!udel!MathWorks.Com!noc.near.net!newshost.meiko.com!not-for-mail From: mike@meiko.com (Mike Stok) Newsgroups: comp.lang.perl Subject: Soundex (again :-) Date: 23 Mar 1994 19:44:35 -0500 Organization: Meiko Scientific, Inc., MA Lines: 272 Message-ID: <2mqnpj$qk4@hibbert.meiko.com> NNTP-Posting-Host: hibbert.meiko.com Thanks to Rich Pinder for finding a little bug in my soundex code I posted a while back. This showed up when he compared it with the output from Oracle's soundex function, and were caused by leading characters which were different but shared the same soundex code. Here's a fixed shar file... Mike #!/bin/sh # This is a shell archive (produced by shar 3.49) # To extract the files from this archive, save it to a file, remove # everything above the "!/bin/sh" line above, and type "sh file_name". # # made 03/24/1994 00:35 UTC by Mike.Stok@meiko.concord.ma.us # Source directory /tmp_mnt/develop/sw/misc/mike/soundex # # existing files will NOT be overwritten unless -c is specified # # This shar contains: # length mode name # ------ ---------- ------------------------------------------ # 1677 -r--r--r-- soundex.pl # 2408 -r-xr-xr-x soundex.t # # ============= soundex.pl ============== if test -f 'soundex.pl' -a X"$1" != X"-c"; then echo 'x - skipping soundex.pl (File already exists)' else echo 'x - extracting soundex.pl (Text)' sed 's/^X//' << 'SHAR_EOF' > 'soundex.pl' && package soundex; X ;# $Id: soundex.pl,v 1.2 1994/03/24 00:30:27 mike Exp $ ;# ;# Implementation of soundex algorithm as described by Knuth in volume ;# 3 of The Art of Computer Programming, with ideas stolen from Ian ;# Phillips . ;# ;# Mike Stok , 2 March 1994. ;# ;# Knuth's test cases are: ;# ;# Euler, Ellery -> E460 ;# Gauss, Ghosh -> G200 ;# Hilbert, Heilbronn -> H416 ;# Knuth, Kant -> K530 ;# Lloyd, Ladd -> L300 ;# Lukasiewicz, Lissajous -> L222 ;# ;# $Log: soundex.pl,v $ ;# Revision 1.2 1994/03/24 00:30:27 mike ;# Subtle bug (any excuse :-) spotted by Rich Pinder ;# in the way I handles leasing characters which were different but had ;# the same soundex code. This showed up comparing it with Oracle's ;# soundex output. ;# ;# Revision 1.1 1994/03/02 13:01:30 mike ;# Initial revision ;# ;# ;############################################################################## X ;# $soundex'noCode is used to indicate a string doesn't have a soundex ;# code, I like undef other people may want to set it to 'Z000'. X $noCode = undef; X ;# main'soundex ;# ;# usage: ;# ;# @codes = &main'soundex (@wordList); ;# $code = &main'soundex ($word); ;# ;# This strenuously avoids $[ X sub main'soundex { X local (@s, $f, $fc, $_) = @_; X X foreach (@s) X { X tr/a-z/A-Z/; X tr/A-Z//cd; X X if ($_ eq '') X { X $_ = $noCode; X } X else X { X ($f) = /^(.)/; X tr/AEHIOUWYBFPVCGJKQSXZDTLMNR/00000000111122222222334556/; X ($fc) = /^(.)/; X s/^$fc+//; X tr///cs; X tr/0//d; X $_ = $f . $_ . '000'; X s/^(.{4}).*/$1/; X } X } X X wantarray ? @s : shift @s; } X 1; SHAR_EOF chmod 0444 soundex.pl || echo 'restore of soundex.pl failed' Wc_c="`wc -c < 'soundex.pl'`" test 1677 -eq "$Wc_c" || echo 'soundex.pl: original size 1677, current size' "$Wc_c" fi # ============= soundex.t ============== if test -f 'soundex.t' -a X"$1" != X"-c"; then echo 'x - skipping soundex.t (File already exists)' else echo 'x - extracting soundex.t (Text)' sed 's/^X//' << 'SHAR_EOF' > 'soundex.t' && #!./perl ;# ;# $Id: soundex.t,v 1.2 1994/03/24 00:30:27 mike Exp $ ;# ;# test module for soundex.pl ;# ;# $Log: soundex.t,v $ ;# Revision 1.2 1994/03/24 00:30:27 mike ;# Subtle bug (any excuse :-) spotted by Rich Pinder ;# in the way I handles leasing characters which were different but had ;# the same soundex code. This showed up comparing it with Oracle's ;# soundex output. ;# ;# Revision 1.1 1994/03/02 13:03:02 mike ;# Initial revision ;# ;# X require '../lib/soundex.pl'; X $test = 0; print "1..13\n"; X while () { X chop; X next if /^\s*;?#/; X next if /^\s*$/; X X ++$test; X $bad = 0; X X if (/^eval\s+/) X { X ($try = $_) =~ s/^eval\s+//; X X eval ($try); X if ($@) X { X $bad++; X print "not ok $test\n"; X print "# eval '$try' returned $@"; X } X } X elsif (/^\(/) X { X ($in, $out) = split (':'); X X $try = "\@expect = $out; \@got = &soundex $in;"; X eval ($try); X X if (@expect != @got) X { X $bad++; X print "not ok $test\n"; X print "# expected ", scalar @expect, " results, got ", scalar @got, "\n"; X print "# expected (", join (', ', @expect), X ") got (", join (', ', @got), ")\n"; X } X else X { X while (@got) X { X $expect = shift @expect; X $got = shift @got; X X if ($expect ne $got) X { X $bad++; X print "not ok $test\n"; X print "# expected $expect, got $got\n"; X } X } X } X } X else X { X ($in, $out) = split (':'); X X $try = "\$expect = $out; \$got = &soundex ($in);"; X eval ($try); X X if ($expect ne $got) X { X $bad++; X print "not ok $test\n"; X print "# expected $expect, got $got\n"; X } X } X X print "ok $test\n" unless $bad; } X __END__ # # 1..6 # # Knuth's test cases, scalar in, scalar out # 'Euler':'E460' 'Gauss':'G200' 'Hilbert':'H416' 'Knuth':'K530' 'Lloyd':'L300' 'Lukasiewicz':'L222' # # 7..8 # # check default bad code # '2 + 2 = 4':undef undef:undef # # 9 # # check array in, array out # ('Ellery', 'Ghosh', 'Heilbronn', 'Kant', 'Ladd', 'Lissajous'):('E460', 'G200', 'H416', 'K530', 'L300', 'L222') # # 10 # # check array with explicit undef # ('Mike', undef, 'Stok'):('M200', undef, 'S320') # # 11..12 # # check setting $soundex'noCode # eval $soundex'noCode = 'Z000'; ('Mike', undef, 'Stok'):('M200', 'Z000', 'S320') # # 13 # # a subtle difference between me & oracle, spotted by Rich Pinder # # CZARKOWSKA:C622 SHAR_EOF chmod 0555 soundex.t || echo 'restore of soundex.t failed' Wc_c="`wc -c < 'soundex.t'`" test 2408 -eq "$Wc_c" || echo 'soundex.t: original size 2408, current size' "$Wc_c" fi exit 0 -- The "usual disclaimers" apply. | Meiko Mike Stok | 130C Baker Ave. Ext Mike.Stok@meiko.concord.ma.us | Concord, MA 01742 Meiko tel: (508) 371 0088 |