Article 8670 of comp.lang.perl: Xref: feenix.metronet.com comp.lang.perl:8670 Path: feenix.metronet.com!news.utdallas.edu!hermes.chpc.utexas.edu!cs.utexas.edu!howland.reston.ans.net!newsserver.jvnc.net!yale.edu!yale!zip.eecs.umich.edu!not-for-mail From: seraphim@umcc.umcc.umich.edu (Henry Hardy) Newsgroups: comp.lang.perl Subject: Zipfian perl script -- help to improve Date: 6 Dec 1993 14:28:12 -0500 Organization: none Lines: 22 Message-ID: <2e014c$6un@umcc.umcc.umich.edu> NNTP-Posting-Host: umcc.umcc.umich.edu Summary: program to count number of occurences of words in a text Keywords: word rank order frequency analysis script Zipf linguistics cryptography Perl sed awk Here's a one line script I wrote to do word rank-order frequency analysis on a text (messages from sci.physics). I am doing a "Zipfian analysis" after the work of George Zipf. The script takes a file called infile and writes the number of occurences in ascending rank order (alphabetical w/in each rank). Here is the script: perl -ne 'print join("\n",split);' infile | sort | uniq -c | sort > outfile Now, since I have never used perl before, I need a bit of guidance to improve this thing. 1) downcase all alpha caps to miniscule (ie 'uncapitalize' words, acronyms etc.) So 'The' and 'the' etc. will be collapsed together. 2) Need to break on all non-alphanumeric characters. If someone can come up with an elegant way of doing these (or even a non- elegant way) in sed, awk, or perl, please respond. thanks! --HH. seraphim@umcc.umich.edu