Home


Mixed Case Conversion

This section contain notes on how to write the program to take a file in all upper case and convert it into a plausible mixed case file. This will act as a filter, with usage looking like:

	mixed.case.pl < old.file > new.file


I want to preserve the line structure and other whitespace in the file. To support this, write a routine which returns tokens from the file. The tokens will be:

  • Whitespace -- Any amount of whitespace, which the higher level will simply pass on to the output file.
  • Word -- A bunch of characters which look like a word.
  • Punctuation -- A character which has special significance. Right now, that is only a period.

The routine calling the token getter will make everything lower case. It will then make the first letter upper case if the previous non-whitespace token was a period.

A further scheme to make this a little better is to recognize certain words, and map them in a special fashion.


The token getting routine is central to this program. This will be a routine which looks at the remainder of the current line, and does something different based upon the contents.

	if (length($OldLine) == 0)
	    {
	    $OldLine = <>;
	    return "" if length($OldLine) == 0;
	    chop $OldLine;
	    return "\n";
	    }

	$val = $OldLine;
	if ($OldLine =~ /^\s/)
	    {
	    $val =~ s/\S.*//;       # Delete from the first non-space
	    }
	elsif ($OldLine =~ /^\w/)
	    {
	    $val =~ s/\W.*//;       # Delete from the first non-word character
	    }
	else
	    {
	    $val = substr($OldLine, 0, 1);
	    }

	$OldLine = substr($OldLine, length($val));
	return $val;
That looks rather ugly, and there undoubtably ways to clean it up. But the basic idea is that we ensure that we have a line to work with, and if we don't, we read the next line.

Once we have a line, we look to see if there are a string of white space, or a string of word characters. If so, we package those up and return those. Otherwise we take just a single character and return that. Note that it isn't strictly necessary to package up the string of white space; we could have let those go through a character at a time. However, it is needed to package up the word characters, so that we can do special case substitution based upon words in the main program.


The next phase is looking at the main program which calls the getToken routine we defined above. This would look something like:

        $EndOfSentence = 1;
        while (($token = getToken()) ne "")
            {
            if ($token eq ".")
                {
                $EndOfSentence = 1;
                }
            elsif ($token =~ /^\w/)
                {
                $token = lc($token);
                if ($EndOfSentence)
                    {
                    $token = ucfirst($token);
                    }
                $EndOfSentence = 0;
                }
            print $token;
            }
This goes through the tokens, and notes when a "." goes through to do appropriate capitalization. If the token was a word token, it is converted to lower case with the lc function. If the first letter should be made upper case because of a previous period, then we convert that back. In any case, if we write a word, turn the EndOfSentence flag off.


The last point in this program is taking care of special cases. For instance, the town of Corvallis should be capitalized regardless of where it occurs in a sentence. Other words, such as the time zone "PST" should be in all capital letters.

This is handled by creating an associative array with the exceptions.

        $Special{"portland"}    = "Portland";
        $Special{"corvallis"}   = "Corvallis";
        $Special{"pst"}         = "PST";
        $Special{"pdt"}         = "PDT";
Then the main loop needs to be modified to check to see if the token is one of the special tokens, and use the specified value if it exists. If it doesn't exist, follow the normal conversions.


To see the final program, look at mixed.case.pl.


PEAK


Last modified 27 May 2006
Dave Regan
http://www.peak.org/~regan/
Resume / Biography