Mixed Case Conversion
This section contain notes on how to write the program to
take a file in all upper case and convert it into
a plausible mixed case file.
This will act as a filter, with usage looking like:
mixed.case.pl < old.file > new.file
I want to preserve the line structure and other whitespace in
the file. To support this, write a routine which returns tokens
from the file. The tokens will be:
- Whitespace -- Any amount of whitespace, which the higher
level will simply pass on to the output file.
- Word -- A bunch of characters which look like a word.
- Punctuation -- A character which has special significance.
Right now, that is only a period.
The routine calling the token getter will make everything lower case.
It will then make the first letter upper case if the previous non-whitespace
token was a period.
A further scheme to make this a little better is to recognize
certain words, and map them in a special fashion.
The token getting routine is central to this program.
This will be a routine which looks at the remainder of the
current line, and does something different based upon
the contents.
if (length($OldLine) == 0)
{
$OldLine = <>;
return "" if length($OldLine) == 0;
chop $OldLine;
return "\n";
}
$val = $OldLine;
if ($OldLine =~ /^\s/)
{
$val =~ s/\S.*//; # Delete from the first non-space
}
elsif ($OldLine =~ /^\w/)
{
$val =~ s/\W.*//; # Delete from the first non-word character
}
else
{
$val = substr($OldLine, 0, 1);
}
$OldLine = substr($OldLine, length($val));
return $val;
That looks rather ugly, and there undoubtably ways to clean it up.
But the basic idea is that we ensure that we have a line to work
with, and if we don't, we read the next line.
Once we have a line, we look to see if there are a string of
white space, or a string of word characters. If so, we package
those up and return those. Otherwise we take just a single character
and return that. Note that it isn't strictly necessary to package
up the string of white space; we could have let those go through
a character at a time. However, it is needed to package up the
word characters, so that we can do special case substitution
based upon words in the main program.
The next phase is looking at the main program which calls
the getToken routine we defined above. This would look
something like:
$EndOfSentence = 1;
while (($token = getToken()) ne "")
{
if ($token eq ".")
{
$EndOfSentence = 1;
}
elsif ($token =~ /^\w/)
{
$token = lc($token);
if ($EndOfSentence)
{
$token = ucfirst($token);
}
$EndOfSentence = 0;
}
print $token;
}
This goes through the tokens, and notes when a "." goes through
to do appropriate capitalization. If the token was a word token,
it is converted to lower case with the lc function.
If the first letter should be made upper case because of a previous
period, then we convert that back.
In any case, if we write a word, turn the EndOfSentence flag off.
The last point in this program is taking care of special cases.
For instance, the town of Corvallis should be capitalized regardless
of where it occurs in a sentence. Other words, such as the time zone
"PST" should be in all capital letters.
This is handled by creating an associative array with the exceptions.
$Special{"portland"} = "Portland";
$Special{"corvallis"} = "Corvallis";
$Special{"pst"} = "PST";
$Special{"pdt"} = "PDT";
Then the main loop needs to be modified to check to see if
the token is one of the special tokens, and use the specified
value if it exists. If it doesn't exist, follow the normal
conversions.
To see the final program, look at
mixed.case.pl.