uppercasing a string

Erwin Kalvelagen

unread,

May 16, 2000, 7:00:00 AM5/16/00

to

What is a good way of uppercasing a character (or string)?
I use now:

c -------------------------------------------------------
character function charupcase(c)
c -------------------------------------------------------
c
c if c is a lowercase letter, return its uppercase
c otherwise return c
c
c Works for ASCII and EBCDIC (needed???)
c

implicit none

character c
character uc

integer i

character*26 LOW, UPP
data LOW /'abcdefghijklmnopqrstuvwxyz'/,
$ UPP /'ABCDEFGHIJKLMNOPQRSTUVWXYZ'/

uc = c
if (c .ge. 'a' .and. c .le. 'z') then
i = index(low,c)
if (i.gt.0) uc = upp(i:i)
endif

charupcase = uc

return
end

I copied this from some other piece of (otherwise excellent) software,
but the call to INDEX worries me. Is this a problem performance-wise?
If so, do you know any good alternatives? It needs to be portable.

Thanks, Erwin

--
Erwin Kalvelagen
GAMS Development Corp.
1217 Potomac St. NW
Washington DC 20007
phone: (202) 342 0180
fax: (202) 342 0181
e-mail:ekalv...@gams.com

Richard Maine

unread,

May 16, 2000, 7:00:00 AM5/16/00

to

"Erwin Kalvelagen" <er...@gams.com> writes:

> What is a good way of uppercasing a character (or string)?

...

> data LOW /'abcdefghijklmnopqrstuvwxyz'/,
> $ UPP /'ABCDEFGHIJKLMNOPQRSTUVWXYZ'/

This part is "good". In that its portable. No assumptions about
collating sequence. You do have one assumption built in here.
Namely that this is the complete list of characters. If you end
up applying this to character sets with accented letters as separate
characters, then you'll miss them. But as long as that's ok for
your application, then fine.

> if (c .ge. 'a' .and. c .le. 'z') then
> i = index(low,c)
> if (i.gt.0) uc = upp(i:i)
> endif

> I copied this from some other piece of (otherwise excellent) software,

> but the call to INDEX worries me. Is this a problem performance-wise?
> If so, do you know any good alternatives? It needs to be portable.

If performance is your top priority, then an "obvious" thing to do is
to build a lookup table based on the collating sequence. You can do
that in a pretty portable manner. But I don't have time to throw
together a code sample right now.

Or another alternative is to use iachar/char, which always use the
ascii collating sequence, regardless of whether or not thats the native
character set. Examples of that have been posted here before.

In my applications, I usually don't have to worry a lot about the
performance of this kind of operation. Unless I mess it up *REALLY*
badly, its not going to be measurable in the context of an overall
application. I suppose exceptions could exist.

Note, by the way that

> if (c .ge. 'a' .and. c .le. 'z') then

does *NOT* guarantee that you have a lower case letter. There may
me non-letter characters mixed in with the letters....indeed EBCDIC
(which this code claims to be ok for) has that feature.

I see that the code does later have a test for whether index returned
zero, which will take care of those cases, but it would be real easy
to think that the test for 0 was superfluous. It isn't; don't take
it out.

--
Richard Maine
ma...@altair.dfrc.nasa.gov

Paul van Delst

unread,

May 16, 2000, 7:00:00 AM5/16/00

to

Erwin Kalvelagen wrote:
>
> What is a good way of uppercasing a character (or string)?

> I use now:
>
> c -------------------------------------------------------
> character function charupcase(c)
> c -------------------------------------------------------
> c
> c if c is a lowercase letter, return its uppercase
> c otherwise return c
> c
> c Works for ASCII and EBCDIC (needed???)
> c
>
> implicit none
>
> character c
> character uc
>
> integer i
>
> character*26 LOW, UPP

> data LOW /'abcdefghijklmnopqrstuvwxyz'/,
> $ UPP /'ABCDEFGHIJKLMNOPQRSTUVWXYZ'/
>

> uc = c

> if (c .ge. 'a' .and. c .le. 'z') then
> i = index(low,c)
> if (i.gt.0) uc = upp(i:i)
> endif
>

> charupcase = uc
>
> return
> end
>

> I copied this from some other piece of (otherwise excellent) software,
> but the call to INDEX worries me. Is this a problem performance-wise?
> If so, do you know any good alternatives? It needs to be portable.

I think you can get rid of the

if (c .ge. 'a' .and. c .le. 'z')

condition since you have the post-INDEX call
if (i.gt.0)
check.

Regarding performance of the INDEX, I can't say. Probably differs
between platforms though. The INDEX intrinsic has been around for a
while though (in f77 or extensions thereof) so I'd be pretty surprised
if it was a complete dog wrt performance.

I do it in pretty much the same way except that my IDL background wanted
a "convert the whole string" type of function . The method was
unceremoniously lifted from Cooper Redwine's "Upgrading to Fortran90"
book but the coding is mine (i.e. don't direct style criticism to him
:o) :

MODULE string_processing
USE type_kinds
IMPLICIT NONE
PRIVATE
CHARACTER( LEN = 128 ), PARAMETER :: &
rcs_Id = '$Id: string_processing.f90,v 1.1 1999/10/18 20:49:15 paulv
Exp $'
CHARACTER( LEN = 26 ), PARAMETER :: lower_case =
'abcdefghijklmnopqrstuvwxyz', &
upper_case =
'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
PUBLIC :: strupcase, strlowcase, strcompress

CONTAINS

!-------------------------------------------------------------------------------
! -- Subroutine to convert strings to UPPER CASE --
!-------------------------------------------------------------------------------

SUBROUTINE strupcase( input_string, output_string )

! -- Arguments
CHARACTER( LEN = * ), INTENT( IN ) :: input_string
CHARACTER( LEN = * ), INTENT( OUT ) :: output_string

! -- Local variables
INTEGER( Long ) :: i, n, position

! -- Copy input string
output_string = input_string

! -- Get length of string
n = LEN( output_string )

! -- Loop over string elements
DO i = 1, n

! -- Find location of letter in lower case constant string
position = INDEX( lower_case, output_string( i:i ) )

! -- If current substring is a lower case letter, make it upper case
IF ( position /= 0 ) &
output_string( i:i ) = upper_case( position:position )

END DO

END SUBROUTINE strupcase

etc.....

END MODULE string_processing

--
Paul van Delst Ph: (301) 763-8000 x7274
CIMSS @ NOAA/NCEP Fax: (301) 763-8545
Rm.202, 5200 Auth Rd. Email: pvan...@ncep.noaa.gov
Camp Springs MD 20746

James Giles

unread,

May 16, 2000, 7:00:00 AM5/16/00

to

Erwin Kalvelagen wrote in message <8fs6vi$91h$1...@bob.news.rcn.net>...

> implicit none
>
> character c
> character uc
>
> integer i
>
> character*26 LOW, UPP
> data LOW /'abcdefghijklmnopqrstuvwxyz'/,
> $ UPP /'ABCDEFGHIJKLMNOPQRSTUVWXYZ'/
>
>
> uc = c
> if (c .ge. 'a' .and. c .le. 'z') then

This test is superfluous given what follows. It's
also not portable since the letters are not required
(by the language standard) to be contiguous in the
collating sequence.

> i = index(low,c)

The returned value from INDEX will be nonzero only if
the argument C is a lowercase letter. If it anything else,
it won't be in the string LOW and INDEX will return
zero.

> if (i.gt.0) uc = upp(i:i)

Replace .GT. with .NE. and remove the whole block
IF. That will give a completely portable conversion
routine.

> endif
>
> charupcase = uc
>
> return
> end
>
>
>I copied this from some other piece of (otherwise excellent) software,
>but the call to INDEX worries me. Is this a problem performance-wise?
>If so, do you know any good alternatives? It needs to be portable.

INDEX is not too bad on most modern hardware, which often
have single or special instructions to scan a string for a single
character. Faster would be the following (not portable if your
program uses characters that aren't in the ASCII set):

function upcase(c)
character*1 upcase, c
character*1 upper(128)
... DATA to set UPPER to the ASCII sequence, but with
... all lowercase replaced with the corresponding uppercase.
upcase = upper(iachar(c))
return
end

Hypothetically, a good compiler could use a single instruction
with the lookup table on many hardware platforms (XLAT on
PCs, for example).

--
J. Giles

Dick Hendrickson

unread,

May 16, 2000, 7:00:00 AM5/16/00

to

James Giles wrote:
>
> Erwin Kalvelagen wrote in message <8fs6vi$91h$1...@bob.news.rcn.net>...
> > implicit none
> >
> > character c
> > character uc
> >
> > integer i
> >
> > character*26 LOW, UPP
> > data LOW /'abcdefghijklmnopqrstuvwxyz'/,
> > $ UPP /'ABCDEFGHIJKLMNOPQRSTUVWXYZ'/
> >
> >
> > uc = c
> > if (c .ge. 'a' .and. c .le. 'z') then
>
> This test is superfluous given what follows. It's
> also not portable since the letters are not required
> (by the language standard) to be contiguous in the
> collating sequence.

Several people have commented on this, but they all are sort of
wrong. The standard does require that 'a' be less than 'b'...
less than 'z'. Given that the questioner later asks about speed,
this is a reasonable way to eliminate all non-lower-case
characters on an ASCII machines and many of them on other machines.
It's a crude test with a more accurate test for false positives.
Not a bad way to do things; unless he knows, for example, that 90%
of the characters are alphabetic. Especially since INDEX is likely
to scan all 26 characters and be relatively slow compared to the
simple IF.

Dick Hendrickson

Donald Arseneau

unread,

May 16, 2000, 7:00:00 AM5/16/00

to

"Erwin Kalvelagen" <er...@gams.com> writes:

> data LOW /'abcdefghijklmnopqrstuvwxyz'/,
> $ UPP /'ABCDEFGHIJKLMNOPQRSTUVWXYZ'/

> but the call to INDEX worries me. Is this a problem performance-wise?

For pure speed, a look-up table. If non-ascii may be a problem
(I assume not because there are no accented or `foreign' letters)
then it is possible to choose between various tables.

But for a simple boost to index(), how about sorting the alphabet
by usage frequency:

data LOW /'esta ... vzj'/,
$ UPP /'ESTA ... VZJ'/

(approximately)

Donald Arseneau as...@triumf.ca