Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

handling large character arrays

3 views
Skip to first unread message

anal...@hotmail.com

unread,
Apr 21, 2010, 6:22:06 PM4/21/10
to
I need to keep in memory an array of around 60000 character variables,
each element of which can have a max length of 4000 byres. But if you
add up the lengths of all the actual data values, it is only 1/8 of
60000*4000.

What would be the cleanest way to store this data to take advantage of
this fact?

Thanks.

Dick Hendrickson

unread,
Apr 21, 2010, 8:46:14 PM4/21/10
to
Seriously, how about character array(60000)*(4000)? Unless I
seriously messed up the math, that's only 240,000,000 bytes. A
quarter giga-byte isn't much now. My laptop has 4 GB and didn't
cost much less than $2000. Brute force is often your friend.

Dick Hendrickson

Gordon Sande

unread,
Apr 21, 2010, 8:47:48 PM4/21/10
to
On 2010-04-21 19:22:06 -0300, "anal...@hotmail.com"
<anal...@hotmail.com> said:

For a symbol table where the average length is short but a few symbols
can be much longer I have used pointers (an integer valued subscript)
into a storage pool. You can either be like Pascal and have a length
associated with each pointer or like C and use a sentinel to end each
string (C uses a nul but ASCII does have ETX for end-of-text). Depends
on how important it is to easily know the length. Just another example
of "make sure you know all the operations to be done before you decide
on the data structure". The pointers, often called headers, will be
integers but the storage pool will characters. I tend to use an array
of characters of length one but one might use a longer character and
find substrings. The issue is converting from a character of length
n to and from an array of n charaacters of lenth one.

The next layer out is to decide whether this is just a storage pool
where you know which pointer to use or is some sort of searchable
data structure which has to determine which poiter does whatever job
is expected of it. But this is not what you asked about.

Storage is so easy and cheap that sometimes I don't bother any more.

Your mileage may vary, as is the usual weaseling out.


e p chandler

unread,
Apr 21, 2010, 8:57:38 PM4/21/10
to

<anal...@hotmail.com> wrote in message
news:24f1b2ab-86b6-4518...@w16g2000vbf.googlegroups.com...

I'll skip the usual rhetorical why?, what_for? and what's wrong with a
database?

How about an array each of whose elements is a derived type. One component
is an integer that specifies the string's length. The second is an
allocatable array of single characters.

---- start text ----
module my_mod
implicit none

type node
integer :: len
character, allocatable :: char(:)
end type
end module my_mod

program my_prog
use my_mod
implicit none
integer,parameter :: buff_size = 80, node_max = 5
character(buff_size) :: in_buff
integer :: curr_node,curr_len,curr_pos,num_nodes
type(node) :: str(node_max)

num_nodes = 0
do curr_node = 1,node_max
read '(a)',in_buff
curr_len = len(trim(in_buff))
if (curr_len == 0) exit
str(curr_node)%len = curr_len
allocate(str(curr_node)%char(curr_len))
do curr_pos = 1,curr_len
str(curr_node)%char(curr_pos) = in_buff(curr_pos:curr_pos)
end do
num_nodes = num_nodes + 1
end do

do curr_node = 1,num_nodes
print *,curr_node,str(curr_node)%len,'|',str(curr_node)%char,'|'
end do

end program my_prog
---- end text ----

Of course then you have the overhead of converting to and from or
inter-operating with normal string variables, etc.

---- e


dpb

unread,
Apr 21, 2010, 8:59:32 PM4/21/10
to

My initial thought w/o the crystal ball working to address Gordon's
points is Dick's response (which is where Gordon also sorta' ends up...
:) ).

Unless this is on a constrained machine or there are other large
requirements, I'd tend to think the brute force is easiest may well also
be best here.

--

Jim Xia

unread,
Apr 21, 2010, 9:20:16 PM4/21/10
to

> I'll skip the usual rhetorical why?, what_for? and what's wrong with a
> database?
>
> How about an array each of whose elements is a derived type. One component
> is an integer that specifies the string's length. The second is an
> allocatable array of single characters.
>
> ---- start text ----
> module my_mod
> implicit none
>
> type node
>   integer :: len
>   character, allocatable :: char(:)
> end type
> end module my_mod
>


How about this

type string
character(:), allocatable :: str
end type


You don't need the len variable. len(node%str) will be that value.


Then you need an array of type string: type(string) ::
my_memory(60000)

This method saves you space only if the majority of the string lengths
are far less than 4000 bytes. Otherwise the Fortran descriptor will
take a toll on your total memory usage.

Cheers,

Jim

Jim Xia

unread,
Apr 21, 2010, 9:32:52 PM4/21/10
to
OK, for the sake of the completeness, let's see my version of my_prog:


module my_mod
implicit none

type string


character(:), allocatable :: str
end type

end module my_mod


program my_prog
use my_mod
implicit none

integer,parameter :: max_buff_size = 4000
integer,parameter :: array_size = 60000
character(max_buff_size) :: in_buff

type(string) :: the_memory(60000)
integer i, whatEverUnit

do i = 1, array_size
read (whatEverUnit, '(a)') in_buff
the_memory(i)%str = trim(in_buff)
end do

end program my_prog

Cheers,

Jim

Richard Maine

unread,
Apr 21, 2010, 9:35:45 PM4/21/10
to
e p chandler <ep...@juno.com> wrote:

> <anal...@hotmail.com> wrote in message
> news:24f1b2ab-86b6-4518...@w16g2000vbf.googlegroups.com...
> >I need to keep in memory an array of around 60000 character variables,
> > each element of which can have a max length of 4000 byres. But if you
> > add up the lengths of all the actual data values, it is only 1/8 of
> > 60000*4000.
> >
> > What would be the cleanest way to store this data to take advantage of
> > this fact?

> How about an array each of whose elements is a derived type. One component


> is an integer that specifies the string's length. The second is an
> allocatable array of single characters.

Why the redundancy of the separate component for the string length? The
allocatable already stores the length, so you are just replicating that
by also making the length a separate component.
...


> curr_len = len(trim(in_buff))
> if (curr_len == 0) exit

My nose for potential bugs detects two of them here. This assumes two
things. Both assumptions might be correct, but I tend to think it good
to make such assumptions explicit. Otherwise, you are likely to find
that the user of your code doesn't share your assumptions, which asks
for bugs.

1. It assumes that trailing blanks are insignificant. Maybe they are,
maybe not. Nothing in the problem statement specifcally said they were.
If I were actually writing real code in response to such a problem
statement, I'd ask before just assuming this to be so. And I'd document
the assumption in case things changed in the future (or in case the
person who told me that the blanks were insignificant was wrong.)

2. It assumes that a length of 0 is invalid. Again, nothing specifically
said that. Note that it is perfectly legit in f90 and later to have a
zero-length string. (It wasn't in f77).

Note as an aside, that this uses f2003 features. One could do it in
f90/f95 with pointers to arrays of character*1, though that's not nearly
as nice. I might be tempted to use a long character string as a storage
pool (as in Gordon's reply) instead of the pointer to character array
thing. That does assume that the data item lengths don't change after
their initial assignment; otherwise managing the space in the storage
pool becomes much more bother.

--
Richard Maine | Good judgment comes from experience;
email: last name at domain . net | experience comes from bad judgment.
domain: summertriangle | -- Mark Twain

e p chandler

unread,
Apr 21, 2010, 9:37:40 PM4/21/10
to

"Jim Xia" <jim...@hotmail.com> wrote in message

> module my_mod
> implicit none
>
> type node
> integer :: len
> character, allocatable :: char(:)
> end type
> end module my_mod
>

How about this

type string
character(:), allocatable :: str
end type

You don't need the len variable. len(node%str) will be that value.

---> len() returns a value of 1. Do you mean size()?


robert....@oracle.com

unread,
Apr 21, 2010, 9:47:43 PM4/21/10
to
On Apr 21, 3:22 pm, "analys...@hotmail.com" <analys...@hotmail.com>
wrote:

How is the array to be used? What operations will be
performed on it?

Robert Corbett

Jim Xia

unread,
Apr 21, 2010, 9:49:58 PM4/21/10
to


Sorry, I meant len(string%str). I renamed the type name and component
name after I typed in the sentence -- forgot to correct them.

character(:), allocatable :: str is a scalar of character with
deferred length. You can automatically allocate this string by
assignment -- see my full example.


Cheers,

Jim

Richard Maine

unread,
Apr 21, 2010, 9:51:23 PM4/21/10
to
e p chandler <ep...@juno.com> wrote:

No, he means len. If len returns other than the correct value, report
the bug to your vendor. Allocatable character length being one of the
f2003 features last on the implementation list, it could well be that
your vendor has such a bug... but it would be a bug.

Size is quite irrelevant. Your compiler shouldn't even allow
size(node%str) in Jim's example, as it isn't an array. If your compiler
allows that, there's another bug to report (and a more surprising one).

Your example uses an array of characters (I missed that at first), but
Jim's uses a character string instead. They are different.

Note also the handy way that Jim's example (in his subsequent followup)
nicely uses the f2003 allocate-on-assignment feature. I think it nicely
illustrates the cleanliness you can get from using allocatable character
length and that feature.

e p chandler

unread,
Apr 21, 2010, 10:04:06 PM4/21/10
to

"Richard Maine" <nos...@see.signature> wrote in message
news:1jhb9j0.jfrt8o1unhkhsN%nos...@see.signature...

>e p chandler <ep...@juno.com> wrote:
>
>> <anal...@hotmail.com> wrote in message
>> news:24f1b2ab-86b6-4518...@w16g2000vbf.googlegroups.com...
>> >I need to keep in memory an array of around 60000 character variables,
>> > each element of which can have a max length of 4000 byres. But if you
>> > add up the lengths of all the actual data values, it is only 1/8 of
>> > 60000*4000.
>> >
>> > What would be the cleanest way to store this data to take advantage of
>> > this fact?
>
>> How about an array each of whose elements is a derived type. One
>> component
>> is an integer that specifies the string's length. The second is an
>> allocatable array of single characters.
>
> Why the redundancy of the separate component for the string length? The
> allocatable already stores the length, so you are just replicating that
> by also making the length a separate component.
> ...

Probably because I was thinking about (non-standard) Pascal. [smile] I see
now that I can use size() instead.

>> curr_len = len(trim(in_buff))
>> if (curr_len == 0) exit
>
> My nose for potential bugs detects two of them here. This assumes two
> things. Both assumptions might be correct, but I tend to think it good
> to make such assumptions explicit. Otherwise, you are likely to find
> that the user of your code doesn't share your assumptions, which asks
> for bugs.

> 1. It assumes that trailing blanks are insignificant. Maybe they are,
> maybe not. Nothing in the problem statement specifcally said they were.
> If I were actually writing real code in response to such a problem
> statement, I'd ask before just assuming this to be so. And I'd document
> the assumption in case things changed in the future (or in case the
> person who told me that the blanks were insignificant was wrong.)

Just a convenient way of setting the length to something.

> 2. It assumes that a length of 0 is invalid. Again, nothing specifically
> said that. Note that it is perfectly legit in f90 and later to have a
> zero-length string. (It wasn't in f77).

Just a convenient way of exiting early from my test program.

OK. But please distringuish the mechanics of my test program and its
assumptions from what it was trying to demonstrate. I do think that I would
make a production program more robust!

--- e


e p chandler

unread,
Apr 21, 2010, 10:26:48 PM4/21/10
to

"Jim Xia" <jim...@hotmail.com> wrote in message
news:36f14b04-a900-4508...@q23g2000yqd.googlegroups.com...

Nice. Sorry I don't have a compiler that handles full F2003. (g95 and
gfortran don't).

e p chandler

unread,
Apr 21, 2010, 10:33:37 PM4/21/10
to

"Richard Maine" <nos...@see.signature> wrote in message
news:1jhbaje.16a5g3wqyqr7wN%nos...@see.signature...

>e p chandler <ep...@juno.com> wrote:
>
>> "Jim Xia" <jim...@hotmail.com> wrote in message
>>
>> > module my_mod
>> > implicit none
>> >
>> > type node
>> > integer :: len
>> > character, allocatable :: char(:)
>> > end type
>> > end module my_mod
>> >
>>
>> How about this
>>
>> type string
>> character(:), allocatable :: str
>> end type
>>
>> You don't need the len variable. len(node%str) will be that value.
>>
>> ---> len() returns a value of 1. Do you mean size()?
>
> No, he means len. If len returns other than the correct value, report
> the bug to your vendor. Allocatable character length being one of the
> f2003 features last on the implementation list, it could well be that
> your vendor has such a bug... but it would be a bug.
>
> Size is quite irrelevant. Your compiler shouldn't even allow
> size(node%str) in Jim's example, as it isn't an array. If your compiler
> allows that, there's another bug to report (and a more surprising one).
>
> Your example uses an array of characters (I missed that at first), but
> Jim's uses a character string instead. They are different.

Yes. I only saw his more complete program after reading your reply.
My test program was difficult enough, for me.

> Note also the handy way that Jim's example (in his subsequent followup)
> nicely uses the f2003 allocate-on-assignment feature. I think it nicely
> illustrates the cleanliness you can get from using allocatable character
> length and that feature.

Nice. I wish I had something that would compile it. (g95 and gfortran
don't).

Richard Maine

unread,
Apr 21, 2010, 11:15:12 PM4/21/10
to
e p chandler <ep...@juno.com> wrote:

> "Richard Maine" <nos...@see.signature> wrote in message
> news:1jhb9j0.jfrt8o1unhkhsN%nos...@see.signature...

> > My nose for potential bugs detects two of them here. This assumes two


> > things. Both assumptions might be correct, but I tend to think it good
> > to make such assumptions explicit. Otherwise, you are likely to find
> > that the user of your code doesn't share your assumptions, which asks
> > for bugs.

...


> OK. But please distringuish the mechanics of my test program and its
> assumptions from what it was trying to demonstrate. I do think that I would
> make a production program more robust!

I understand. I might even have done something simillar myself in order
to demonstrate the idea you were after. I just thought it an opportunity
to make a separate point about carefully considering assumptions (at
least in real code). You might well not need the lesson, but enough
people do that I thought it worthwhile to take the opportunity.

Dave Allured

unread,
Apr 22, 2010, 12:21:44 AM4/22/10
to

Something else to consider, an old fashioned yet simple approach if you
do not need to dynamically edit the set of strings:

integer, parameter :: n_strings = 60000
integer, parameter :: max_len = 4000
integer, parameter :: buf_size = n_strings * (max_len / 8)
character(buf_size) :: strs
character(max_len) :: in
integer i, p, infile
integer p1(n_strings), p2(n_strings)

p = 1

do i = 1, n_strings
read (infile, '(a)') in
p1(i) = p
p = p + len_trim (in)
p2(i) = p - 1
strs(p1(i):p2(i)) = in
end do

i = 1234
write (*,'(a,i0,a,a)') 'string #', i, ' = ', strs(p1(i):p2(i))

The buffer is easily accessed and searched. Inserting, removing, and
reordering are problematical. Like others said, it depends on what you
need to do with the strings.

--Dave

Richard Maine

unread,
Apr 22, 2010, 1:30:49 AM4/22/10
to
Dave Allured <nos...@nospom.com> wrote:

> anal...@hotmail.com wrote:
> >
> > I need to keep in memory an array of around 60000 character variables,
> > each element of which can have a max length of 4000 byres. But if you
> > add up the lengths of all the actual data values, it is only 1/8 of
> > 60000*4000.
>

> Something else to consider, an old fashioned yet simple approach if you
> do not need to dynamically edit the set of strings:

[elided]

Yes. That's the character string used as a storage pool approach that
Gordon and I mentioned, though neither of us provided sample code.

anal...@hotmail.com

unread,
Apr 22, 2010, 6:18:31 AM4/22/10
to
On Apr 22, 12:21 am, Dave Allured <nos...@nospom.com> wrote:


At this point, all I want to do is to be able to are things like "find
the nth element", "give me all the elements that contain a particular
substring", "print out all zero length elements" etc.

Inserts etc. are handled outside the Fortran and are not an issue.
With this method, in the event I wanted to sort the elements , it
should be easy to do it through an array of pointers that would
contain the sort order.

Thank you and thanks to all the other responders (whose suggestions
are a bit too advanced for me - but would be a good tutorial when I
have the time).

And yes - 240 Meg of memory is an issue, I have had the program fail
more than once for "Image size too large".

Gordon Sande

unread,
Apr 22, 2010, 9:08:14 AM4/22/10
to
On 2010-04-22 07:18:31 -0300, "anal...@hotmail.com"
<anal...@hotmail.com> said:

Listing the desired operations is very good. The tough one here is
finding elements with a given substring. The suggestions that have
been given have basically been directed at symbol tables which
will match whole entries. There are fancy schemes for searching
large strings for specified substrings. Searching a bunch of smaller
strings may well have different tradeoffs. It is the stuff that one
can find in some advanced algorithms books. I believe that the folks who
do the Oxford dictionary (I am sure there are others but I do not follow
such things) have multiple papers on their advanced methods. Us mere
mortals have to trust the various "grep" libraries for the few times
the text manipulation problems go beyond symbol tables.

Louis Krupp

unread,
Apr 22, 2010, 11:06:34 AM4/22/10
to
anal...@hotmail.com wrote:
<snip>

> And yes - 240 Meg of memory is an issue, I have had the program fail
> more than once for "Image size too large".

Wild guess: The problem might be just that -- the program image size --
and not strictly memory usage. Try making the array allocatable and see
if that helps.

Louis

robin

unread,
Apr 24, 2010, 7:55:44 PM4/24/10
to
"Gordon Sande" <Gordon...@EastLink.ca> wrote in message news:2010042210081316807-GordonSande@EastLinkca...

| Listing the desired operations is very good. The tough one here is
| finding elements with a given substring.

Using the linear model suggested is trivial using INDEX
The pointers identify the individual strings, and a separate search
is carried out for each string.

0 new messages