Database memory usage

Rui

unread,

Apr 26, 2010, 7:31:46 AM4/26/10

to

I'm new to databases.
I want to put a big list in a database, to later work on it.
It all worked fine, using the default hsqldb that comes in
Mathematica, but I couldn't insert more than few thousand elements at
a time, and then Link problems or java heap problems.
Finally, I couldn't even open the connection. Java runs out of memory.
Memory usage of javaw process is less than 300MB when it runs out.
In any case, I thought that the databases were, or could be, stored in
the disk, so they can really be big and be a real big advantage over
the list. How can that be done?
If the question is too newbie and you feel I need to read a tutorial
on something, hehe, feel free to send me a link to a good one, hehe

Rui Rojo

David Bailey

unread,

Apr 27, 2010, 4:04:13 AM4/27/10

to

If you just want to write a lot of data away in chunks and re-read it
later, there is no need to use a database at all:

str=OpenWrite["Myfile.dat"];

Write[str,list1]
..
..
..
Write[str,listN]
Close[str]

You could also write the data in binary form if you prefer.

However, from what you say, I imagine there is some sort of memory leak
in the database handling process - you should report this as a bug to
Wolfram. In the meantime, if you really need to write a database, you
will need to use J/Link and possibly some actual Java code to do the job
correctly.

David Bailey
http://www.dbaileyconsultancy.co.uk

Vince Virgilio

unread,

Apr 27, 2010, 4:05:39 AM4/27/10

to

On Apr 26, 7:31 am, Rui <rui....@gmail.com> wrote:
> I'm new to databases.
> I want to put a big list in a database, to later work on it.
> It all worked fine, using the default hsqldb that comes in
> Mathematica, but I couldn't insert more than few thousand elements at
> a time, and then Link problems or java heap problems.
> Finally, I couldn't even open the connection. Java runs out of memory.
> Memory usage of javaw process is less than 300MB when it runs out.

Rui,

For what it's worth: "Me too", but a long time ago, with a different
kind of database.

I noticed similar failures when I was trying to export very large
Microsoft Excel workbooks (XLS), with multiple worksheets. There
appear to be some hardwired assumptions or limits in Mathematica's
general export mechanism; and obviously the limit interacts with Java,
if not totally determined by it.

Vince Virgilio

David Reiss

unread,

Apr 27, 2010, 4:06:00 AM4/27/10

to

Hi Rui,

There are two parts to the answer to your question. The first one has
to do directly with the Java issue. DatabaseLink uses Java as an
intermediary to communicate with the database. Because of this any
memory restrictions on the Java process will limit the amount of data
that you can move between Mathematica and the database at any one
time. One way around this that is scalable is to bring the data into
or send the data to the database in chunks so that the Java heap size
is not exceeded. A simple approach to this is to give your database
table an auto-indexing column and to use the values in that column to
select the portions of the data sequentially (when getting data from
the database). And of course when sending data to the database only
send it in smaller sets of data. The general rule though is that if
you try to send a vary large amount of data to a database all at one
time (or retrieve a very large amount) you will run in to problems
eventually with the Java memory allocation.

Another possible issue with hsqldb that you may run into is that (I
believe) hsqldb is an entirely memory resident database: in contrast
to mySQL and others. So you may run into issues with that baed on the
size of the database tables and the size of your memory. This is
independent of the java issue.

A non-scalable way to address the Java issue can be to reset the
default heap size that Java uses on your computer (this is independent
of amy Mathematica settings). Here is a way to do this, setting the
Jave heap size to 1 Gigabyte. But you must do this befor eloading
DatabaseLink:

Needs["JLink`"];

InstallJava[];

ReinstallJava[CommandLine -> "java", JVMArguments -> "-Xmx1g"];

LoadJavaClass["java.lang.Runtime"];

Print["The maximum memory for Java is currently
"<>ToString[java`lang`Runtime`getRuntime[][maxMemory[]]]]

And here is a link that tells you a bit about how to specify the java
heap settings:

http://javahowto.blogspot.com/2006/06/6-common-errors-in-setting-java-heap.html

But remember that this only resets the Java heap size.... if you try
to pull too much data from the database via DatabaseLink, you will
ultimately exceed it....

Hope this helps,

David
http://scientificarts.com/worklife

Albert Retey

unread,

Apr 27, 2010, 4:06:11 AM4/27/10

to

Hi,

> I'm new to databases.
> I want to put a big list in a database, to later work on it.
> It all worked fine, using the default hsqldb that comes in
> Mathematica, but I couldn't insert more than few thousand elements at
> a time, and then Link problems or java heap problems.
> Finally, I couldn't even open the connection. Java runs out of memory.
> Memory usage of javaw process is less than 300MB when it runs out.

the java vm runs with a fixed size of memory that it is allowed to use,
I think that something between 200 and 300 MB is the default so that is
what you see it runs out of. This setting can be adjusted, and I think
there has been an explanation of how to do it in a post some time ago.

> In any case, I thought that the databases were, or could be, stored in
> the disk, so they can really be big and be a real big advantage over
> the list. How can that be done?

As far as I remember the hsql default is to store tables in memory if
possible and writes just a log to disk from which it can regenerate the
data in memory. There is a special way to define tables to always be
written to disk. I don't exactly remember the details and don't know
whether that could help with your problem but you should find details in
the documentation for hsql.

I think if you try to store the huge list in one go (and read it back
that way) you would be better of to use a binary format to store the
list in one file. You could try the Dump (.mx) or Wolfram Data Exchange
(wdx) formats, which are both documented. In my opinion using a database
would only make sense if your data has some kind of structure and you
would want to work only on a subset of the huge list at a time that
would be easy to be extracted from the total data with a SQLSelect.

> If the question is too newbie and you feel I need to read a tutorial
> on something, hehe, feel free to send me a link to a good one, hehe

Maybe you want to explain some more details about your huge list, what
you want to do with it and why you need to store it on disk...

hth,

albert

Rui

unread,

Apr 27, 2010, 7:41:23 AM4/27/10

to

My immediate need is to store a lot of data away and later be able to
retreive only the parts I want without loading everything. I know that
that can be done with streams, but I figure I would end up doing a
mini database program myself with files and streams, hehe.
My main objective, however, is to keep on learning, and streams is
also something I've yet to play around with, so I'll probably give it
a try.

Right now I have lots of English literature that I've converted in
txt, created tables and statistics on number of appearances of each
word. More than 70 million words total, 5 million sentences that I
wanna be able to fetch as example quotes of all words it contains. So,
for example, I want to be able to query things like:
* The 1000;;1040 most used words and the number of times they appeared
* A random example of the list of examples of the word "rant"
* Other stuff that I may add in the near future

David Reiss!
Very useful. I had seen a code around to increase the java heap size
but it didn't work well. Yours, like a charm.
That still left me with the second problem, that the db was stored in
memory.
Like Albert said, I found in the documenation of the hsqldb that if I
created the table in raw sql with "CREATE CACHED TABLE....." then it
stores it in disk.
And it worked :D
So, adding the java heap thing and the cached thing solved the issued.
2 other minor issues came up because of that:
* it takes aaaages to connect to the database... It isn't using up
more than 200MB, but it still takes minutes to connect
* can barely do anything in batches, its very prone to errors even in
small batches.
Anyway, both of these I can live with. However, I downloaded MySQL to
see if it works better. I know nooothing about settings, but I was
able to do the minimum to have it apparently working
It is taking loooooonger than hsqldb to insert the stuff in, but it's
fast to connect, it's not using loots of memory (and I didn't have to
tell it if its in memory or in disk), and the batch thing is working
(for small batches).
I'll be using both until my feelings chose between the two

Thanks a lot.

Vince Virgilio

unread,

Apr 28, 2010, 1:57:38 AM4/28/10

to

On Apr 27, 4:06 am, David Reiss <dbr...@gmail.com> wrote:

SNIP

> the database). And of course when sending data to the database only
> send it in smaller sets of data. The general rule though is that if
> you try to send a vary large amount of data to a database all at one
> time (or retrieve a very large amount) you will run in to problems
> eventually with the Java memory allocation.

SNIP

Unfortunately, that undermines the high abstraction advertised by
Mathematica.

Mathematica should hide this.

I think it doesn't because of the mentality "everything is an
expression". Seems that Mathematica's greedy evaluation leaves no
breathing room for more careful consumption of large expressions (bad
manners). I've noticed similar tendencies in Workbench's syntax
parser. [*]

Vince Virgilio

[*] And I'm not convinced the cost/benefit ratio is right. Vim's
lightning fast handling of superficially similar effects is very
persuasive.

David Bailey

unread,

Apr 29, 2010, 2:51:29 AM4/29/10

to

Rui wrote:
> On Apr 27, 5:04 am, David Bailey <d...@removedbailey.co.uk> wrote:
>> Rui wrote:

>
> My immediate need is to store a lot of data away and later be able to
> retreive only the parts I want without loading everything. I know that
> that can be done with streams, but I figure I would end up doing a
> mini database program myself with files and streams, hehe.
> My main objective, however, is to keep on learning, and streams is
> also something I've yet to play around with, so I'll probably give it
> a try.
>
> Right now I have lots of English literature that I've converted in
> txt, created tables and statistics on number of appearances of each
> word. More than 70 million words total, 5 million sentences that I
> wanna be able to fetch as example quotes of all words it contains. So,
> for example, I want to be able to query things like:
> * The 1000;;1040 most used words and the number of times they appeared
> * A random example of the list of examples of the word "rant"
> * Other stuff that I may add in the near future

Have you established that you can perform the queries that you need
using the database? I think I'd be inclined to take a small portion of
your data, put it in the database and devise the queries that you would
need.

My inclination would be to start by trying to do some of those things
inside Mathematica. For example, you could have a list of
{{"word1",usage},{"word2",usage},......}

Sorted on usage, that would answer your usage queries efficiently.

Mathematica is also rather good at doing things with very long strings:

In[1]:= sss = StringJoin @@ ConstantArray["x", 200000];

In[2]:= StringLength[sss]

Out[2]= 200000

In[3]:= sss1 = sss <> "fred";

In[5]:= StringPosition[sss1, "fred"] // Timing

Out[5]= {0.015, {{200001, 200004}}}

Thus an in-memory solution to your problem is probably feasible,
particularly if you can run on 64-bit Mathematica.

David Bailey
http://www.dbaileyconsultancy.co.uk