Storage limits of sparsemapcontent

37 views
Skip to first unread message

Zach A. Thomas

unread,
May 29, 2011, 3:33:43 PM5/29/11
to sakai-kernel List
Hi. When we migrated the pilot at NYU to sparsemapcontent, some pages were lost with "encoded string too long" errors. When I went digging a little deeper, I found that the DataOutputStream writeUTF method used by StringType.java has a limit of 2^16 bytes per call. You can actually write more than this by splitting the data into smaller chunks and making multiple calls to writeUTF.

I went looking online for discussion of this problem. Here's how netbeans.org solved it: http://hg.netbeans.org/main/rev/6d07994bc971

Locally, I have tried this same fix on StringType.java, and it seems to work fine, but then I found out that blob columns in MySQL are also limited to 2^16 bytes! The combined storage for all the properties on a node must be below this limit. So I modified the MySQL ddl to use mediumblob (up to 16M bytes). This limitation doesn't surface on Oracle, where a blob can be up to 8 terabytes (wow).

The question for this list is whether we should take the netbeans approach and allow Strings over 64K bytes in the database, or somehow marshal/unmarshal these larger values to the filesystem?

In NYU's case, the properties which are this large are always sakai:pagecontent, which stores arbitrary HTML for pages. It's easy to imagine 64K byte and larger pages.

thanks,
Zach


Ian Boston

unread,
May 31, 2011, 5:54:56 AM5/31/11
to sakai-...@googlegroups.com
Zach,
The intention was the the properties of a Content Item would never be
greater than 64K, since that would mean streaming significant amounts
of data in and out of Java objects. If Content Items are becoming
greater than 64K, then we should address that by using file bodies
which stream correctly rather than allowing unlimited property sizes.

The Sparse ContentManagerImpl is not sophisticated enough to allow
arbitarty property sizes upto TB in size without any overhead. That
was a positive decision, made to avoid lots of complexity. I still
think that was the right decision.

Why are you getting more than 64K in a ContentItems properties?
That's a *big* object to be cached in memory, if there were millions
of them it would have a big impact on memory usage.
Ian

> --
> You received this message because you are subscribed to the Google Groups
> "Sakai Nakamura" group.
> To post to this group, send email to sakai-...@googlegroups.com.
> To unsubscribe from this group, send email to
> sakai-kernel...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/sakai-kernel?hl=en.
>

Zach Thomas

unread,
May 31, 2011, 11:51:02 AM5/31/11
to Sakai Nakamura
It's sakai:pagecontent, which contains the HTML for any given group
page. They can get quite large.

Zach

On May 31, 4:54 am, Ian Boston <i...@tfd.co.uk> wrote:
> Zach,
> The intention was the the properties of a Content Item would never be
> greater than 64K, since that would mean streaming significant amounts
> of data in and out of Java objects. If Content Items are becoming
> greater than 64K, then we should address that by using file bodies
> which stream correctly rather than allowing unlimited property sizes.
>
> The Sparse ContentManagerImpl is not sophisticated enough to allow
> arbitarty property sizes upto TB in size without any overhead. That
> was a positive decision, made to avoid lots of complexity. I still
> think that was the right decision.
>
> Why are you getting more than 64K in a ContentItems properties?
> That's a *big* object to be cached in memory, if there were millions
> of them it would have a big impact on memory usage.
> Ian
>

Ian Boston

unread,
May 31, 2011, 12:03:20 PM5/31/11
to sakai-...@googlegroups.com
Over 64K they really should be a file.
Under 64K, they should be a property

64K is a very large HTML page, I have a feeling you can fit Hamlet
into that provided you dont go wild on markup.

Ian

D. Stuart Freeman

unread,
May 31, 2011, 12:06:14 PM5/31/11
to sakai-...@googlegroups.com
On Tue, May 31, 2011 at 05:03:20PM +0100, Ian Boston wrote:
> Over 64K they really should be a file.
> Under 64K, they should be a property
>
> 64K is a very large HTML page, I have a feeling you can fit Hamlet
> into that provided you dont go wild on markup.

I had to check: http://www.gutenberg.org/ebooks/1524
;)

--
D. Stuart Freeman
Georgia Institute of Technology

signature.asc

Chris Tweney

unread,
May 31, 2011, 12:17:24 PM5/31/11
to sakai-...@googlegroups.com
IMHO the ContentManager should be the one to decide whether it should
store something in a file or a property. If you put that logic into the
calling code, then the caller needs to know a lot about underlying
storage mechanisms, and we'll have duplicated size checks scattered all
over the app.

-chris

Ian Boston

unread,
May 31, 2011, 12:30:36 PM5/31/11
to sakai-...@googlegroups.com
That would be great, however, to do so would make the driver code
horribly complex, which is why the restriction is there. If you have a
look in the guts of Jackrabbit you get an idea just how expensive this
can be. I have to assume that the Jackrabbit team really do know what
they are doing, and have found the most elegant solution in this area.
They put it right at the bottom of their stack in the Bundle
Persistence manager that intelligently blocks up properties. Earlier
divers in Jackrabbit imposed a similar 64K limit. One other thing to
note is that IIRC Jackrabbit used its schema to help it make those
decisions.

I dont think we have the resource to do this at the lower levels and
make it work.... its quite a large re-write of the insert and get
methods in the drivers.

Ian

Chris Tweney

unread,
May 31, 2011, 1:19:07 PM5/31/11
to sakai-...@googlegroups.com
Call me crazy here, but I think it's better to have that expensive,
complicated logic centralized in one low-level place than to have it
duplicated, with various levels of skill and correctness, across several
dozen different client components. If we don't do it in the storage
engine, then we're going to do it over and over again at the application
level. Or, we won't do it, and we'll have a bunch of bug reports that
come in from the real world when properties get above 64K.

64K is actually quite small for a real-world web page. Consider that
many users will create pages by pasting in from MS Word, where just a
couple of text pages can easily reach that size and larger.

-chris

Alan Marks

unread,
May 31, 2011, 4:19:27 PM5/31/11
to sakai-...@googlegroups.com
A brief aside from the implementation details:

There is no question that real-world use will run into this limit and that long-term we need to find a way to let users save larger content. I created a 14 page Word doc, then copied the text and pasted it into TinyMCE, resulting in a 500 error. Thirteen pages worked. This is probably a pretty common scenario.

That said, you could make a case that it would be supporting bad design to allow such very long pages, but I could be accused of rationalizing. 

At any rate, because we're past feature-freeze and in ship-mode, the leads talked about this today and decided it would be too large and destabilizing to fix now. We're going to provide better messaging to the user, so that they can know when they've hit this limit. Sometimes you have to make tradeoffs to ship. This is one of those times. I've created the following Jiras:


Alan Marks
Sakai OAE Project Director
skype: skramnala

Ian Boston

unread,
Jun 1, 2011, 5:07:28 AM6/1/11
to sakai-...@googlegroups.com
After the leads meeting I thought about this and I think it may be
possible to create a new data type that handles large data types.
LongString.

This will have to be off by default since it may have a bad impact all
over the place.
On write, if a String is over a limit it will be written as a
LongString, which will be a reference to a file on disk. Once that
happens, it will be ignored for all sparse indexing although might
still be Ok for Solr indexing.
When its read it will come out as a LongString and provided its not
referenced anywhere in the Nakamura code base it will make it all the
way out to json.

If it is referenced anywhere in the Nakamura code base it will cause a
ClassCastException (since a LongString cant be cast to a String and a
String is final). That will randomly break random things and its quite
likely that those breakages will be masked by other error handling,
which is why, it I can get this to work at all, it will be off by
default, turned on at your peril.

Also, once a big string is in the DB, the only way to convert it back
to a String will be to delete the property and re-create it. I haven't
tried to write this patch yet and I may have missed something in my
thought process that make it impossible. The CastCastException is the
real blocker, caused, in part by abandoning the original design that
used coercion of data types rather than direct class casts.

Ian

Reply all
Reply to author
Forward
0 new messages