Hi. When we migrated the pilot at NYU to sparsemapcontent, some pages were lost with "encoded string too long" errors. When I went digging a little deeper, I found that the DataOutputStream writeUTF method used by StringType.java has a limit of 2^16 bytes per call. You can actually write more than this by splitting the data into smaller chunks and making multiple calls to writeUTF.
Locally, I have tried this same fix on StringType.java, and it seems to work fine, but then I found out that blob columns in MySQL are also limited to 2^16 bytes! The combined storage for all the properties on a node must be below this limit. So I modified the MySQL ddl to use mediumblob (up to 16M bytes). This limitation doesn't surface on Oracle, where a blob can be up to 8 terabytes (wow).
The question for this list is whether we should take the netbeans approach and allow Strings over 64K bytes in the database, or somehow marshal/unmarshal these larger values to the filesystem?
In NYU's case, the properties which are this large are always sakai:pagecontent, which stores arbitrary HTML for pages. It's easy to imagine 64K byte and larger pages.
Zach, The intention was the the properties of a Content Item would never be greater than 64K, since that would mean streaming significant amounts of data in and out of Java objects. If Content Items are becoming greater than 64K, then we should address that by using file bodies which stream correctly rather than allowing unlimited property sizes.
The Sparse ContentManagerImpl is not sophisticated enough to allow arbitarty property sizes upto TB in size without any overhead. That was a positive decision, made to avoid lots of complexity. I still think that was the right decision.
Why are you getting more than 64K in a ContentItems properties? That's a *big* object to be cached in memory, if there were millions of them it would have a big impact on memory usage. Ian
On 29 May 2011 20:33, Zach A. Thomas <z...@aeroplanesoftware.com> wrote:
> Hi. When we migrated the pilot at NYU to sparsemapcontent, some pages were > lost with "encoded string too long" errors. When I went digging a little > deeper, I found that the DataOutputStream writeUTF method used by > StringType.java has a limit of 2^16 bytes per call. You can actually write > more than this by splitting the data into smaller chunks and making multiple > calls to writeUTF. > I went looking online for discussion of this problem. Here's how > netbeans.org solved it: http://hg.netbeans.org/main/rev/6d07994bc971 > Locally, I have tried this same fix on StringType.java, and it seems to work > fine, but then I found out that blob columns in MySQL are also limited to > 2^16 bytes! The combined storage for all the properties on a node must be > below this limit. So I modified the MySQL ddl to use mediumblob (up to 16M > bytes). This limitation doesn't surface on Oracle, where a blob can be up to > 8 terabytes (wow). > The question for this list is whether we should take the netbeans approach > and allow Strings over 64K bytes in the database, or somehow > marshal/unmarshal these larger values to the filesystem? > In NYU's case, the properties which are this large are always > sakai:pagecontent, which stores arbitrary HTML for pages. It's easy to > imagine 64K byte and larger pages. > thanks, > Zach
> -- > You received this message because you are subscribed to the Google Groups > "Sakai Nakamura" group. > To post to this group, send email to sakai-kernel@googlegroups.com. > To unsubscribe from this group, send email to > sakai-kernel+unsubscribe@googlegroups.com. > For more options, visit this group at > http://groups.google.com/group/sakai-kernel?hl=en.
> Zach,
> The intention was the the properties of a Content Item would never be
> greater than 64K, since that would mean streaming significant amounts
> of data in and out of Java objects. If Content Items are becoming
> greater than 64K, then we should address that by using file bodies
> which stream correctly rather than allowing unlimited property sizes.
> The Sparse ContentManagerImpl is not sophisticated enough to allow
> arbitarty property sizes upto TB in size without any overhead. That
> was a positive decision, made to avoid lots of complexity. I still
> think that was the right decision.
> Why are you getting more than 64K in a ContentItems properties?
> That's a *big* object to be cached in memory, if there were millions
> of them it would have a big impact on memory usage.
> Ian
> On 29 May 2011 20:33, Zach A. Thomas <z...@aeroplanesoftware.com> wrote:
> > Hi. When we migrated the pilot at NYU to sparsemapcontent, some pages were
> > lost with "encoded string too long" errors. When I went digging a little
> > deeper, I found that the DataOutputStream writeUTF method used by
> > StringType.java has a limit of 2^16 bytes per call. You can actually write
> > more than this by splitting the data into smaller chunks and making multiple
> > calls to writeUTF.
> > I went looking online for discussion of this problem. Here's how
> > netbeans.org solved it: http://hg.netbeans.org/main/rev/6d07994bc971 > > Locally, I have tried this same fix on StringType.java, and it seems to work
> > fine, but then I found out that blob columns in MySQL are also limited to
> > 2^16 bytes! The combined storage for all the properties on a node must be
> > below this limit. So I modified the MySQL ddl to use mediumblob (up to 16M
> > bytes). This limitation doesn't surface on Oracle, where a blob can be up to
> > 8 terabytes (wow).
> > The question for this list is whether we should take the netbeans approach
> > and allow Strings over 64K bytes in the database, or somehow
> > marshal/unmarshal these larger values to the filesystem?
> > In NYU's case, the properties which are this large are always
> > sakai:pagecontent, which stores arbitrary HTML for pages. It's easy to
> > imagine 64K byte and larger pages.
> > thanks,
> > Zach
> > --
> > You received this message because you are subscribed to the Google Groups
> > "Sakai Nakamura" group.
> > To post to this group, send email to sakai-kernel@googlegroups.com.
> > To unsubscribe from this group, send email to
> > sakai-kernel+unsubscribe@googlegroups.com.
> > For more options, visit this group at
> >http://groups.google.com/group/sakai-kernel?hl=en.
> It's sakai:pagecontent, which contains the HTML for any given group > page. They can get quite large.
> Zach
> On May 31, 4:54 am, Ian Boston <i...@tfd.co.uk> wrote: >> Zach, >> The intention was the the properties of a Content Item would never be >> greater than 64K, since that would mean streaming significant amounts >> of data in and out of Java objects. If Content Items are becoming >> greater than 64K, then we should address that by using file bodies >> which stream correctly rather than allowing unlimited property sizes.
>> The Sparse ContentManagerImpl is not sophisticated enough to allow >> arbitarty property sizes upto TB in size without any overhead. That >> was a positive decision, made to avoid lots of complexity. I still >> think that was the right decision.
>> Why are you getting more than 64K in a ContentItems properties? >> That's a *big* object to be cached in memory, if there were millions >> of them it would have a big impact on memory usage. >> Ian
>> On 29 May 2011 20:33, Zach A. Thomas <z...@aeroplanesoftware.com> wrote:
>> > Hi. When we migrated the pilot at NYU to sparsemapcontent, some pages were >> > lost with "encoded string too long" errors. When I went digging a little >> > deeper, I found that the DataOutputStream writeUTF method used by >> > StringType.java has a limit of 2^16 bytes per call. You can actually write >> > more than this by splitting the data into smaller chunks and making multiple >> > calls to writeUTF. >> > I went looking online for discussion of this problem. Here's how >> > netbeans.org solved it: http://hg.netbeans.org/main/rev/6d07994bc971 >> > Locally, I have tried this same fix on StringType.java, and it seems to work >> > fine, but then I found out that blob columns in MySQL are also limited to >> > 2^16 bytes! The combined storage for all the properties on a node must be >> > below this limit. So I modified the MySQL ddl to use mediumblob (up to 16M >> > bytes). This limitation doesn't surface on Oracle, where a blob can be up to >> > 8 terabytes (wow). >> > The question for this list is whether we should take the netbeans approach >> > and allow Strings over 64K bytes in the database, or somehow >> > marshal/unmarshal these larger values to the filesystem? >> > In NYU's case, the properties which are this large are always >> > sakai:pagecontent, which stores arbitrary HTML for pages. It's easy to >> > imagine 64K byte and larger pages. >> > thanks, >> > Zach
>> > -- >> > You received this message because you are subscribed to the Google Groups >> > "Sakai Nakamura" group. >> > To post to this group, send email to sakai-kernel@googlegroups.com. >> > To unsubscribe from this group, send email to >> > sakai-kernel+unsubscribe@googlegroups.com. >> > For more options, visit this group at >> >http://groups.google.com/group/sakai-kernel?hl=en.
> -- > You received this message because you are subscribed to the Google Groups "Sakai Nakamura" group. > To post to this group, send email to sakai-kernel@googlegroups.com. > To unsubscribe from this group, send email to sakai-kernel+unsubscribe@googlegroups.com. > For more options, visit this group at http://groups.google.com/group/sakai-kernel?hl=en.
> On 31 May 2011 16:51, Zach Thomas <zach.tho...@gmail.com> wrote: > > It's sakai:pagecontent, which contains the HTML for any given group > > page. They can get quite large.
> > Zach
> > On May 31, 4:54 am, Ian Boston <i...@tfd.co.uk> wrote: > >> Zach, > >> The intention was the the properties of a Content Item would never be > >> greater than 64K, since that would mean streaming significant amounts > >> of data in and out of Java objects. If Content Items are becoming > >> greater than 64K, then we should address that by using file bodies > >> which stream correctly rather than allowing unlimited property sizes.
> >> The Sparse ContentManagerImpl is not sophisticated enough to allow > >> arbitarty property sizes upto TB in size without any overhead. That > >> was a positive decision, made to avoid lots of complexity. I still > >> think that was the right decision.
> >> Why are you getting more than 64K in a ContentItems properties? > >> That's a *big* object to be cached in memory, if there were millions > >> of them it would have a big impact on memory usage. > >> Ian
> >> On 29 May 2011 20:33, Zach A. Thomas <z...@aeroplanesoftware.com> wrote:
> >> > Hi. When we migrated the pilot at NYU to sparsemapcontent, some pages were > >> > lost with "encoded string too long" errors. When I went digging a little > >> > deeper, I found that the DataOutputStream writeUTF method used by > >> > StringType.java has a limit of 2^16 bytes per call. You can actually write > >> > more than this by splitting the data into smaller chunks and making multiple > >> > calls to writeUTF. > >> > I went looking online for discussion of this problem. Here's how > >> > netbeans.org solved it: http://hg.netbeans.org/main/rev/6d07994bc971 > >> > Locally, I have tried this same fix on StringType.java, and it seems to work > >> > fine, but then I found out that blob columns in MySQL are also limited to > >> > 2^16 bytes! The combined storage for all the properties on a node must be > >> > below this limit. So I modified the MySQL ddl to use mediumblob (up to 16M > >> > bytes). This limitation doesn't surface on Oracle, where a blob can be up to > >> > 8 terabytes (wow). > >> > The question for this list is whether we should take the netbeans approach > >> > and allow Strings over 64K bytes in the database, or somehow > >> > marshal/unmarshal these larger values to the filesystem? > >> > In NYU's case, the properties which are this large are always > >> > sakai:pagecontent, which stores arbitrary HTML for pages. It's easy to > >> > imagine 64K byte and larger pages. > >> > thanks, > >> > Zach
> >> > -- > >> > You received this message because you are subscribed to the Google Groups > >> > "Sakai Nakamura" group. > >> > To post to this group, send email to sakai-kernel@googlegroups.com. > >> > To unsubscribe from this group, send email to > >> > sakai-kernel+unsubscribe@googlegroups.com. > >> > For more options, visit this group at > >> >http://groups.google.com/group/sakai-kernel?hl=en.
> > -- > > You received this message because you are subscribed to the Google Groups "Sakai Nakamura" group. > > To post to this group, send email to sakai-kernel@googlegroups.com. > > To unsubscribe from this group, send email to sakai-kernel+unsubscribe@googlegroups.com. > > For more options, visit this group at http://groups.google.com/group/sakai-kernel?hl=en.
> -- > You received this message because you are subscribed to the Google Groups "Sakai Nakamura" group. > To post to this group, send email to sakai-kernel@googlegroups.com. > To unsubscribe from this group, send email to sakai-kernel+unsubscribe@googlegroups.com. > For more options, visit this group at http://groups.google.com/group/sakai-kernel?hl=en.
-- D. Stuart Freeman Georgia Institute of Technology
IMHO the ContentManager should be the one to decide whether it should store something in a file or a property. If you put that logic into the calling code, then the caller needs to know a lot about underlying storage mechanisms, and we'll have duplicated size checks scattered all over the app.
> On Tue, May 31, 2011 at 05:03:20PM +0100, Ian Boston wrote: >> Over 64K they really should be a file. >> Under 64K, they should be a property
>> 64K is a very large HTML page, I have a feeling you can fit Hamlet >> into that provided you dont go wild on markup. > I had to check: http://www.gutenberg.org/ebooks/1524 > ;)
>> Ian
>> On 31 May 2011 16:51, Zach Thomas<zach.tho...@gmail.com> wrote: >>> It's sakai:pagecontent, which contains the HTML for any given group >>> page. They can get quite large.
>>> Zach
>>> On May 31, 4:54 am, Ian Boston<i...@tfd.co.uk> wrote: >>>> Zach, >>>> The intention was the the properties of a Content Item would never be >>>> greater than 64K, since that would mean streaming significant amounts >>>> of data in and out of Java objects. If Content Items are becoming >>>> greater than 64K, then we should address that by using file bodies >>>> which stream correctly rather than allowing unlimited property sizes.
>>>> The Sparse ContentManagerImpl is not sophisticated enough to allow >>>> arbitarty property sizes upto TB in size without any overhead. That >>>> was a positive decision, made to avoid lots of complexity. I still >>>> think that was the right decision.
>>>> Why are you getting more than 64K in a ContentItems properties? >>>> That's a *big* object to be cached in memory, if there were millions >>>> of them it would have a big impact on memory usage. >>>> Ian
>>>> On 29 May 2011 20:33, Zach A. Thomas<z...@aeroplanesoftware.com> wrote:
>>>>> Hi. When we migrated the pilot at NYU to sparsemapcontent, some pages were >>>>> lost with "encoded string too long" errors. When I went digging a little >>>>> deeper, I found that the DataOutputStream writeUTF method used by >>>>> StringType.java has a limit of 2^16 bytes per call. You can actually write >>>>> more than this by splitting the data into smaller chunks and making multiple >>>>> calls to writeUTF. >>>>> I went looking online for discussion of this problem. Here's how >>>>> netbeans.org solved it: http://hg.netbeans.org/main/rev/6d07994bc971 >>>>> Locally, I have tried this same fix on StringType.java, and it seems to work >>>>> fine, but then I found out that blob columns in MySQL are also limited to >>>>> 2^16 bytes! The combined storage for all the properties on a node must be >>>>> below this limit. So I modified the MySQL ddl to use mediumblob (up to 16M >>>>> bytes). This limitation doesn't surface on Oracle, where a blob can be up to >>>>> 8 terabytes (wow). >>>>> The question for this list is whether we should take the netbeans approach >>>>> and allow Strings over 64K bytes in the database, or somehow >>>>> marshal/unmarshal these larger values to the filesystem? >>>>> In NYU's case, the properties which are this large are always >>>>> sakai:pagecontent, which stores arbitrary HTML for pages. It's easy to >>>>> imagine 64K byte and larger pages. >>>>> thanks, >>>>> Zach >>>>> -- >>>>> You received this message because you are subscribed to the Google Groups >>>>> "Sakai Nakamura" group. >>>>> To post to this group, send email to sakai-kernel@googlegroups.com. >>>>> To unsubscribe from this group, send email to >>>>> sakai-kernel+unsubscribe@googlegroups.com. >>>>> For more options, visit this group at >>>>> http://groups.google.com/group/sakai-kernel?hl=en. >>> -- >>> You received this message because you are subscribed to the Google Groups "Sakai Nakamura" group. >>> To post to this group, send email to sakai-kernel@googlegroups.com. >>> To unsubscribe from this group, send email to sakai-kernel+unsubscribe@googlegroups.com. >>> For more options, visit this group at http://groups.google.com/group/sakai-kernel?hl=en.
>> -- >> You received this message because you are subscribed to the Google Groups "Sakai Nakamura" group. >> To post to this group, send email to sakai-kernel@googlegroups.com. >> To unsubscribe from this group, send email to sakai-kernel+unsubscribe@googlegroups.com. >> For more options, visit this group at http://groups.google.com/group/sakai-kernel?hl=en.
That would be great, however, to do so would make the driver code horribly complex, which is why the restriction is there. If you have a look in the guts of Jackrabbit you get an idea just how expensive this can be. I have to assume that the Jackrabbit team really do know what they are doing, and have found the most elegant solution in this area. They put it right at the bottom of their stack in the Bundle Persistence manager that intelligently blocks up properties. Earlier divers in Jackrabbit imposed a similar 64K limit. One other thing to note is that IIRC Jackrabbit used its schema to help it make those decisions.
I dont think we have the resource to do this at the lower levels and make it work.... its quite a large re-write of the insert and get methods in the drivers.
Ian
On 31 May 2011 17:17, Chris Tweney <ch...@media.berkeley.edu> wrote:
> IMHO the ContentManager should be the one to decide whether it should store > something in a file or a property. If you put that logic into the calling > code, then the caller needs to know a lot about underlying storage > mechanisms, and we'll have duplicated size checks scattered all over the > app.
> -chris
> On 5/31/11 9:06 AM, D. Stuart Freeman wrote:
>> On Tue, May 31, 2011 at 05:03:20PM +0100, Ian Boston wrote:
>>> Over 64K they really should be a file. >>> Under 64K, they should be a property
>>> 64K is a very large HTML page, I have a feeling you can fit Hamlet >>> into that provided you dont go wild on markup.
>>> On 31 May 2011 16:51, Zach Thomas<zach.tho...@gmail.com> wrote:
>>>> It's sakai:pagecontent, which contains the HTML for any given group >>>> page. They can get quite large.
>>>> Zach
>>>> On May 31, 4:54 am, Ian Boston<i...@tfd.co.uk> wrote:
>>>>> Zach, >>>>> The intention was the the properties of a Content Item would never be >>>>> greater than 64K, since that would mean streaming significant amounts >>>>> of data in and out of Java objects. If Content Items are becoming >>>>> greater than 64K, then we should address that by using file bodies >>>>> which stream correctly rather than allowing unlimited property sizes.
>>>>> The Sparse ContentManagerImpl is not sophisticated enough to allow >>>>> arbitarty property sizes upto TB in size without any overhead. That >>>>> was a positive decision, made to avoid lots of complexity. I still >>>>> think that was the right decision.
>>>>> Why are you getting more than 64K in a ContentItems properties? >>>>> That's a *big* object to be cached in memory, if there were millions >>>>> of them it would have a big impact on memory usage. >>>>> Ian
>>>>> On 29 May 2011 20:33, Zach A. Thomas<z...@aeroplanesoftware.com> >>>>> wrote:
>>>>>> Hi. When we migrated the pilot at NYU to sparsemapcontent, some pages >>>>>> were >>>>>> lost with "encoded string too long" errors. When I went digging a >>>>>> little >>>>>> deeper, I found that the DataOutputStream writeUTF method used by >>>>>> StringType.java has a limit of 2^16 bytes per call. You can actually >>>>>> write >>>>>> more than this by splitting the data into smaller chunks and making >>>>>> multiple >>>>>> calls to writeUTF. >>>>>> I went looking online for discussion of this problem. Here's how >>>>>> netbeans.org solved it: http://hg.netbeans.org/main/rev/6d07994bc971 >>>>>> Locally, I have tried this same fix on StringType.java, and it seems >>>>>> to work >>>>>> fine, but then I found out that blob columns in MySQL are also limited >>>>>> to >>>>>> 2^16 bytes! The combined storage for all the properties on a node must >>>>>> be >>>>>> below this limit. So I modified the MySQL ddl to use mediumblob (up to >>>>>> 16M >>>>>> bytes). This limitation doesn't surface on Oracle, where a blob can be >>>>>> up to >>>>>> 8 terabytes (wow). >>>>>> The question for this list is whether we should take the netbeans >>>>>> approach >>>>>> and allow Strings over 64K bytes in the database, or somehow >>>>>> marshal/unmarshal these larger values to the filesystem? >>>>>> In NYU's case, the properties which are this large are always >>>>>> sakai:pagecontent, which stores arbitrary HTML for pages. It's easy to >>>>>> imagine 64K byte and larger pages. >>>>>> thanks, >>>>>> Zach >>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups >>>>>> "Sakai Nakamura" group. >>>>>> To post to this group, send email to sakai-kernel@googlegroups.com. >>>>>> To unsubscribe from this group, send email to >>>>>> sakai-kernel+unsubscribe@googlegroups.com. >>>>>> For more options, visit this group at >>>>>> http://groups.google.com/group/sakai-kernel?hl=en.
>>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "Sakai Nakamura" group. >>>> To post to this group, send email to sakai-kernel@googlegroups.com. >>>> To unsubscribe from this group, send email to >>>> sakai-kernel+unsubscribe@googlegroups.com. >>>> For more options, visit this group at >>>> http://groups.google.com/group/sakai-kernel?hl=en.
>>> -- >>> You received this message because you are subscribed to the Google Groups >>> "Sakai Nakamura" group. >>> To post to this group, send email to sakai-kernel@googlegroups.com. >>> To unsubscribe from this group, send email to >>> sakai-kernel+unsubscribe@googlegroups.com. >>> For more options, visit this group at >>> http://groups.google.com/group/sakai-kernel?hl=en.
> -- > You received this message because you are subscribed to the Google Groups > "Sakai Nakamura" group. > To post to this group, send email to sakai-kernel@googlegroups.com. > To unsubscribe from this group, send email to > sakai-kernel+unsubscribe@googlegroups.com. > For more options, visit this group at > http://groups.google.com/group/sakai-kernel?hl=en.
Call me crazy here, but I think it's better to have that expensive, complicated logic centralized in one low-level place than to have it duplicated, with various levels of skill and correctness, across several dozen different client components. If we don't do it in the storage engine, then we're going to do it over and over again at the application level. Or, we won't do it, and we'll have a bunch of bug reports that come in from the real world when properties get above 64K.
64K is actually quite small for a real-world web page. Consider that many users will create pages by pasting in from MS Word, where just a couple of text pages can easily reach that size and larger.
> That would be great, however, to do so would make the driver code > horribly complex, which is why the restriction is there. If you have a > look in the guts of Jackrabbit you get an idea just how expensive this > can be. I have to assume that the Jackrabbit team really do know what > they are doing, and have found the most elegant solution in this area. > They put it right at the bottom of their stack in the Bundle > Persistence manager that intelligently blocks up properties. Earlier > divers in Jackrabbit imposed a similar 64K limit. One other thing to > note is that IIRC Jackrabbit used its schema to help it make those > decisions.
> I dont think we have the resource to do this at the lower levels and > make it work.... its quite a large re-write of the insert and get > methods in the drivers.
> Ian
> On 31 May 2011 17:17, Chris Tweney<ch...@media.berkeley.edu> wrote: >> IMHO the ContentManager should be the one to decide whether it should store >> something in a file or a property. If you put that logic into the calling >> code, then the caller needs to know a lot about underlying storage >> mechanisms, and we'll have duplicated size checks scattered all over the >> app.
>> -chris
>> On 5/31/11 9:06 AM, D. Stuart Freeman wrote: >>> On Tue, May 31, 2011 at 05:03:20PM +0100, Ian Boston wrote: >>>> Over 64K they really should be a file. >>>> Under 64K, they should be a property
>>>> 64K is a very large HTML page, I have a feeling you can fit Hamlet >>>> into that provided you dont go wild on markup. >>> I had to check: http://www.gutenberg.org/ebooks/1524 >>> ;)
>>>> Ian
>>>> On 31 May 2011 16:51, Zach Thomas<zach.tho...@gmail.com> wrote: >>>>> It's sakai:pagecontent, which contains the HTML for any given group >>>>> page. They can get quite large.
>>>>> Zach
>>>>> On May 31, 4:54 am, Ian Boston<i...@tfd.co.uk> wrote: >>>>>> Zach, >>>>>> The intention was the the properties of a Content Item would never be >>>>>> greater than 64K, since that would mean streaming significant amounts >>>>>> of data in and out of Java objects. If Content Items are becoming >>>>>> greater than 64K, then we should address that by using file bodies >>>>>> which stream correctly rather than allowing unlimited property sizes.
>>>>>> The Sparse ContentManagerImpl is not sophisticated enough to allow >>>>>> arbitarty property sizes upto TB in size without any overhead. That >>>>>> was a positive decision, made to avoid lots of complexity. I still >>>>>> think that was the right decision.
>>>>>> Why are you getting more than 64K in a ContentItems properties? >>>>>> That's a *big* object to be cached in memory, if there were millions >>>>>> of them it would have a big impact on memory usage. >>>>>> Ian
>>>>>> On 29 May 2011 20:33, Zach A. Thomas<z...@aeroplanesoftware.com> >>>>>> wrote:
>>>>>>> Hi. When we migrated the pilot at NYU to sparsemapcontent, some pages >>>>>>> were >>>>>>> lost with "encoded string too long" errors. When I went digging a >>>>>>> little >>>>>>> deeper, I found that the DataOutputStream writeUTF method used by >>>>>>> StringType.java has a limit of 2^16 bytes per call. You can actually >>>>>>> write >>>>>>> more than this by splitting the data into smaller chunks and making >>>>>>> multiple >>>>>>> calls to writeUTF. >>>>>>> I went looking online for discussion of this problem. Here's how >>>>>>> netbeans.org solved it: http://hg.netbeans.org/main/rev/6d07994bc971 >>>>>>> Locally, I have tried this same fix on StringType.java, and it seems >>>>>>> to work >>>>>>> fine, but then I found out that blob columns in MySQL are also limited >>>>>>> to >>>>>>> 2^16 bytes! The combined storage for all the properties on a node must >>>>>>> be >>>>>>> below this limit. So I modified the MySQL ddl to use mediumblob (up to >>>>>>> 16M >>>>>>> bytes). This limitation doesn't surface on Oracle, where a blob can be >>>>>>> up to >>>>>>> 8 terabytes (wow). >>>>>>> The question for this list is whether we should take the netbeans >>>>>>> approach >>>>>>> and allow Strings over 64K bytes in the database, or somehow >>>>>>> marshal/unmarshal these larger values to the filesystem? >>>>>>> In NYU's case, the properties which are this large are always >>>>>>> sakai:pagecontent, which stores arbitrary HTML for pages. It's easy to >>>>>>> imagine 64K byte and larger pages. >>>>>>> thanks, >>>>>>> Zach >>>>>>> -- >>>>>>> You received this message because you are subscribed to the Google >>>>>>> Groups >>>>>>> "Sakai Nakamura" group. >>>>>>> To post to this group, send email to sakai-kernel@googlegroups.com. >>>>>>> To unsubscribe from this group, send email to >>>>>>> sakai-kernel+unsubscribe@googlegroups.com. >>>>>>> For more options, visit this group at >>>>>>> http://groups.google.com/group/sakai-kernel?hl=en. >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "Sakai Nakamura" group. >>>>> To post to this group, send email to sakai-kernel@googlegroups.com. >>>>> To unsubscribe from this group, send email to >>>>> sakai-kernel+unsubscribe@googlegroups.com. >>>>> For more options, visit this group at >>>>> http://groups.google.com/group/sakai-kernel?hl=en.
>>>> -- >>>> You received this message because you are subscribed to the Google Groups >>>> "Sakai Nakamura" group. >>>> To post to this group, send email to sakai-kernel@googlegroups.com. >>>> To unsubscribe from this group, send email to >>>> sakai-kernel+unsubscribe@googlegroups.com. >>>> For more options, visit this group at >>>> http://groups.google.com/group/sakai-kernel?hl=en.
>> -- >> You received this message because you are subscribed to the Google Groups >> "Sakai Nakamura" group. >> To post to this group, send email to sakai-kernel@googlegroups.com. >> To unsubscribe from this group, send email to >> sakai-kernel+unsubscribe@googlegroups.com. >> For more options, visit this group at >> http://groups.google.com/group/sakai-kernel?hl=en.
There is no question that real-world use will run into this limit and that long-term we need to find a way to let users save larger content. I created a 14 page Word doc, then copied the text and pasted it into TinyMCE, resulting in a 500 error. Thirteen pages worked. This is probably a pretty common scenario.
That said, you could make a case that it would be supporting bad design to allow such very long pages, but I could be accused of rationalizing.
At any rate, because we're past feature-freeze and in ship-mode, the leads talked about this today and decided it would be too large and destabilizing to fix now. We're going to provide better messaging to the user, so that they can know when they've hit this limit. Sometimes you have to make tradeoffs to ship. This is one of those times. I've created the following Jiras:
> Call me crazy here, but I think it's better to have that expensive, > complicated logic centralized in one low-level place than to have it > duplicated, with various levels of skill and correctness, across several > dozen different client components. If we don't do it in the storage engine, > then we're going to do it over and over again at the application level. Or, > we won't do it, and we'll have a bunch of bug reports that come in from the > real world when properties get above 64K.
> 64K is actually quite small for a real-world web page. Consider that many > users will create pages by pasting in from MS Word, where just a couple of > text pages can easily reach that size and larger.
> -chris
> On 5/31/11 9:30 AM, Ian Boston wrote:
>> That would be great, however, to do so would make the driver code >> horribly complex, which is why the restriction is there. If you have a >> look in the guts of Jackrabbit you get an idea just how expensive this >> can be. I have to assume that the Jackrabbit team really do know what >> they are doing, and have found the most elegant solution in this area. >> They put it right at the bottom of their stack in the Bundle >> Persistence manager that intelligently blocks up properties. Earlier >> divers in Jackrabbit imposed a similar 64K limit. One other thing to >> note is that IIRC Jackrabbit used its schema to help it make those >> decisions.
>> I dont think we have the resource to do this at the lower levels and >> make it work.... its quite a large re-write of the insert and get >> methods in the drivers.
>> Ian
>> On 31 May 2011 17:17, Chris Tweney<ch...@media.berkeley.edu> wrote:
>>> IMHO the ContentManager should be the one to decide whether it should >>> store >>> something in a file or a property. If you put that logic into the calling >>> code, then the caller needs to know a lot about underlying storage >>> mechanisms, and we'll have duplicated size checks scattered all over the >>> app.
>>> -chris
>>> On 5/31/11 9:06 AM, D. Stuart Freeman wrote:
>>>> On Tue, May 31, 2011 at 05:03:20PM +0100, Ian Boston wrote:
>>>>> Over 64K they really should be a file. >>>>> Under 64K, they should be a property
>>>>> 64K is a very large HTML page, I have a feeling you can fit Hamlet >>>>> into that provided you dont go wild on markup.
>>>>> On 31 May 2011 16:51, Zach Thomas<zach.tho...@gmail.com> wrote:
>>>>>> It's sakai:pagecontent, which contains the HTML for any given group >>>>>> page. They can get quite large.
>>>>>> Zach
>>>>>> On May 31, 4:54 am, Ian Boston<i...@tfd.co.uk> wrote:
>>>>>>> Zach, >>>>>>> The intention was the the properties of a Content Item would never be >>>>>>> greater than 64K, since that would mean streaming significant amounts >>>>>>> of data in and out of Java objects. If Content Items are becoming >>>>>>> greater than 64K, then we should address that by using file bodies >>>>>>> which stream correctly rather than allowing unlimited property sizes.
>>>>>>> The Sparse ContentManagerImpl is not sophisticated enough to allow >>>>>>> arbitarty property sizes upto TB in size without any overhead. That >>>>>>> was a positive decision, made to avoid lots of complexity. I still >>>>>>> think that was the right decision.
>>>>>>> Why are you getting more than 64K in a ContentItems properties? >>>>>>> That's a *big* object to be cached in memory, if there were millions >>>>>>> of them it would have a big impact on memory usage. >>>>>>> Ian
>>>>>>> On 29 May 2011 20:33, Zach A. Thomas<z...@aeroplanesoftware.com> >>>>>>> wrote:
>>>>>>> Hi. When we migrated the pilot at NYU to sparsemapcontent, some >>>>>>>> pages >>>>>>>> were >>>>>>>> lost with "encoded string too long" errors. When I went digging a >>>>>>>> little >>>>>>>> deeper, I found that the DataOutputStream writeUTF method used by >>>>>>>> StringType.java has a limit of 2^16 bytes per call. You can actually >>>>>>>> write >>>>>>>> more than this by splitting the data into smaller chunks and making >>>>>>>> multiple >>>>>>>> calls to writeUTF. >>>>>>>> I went looking online for discussion of this problem. Here's how >>>>>>>> netbeans.org solved it: >>>>>>>> http://hg.netbeans.org/main/rev/6d07994bc971 >>>>>>>> Locally, I have tried this same fix on StringType.java, and it seems >>>>>>>> to work >>>>>>>> fine, but then I found out that blob columns in MySQL are also >>>>>>>> limited >>>>>>>> to >>>>>>>> 2^16 bytes! The combined storage for all the properties on a node >>>>>>>> must >>>>>>>> be >>>>>>>> below this limit. So I modified the MySQL ddl to use mediumblob (up >>>>>>>> to >>>>>>>> 16M >>>>>>>> bytes). This limitation doesn't surface on Oracle, where a blob can >>>>>>>> be >>>>>>>> up to >>>>>>>> 8 terabytes (wow). >>>>>>>> The question for this list is whether we should take the netbeans >>>>>>>> approach >>>>>>>> and allow Strings over 64K bytes in the database, or somehow >>>>>>>> marshal/unmarshal these larger values to the filesystem? >>>>>>>> In NYU's case, the properties which are this large are always >>>>>>>> sakai:pagecontent, which stores arbitrary HTML for pages. It's easy >>>>>>>> to >>>>>>>> imagine 64K byte and larger pages. >>>>>>>> thanks, >>>>>>>> Zach >>>>>>>> -- >>>>>>>> You received this message because you are subscribed to the Google >>>>>>>> Groups >>>>>>>> "Sakai Nakamura" group. >>>>>>>> To post to this group, send email to sakai-kernel@googlegroups.com. >>>>>>>> To unsubscribe from this group, send email to >>>>>>>> sakai-kernel+unsubscribe@googlegroups.com. >>>>>>>> For more options, visit this group at >>>>>>>> http://groups.google.com/group/sakai-kernel?hl=en.
>>>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "Sakai Nakamura" group. >>>>>> To post to this group, send email to sakai-kernel@googlegroups.com. >>>>>> To unsubscribe from this group, send email to >>>>>> sakai-kernel+unsubscribe@googlegroups.com. >>>>>> For more options, visit this group at >>>>>> http://groups.google.com/group/sakai-kernel?hl=en.
>>>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups >>>>> "Sakai Nakamura" group. >>>>> To post to this group, send email to sakai-kernel@googlegroups.com. >>>>> To unsubscribe from this group, send email to >>>>> sakai-kernel+unsubscribe@googlegroups.com. >>>>> For more options, visit this group at >>>>> http://groups.google.com/group/sakai-kernel?hl=en.
>>>>> -- >>> You received this message because you are subscribed to the Google Groups >>> "Sakai Nakamura" group. >>> To post to this group, send email to sakai-kernel@googlegroups.com. >>> To unsubscribe from this group, send email to >>> sakai-kernel+unsubscribe@googlegroups.com. >>> For more options, visit this group at >>> http://groups.google.com/group/sakai-kernel?hl=en.
> -- > You received this message because you are subscribed to the Google Groups > "Sakai Nakamura" group. > To post to this group, send email to sakai-kernel@googlegroups.com. > To unsubscribe from this group, send email to > sakai-kernel+unsubscribe@googlegroups.com. > For more options, visit this group at > http://groups.google.com/group/sakai-kernel?hl=en.
--
Alan Marks Sakai OAE Project Director skype: skramnala
After the leads meeting I thought about this and I think it may be possible to create a new data type that handles large data types. LongString.
This will have to be off by default since it may have a bad impact all over the place. On write, if a String is over a limit it will be written as a LongString, which will be a reference to a file on disk. Once that happens, it will be ignored for all sparse indexing although might still be Ok for Solr indexing. When its read it will come out as a LongString and provided its not referenced anywhere in the Nakamura code base it will make it all the way out to json.
If it is referenced anywhere in the Nakamura code base it will cause a ClassCastException (since a LongString cant be cast to a String and a String is final). That will randomly break random things and its quite likely that those breakages will be masked by other error handling, which is why, it I can get this to work at all, it will be off by default, turned on at your peril.
Also, once a big string is in the DB, the only way to convert it back to a String will be to delete the property and re-create it. I haven't tried to write this patch yet and I may have missed something in my thought process that make it impossible. The CastCastException is the real blocker, caused, in part by abandoning the original design that used coercion of data types rather than direct class casts.
Ian
On 31 May 2011 21:19, Alan Marks <alanma...@sakaifoundation.org> wrote:
> A brief aside from the implementation details: > There is no question that real-world use will run into this limit and that > long-term we need to find a way to let users save larger content. I created > a 14 page Word doc, then copied the text and pasted it into TinyMCE, > resulting in a 500 error. Thirteen pages worked. This is probably a pretty > common scenario. > That said, you could make a case that it would be supporting bad design to > allow such very long pages, but I could be accused of rationalizing. > At any rate, because we're past feature-freeze and in ship-mode, the leads > talked about this today and decided it would be too large and > destabilizing to fix now. We're going to provide better messaging to the > user, so that they can know when they've hit this limit. Sometimes you have > to make tradeoffs to ship. This is one of those times. I've created the > following Jiras: > https://jira.sakaiproject.org/browse/KERN-1919 > https://jira.sakaiproject.org/browse/SAKIII-3162 > https://jira.sakaiproject.org/browse/KERN-1920 > On Tue, May 31, 2011 at 10:19 AM, Chris Tweney <ch...@media.berkeley.edu> > wrote:
>> Call me crazy here, but I think it's better to have that expensive, >> complicated logic centralized in one low-level place than to have it >> duplicated, with various levels of skill and correctness, across several >> dozen different client components. If we don't do it in the storage engine, >> then we're going to do it over and over again at the application level. Or, >> we won't do it, and we'll have a bunch of bug reports that come in from the >> real world when properties get above 64K.
>> 64K is actually quite small for a real-world web page. Consider that many >> users will create pages by pasting in from MS Word, where just a couple of >> text pages can easily reach that size and larger.
>> -chris
>> On 5/31/11 9:30 AM, Ian Boston wrote:
>>> That would be great, however, to do so would make the driver code >>> horribly complex, which is why the restriction is there. If you have a >>> look in the guts of Jackrabbit you get an idea just how expensive this >>> can be. I have to assume that the Jackrabbit team really do know what >>> they are doing, and have found the most elegant solution in this area. >>> They put it right at the bottom of their stack in the Bundle >>> Persistence manager that intelligently blocks up properties. Earlier >>> divers in Jackrabbit imposed a similar 64K limit. One other thing to >>> note is that IIRC Jackrabbit used its schema to help it make those >>> decisions.
>>> I dont think we have the resource to do this at the lower levels and >>> make it work.... its quite a large re-write of the insert and get >>> methods in the drivers.
>>> Ian
>>> On 31 May 2011 17:17, Chris Tweney<ch...@media.berkeley.edu> wrote:
>>>> IMHO the ContentManager should be the one to decide whether it should >>>> store >>>> something in a file or a property. If you put that logic into the >>>> calling >>>> code, then the caller needs to know a lot about underlying storage >>>> mechanisms, and we'll have duplicated size checks scattered all over the >>>> app.
>>>> -chris
>>>> On 5/31/11 9:06 AM, D. Stuart Freeman wrote:
>>>>> On Tue, May 31, 2011 at 05:03:20PM +0100, Ian Boston wrote:
>>>>>> Over 64K they really should be a file. >>>>>> Under 64K, they should be a property
>>>>>> 64K is a very large HTML page, I have a feeling you can fit Hamlet >>>>>> into that provided you dont go wild on markup.
>>>>>> On 31 May 2011 16:51, Zach Thomas<zach.tho...@gmail.com> wrote:
>>>>>>> It's sakai:pagecontent, which contains the HTML for any given group >>>>>>> page. They can get quite large.
>>>>>>> Zach
>>>>>>> On May 31, 4:54 am, Ian Boston<i...@tfd.co.uk> wrote:
>>>>>>>> Zach, >>>>>>>> The intention was the the properties of a Content Item would never >>>>>>>> be >>>>>>>> greater than 64K, since that would mean streaming significant >>>>>>>> amounts >>>>>>>> of data in and out of Java objects. If Content Items are becoming >>>>>>>> greater than 64K, then we should address that by using file bodies >>>>>>>> which stream correctly rather than allowing unlimited property >>>>>>>> sizes.
>>>>>>>> The Sparse ContentManagerImpl is not sophisticated enough to allow >>>>>>>> arbitarty property sizes upto TB in size without any overhead. That >>>>>>>> was a positive decision, made to avoid lots of complexity. I still >>>>>>>> think that was the right decision.
>>>>>>>> Why are you getting more than 64K in a ContentItems properties? >>>>>>>> That's a *big* object to be cached in memory, if there were millions >>>>>>>> of them it would have a big impact on memory usage. >>>>>>>> Ian
>>>>>>>> On 29 May 2011 20:33, Zach A. Thomas<z...@aeroplanesoftware.com> >>>>>>>> wrote:
>>>>>>>>> Hi. When we migrated the pilot at NYU to sparsemapcontent, some >>>>>>>>> pages >>>>>>>>> were >>>>>>>>> lost with "encoded string too long" errors. When I went digging a >>>>>>>>> little >>>>>>>>> deeper, I found that the DataOutputStream writeUTF method used by >>>>>>>>> StringType.java has a limit of 2^16 bytes per call. You can >>>>>>>>> actually >>>>>>>>> write >>>>>>>>> more than this by splitting the data into smaller chunks and making >>>>>>>>> multiple >>>>>>>>> calls to writeUTF. >>>>>>>>> I went looking online for discussion of this problem. Here's how >>>>>>>>> netbeans.org solved it: >>>>>>>>> http://hg.netbeans.org/main/rev/6d07994bc971 >>>>>>>>> Locally, I have tried this same fix on StringType.java, and it >>>>>>>>> seems >>>>>>>>> to work >>>>>>>>> fine, but then I found out that blob columns in MySQL are also >>>>>>>>> limited >>>>>>>>> to >>>>>>>>> 2^16 bytes! The combined storage for all the properties on a node >>>>>>>>> must >>>>>>>>> be >>>>>>>>> below this limit. So I modified the MySQL ddl to use mediumblob (up >>>>>>>>> to >>>>>>>>> 16M >>>>>>>>> bytes). This limitation doesn't surface on Oracle, where a blob can >>>>>>>>> be >>>>>>>>> up to >>>>>>>>> 8 terabytes (wow). >>>>>>>>> The question for this list is whether we should take the netbeans >>>>>>>>> approach >>>>>>>>> and allow Strings over 64K bytes in the database, or somehow >>>>>>>>> marshal/unmarshal these larger values to the filesystem? >>>>>>>>> In NYU's case, the properties which are this large are always >>>>>>>>> sakai:pagecontent, which stores arbitrary HTML for pages. It's easy >>>>>>>>> to >>>>>>>>> imagine 64K byte and larger pages. >>>>>>>>> thanks, >>>>>>>>> Zach >>>>>>>>> -- >>>>>>>>> You received this message because you are subscribed to the Google >>>>>>>>> Groups >>>>>>>>> "Sakai Nakamura" group. >>>>>>>>> To post to this group, send email to sakai-kernel@googlegroups.com. >>>>>>>>> To unsubscribe from this group, send email to >>>>>>>>> sakai-kernel+unsubscribe@googlegroups.com. >>>>>>>>> For more options, visit this group at >>>>>>>>> http://groups.google.com/group/sakai-kernel?hl=en.
>>>>>>> -- >>>>>>> You received this message because you are subscribed to the Google >>>>>>> Groups "Sakai Nakamura" group. >>>>>>> To post to this group, send email to sakai-kernel@googlegroups.com. >>>>>>> To unsubscribe from this group, send email to >>>>>>> sakai-kernel+unsubscribe@googlegroups.com. >>>>>>> For more options, visit this group at >>>>>>> http://groups.google.com/group/sakai-kernel?hl=en.
>>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups >>>>>> "Sakai Nakamura" group. >>>>>> To post to this group, send email to sakai-kernel@googlegroups.com. >>>>>> To unsubscribe from this group, send email to >>>>>> sakai-kernel+unsubscribe@googlegroups.com. >>>>>> For more options, visit this group at >>>>>> http://groups.google.com/group/sakai-kernel?hl=en.
>>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups >>>> "Sakai Nakamura" group. >>>> To post to this group, send email to sakai-kernel@googlegroups.com. >>>> To unsubscribe from this group, send email to >>>> sakai-kernel+unsubscribe@googlegroups.com. >>>> For more options, visit this group at >>>> http://groups.google.com/group/sakai-kernel?hl=en.
>> -- >> You received this message because you are