New record too long

13 views
Skip to first unread message

Alexei Vinidiktov

unread,
Mar 21, 2009, 3:01:36 AM3/21/09
to web2py Web Framework
Hello,

I'm just beginning to learn web2py. I've bought the web2py manual and
am reading Chapter 1.

I've defined a model through the admin interface:

db = SQLDB('sqlite://storage.db')
db.define_table('contacts',
SQLField('name', 'string', length=20),
SQLField('phone', 'string', length=12))

When I go to the admin interface to add some records, I can add names
that are written with Latin characters just fine, but when I try to
enter a name written with Cyrillic characters, I get an error that
says that the name is too long, although it is not.

For example, if I enter the name Олег Зимний, which is 11 characters
long, I get that error.

If I enter a short name such as Олег, the record is added fine.

The maximum length is set to 20 in the table definition and names with
Latin characters whose length is up to 20 characters can be added
fine.

Is it a web2py bug? If it is, can it be easily fixed?

--
Alexei Vinidiktov

Yarko Tymciurak

unread,
Mar 21, 2009, 4:06:30 AM3/21/09
to web...@googlegroups.com
Hi Alexei -

web2py uses UTF8 internally; this means Cyrillica will encode in 2-bytes per character


I copy/pasted "Oleg Zumniy" from your note into development copy (sqlite) of the PyCon2009 conference server...
>>> s=db(db.contacts.id>0).select()
>>> s[0].name
'\xd0\x9e\xd0\xbb\xd0\xb5\xd0\xb3 \xd0\x97\xd0\xb8\xd0\xbc\xd0\xbd\xd0\xb8\xd0\xb9'

As you can see - 2-bytes per character ...

SQLField defaults are shown on p.138 - 'string', length=32 is default.   Try that, see if that works for you.

Hope that helps.

Regards,
Yarko

2009/3/21 Alexei Vinidiktov <alexei.v...@gmail.com>

Alexei Vinidiktov

unread,
Mar 21, 2009, 5:28:43 AM3/21/09
to web...@googlegroups.com
Hi Yarko,

Thanks for your help.

I've tried setting the name field length to 32, and it worked fine
with a name such as Олег Зимний.

It was to be expected though.

The question is, in what units should the field length be measured -
bytes or characters?

I think it should be measured in characters, because you never know
know many bytes a string with international characters will be. I
understand it may not be possible, so I'd like to know what's the
practical advice? Should I asign a string field double the number of
bytes the longest name (or other information stored in the field) can
have? For instance, if I want a string field to contain the maximum of
20 characters, I should set it to 40 units (bytes). Is that correct?

I think this approach is error prone, because one can forget to do so
every time one adds a string field to a db definition.

21 марта 2009 г. 15:06 пользователь Yarko Tymciurak <yar...@gmail.com> написал:
--
Alexei Vinidiktov

Yarko Tymciurak

unread,
Mar 21, 2009, 6:15:45 AM3/21/09
to web...@googlegroups.com
Hi Alexei -

Since UTF8 is variable length, and data is cheap, you can be generous.   when a field is too short, you invariably have at least an unhappy customer by some measure.

You saw your test  name was not 22 bytes, but 21 ... so 2x is mildly conservative.  You should be ok w/ something like:

charlen = lambda n: n*len('л')

db.define_table( 'mytable',
     ...
     SQLField( 'something', length=charlen(32) ), 
     ...


This may be a good pattern to use regardless...  what do you think?

AchipA

unread,
Mar 21, 2009, 7:01:38 AM3/21/09
to web2py Web Framework
Characters vs byte is possible (see unicode objects in python), but
characters are problematic in databases (think record sizes, index
structures, collation, etc). That's why most databases either 'cheat'
by using byte counts in some places or suffer from a feature/
performance point. Also, there might be encodings that do not have a
predefined maximum number of bytes per character so you cannot predict
the number of required bytes (a special case, I admit, but once you go
down the multibyte char path it's all or nothing).

These are also the reasons why a lot of people with large databases
prefer single-character encodings *inside* the database. So, for
example if you deal with russian, you could use code page 1250 on the
table level (note that you can still talk to the database in unicode,
it's just a question of storage !). The important thing is to have the
data in correct format in the DB and avoid any conversions at all if
possible (leave it to the database or the browser).

On Mar 21, 10:28 am, Alexei Vinidiktov <alexei.vinidik...@gmail.com>
wrote:
> Hi Yarko,
>
> Thanks for your help.
>
> I've tried setting the name field length to 32, and it worked fine
> with a name such as Олег Зимний.
>
> It was to be expected though.
>
> The question is, in what units should the field length be measured -
> bytes or characters?
>
> I think it should be measured in characters, because you never know
> know many bytes a string with international characters will be.  I
> understand it may not be possible, so I'd like to know what's the
> practical advice? Should I asign a string field double the number of
> bytes the longest name (or other information stored in the field) can
> have? For instance, if I want a string field to contain the maximum of
> 20 characters, I should set it to 40 units (bytes). Is that correct?
>
> I think this approach is error prone, because one can forget to do so
> every time one adds a string field to a db definition.
>
> 21 марта 2009 г. 15:06 пользователь Yarko Tymciurak <yark...@gmail.com> написал:
>
>
>
> > Hi Alexei -
> > web2py uses UTF8 internally; this means Cyrillica will encode in 2-bytes per
> > character
> > (have a look
> > at http://en.wikipedia.org/wiki/UTF-8#Rationale_behind_UTF-8.27s_design,
> > or http://ru.wikipedia.org/wiki/UTF-8#Rationale_behind_UTF-8.27s_design)
>
> > I copy/pasted "Oleg Zumniy" from your note into development copy (sqlite) of
> > the PyCon2009 conference server...
> >>>> s=db(db.contacts.id>0).select()
> >>>> s[0].name
> > '\xd0\x9e\xd0\xbb\xd0\xb5\xd0\xb3 \xd0\x97\xd0\xb8\xd0\xbc\xd0\xbd\xd0\xb8\xd0\xb9'
> > As you can see - 2-bytes per character ...
> > SQLField defaults are shown on p.138 - 'string', length=32 is default.   Try
> > that, see if that works for you.
> > Hope that helps.
> > Regards,
> > Yarko
> > 2009/3/21 Alexei Vinidiktov <alexei.vinidik...@gmail.com>

Alexei Vinidiktov

unread,
Mar 21, 2009, 8:42:01 AM3/21/09
to web...@googlegroups.com
The thing is the project that I'm intending to use web2py for is a web
application for language learners, and I need to be sure that as many
languages as possible are correctly treated by the application.

So, I don't think it would be safe to use a Russian character for
calculating the length of a field as in charlen = lambda n:
n*len('л').

I'm not an advanced Python programmer, so if I'm wrong, please correct me.

21 марта 2009 г. 17:15 пользователь Yarko Tymciurak <yar...@gmail.com> написал:
--
Alexei Vinidiktov

Alexei Vinidiktov

unread,
Mar 21, 2009, 8:59:56 AM3/21/09
to web...@googlegroups.com
Unfortunately, due to the nature of the web application I'm planning
on using web2py for, I can't use a single-byte encoding for the
database or most tables.

The tables are going to store strings in many different languages of the world.

I was hoping that web2py could transparently communicate with
databases that are UTF8 encoded and that I would be able to do
operations on strings retrieved from databases without thinking about
their encodings.

Does web2py retrieve strings from databases as unicode Python objects
or single-byte strings? I assume that it's the latter and the
single-byte strings are UTF-8 encoded. Is that so?

I'll have to look into that much more closely.

21 марта 2009 г. 18:01 пользователь AchipA <attila...@gmail.com> написал:
--
Alexei Vinidiktov

AchipA

unread,
Mar 21, 2009, 10:37:57 AM3/21/09
to web2py Web Framework
>I was hoping that web2py could transparently communicate with
>databases that are UTF8 encoded and that I would be able to do
>operations on strings retrieved from databases without thinking about
>their encodings.

That is the goal. It will never be 100% as it is somewhat dabase/
version dependant. As you yourself write, it uses utf-8 encoded
strings (which is the python 2.x norm and this won't change to unicode
objects at least until web2py support for Python 3.0 arrives) and uses
utf8 data in the database. That being told, a quick glance at the
IS_LENGTH validator shows that it might not be entirely correctly
using len(), I think Massimo should take a look at it.

>>> len('a')
1
>>> len('á')
2
>>> len(u'á')
1
>>> len('á'.decode('utf-8'))
1


On Mar 21, 1:59 pm, Alexei Vinidiktov <alexei.vinidik...@gmail.com>
wrote:
> Unfortunately, due to the nature of the web application I'm planning
> on using web2py for, I can't use a single-byte encoding for the
> database or most tables.
>
> The tables are going to store strings in many different languages of the world.
>
> I was hoping that web2py could transparently communicate with
> databases that are UTF8 encoded and that I would be able to do
> operations on strings retrieved from databases without thinking about
> their encodings.
>
> Does web2py retrieve strings from databases as unicode Python objects
> or single-byte strings? I assume that it's the latter and the
> single-byte strings are UTF-8 encoded. Is that so?
>
> I'll have to look into that much more closely.
>
> 21 марта 2009 г. 18:01 пользователь AchipA <attila.cs...@gmail.com> написал:

AchipA

unread,
Mar 21, 2009, 10:48:02 AM3/21/09
to web2py Web Framework
Oops, wrong copypaste, here is the correct one:

In [1]: validator = IS_LENGTH(1)

In [2]: validator('a')
Out[2]: ('a', None)

In [3]: validator('aa')
Out[3]: ('aa', 'too long!')

In [4]: validator('á')
Out[4]: ('\xc3\xa1', 'too long!')

In [5]: validator('á'.decode('utf-8'))
Out[5]: (u'\xe1', None)

In [6]: validator('ж'.decode('utf-8'))
Out[6]: (u'\u0436', None)

I say Alexei found a bug :)
> > >> > orhttp://ru.wikipedia.org/wiki/UTF-8#Rationale_behind_UTF-8.27s_design)

Alexei Vinidiktov

unread,
Mar 21, 2009, 11:23:34 AM3/21/09
to web...@googlegroups.com
AchipA, thanks for taking the time to investigate the issue!

2009/3/21 AchipA <attila...@gmail.com>:


>
> Oops, wrong copypaste, here is the correct one:
>
> In [1]: validator = IS_LENGTH(1)
>
> In [2]: validator('a')
> Out[2]: ('a', None)
>
> In [3]: validator('aa')
> Out[3]: ('aa', 'too long!')
>
> In [4]: validator('á')
> Out[4]: ('\xc3\xa1', 'too long!')
>
> In [5]: validator('á'.decode('utf-8'))
> Out[5]: (u'\xe1', None)
>
> In [6]: validator('ж'.decode('utf-8'))
> Out[6]: (u'\u0436', None)
>
> I say Alexei found a bug :)
>

[...]


--
Alexei Vinidiktov

Yarko Tymciurak

unread,
Mar 21, 2009, 2:45:48 PM3/21/09
to web...@googlegroups.com
2009/3/21 Alexei Vinidiktov <alexei.v...@gmail.com>


The thing is the project that I'm intending to use web2py for is a web
application for language learners, and I need to be sure that as many
languages as possible are correctly treated by the application.

So, I don't think it would be safe to use a Russian character for
calculating the length of a field as in charlen = lambda n:
n*len('л').


3 bytes covers "the basic multilingual plane" which covers all characters in common use.  four bytes are needed for characters.... which are rarely used in practice."

I think you can probably start w/ 3 byte assumption, most times that will be more than you need, so statistically, the rarely used characters either will not come into play at all, or will fit regardless.   You can collect data (my guess is *3 will be too much anyway).

Looking forward to hearing more about this interesting project!

Regards,
Yarko 

Yarko Tymciurak

unread,
Mar 21, 2009, 2:47:50 PM3/21/09
to web...@googlegroups.com
2009/3/21 Alexei Vinidiktov <alexei.v...@gmail.com>


Unfortunately, due to the nature of the web application I'm planning
on using web2py for, I can't use a single-byte encoding for the
database or most tables.

The tables are going to store strings in many different languages of the world.

I was hoping that web2py could transparently communicate with
databases that are UTF8 encoded and that I would be able to do
operations on strings retrieved from databases without thinking about
their encodings.

I think you can count on everything being handles UTF8 encoded (but remember, we are using other people's code).  Massimo, who designed all this, will be able to comment better.

Yarko Tymciurak

unread,
Mar 21, 2009, 2:51:15 PM3/21/09
to web...@googlegroups.com
2009/3/21 AchipA <attila...@gmail.com>


>I was hoping that web2py could transparently communicate with
>databases that are UTF8 encoded and that I would be able to do
>operations on strings retrieved from databases without thinking about
>their encodings.

That is the goal. It will never be 100% as it is somewhat dabase/
version dependant. As you yourself write, it uses utf-8 encoded
strings (which is the python 2.x norm and this won't change to unicode
objects at least until web2py support for Python 3.0 arrives) and uses
utf8 data in the database. That being told, a quick glance at the
IS_LENGTH validator shows that it might not be entirely correctly
using len(), I think Massimo should take a look at it.

agree... 
 

>>> len('a')
1
>>> len('á')
2
>>> len(u'á')
1
>>> len('á'.decode('utf-8'))
1

thanks...
 
-y

Yarko Tymciurak

unread,
Mar 21, 2009, 2:52:59 PM3/21/09
to web...@googlegroups.com


2009/3/21 Alexei Vinidiktov <alexei.v...@gmail.com>
:-)
 

--
Alexei Vinidiktov



Alexei Vinidiktov

unread,
Mar 22, 2009, 7:34:54 AM3/22/09
to web...@googlegroups.com
22 марта 2009 г. 1:45 пользователь Yarko Tymciurak <yar...@gmail.com> написал:

> 2009/3/21 Alexei Vinidiktov <alexei.v...@gmail.com>
>>
>> The thing is the project that I'm intending to use web2py for is a web
>> application for language learners, and I need to be sure that as many
>> languages as possible are correctly treated by the application.
>>
>> So, I don't think it would be safe to use a Russian character for
>> calculating the length of a field as in charlen = lambda n:
>> n*len('л').
>
> From the link I sent
> ( http://en.wikipedia.org/wiki/UTF-8#Rationale_behind_UTF-8.27s_design,)
> 3 bytes covers "the basic multilingual plane" which covers all characters in
> common use.  four bytes are needed for characters.... which are rarely used
> in practice."
> I think you can probably start w/ 3 byte assumption, most times that will be
> more than you need, so statistically, the rarely used characters either will
> not come into play at all, or will fit regardless.   You can collect data
> (my guess is *3 will be too much anyway).
> Looking forward to hearing more about this interesting project!

Thanks for your input, Yarko. I've read the articles you mentioned and
I understand UTF8 better now. You are right about the 3 byte
assumption. It's a pretty safe bet for my purposes.

I hope the project I'm working on will be shaping up in the coming
months, and that I'll have enough news to share about the progress.

Anyway, as I'm only beginning to work with web2py, I'm going to have
quite a few questions to ask.

[...]

--
Alexei Vinidiktov

Yarko Tymciurak

unread,
Mar 23, 2009, 1:40:59 AM3/23/09
to web...@googlegroups.com
2009/3/21 AchipA <attila...@gmail.com>


Oops, wrong copypaste, here is the correct one:

In [1]: validator = IS_LENGTH(1)

In [2]: validator('a')
Out[2]: ('a', None)

In [3]: validator('aa')
Out[3]: ('aa', 'too long!')

In [4]: validator('á')
Out[4]: ('\xc3\xa1', 'too long!')

In [5]: validator('á'.decode('utf-8'))
Out[5]: (u'\xe1', None)

In [6]: validator('ж'.decode('utf-8'))
Out[6]: (u'\u0436', None)

I say Alexei found a bug :)

Yes - it would look like you are right....  I'll let Massimo digest this better after PyCon...

- Yarko 
Reply all
Reply to author
Forward
0 new messages