I would like to add a new type of sequence in Tryton that will generate random
but unique number.
A simple solution will be to generate random number and compare it with
already generated number (stored in a table). But there is the birthday
paradox [1] that reduces the speed of this method.
An other way will be to store in a table every possible value (if the length
of the number is fixed) and pick/remove one randomly. But here the generation
of all values can be slow.
Is there anybody that knows a better way?
[1] http://en.wikipedia.org/wiki/Birthday_paradox
--
Cédric Krier
B2CK SPRL
Rue de Rotterdam, 4
4000 Liège
Belgium
Tel: +32 472 54 46 59
Email: cedric...@b2ck.com
Jabber: cedric...@b2ck.com
Website: http://www.b2ck.com/
twitter: http://twitter.com/cedrickrier
identi.ca: http://identi.ca/cedrickrier
I'm not really sure why this feature would be useful, but I saw
something similar in OpenBravoPOS, the id on their tables are characters
instead of integers and they use UUID values as the IDs.
I don't really like that solution, specially for the fact that the key
size is bigger and thus, the database size would grow a lot with many
rows, but maybe it would help you in this use case...
Cheers.
C�dric Krier escribi�:
> Hi,
>
> I would like to add a new type of sequence in Tryton that will generate random
> but unique number.
> A simple solution will be to generate random number and compare it with
> already generated number (stored in a table). But there is the birthday
> paradox [1] that reduces the speed of this method.
>
> An other way will be to store in a table every possible value (if the length
> of the number is fixed) and pick/remove one randomly. But here the generation
> of all values can be slow.
>
> Is there anybody that knows a better way?
>
>
> [1] http://en.wikipedia.org/wiki/Birthday_paradox
>
>
--
Carlos Perell� Mar�n
[P+] SERVICIOS PROFESIONALES
http://www.pemas.es
mailto:car...@pemas.es || mailto:car...@gnome.org
Este mensaje y los ficheros anexos son confidenciales. Los mismos
contienen informaci�n reservada de la empresa que no puede ser
difundida. Si usted ha recibido este correo por error, tenga la
amabilidad de eliminarlo de su sistema y avisar al remitente mediante
reenv�o a su direcci�n electr�nica; no deber� copiar el mensaje ni
divulgar su contenido a ninguna persona.
Su direcci�n de correo electr�nico junto a sus datos personales forman
parte de un fichero titularidad de PEMAS Servicios Profesionales, S.L.
cuya finalidad es la de mantener el contacto con Ud. De acuerdo con la
Ley Org�nica 15/1999, usted puede ejercitar sus derechos de acceso,
rectificaci�n, cancelaci�n y, en su caso, oposici�n enviando una
solicitud por escrito, acompa�ada de una fotocopia de su DNI dirigida a
PEMAS Servicios Profesionales, S.L. C/ Santos Justo y Pastor, 72 - 5,
C.P. 46022 de Valencia (Espa�a).
I don't want to use UUID because as you said they are too big.
The useful of this kind of sequence (only number) is to not have a leak of
information. Like the number of generated document or if there is a hole
between two documents etc.
How I see the functionnality:
- adding a selection field on ir.sequence with ('ordered', 'random')
- hide number_next, number_increment and padding if random is selected
- add and show random_size
So user could choose between ordered or random sequence for document.
But for now, the main issue is the generation.
> I would like to add a new type of sequence in Tryton that will generate random
> but unique number.
For what is it needed?
> A simple solution will be to generate random number and compare it with
> already generated number (stored in a table). But there is the birthday
> paradox [1] that reduces the speed of this method.
Birthday paradox is strongly related to the limited quantity of available
numbers.
I don't know, for what purpose it is needed. Does it have to be unique for
the machine, for the database, for the server, for universe?
Perhaps it would be possible to use something like the Message-ID for news, that
also has to be unique. Combining a unique part with a random part could save a
lot of overhead.
BTW: As soon as we will have Tryton work as a mail server it would be
desirable to have a (unique) domain anyway.
> An other way will be to store in a table every possible value (if the length
> of the number is fixed) and pick/remove one randomly. But here the generation
> of all values can be slow.
Seems to me like a huge table which is a lot of overhead. And it won't be
random, if you know numbers taken already from this table.
--
Mathias Behrle
MBSolutions
Gilgenmatten 10 A
D-79114 Freiburg
Tel: +49(761)471023
Fax: +49(761)4770816
http://mbsolutions.selfip.biz
UStIdNr: DE 142009020
PGP/GnuPG key availabable from any keyserver, ID: 0x89BCA161
We can use UUID like it is already done in calendar modules.
And normally, Tryton will not need to generate Message-ID as it is not
intended to be a MTA.
>
> > An other way will be to store in a table every possible value (if the length
> > of the number is fixed) and pick/remove one randomly. But here the generation
> > of all values can be slow.
>
> Seems to me like a huge table which is a lot of overhead. And it won't be
> random, if you know numbers taken already from this table.
The randomness is for external people as I explain on previous email.
> I don't know, for what purpose it is needed. Does it have to be
> unique for the machine, for the database, for the server, for
> universe?
>
> Perhaps it would be possible to use something like the
> Message-ID for news, that also has to be unique. Combining a unique
> part with a random part could save a lot of overhead.
Depending on "how unique" and "how random" the data has to be, I agree
with Mathias suggestion.
Some possible algorithms:
* sequence + date + time + domain
* sequence + random.random() + date + time + domain
* sha1(sequence + random.random() + date + time + domain)
Where "sequence" is a ir.sequence with (hidden) increment of 1.
SHA-1 is still considered collision-free (AFIAK) thus this would give
unique values. (Otherwise another member of the sha-family shoulkd do well.)
The hash-value can be represented as
- hexdigest (100% overhead)
- base64 (33% overhead)
- base85 (25% overhead - http://en.wikipedia.org/wiki/Ascii85)
- baseE91 (14% overhead http://base91.sourceforge.net/)
You may get other interesting answers when asking on comp.lang.python :-)
--
Schönen Gruß - Regards
Hartmut Goebel
Dipl.-Informatiker (univ.), CISSP, CSSLP
Goebel Consult
Spezialist für IT-Sicherheit in komplexen Umgebungen
http://www.goebel-consult.de
Monatliche Kolumne: http://www.cissp-gefluester.de/
Goebel Consult mit Mitglied bei http://www.7-it.de
> The randomness is for external people as I explain on previous email.
For this case a pre- or postfixing the date should not be a problem. So
if using real random values, one could store them, keep them for a day
or two and then clear the old values.
Another possibility would be to shorten the value:
* date + md5(sequence + random)[:16]
The problem is not to check the uniqueness, we will use postgresql for that
with an index.
The problem is the scallability of the solution, yours doesn't scale better
then my first one. It will just need a lot of sequence request per day.
But a timestamp is perhaps the solution with a wait if there is a collision.
The sequence will have a timestamp precision field and a last timestamp used.
> The problem is not to check the uniqueness, we will use postgresql
> for that with an index. The problem is the scallability of the
> solution, yours doesn't scale better then my first one. It will just
> need a lot of sequence request per day.
I suggest asking at comp.lang.python. There are a lot of wise guys
hanging around there, so you should get some more useful ideas there.
(If you are unable to post there, I'll forward your answer.)
> But a timestamp is perhaps the solution with a wait if there is a
> collision. The sequence will have a timestamp precision field and a
> last timestamp used.
Please rethink whether it is necessary to record the last timestamp. If
precisstion is small enough, there time will change between every call.
An addition could be to include a (smaller) sequence counter, too. Thus
an attacker would have to guess both the counter and the correct timestamp.
(So to summarise my suggestion: extend the current sequence by a
timestamp as you described and use both in conjunction.)
If the precision is configurable, we must check it.
>
> An addition could be to include a (smaller) sequence counter, too. Thus
> an attacker would have to guess both the counter and the correct timestamp.
Don't understand what an attacker can do ?
About the scalability of this approach :
Let's say we pick an integer. On my 32bit laptop, sys.maxint gives me:
2147483647, which is roughly 2*10**9.
So this means that the odds of generating a number already in the
sequence is 1 in 10 when 2*10**8 (200 million) numbers have been
generated. One in ten is not bad this means that the the creation of
the next item in the sequence is on average 10% slower than if there
was no "collision".
To give an idea of how much "time" is needed to generate a sequence
with 2*10**8 numbers: It takes 5479 years if 100 numbers are generated
each day (2*10**8 / (100*365) gives 5479).
So scalability may become a problem if one want to generate more than
10,000 number per day over 50 years (10,000 number a day reach the 10%
slowdown after 54 years) or more than 100,000 per day over 5 years.
The only use case I would see for such numbers is a big cable/water/
electricity/phone company that want to generate statements for their
customers.
But if we want to scale more at the expense of being a bit less random
would be to use a variant of the "store in a table every possible
value" solution:
- Generate a randomized table with the number from 1 to 10k.
- Consume table
- Generate a table with number from 10k+1 to 20k.
- ...
Another variant is to use two such "possible values tables": one that
gives which intervals are still available (1to 10k, 10k+1 to 20k, etc)
and the other giving the available values in the currently active
interval.
Bertrand
May be I could help letting you figure the problem.
> Hi,
>
> I would like to add a new type of sequence in Tryton that will generate random
> but unique number.
> A simple solution will be to generate random number and compare it with
> already generated number (stored in a table). But there is the birthday
> paradox [1] that reduces the speed of this method.
>
> An other way will be to store in a table every possible value (if the length
> of the number is fixed) and pick/remove one randomly. But here the generation
> of all values can be slow.
>
> Is there anybody that knows a better way?
>
Compare it with a box of numbered ball. The diference with the two ways as you propose here, is:
1) The first algorith is as having another box with infinite numbers (limited in a range), and you pickup from it box a ball and them look for a place in a numbered site, if the site is busy, them you repeat the previous step.
2) The second algorith is as having previously full the array, and them pickup the number from it.
Well clearly, the first have the problem of the colitions, that increase as the site is filling, and the second have the problem
of the space that could be very high if the range is big.
As you can see the problem, is not the algorith, instead the range. As an example, if the range are the integer between 1 to 100, the problema as you have pick up 50% of the number in the first method is that the next ball have a chance of 50% of be an elegible ball, and the probabilitie will decrease very fast as you pick up more balls. In this case, I preffer the second algorith.
But if you have a range of 1 to 10^32, the problem is that if you use the second method your computer could don't have enouf me
mory to do that.
I suggest, that you use the both methods, as a function of the size of the memory of the computer.
>
> [1] http://en.wikipedia.org/wiki/Birthday_paradox
>
As I understand the problem, the paradox is part of the problem as you explain, is more similar to the firts algorith, but have nothing in relation to the second algorith.
I suggest that you use any of the random generator that have python, my personal experience is that make a new one could be as usefull as make a new language, that is not the problem.
Regards.
--
Juan Fernando Jaramillo
Gte Tecnología
MIG Internacional
www.miginternacional.com
I propose to add a selection on ir.sequence that defines a type:
- incremental (current one and default)
- timestamp numeric
- timestamp hexanumeric
With x more fields for timestamp configuration:
- precision (the precision in number of decimal after second)
- offset (offset to remove since epoch)
- last (hidden) (last sequence generated to prevent collision)