> Can you please provide explanation why do you think that "unicharset_extractor.exe produces wrong and uncomplete files"?
Because this is definitely wrong:
90
NULL 0 NULL 0
A 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # A [41 ]A
B 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # B [42 ]A
C 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # C [43 ]A
D 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # D [44 ]A
E 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # E [45 ]A
F 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # F [46 ]A
G 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # G [47 ]A
H 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # H [48 ]A
I 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # I [49 ]A
J 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # J [4a ]A
K 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # K [4b ]A
L 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # L [4c ]A
M 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # M [4d ]A
N 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # N [4e ]A
O 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # O [4f ]A
P 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # P [50 ]A
Q 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # Q [51 ]A
R 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # R [52 ]A
S 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # S [53 ]A
T 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # T [54 ]A
U 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # U [55 ]A
V 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # V [56 ]A
W 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # W [57 ]A
X 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # X [58 ]A
Y 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # Y [59 ]A
Z 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # Z [5a ]A
a 3 0,255,0,255,0,32767,0,32767,0,32767 NULL 1 0 0 # a [61 ]a
b 3 0,255,0,255,0,32767,0,32767,0,32767 NULL 2 0 0 # b [62 ]a
c 3 0,255,0,255,0,32767,0,32767,0,32767 NULL 3 0 0 # c [63 ]a
d 3 0,255,0,255,0,32767,0,32767,0,32767 NULL 4 0 0 # d [64 ]a
1.)
The column "other_case" should contain the ID of the other-case letter.
For the lowercase letters they point correctly to the uppercase letters.
But the uppercase letters they all have a value of -1 which is wrong.
Here should be the corresponding ID of the lowercase letter.
2.)
The script name is always NULL.
It should be LATIN or COMMON
3.)
All the min / max values are completely missing.
They are 0, 255 or 32767.
10 missing columns!
4.)
The last column "normed_form" is missing.
With the '#' a comment is starting.
But when reading this unicharset the '#' is misinterpreted as the "normed_form".
Here should be mostly the same letter as in the first column.
Here you see a unicharset extracted from a trainddata file with all columns filled correctly:
A 5 52,68,216,255,100,216,0,17,98,231 Latin 2 0 15 A # A [41 ]A
B 5 62,68,216,255,91,227,0,27,106,227 Latin 23 0 102 B # B [42 ]A
etc..
a 3 58,65,186,200,85,164,0,26,97,185 Latin 15 0 2 a # a [61 ]a
b 3 58,64,216,255,87,180,0,25,100,200 Latin 102 0 23 b # b [62 ]a
Result:
The unicharset_extractor tool is very buggy.
I have to edit all by hand.
So my question remains:
Were do I find a detailed documentation of the Unicharset file ???