Unrecognized characters in the traineddata model

roberty...@gmail.com

unread,

Aug 15, 2017, 1:47:10 AM8/15/17

to tesseract-ocr

Hello,

I have extracted all the characters and id numbers from the chi_sim.traineddata. And all the characters are stored in a txt file, which can be demonstrated following:

0
1    Joined
2    |Broken|0|1
3    S
4    D
5    F
6    8
7    7
8    0
9    K
10    O
11    U
12    H
13    E
14    I
15    4
16    5
17    1
18    9
19    &
20    C
21    W
22    N
23    _
24    P
25    M
26    T
27    V
28    R
29    L
30    A
31    Y
32    2
33    J
34    B
35    G
36    3
37    6
38    Z
39    X
40    Q
41    '
42    +
43    -
44    .
45    #
46    e
47    v
48    a
49    m
50    i
51    z
52    o
53    l
54    s
55    h
56    n
57    d
58    g
59    y
60    u
61    王
62    汝
63    敏
64    邹
65    立
66    健
67    熊
...
...
4013    扔
4014    嗨
4015    髋
4016    「
4017    [
4018    』
4019    瀵
4020    〕
4021    掺
4022    |"|0|2
4023    |"|1|2
4024    rn
4025    |m|0|2
4026    |m|1|2
4027    in
4028    cl
4029    |d|0|2
4030    |d|1|2
4031    rm
4032    |rm|0|2
4033    |rm|1|2
4034    nn
4035    |nn|0|2
4036    |nn|1|2
4037    ri
4038    |n|0|2
4039    |n|1|2
4040    |h|0|2
4041    |h|1|2
4042    |u|0|2
4043    |u|1|2
4044    |m|0|3
4045    |m|1|3
4046    |m|2|3
4047    |H|0|2
4048    |H|1|2
4049    |H|0|3
4050    |H|1|3
4051    |H|2|3
4052    |w|0|2
4053    |w|1|2
4054    |W|0|2
4055    |W|1|2
4056    fi
4057    |k|0|2
4058    |k|1|2
4059    ki
4060    |ki|0|2
4061    |ki|1|2
4062    |in|0|2
4063    |in|1|2
4064    tl
4065    th
...

I can recognize most of the characters, such as the han, ladin alphabet. But some characters, such as 'Joined', ' |Broken|0|1' at the file header, and |"|0|2, |m|0|2 at the end of the file, cannot be recognized by myself.

Can you explan what these characters mean?
4059    ki
4060    |ki|0|2
4061    |ki|1|2
4062    |in|0|2
4063    |in|1|2
and so on

Thx alot.

roberty...@gmail.com

unread,

Aug 17, 2017, 9:45:11 PM8/17/17

to tesseract-ocr

I have debugged the code, and find that the special characters 'Joined' and '|Broken|0|1' are added while generating the unicharset file.

But what is the function of these characters? Can anyone tell me which stage in the training process, these characters play in a role? I can't find it. Thx a lot.

For other special characters, such as 'cl', '|d|0|2', '|d|1|2', what is the function of these characters? Are they added in the combine_lang_model stage?

Can you help me?

Thanks sincerely.

在 2017年8月15日星期二 UTC+8下午1:47:10，roberty...@gmail.com写道：

roberty...@gmail.com

unread,

Aug 17, 2017, 9:54:41 PM8/17/17

to tesseract-ocr

Maybe some other information about these special characters also help me. If you know about it, please leave words.

Thanks.

在 2017年8月18日星期五 UTC+8上午9:45:11，roberty...@gmail.com写道：

Reply all

Reply to author

Forward