অংশগ্রহণকারী = horizontal line
vertical line
অং
শ
গ্র
হণ
কা
রী
The following python script does it for you:
#!/usr/bin/python
#-*- coding:utf8 -*-
import sys, codecs
word = unicode(sys.argv[1],'utf-8')
for num in range(0,len(word)):
print word[num].encode('utf-8')
Here is how to execute it on the command line:
./split.py অংশগ্রহণকারী
--
Regards,
Debayan Banerjee
I am assuming you are not interested in syllable wise breakup.
--
Regards,
Debayan Banerjee
I thought your input will be a word, and not individual letters.
What are you exactly trying to do?
--
Regards,
Debayan Banerjee
Ah! The script works for even this word but not exactly. You need
consonants+overlapping vowels separated out. I have to think about it
a little bit.
--
Regards,
Debayan Banerjee
>> I forgot to inform you that the in py program Input file is required i.e.
>> open (text.txt /text.doc)to be added.
>> I don't know how to add in the script since I am not familiar with python.
>> Just now I was able to run py in CMD.exe MSwindows using string as "
>> testing" (english) works well. But when I tried to paste kannada or
>> anyindic script it appears as ????? in cmd.exe and also in output.txt.
I think this problem is more with the support of Unicode on WinXP
commandline than the script.
Though I am not an expert in WinXP as such, but I think you can use
"\U" as an option for unicode.
I am sure someone here will be able to help you with the Unicode input.
--
Sincere regards,
Pranava Swaroop
#!/usr/bin/python
#-*- coding:utf8 -*-
import sys, codecs
f = open(sys.argv[1])
lines = f.readlines()
for words in lines:
for letters in words:
print letters.encode('utf-8')
Save the above script with split.py as file name.
Then do chmod +x split.py .
Then run like this: ./split.py data.txt where data.txt is your text file.
--
Regards,
Debayan Banerjee
Dear Sriranga,
Here is a tool, which is a simple modified version of one of our
tesseract post-processors.
http://lime-ocr.googlecode.com/files/h2v-splitter.zip
Just uploaded to see if this is any useful for your work. If you need
any modifications for your use, please let me know.
Source and a simple help text is included in the zip.
Regards
Nishad
On 1/14/2010 11:18 AM, 74yrs old wrote:
> Debayan,
> Any progress? I may kindly be updated.
> With regards,
> -sriranga(77yrsold0
>
> On Tue, Dec 8, 2009 at 4:01 PM, Debayan Banerjee <deba...@gmail.com
> <mailto:deba...@gmail.com>> wrote:
>
>
> 2009/12/8 74yrs old <withbl...@gmail.com
> <mailto:withbl...@gmail.com>>:
> > Debayan,
> > I am giving example of word below:
> > word = ಕನ್ನಡದಲ್ಲಿ ಮಾತಾಡಿ
>
> Ah! The script works for even this word but not exactly. You need
> consonants+overlapping vowels separated out. I have to think about it
> a little bit.
>
>
> --
> Regards,
> Debayan Banerjee
>
>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
iQEcBAEBAgAGBQJLU8NKAAoJEDng0hpUv7qbyXsIAKpUijkeS3zY4Rl6sEAX3oXc
k2+HdHOZSfbO7Yk0saWTMJwBYlIfugPowmXB7zIJskpzZfDpYGGnexkiEI8I/P+m
o9RSIJ7piZc3XnJI9XTJ0iDEfHjeMHapZxr7Os0Mwf6WJPOlJkm7n6ECd5OCJW4E
DEhbZXzAP4Z335iEKRSEbp68ejY7mVoe/2cG6kfZR0fJIY3bLh39sMyvb2v4xasr
hL60p9V3GifmplgJ7c9LTDa2iEZWto7k97wkv5uVX+9h8ya9GaGj/sFsLvkS0I6c
UB4G4vPn77udUJrMDYvNtprijZg8qApn2Z6kjzqmiHAsRC6XK79c7GAF0kuR4H8=
=1QN6
-----END PGP SIGNATURE-----
Dear Sriranga,
I couldn't respond on time, since I'm traveling. Here's an updated
version.
http://lime-ocr.googlecode.com/files/h2v-split-2.zip
Please go through provided readme, command line parameters were
changed and using a single tool you can split into words or
characters. cleaner got more options like removing punctuations.
I'm sorry to say, but I couldn't pay enough time to test it properly.
I'll update and fix it (if there is any error) once I'm back.
Related tesseract trainer GUI and other tesseract issues, please seek
help from ninjas from indic-ocr community. I myself, never tried this GUI.
If your development is in a hopeful direction, then i humbly request
you to use tesseract-ocr version 3. It gives better results for
Malayalam (According to one of our clients, for typewritten documents
its near 100%, and in v2 it's hardly 85%). Structurally all Dravidian
languages are same, so I think for Kannada you can move along with
tessercat-ocr 3. (I'm not so sure about Hindi and other Devanagari
Scripts)
Please feel free to get back, with any thing in need.
Regards
Nishad
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
iQEcBAEBAgAGBQJLVlcVAAoJEDng0hpUv7qbXBsIALTn7Ch1GPfsXDm4WDTWjan2
QoX2SuuLpSGTIES2y9jcNezsYV27jOuEiQ/5wydt7XQwbsXZENymunqh1tT77LpP
OSs4MZ3vMyOGXSCLT+MpTwb79UcPQV4a9RZR8+jsMctvsnAXD1xmVM799lOHLfl3
W1SwTNZ5kI23gf8IqXg6oOyYLLzMJBwxWR/hXqSnWMM7nn9/2UUBg/gZE5IsCkX7
h4ma6ND6AkhAIaE5mAeItiAEfuRdn4jCWz1dJjnzxhEHQHjopkhPRHDOROPodkTP
zYWI9fdOMmrMNjEM/D1AILK5HpWwCqaCD8E3/s4FW3h8t/bSy7zRL5jDKTXOcbA=
=ZB5B
-----END PGP SIGNATURE-----
Dear Sriranga,
I don't have any language files for Kannada generated for tesseract-3.
I'm sorry about lime-ocr. I guess you used python script called
lime-ocr.py from trunk. Please don't worry about the errors. That's
only the source not the distribution.I kept the source in trunk to
ensure it's strictly under open domain.
I already mailed my guys in office to create a package. They will be
sending it soon. Some of our clients are using lime-ocr in production
environment and it gives significant performance. I never get enough
time to create proper documentation for end users or creating a proper
setup package, that is the main reason for it's absence in a single
package. Lime OCR is nothing but a GUI. In all of our projects we keep
tesseract untouched and utilize some pre-processing and
post-processing tools to produce expected output.
Sorry about 'ninja' it's just a jargon. I mean most experienced and
adventurous people in indic-ocr group. Here is a hint about the word.
http://en.wikipedia.org/wiki/Ninja
Text Split is not exactly the tool you require. You need some
trainable pattern based tools for character splitting. But for word
split you can use it. For example it can't split a letter along with
it's vowel combination. I guess if we create some pattern (of course
with pain) we can use that for many Indian languages. Transliteration
can be easily implemented for these training files and job comes handy.
I'll try to experiment with Kannada, once I'm back.
Please keep posted
Regards
Nishad
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
iQEcBAEBAgAGBQJLVoiAAAoJEDng0hpUv7qb2HUH/ixbKfEBbbgpvrxQuy3yDHNZ
4NsOf/fmVf7MzG8omAuzXRqkhXWSSCR1Y6AB/oYJICw/sTTjI5j6EmEkvRZ+38wg
EU4XrNlKLivtCFOXSeFASQpmV/+cuL6YudXOGGq3ZQJIfcL3hnlhexNYuS8j3WAt
zje1kFRPwRvhJikV38Eu/xHVTDBrCHzGxa/Sr3+2qhz1gvmVo0gI2hqD9KqWLc14
fhtdZwmdHyS2X6zFqvp/oMtM1LnMpTNVxqQLcstNVhix8gEifYCiGdKbU9q5MDv9
nsfjwszbSOTzp5NB1jxudZp0MEsdNxu2BQEHnWuEtWsoWPx8nAsjBGwnin5nSh8=
=Nu3j
-----END PGP SIGNATURE-----
Dear Sriranga,
Here comes your solution
For split_word_20012010102036.txt
h2v-split.exe abarahanews5.txt word 1
(Just word split with new-lines and tabs removed)
For split_word_20012010102221.txt
h2v-split.exe abarahanews5.txt word 2
(Word split with new-lines, tabs and punctuations removed)
For split_word_20012010102336.txt
h2v-split.exe A-kan-utf8.txt word 1
(Only word split with new-lines and tabs removed)
I'm attaching a Kannada Word List, It seems to be useful for you. It has
around 60000 Words.
All the best and please keep me posted.
I'll be busy up to 10 pm. But please don't hesitate to shoot.
Wishes & Regards
Nishad
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
iQEcBAEBAgAGBQJLVpq4AAoJEDng0hpUv7qbwQQIAKsrEG5AZtoYug8eLYx7vUco
NRjpg/KB2Bo7UpqkBO2eqCleFfdEkJZ0xJG8jIFyLY5GAaOh6C6jCl4ul0NKWrh3
SG53NXs4r+LJBv22nkvSOUEnpF4Sg02JavAVFJkiLN9pcEfbLkp/93EjbPOoRY1M
2YbBtJU8VN2UlTXKb2A/ItZlPAW/X6CeBeX/yDyEw9FlPtRNphHw2d5sRN6H0VAn
uHtfl9VhSHZx0CXN0j3h+ce1qHolrCkgO5r+oTxtukgyR4fKCzwNZryAKj/OpShG
V87J0WnSiAB45F9lFKTVHY1GCJMd7hBtBRGGV7nH0tCni8fdSq0JQ72FoJ1G6fo=
=qz1B
-----END PGP SIGNATURE-----
Dear Sriranga,
Lime OCR is a single installer, it doesn't require python. I'll upload
any available version (which I'm expecting by lunch break) soon. Please
don't spoil your time on that python source. Give me time up to lunch
break.
Regard
Nishad
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
iQEcBAEBAgAGBQJLVrArAAoJEDng0hpUv7qbvlUH/A+eItJvRxMIGTvskYPhav4R
X/YusBywG6e2drpg85r1SOp6FOng3lz5uZrUnQxPC32abInzt0wHEmqK3r/KeXqG
kElnyLNY9+46ze7NobjcIWueV8ooqAmTuSjARd0+o1AmEhmo8S1Rm17AAA25+ZQs
ZVOMfu20KU7RAMPnfO8zjNAdNwp4WilAv+AybuezFBLzlsQk1gk/ntYH1zt4MwSi
KgdHQeGg9pMw6CVsLW3OfM6UjWKSzgy9bVECkIbZfX2N630F9KvQRMNLAwuUMdE8
wgGmUdwM6aG8pl59iYE4irCo7vsqDzeVml0ujuP03wVSEtM+M1d4aULZUHQyh9s=
=Treb
-----END PGP SIGNATURE-----
Dear Sriranga,
First of all, Lime OCR is only a GUI for tesseract.
tesseract-bin-win-r319.7z is tesseract 3 binaries for windows. I
uploaded our production build, just in case for somebody in need. It's
build from pure tesseract code without any modification.
It seem that there is a confusion regarding Lime OCR and tesseract. We
are extensively using tesserct engine for some of our projects related
digital archiving. For some of our client we optimized Tesseract-GUI,
and we forced to switch to Windows since some scanner drivers are
missing in Linux. Finally when it was perfect in function, we decided to
give it back to the community.
This name Lime-OCR was there from 2008 beginning, that time we were
using a QT application and now its partially replaced with this open
source python script.
I couldn't find enough time to cleanup Lime-OCR for general purpose,
because there were many complicated scripts which is only useful for
certain purposes like, post processes and pre image enhancers for some
typewrites n stuffs like that. More than that we need to provide a
useful help resource, in-order to make it digestible for simple users.
We even didn't finish it's testing in different windows versions.
Whatever, your attempt to use the script is just an inspiration. So here
we deliver out Lime-OCR first time for general public.
There are many known issues like some GTK bugs, certain logical errors.
There are no help files, Setup is not properly fixed to afford updates.
After all, we don't know whether it'll work or not.
If you find any issues please post it in issue section of the project
site, than in mail. Once I'm back in office, I'll assign somebody to
maintain the project.
Please remember, it's not yet a complete software. Just posting as a
pre-alpha.
http://lime-ocr.googlecode.com/files/LimeOCR-PreAlpha-241-Installer.exe
About trainer GUI, I don't think it is working as it supposed to be. Use
http://code.google.com/p/jtesseract/ and create a small training data
set with a maximum of 10 words. OCR that same image. Repeat this step
with Trainer GUI generated training data. and compare the results.
combine.exe got some bugs, Try to train tesseract 3 with simple sample
data and see if there is any improvement.
Regards
Nishad TR
NB: Lime-OCR only supports tesseract 3. but if you wish you can replace
tesseract.exe and tessdata with version 2 and replace langlist.exe with
the one provided in v2 folder of installed location.
I'll be online up to 9PM so, please get back to me if in case.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
iQEcBAEBAgAGBQJLVxMXAAoJEDng0hpUv7qbOQ0H+wbCrRnzTdi89bEpyYi7Nc/T
3Vb3o6N5cOjkn5C/Fy73/PkoHPpVGR5dSpsPqxyO/RD0sWMeyZ2hmQJ4ERkjTfLV
nOkcQp9nqid4mK0NpvCT/rJAXpXR6n2puqAQGtE4YjpOF8wbLPfMMjso+yZgGEft
RdiXwb4F1tresluAuLVNLrJd74qDkqONxWor6ancoklPaiaLqfTVG9V7xLn99G9z
wjldwCalRPzFT+Ep8HsY2ICZzLtIbgcG2+ORBJiH6hKB/RZ03Pkkv2HB1BnDedes
nO8hIkaQn9fSX4uQEbZxpoOCuwpjTEUhnf0Xjqx/cVWXaXw3/taTm4h3WO551sU=
=HQlR
-----END PGP SIGNATURE-----
Dear Sriranga,
Please send me screen shot of the GUI, without language selection
option. Installed directory is where you installed Lime-OCR
It may be C:\Program Files\Lime-OCR
you can find a replacement program in v2 folder. (But I never tried that)
Regards
Nishad
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
iQEcBAEBAgAGBQJLVyuZAAoJEDng0hpUv7qbM2MH/RbnvTBxPRpIAPBVbgAtfWj1
5XqECsg3aXEl8OkcsycIlQP4QdYAmzWfMjiI+M0FXA/k6GVqdKOASq/1ThC7o1e6
7GtmLQfvciFz/bQn/dnobIJ0ZyvJwN84SeU4EwmJT50+VQZRKIZdiDpCns941Frz
vC7Rm9m/Psb+be3ZMa9mnsefbixA4wyKnHPs1GkZ2lEaPm3fjRDfts6QFHU6Ozf/
ntwaXJuK94UG6yrvXlW2rA8A2UcyS50oFWxJKvl/6ed9VdgmVWTJIrkmdzIztpME
DU1eRugvM35Woxp5R1tYD8KK+1ZpB4+y8ZGiTuGHGvwTI77wtdh0sOBPoUHIcac=
=9m7G
-----END PGP SIGNATURE-----
Dear Sriranga,
First uninstall Lime-OCR, then delete Your install directory. ie
"C:\Program Files\LIME OCR".
After that simply take CMD (It will in a prompt of your profile folder).
then enter this command
del .lime-ocrrc
Install fresh Lime OCR, then try to start it.
If it still fails, please get back to me.
Regards
Nishad
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
iQEcBAEBAgAGBQJLVy/4AAoJEDng0hpUv7qbeGcH/A1JISfdgeZcOtfbmlK6ruw8
VIWh6l4GGoILVRwI812IPimmMyqXsxMl26OfKq88hK5nBRdDNPaP0jmHWxi72nuT
jMDVD7qGGtxWiAGKDGgI1E4uPjugs6gIyOOnyTbZJuUDkiIxCf9wGaIYXn0NcKSb
3anuJgA4hJM8/5OZdwbu40pUzR66NuV03OES4Bx66tBtaEIdmFajgFRcYAD3m/67
5Xl/QuKIaKTqTnY/rMGZS3nMYbe6dITCTxi09ogiDMdQakdwZEqW+QybVY9XwLNT
Qm7vd7PE/7rLu9kD5jOtGScwmt18bJYPmFc9gsWUvBQby3lxfWhImw4ig6XU3QQ=
=438q
-----END PGP SIGNATURE-----
Dear Indu,
There is a combine executable. Probably with lot of bugs. and
.traineddata generated was not useful.
But you can try, I got only windows executable.
You can download the source and generate a combine binary in linux
Regards
Nishad
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
iQEcBAEBAgAGBQJLWAEbAAoJEDng0hpUv7qbfzUIAJj5duZpaVYoLaXC7QnQXIxP
bCld9b+gh6P4uiqVk4D+9NPQ+/x5sItuddHCX0OvflfSTZgTl26PpJgHt7CnhdSp
JLc3q74uXIqWH+yiXxEvUEIDWmwhnuP7DeGuSC450MHR6LXBMmgVAAStFJ9Oee7o
3DNxwP7uw/N1F7EOUhRj2m/Wh5VfHY6vdbNW2b34waaHVP1Gwm39Ew4Etb3+wSH5
ur9rkNsbhpXf/0Nq3QQ9o4cPC6d2fQ0kOqc3elWqTnpUeRjS7j3ilHxH+9C9WCKK
FBzQ4tBGovDYcr3yrB+bhYLPoFkawuyQVkqjOBFMuvbLbNcUTkUFx2i030QhdqM=
=0rt1
-----END PGP SIGNATURE-----
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Dear Sriranga,I couldn't respond on time, since I'm traveling. Here's an updated
version.
http://lime-ocr.googlecode.com/files/h2v-split-2.zip
Please go through provided readme, command line parameters were
changed and using a single tool you can split into words or
characters. cleaner got more options like removing punctuations.
I'm sorry to say, but I couldn't pay enough time to test it properly.
I'll update and fix it (if there is any error) once I'm back.
Related tesseract trainer GUI and other tesseract issues, please seek
help from ninjas from indic-ocr community. I myself, never tried this GUI.
If your development is in a hopeful direction, then i humbly request
you to use tesseract-ocr version 3. It gives better results for
Malayalam (According to one of our clients, for typewritten documents
its near 100%, and in v2 it's hardly 85%). Structurally all Dravidian
languages are same, so I think for Kannada you can move along with
tessercat-ocr 3. (I'm not so sure about Hindi and other Devanagari
Scripts)
Please feel free to get back, with any thing in need.
Regards
Nishad
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
iQEcBAEBAgAGBQJLVlcVAAoJEDng0hpUv7qbXBsIALTn7Ch1GPfsXDm4WDTWjan2
QoX2SuuLpSGTIES2y9jcNezsYV27jOuEiQ/5wydt7XQwbsXZENymunqh1tT77LpP
OSs4MZ3vMyOGXSCLT+MpTwb79UcPQV4a9RZR8+jsMctvsnAXD1xmVM799lOHLfl3
W1SwTNZ5kI23gf8IqXg6oOyYLLzMJBwxWR/hXqSnWMM7nn9/2UUBg/gZE5IsCkX7
h4ma6ND6AkhAIaE5mAeItiAEfuRdn4jCWz1dJjnzxhEHQHjopkhPRHDOROPodkTP
zYWI9fdOMmrMNjEM/D1AILK5HpWwCqaCD8E3/s4FW3h8t/bSy7zRL5jDKTXOcbA=
=ZB5B
-----END PGP SIGNATURE-----