Non-UTF8 comments in source code files

25 views
Skip to first unread message

Josh Hansen

unread,
Jan 28, 2010, 2:30:15 PM1/28/10
to ocropus
I noticed that in some source files -- most notably ocr-voronoi/
voronoi-pageseg.cc -- there are a large number of code comments that
are not in UTF8 or any other character set I can identify. As these
comments might prove useful for newbies trying to understand the code
(read: me) I wonder if someone could illuminate me as to how those
comments are encoded.

Thank you very much for any assistance
- Josh

Thomas Breuel

unread,
Feb 4, 2010, 3:33:03 AM2/4/10
to ocr...@googlegroups.com
That code was donated by Koichi Kise; I assume the comments are in Japanese.

Tom

> --
> You received this message because you are subscribed to the Google Groups "ocropus" group.
> To post to this group, send email to ocr...@googlegroups.com.
> To unsubscribe from this group, send email to ocropus+u...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/ocropus?hl=en.
>
>

Dave Sampson

unread,
Feb 13, 2010, 8:47:52 AM2/13/10
to ocr...@googlegroups.com
Hey Folks,

I am wondering if ocropus can handle two use cases.

1. A user has hand written notes on lined paper and want to digitize
them into editable text. The user scans their notes into an acceptable
format. The user runs ocropus on the generated file. The file result is
a txt file of the interpreted characters.

How might a training process look for this use case?


2. A user has collected a drawer full on receipts. The user wishes to
digitize these to gather useful information for a database or
spreadsheet application. The user scans the receipts of different sizes,
lengths and printed font.

Ocropus is used to generate a tab delineated text file with the
following layouts for a receipt

* Header: Name of location, address and phone number
* Body: item description and price as two columns (tab or character
delineation)
* taxes: are indicating applicable taxes or surcharges
* final value: Usually a larger character set.


3. A user has a series of paper bills or PDF bills and wishes to get an
editable version where data can be extracted. Ocropus is pointed towards
the PDF or scanned version of the bill. For example a cell phone / cable
bill may include the following layout elements:
* Summary: Summary of the whole bill and amount due including phone
usage and cable expenses.
* Phone summary: Summary of phone usage for all phones on the bill
* Detailed phone: Breakdown of phone usage by user (lets say 3 users).
This will include phone number called, duration, start / stop times cost
of call etc.
* Cable charges: explanation of charges including base package and
special purchases such as movies.

These are the three main use cases I am looking at to try and allow
individuals to better tracks their financial habits. I am interested in
if anyone has achieved any of the above and how. If the above has not
been achieved whether it is theoretically possible and how one might
approach the problem. And finally if I will never achieve the above with
ocropus and where I might continue my hunt for an open source solution.


Thanks all.

Josh Hansen

unread,
Feb 23, 2010, 2:25:32 AM2/23/10
to ocropus
I converted the file to UTF8 and included translations from Google
Translate. Since I don't yet have a way of publishing my mercurial
branch, the new version of the file can be accessed at
http://joshhansen.net/files/cpp/voronoi-pageseg.cc

Thanks!
- Josh

Bob Gustafson

unread,
Mar 8, 2010, 9:08:33 AM3/8/10
to ocr...@googlegroups.com
On Tue, Feb 23, 2010 at 1:25 AM, Josh Hansen <joshuaaa...@gmail.com> wrote:
I converted the file to UTF8 and included translations from Google
Translate. Since I don't yet have a way of publishing my mercurial
branch, the new version of the file can be accessed at
http://joshhansen.net/files/cpp/voronoi-pageseg.cc

Thanks!
- Josh


Good job!

It would be good if your translations could be introduced into the mercurial repository...

Thomas Breuel

unread,
Mar 8, 2010, 9:33:12 AM3/8/10
to ocr...@googlegroups.com
I'll pull in the submitted patches over the next couple of weeks.
Sorry that it's taking so long.

Tom

Reply all
Reply to author
Forward
0 new messages