Ocropus Mobile, OpenFST for custom application, Training resources

298 views
Skip to first unread message

Nathan K

unread,
Jul 13, 2011, 4:58:05 PM7/13/11
to ocropus
Hi All,
Thanks for all the great work in producing Ocropus! I'm hoping it can
help me digitize some content on mobile devices.

#Q1 - Ocropus Engine on iPhone and/or Andoid
I'm particularly interested in hearing members opinions about the best
way to get ocropus running on iOS and/or android. I'm largely
interested in using a trained 'lanuage' model to make predictions on
the device. Training will occur offline and off device. I know a few
people have got tesserect running on mobile devices, but I'm unable to
find much information on doing the same with Ocropus. I'd greatly
appreciate a slice of your collective wisdom in an effort to avoid
wasting days taking the wrong path.

Would it be easier to just prototype the algorithm using the scripts,
then grab the specific c++ code of interest and include it directly in
my application. Or best to compile as a static/dynamic library?

#Q2 - OpenFST for custom application? - Please bare with me this is a
new tool to me.
Also, I'm interested in using the probabilistic language model.
Currently I'm inexperienced with the particular library included. In
my application I the text I'm processing is found on an invoice and
follows the structure:

<Company name> <Item Name> <Size> \t\t <price>
...
...

Would training a application specific language model in OpenFST be
able to improve results for this application? There is no field
devider in the document, so I'm looking for a method to automatically
restrict symbol probabilities and make corrections when the highest
ranking returned OCR symbol does not fit its context.

#Q3 - Training Resources
I've found some references to a training course. However have been
unable to login when accessing it at https://sites.google.com/a/iupr.com/ocropus-course/about

Anyone know how I might gain access to any extra documentation to
steepen my learning curve with Ocropus?

Note: I'm still unable to get the python bindings to work cleanly on
OSX. If anyone has achieved this I'd appreciate hearing from you as to
the steps involved. Cheers.

Thomas L. Packer

unread,
Jul 13, 2011, 11:32:23 PM7/13/11
to ocr...@googlegroups.com
Hello Nathan

I'm interested in hearing more about your project, since it is
closely related to what I do consulting work in: information extraction from
OCR'ed document images, including phone camera images. I won't say to not
use Ocropus, but I would like to understand the application better to know
why you might want to use Ocropus (since it can be a headache to use).

Thomas L. Packer
~~~~~~~~~~~~~~~~~~~~

--
You received this message because you are subscribed to the Google Groups
"ocropus" group.
To post to this group, send email to ocr...@googlegroups.com.
To unsubscribe from this group, send email to
ocropus+u...@googlegroups.com.
For more options, visit this group at
http://groups.google.com/group/ocropus?hl=en.

Jason Culverhouse

unread,
Jul 13, 2011, 11:51:15 PM7/13/11
to ocr...@googlegroups.com
Nathan,

On Jul 13, 2011, at 1:58 PM, Nathan K wrote:

> Note: I'm still unable to get the python bindings to work cleanly on
> OSX. If anyone has achieved this I'd appreciate hearing from you as to
> the steps involved. Cheers.

I actually seemed to get most of them to work, I added code reviews to the repository for some of these. I'm using the versions of openfst and ocropus from mac ports ( they may be out of date as to what is in the source repository but I added them to MacPorts to save someone the trouble in the future ). No criticism here but everything downstream has an "include the world" effect and was a little too much for me to add as a Portfile. It's easier just to check them out and build from source. I include the basic patches that I applied to the hg checkouts, they might work for you or at least point you in the right direction

Jason

(ocropuspython)jason@rainmaker:~/play/ocropuspython/src/ocropy$ hg diff

diff -r 6703502a95a9 ocrolib/fgen.py
--- a/ocrolib/fgen.py Tue Apr 26 16:17:44 2011 +0200
+++ b/ocrolib/fgen.py Wed Jul 13 20:36:34 2011 -0700
@@ -116,8 +116,7 @@
pcr.show_layout(layout)

data = surface.get_data()
- data = bytearray(data)
- a = array(data,'B')
+ a = frombuffer(data,'B')
a.shape = (h,w,4)
a = a[:th,:tw,:3]
a = a[:,:,::-1]
diff -r 6703502a95a9 ocrolib/utils.py
--- a/ocrolib/utils.py Tue Apr 26 16:17:44 2011 +0200
+++ b/ocrolib/utils.py Wed Jul 13 20:36:34 2011 -0700
@@ -40,6 +40,11 @@

def number_of_processors():
try:
+ import multiprocessing
+ return multiprocessing.cpu_count()
+ except:
+ pass
+ try:
return int(os.popen("cat /proc/cpuinfo | grep 'processor.*:' | wc -l").read())
except:
return 1

(ocropuspython)jason@rainmaker:~/play/ocropuspython/src/ocroswig$ hg diff
diff -r 404e31a7d3af iulib.i
--- a/iulib.i Fri Feb 25 01:11:50 2011 +0100
+++ b/iulib.i Wed Jul 13 20:40:42 2011 -0700
@@ -12,7 +12,6 @@
%include "cstring.i"
#endif
%{
-#include <malloc.h>
#include <string.h>
#include <colib/narray.h>
#include <colib/narray-util.h>
diff -r 404e31a7d3af ocropus.i
--- a/ocropus.i Fri Feb 25 01:11:50 2011 +0100
+++ b/ocropus.i Wed Jul 13 20:40:42 2011 -0700
@@ -9,7 +9,6 @@
%include "typemaps.i"
%include "cstring.i"
%{
-#include <malloc.h>
#include <string.h>
#include <colib/checks.h>
#include <colib/narray.h>
diff -r 404e31a7d3af setup.py
--- a/setup.py Fri Feb 25 01:11:50 2011 +0100
+++ b/setup.py Wed Jul 13 20:40:42 2011 -0700
@@ -10,7 +10,7 @@

baselibs = ['tiff','png','jpeg','SDL','SDL_gfx','SDL_image','m']

-include_dirs = ['/usr/local/include'] + get_numpy_include_dirs()
+include_dirs = ['/opt/local/include'] + get_numpy_include_dirs()
swig_opts = ["-c++"] + ["-I" + d for d in include_dirs]

iulib = Extension('_iulib',

(ocropuspython)jason@rainmaker:~/play/ocropuspython/src/pyopenfst$ hg diff
diff -r cfdc18d21ed6 openfst_properties.i
--- a/openfst_properties.i Tue Mar 01 11:14:59 2011 +0100
+++ b/openfst_properties.i Wed Jul 13 20:41:55 2011 -0700
@@ -65,9 +65,7 @@
uint64 ConcatProperties(uint64 inprops1, uint64 inprops2,
bool delayed = false);
uint64 DeterminizeProperties(uint64 inprops, bool has_subsequential_label);
-uint64 DifferenceProperties(uint64 inprops1, uint64 inprops2);
uint64 FactorWeightProperties(uint64 inprops);
-uint64 IntersectProperties(uint64 inprops1, uint64 inprops2);
uint64 InvertProperties(uint64 inprops);
uint64 ProjectProperties(uint64 inprops, bool project_input);
uint64 RelabelProperties(uint64 inprops);
@@ -78,15 +76,8 @@
uint64 ReverseProperties(uint64 inprops);
uint64 ReweightProperties(uint64 inprops);
uint64 RmEpsilonProperties(uint64 inprops, bool delayed = false);
+uint64 ShortestPathProperties(uint64 props);
uint64 SynchronizeProperties(uint64 inprops);
uint64 UnionProperties(uint64 inprops1, uint64 inprops2, bool delayed = false);

-%inline %{
-const char *PropertyBitName(int bit)
-{
- if (bit > 63)
- return NULL;
- else
- return PropertyNames[bit];
-}
-%}
+extern const char *PropertyNames[];
diff -r cfdc18d21ed6 openfst_symtab.i
--- a/openfst_symtab.i Tue Mar 01 11:14:59 2011 +0100
+++ b/openfst_symtab.i Wed Jul 13 20:41:55 2011 -0700
@@ -4,40 +4,41 @@
"Symbol table class, map input/output symbol IDs to and from strings.") SymbolTable;
struct SymbolTable {
%feature("docstring", "Create a new symbol table with identifying name.");
- SymbolTable(std::string const & name);
+ SymbolTable(const std::string& name);
%feature("docstring", "Add a symbol to the symbol table, optionally with a specific\n"
"numeric ID. Returns the numeric ID assigned to this symbol (which may\n"
"already exist).");
- long long AddSymbol(std::string const & name, long long id);
- long long AddSymbol(std::string const & name);
+ uint64 AddSymbol(const std::string& symbol, uint64 key);
+ uint64 AddSymbol(const std::string& symbol);
%feature("docstring", "Merge the contents of another symbol table into this one.");
- void AddTable(SymbolTable const & symtab);
+ void AddTable(const SymbolTable& table);
%feature("docstring", "Returns the identifying name of this symbol table.");
- std::string const & Name() const;
+ const std::string& Name() const;
std::string CheckSum() const;
+ std::string LabeledCheckSum() const;
%feature("docstring", "Return a copy of this symbol table.");
SymbolTable *Copy() const;
%feature("docstring", "Read entries from a text file.");
- static SymbolTable* ReadText(std::string const & filename,
+ static SymbolTable* ReadText(const std::string& filename,
bool allow_negative = false);
%feature("docstring", "Write entries to a text file.");
- bool WriteText(std::string const & filename) const;
+ bool WriteText(const std::string& filename) const;
%feature("docstring", "Read entries from a binary file.");
- static SymbolTable *Read(std::string const & filename);
+ static SymbolTable* Read(const std::string& filename);
%feature("docstring", "Write entries to a binary file.");
- bool Write(std::string const & filename) const;
+ bool Write(const std::string& filename) const;
%feature("docstring",
"Look up a symbol or numeric ID in the table. If called with a string,\n"
"returns the ID for that string or -1 if not found. If called with an\n"
"integer, returns the string for that ID, or the empty string if not found.");
- std::string Find(long long id) const;
- long long Find(std::string const & name);
+ std::string Find(uint64 key) const;
+ uint64 Find(const std::string& symbol) const;
%feature("docstring",
"Returns the next automatically-assigned symbol ID.");
- long long AvailableKey(void) const;
+ uint64 AvailableKey(void) const;
%feature("docstring",
"Returns the number of unique symbols in this table.");
- unsigned long NumSymbols(void) const;
+ size_t NumSymbols(void) const;

%extend {
%pythoncode %{
@@ -54,10 +55,10 @@
"for symbol, id in symtab:\n"
" print \"symbol %s has id %d\" % (symbol, id)\n") SymbolTableIterator;
struct SymbolTableIterator {
- SymbolTableIterator(SymbolTable const & symtab);
+ SymbolTableIterator(const SymbolTable& table);
bool Done(void);
- const char * Symbol(void);
- long long Value(void);
+ std::string Symbol(void);
+ uint64 Value(void);
void Next(void);
void Reset(void);
};
diff -r cfdc18d21ed6 openfst_templates.i
--- a/openfst_templates.i Tue Mar 01 11:14:59 2011 +0100
+++ b/openfst_templates.i Wed Jul 13 20:41:55 2011 -0700
@@ -337,23 +337,32 @@
template<class M> class RhoMatcher {
public:
typedef typename M::FST FST;
- RhoMatcher(FST const &fst, MatchType match_type,
- int rho_label=kNoLabel, bool rewrite_both=false);
+ RhoMatcher(const FST &fst,
+ MatchType match_type,
+ int rho_label = kNoLabel,
+ MatcherRewriteMode rewrite_mode = MATCHER_REWRITE_AUTO,
+ M *matcher = 0);
};

template<class M> class SigmaMatcher {
public:
typedef typename M::FST FST;
- SigmaMatcher(FST const &fst, MatchType match_type,
- int sigma_label=kNoLabel, bool rewrite_both=false);
+ SigmaMatcher(const FST &fst,
+ MatchType match_type,
+ int sigma_label = kNoLabel,
+ MatcherRewriteMode rewrite_mode = MATCHER_REWRITE_AUTO,
+ M *matcher = 0);
};

template<class M> class PhiMatcher {
public:
typedef typename M::FST FST;
- PhiMatcher(FST const &fst, MatchType match_type,
- int phi_label=kNoLabel, bool phi_loop=true,
- bool rewrite_both=false);
+ PhiMatcher(const FST &fst,
+ MatchType match_type,
+ int phi_label = kNoLabel,
+ bool phi_loop = true,
+ MatcherRewriteMode rewrite_mode = MATCHER_REWRITE_AUTO,
+ M *matcher = 0);
};

/* Compose options. */

Nathan K

unread,
Jul 15, 2011, 1:24:27 AM7/15/11
to ocr...@googlegroups.com
Thanks for that input Jason. It was very helpful, I've run into a few
more bugs (posted in as a new thread) if you have a moment I'd
appreciate you input. Thanks so much for the MacPorts versions. They
worked a treat. Did you have any issues using OpenFST 1.2.7? I'm
getting an issue about not being able to find headers.

> --
> You received this message because you are subscribed to the Google Groups "ocropus" group.
> To post to this group, send email to ocr...@googlegroups.com.
> To unsubscribe from this group, send email to ocropus+u...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/ocropus?hl=en.
>
>

--
Email: nathank [at] noshly.com (professional)
Email: its [at] madteckhead.com (personal)
Website: http://www.madteckhead.com

--------------------------------------------
Q: Why is this email five sentences or less?
A: http://five.sentenc.es

This email (including any attachments) is confidential and may be
privileged. If you have received it in error, please notify the sender
by return email and delete this message from your system. Any
unauthorised use or dissemination of this message in whole or in part
is strictly prohibited. Please note that emails are susceptible to
change and we will not be liable for the improper or incomplete
transmission of the information contained in this communication nor
for any delay in its receipt or damage to your system. We do not
guarantee that the integrity of this communication has been maintained
nor that this communication is free of viruses, interceptions or
interference.

Jason Culverhouse

unread,
Jul 15, 2011, 1:50:40 PM7/15/11
to ocr...@googlegroups.com
Nathan,

On Jul 14, 2011, at 10:24 PM, Nathan K wrote:

> Thanks for that input Jason. It was very helpful, I've run into a few
> more bugs (posted in as a new thread) if you have a moment I'd
> appreciate you input. Thanks so much for the MacPorts versions. They
> worked a treat. Did you have any issues using OpenFST 1.2.7? I'm
> getting an issue about not being able to find headers.
>

something like this might help?

cat ~/.pydistutils.cfg

[build_ext]
include-dirs=/opt/local/include
library-dirs = /opt/local/lib

Jason

Tom

unread,
Oct 9, 2011, 10:30:53 AM10/9/11
to ocr...@googlegroups.com
Sorry for only getting around to responding right now.

OCRopus is not suitable for using on either Android or iOS because much of it is writting in Python (and NumPy) and neither of those platforms supports those languages.

Bill Janssen

unread,
Nov 5, 2011, 3:47:22 PM11/5/11
to ocropus
There are now applications in the iOS app store for the iPhone which
include the Python interpreter -- see MathPy, for instance. So, it's
now possible to use Python in iPhone apps.
Reply all
Reply to author
Forward
0 new messages