Tesseract for Phishing detection

83 views
Skip to first unread message

Jack D

unread,
May 25, 2016, 3:04:56 PM5/25/16
to tesseract-ocr
Tesseract noob here.

I'm looking for a way to detect if, for example, "paiipal" is visually similar to "paypal". Is there a way I can perform OCR on an image containing "paiipal" and get the probability that it is "paypal". I can also perform this function on just "ii" and check for "y".

I'm also wondering if it can train tessaract to work with a dictionary containing the brands that I am trying to protect from phishing attacks. Is that feasible and a good idea?

Thanks a bunch.

Bojidar Stanchev

unread,
Jun 8, 2016, 4:56:42 AM6/8/16
to tesseract-ocr
Tesseract is mostly used to recognize text from images.

From what I understand you want to protect yourself from phishing.
A very good way to do that is to familiarize yourself with Levenshtein distance algorithm.
It's very simple - it calculates how many changes you need to make to a string to get to the desired string.
For example if you have paiipal and compare it to paypal it will give you a distance of 3 - remove 2 letters and add 1.

Why am I suggesting this - because your problem has already been solved in a slightly different situation - corporate world.
Sometimes a bad employee in a company would try to switch the company name on a document with the same name but 2 letters are swapped for example,
small alterations like this are hard to notice for a human, like you pointed out, but for a machine is very easy.

I hope this helps, if not, maybe I did not fully understand your intentions and you would have to clarify why you need to use Tesseract so I can further help you.

Jack D

unread,
Jul 6, 2016, 3:41:20 PM7/6/16
to tesseract-ocr
I'm aware of Edit Distance and bit-tap algorithm's use in phishing detection, but the challenge is that low edit-distance between two words doesn't guarantee visual similarity. for example - paypal and laypas.

So, I've been considering converting the target text into image, may be apply a filter or two, and OCR all possible words along with the likely hood. Is that possible with tesseract?

ramalin...@excelbizsolutions.com

unread,
Jul 7, 2016, 4:36:35 AM7/7/16
to tesseract-ocr
eclipse inside not showing project because of import problem

Reply all
Reply to author
Forward
0 new messages