Tesseract Security Inquiry

873 views

Skip to first unread message

Daniel F

unread,

Feb 23, 2023, 1:48:04 AM2/23/23

to tesseract-dev

Good Evening,

I am with the U.S. Senate Cybersecurity Department evaluating Tesseract for use on the U.S. Senate Network. I have a couple questions. Can someone please answer the following questions?

1. Does the GitHub project supply checksum for executables of the software for users to verify prior to installing?

2. Does the GitHub project employ any integrity checking mechanism, development pipelines, or security testing during code development or code commits?

In order to meet our SLA, I would like to request a response within 2 business days.

Any insight would be greatly appreciated. Thank you for your time.

Ger Hobbelt

unread,

Feb 28, 2023, 12:44:51 AM2/28/23

to tesser...@googlegroups.com

L.S.

Before I answer, I would strongly advise you to educate your chain of command on communicating with open source in general, related expectation management and delivery.

You are not communicating with your average mid- to large-scale American business. Open Source projects (such as tesseract) are fundamentally volunteer-driven. We're not talking NGOs, we're talking volunteer labour as in *unpaid labour*. (There are some exceptions out there, but then we're talking about *sponsorships*, possibly by companies or individuals. Nevertheless, the below still applies to you to the full extent.)

Next to that, you're talking to an international gathering. No matter how nicely you phrased your request, in open source (*unpaid labour* by volunteers) your communication is highly offensive. Let me try to put that in your American perspective:

Consider interacting with open source as being invited to an (international) barbeque. Open Source comes with an unwritten rule of good behaviour: quid pro quo. For the barbeque, quid pro quo means you might bring your family, but you also bring something for the roasted hog to be dipped in, or something munchable or drinkable that goes nicely with it. Going there empty-handed is, I am sure, regarded as inconsiderate, at the least, both in your place as it is in mine.

With open source, the unwritten expectation equivalent of that dip or munchies is you showing you've done your research / due diligence and sharing information.

Now let's analyze your communication.

I said "highly offensive". I'll translate this for you, so you can easily comprehend as an American:

> In order to meet our SLA, I would like to request a response within 2 business days.

You just called for the housenigger.

Yes, you did. And I am not your serf. Nor is anyone else out there.

quid pro quo: If you (or rather your chain of command) had done your research, you would know already that demanding responses within 2 business days in open source will skew your yield negatively to the extreme: return rates will be extremely low, and you can expect some quite heated responses from several directions, if you get any at all.

I suggest 2 to 4 business weeks as rather more sensible. Putting deadlines on *unpaid* volunteer effort of any kind is a bit of an uncouth approach anyway, ethically, particularly on a first encounter.

On a personal note: I had my accountant on the horn this morning. He had some sad news (taxes) and the meager good news did not include him informing me that the contract + SLA with the U.S. Senate has passed his scrutiny, been signed and verified to his satisfaction, so I am cleared to start and register my billable hours on your project X. You getting me?

Consider this free education. Hoping I'm talking to a quick learner.

quid pro quo: bringing your dip to the barbeque: you might want to tell us why you're in such a hurry. And simply 'because I need it' will be treated as the petulant child that it is. An explanation of why the hurry includes information on what went wrong on your end to be restricted by this über-needy response time. If you can't share, keep that barbeque in mind: your best bet is to bring something else that is valued by the gathering. (Maybe you might want to ask what they like first?) When providing such why the hurry information, your chances of finding a kind soul willing to help you out timely anyway will significantly increase: unpaid volunteer work may not be billed in dollars, but it sure as hell is billed in another hard coin: your sensibility and courtesy.

> 1. Does the GitHub project supply checksum for executables of the software for users to verify prior to installing?

quid pro quo: here you clearly show you have not done any of the preliminary research that is expected of a sensible individual or organization: if you had checked the website, github and elsewhere, e.g. Releases · tesseract-ocr/tesseract (github.com), you would already know that tesseract is published in source code form only. For non-technicals, that means it is not executable as of yet, but requires further (compile, link and package) work by others, which must then happen outside the project. One such individual is Mr. Weil from the Mannheim University (Germany, Europe), who has been kindly supplying the world with pre-built, ready-to-run, tesseract *executables*. See f.e.: Home · UB-Mannheim/tesseract Wiki (github.com)

Depending on your precise usage patterns of tesseract, intended or actual, you might want to inquire with him, but don't forget: bring some dip, salad or bottle equivalent that he might like. After all, it does not say there Mr. Weil is *paid* to do this extra public service work: his efforts are a gift from Mr. Weil and possibly the Mannheim University.

Second to that, your network may carry applications ("executables") which incorporate tesseract as a component, a.k.a. a library. It might behoof you to find out who the producers of those executables are and inquire there as well, to obtain your accountability surface area overview.

I haven't checked whether Mr. Weil publishes SHA256 or similar hashes alongside his binary installers. I haven't seen security hashes for the individual installed executables (that's unusual anyway; mostly you encounter them for install packages only) but he might be willing to produce and publish those from now on, when asked nicely. You'll have to ask him.

Remember: you're at a barbeque gathering: I may know Mr. Weil, but we're not related. Certainly not in any business/dependency kind of way. As they say: try diplomacy; bring some dip. YMMV.

> 2. Does the GitHub project employ any integrity checking mechanism, development pipelines, or security testing during code development or code commits?

quid pro quo: again, you clearly show you have not done any of the preliminary research that is expected of a sensible individual or organization. You could have easily shown you've done your end of the bargain and list your preliminary due diligence results. I.e. had you phrased your question like this you have a better chance at getting anywhere:

[Start]

Is it correct that the tesseract GitHub project employs the following for these subjects (and please fill in any gaps you find):

A. integrity checking mechanisms:

- all source code is contained and managed in git and hence is subject to the integrity checking facilities of git: everything in the repository is content-hashed using SHA1 and is non-modifiable by design, once committed. (See also `git push --force` for an addendum to this: content MAY SEEM to be changed afterwards, but it ALWAYS impacts the commit trees, so everyone involved notices the git repo 'forced update'.)

+ The Git Commit Hash - Mike Street - Lead Developer and CTO (mikestreety.co.uk)

+ How is the Git hash calculated? - Stack Overflow

- second, the git system is fault-tolerant by design as everyone who works on/with the github repository has a mandatory local mirror (a.k.a. "git clone").

+ If GitHub would, hypothetically, carry a *corrupted* repository at some point, this will impact new developers (users) as they will not be able to 'git clone', but all active maintainers and contributors will have pre-existing local mirrors, hence such fatal situation can be dealt with by the maintainers in conjunction with GitHub to re-establish an uncorrupted repository at GitHub originating from the maintainer(s). (To date, I have not personally heard of such an occurrence, but git has reckoned with this scenario by design: this is one of the reasons git keeps local mirrors. -- Another reason there is thus providing the ability to continue working off-line / off-net.)

content , development pipelines, or security testing during code development or code commits?

+ The same or simpler (& faster) recovery scenario applies when the GitHub repository may be infested with adversarial source code: upon detection, git and GitHub enable a swift and complete recovery & removal (See 'git tree rewriting / filtering' for some suggestions; one of the maintainers can do this and then 'git push --force' to remove the undesirables after the fact -- other developers' local git mirror will update on the next sync, a.k.a. 'git fetch / git pull')

A.1 risks and countermeasures*

By using git, the entire git tesseract development history is permanently available, every commit constant (i.e. non-modifiable).

The only risk here is the maintainer(s): when they decide to damage or nuke the repository (it has happened very rarely, but there is an occurrence of this in recent years: 2016AD: How one developer just broke Node, Babel and thousands of projects in 11 lines of JavaScript • The Register), the general point of contact (GitHub git repo) is damaged/unaccessible. However, anyone with a previous local git mirror, i.e. any maintainer or developer which previously cloned the repo to work on it, look at it or use it within their own projects, can create a new repository on GitHub or elsewhere, containing the entire project history up to the moment they last synced. This is a.k.a. 'git fork'. This behaviour is commonplace and well documented throughout, and is considered easy to do. Some open source members even encourage this 'forking' process as SOP.

*) item A.1 added for clarification as these particular processes are not known or usual in business applications delivered by business enterprises.

B. development pipelines

tesseract software (any and all source codes, documentation, etc. kept in the GitHub repository) can only be changed (updated, etc.) by the currently active maintainer(s) and repo-access-granted contributor(s), all of whom have to be registered at GitHub. Only administrator-privileged maintainer(s) may add new repo-write-access-granted collaborators, which is usually a (very) small set of individuals, who are (often) meant to serve as administrative colleagues. Most of the participants interact via 'github pull requests', all of which invariably are vetted by the registered maintainers, for only the (tiny) latter group has write/modify access rights to the repository.

The individuals with write/modify access rights are an unidentified subset of Contributors to tesseract-ocr/tesseract (github.com) -- which may be inferred by looking at the names of the people listed as committer for merge commits, but for administrative purposes I need this list to be more specific and verifiable.

[To obtain the precise set of active maintainers for any project/repository, the U.S. department is strongly advised to encourage Microsoft (owner of GitHub) to offer such a feature as part of 'GitHub Insights', for this is useful information that is currently hard/impossible to obtain with any certainty. Help the open source community to help you.]

Hence the query whether this list is complete or entirely correct (I don't expect it to be): individuals/accounts with merge commit rights for the tesseract repository today:

- Zdenko Podobný (maintainer)

- Stefan Weil

- Igor Pugin

- Amit D. ('amitdo')

- Shreeshrii?

- Ray Smith?

C. security testing during code development or code commits

AFAICT there's no notable activity with particular emphasis on *security testing* visible in the repository.

It does contain unittests, apparently there's quite an amount of testing done at UB Mannheim, so application stability risks through memory corruption / heap leakage / etc., are being dealt with, but nothing mentions explicit security or adversarial detection, adversarial inputs and/or countermeasures.

While tesseract is observable in production (e.g. archive.org shows usage), I haven't found any publications by those entities outside the project itself addressing this subject, so help and hints are most welcome.

Re attack surface area investigations: tesseract source code mentions the optional component `cUrl` (curl/curl: A command line tool and library for transferring data with URL syntax, supporting DICT, FILE, FTP, FTPS, GOPHER, GOPHERS, HTTP, HTTPS, IMAP, IMAPS, LDAP, LDAPS, MQTT, POP3, POP3S, RTMP, RTMPS, RTSP, SCP, SFTP, SMB, SMBS, SMTP, SMTPS, TELNET, TFTP, WS and WSS. libcurl offers a myriad of powerful features (github.com)) which would be network-accessing when tesseract is given an URI instead of a file to process via the tesseract CLI.

[Edit this part to match reality]

We, the U.S. Senate Network, intend to use (or are using) tesseract as an application component instead, i.e. any software we use uses the tesseract API. Any suggestions what I should look for / ask for / ask about, when inquiring with the developers/maintainers of those applications, re security concerns and security testing in particular?

Any corrections and additions to the above are much appreciated.

Yours truly,

Your Name

U.S. Senate Cybersecurity Department

[End]

As the info above is from a third party (me, with maybe 20 lines of code contributed to the project?), you might want to verify with the maintainer, Mr. Zdenko Podobný. I suggest you do recall the barbeque analogy.

quid pro quo: you could, for instance, provide a substantiated offer to publish your report here: some of us may be interested and may be swayed that way. "substantiated" meaning here: you have vetted this offer with your superiors so you don't blather and then come back with "sorry, but I can't do it, 'cause it got classified/restricted access." Recall the barbeque: you don't bring wine bottles that happen to be empty when the cork is popped. That'd be nasty.

One free lesson, served.

Met vriendelijke groeten / Best regards,

Ger Hobbelt

--------------------------------------------------
web: http://www.hobbelt.com/
http://www.hebbut.net/
mail: g...@hobbelt.com
mobile: +31-6-11 120 978
--------------------------------------------------

--
You received this message because you are subscribed to the Google Groups "tesseract-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-dev/ef02d4c2-95da-4993-9e44-efaf38a2045dn%40googlegroups.com.

Senthil M.K.

unread,

Feb 28, 2023, 1:22:51 AM2/28/23

to tesser...@googlegroups.com

You gotta be kidding me... Have you ever heard of NIST or multiple agencies doing cyber defense for the federal government? They probably have much more customized images for using Tesseract or something even better for Cv and OCR.

Advertising your name and your role as well as what you are evaluating for is itself a security hole. Please refrain from posting your name/email address (thank god you are using gmail and not your .gov email address) or the federal agency you work for when you ask open-ended questions like these.

Look around. There are way too many federal agencies in DC with more than relevant expertise. You just got to ask.

Have a nice day kid.

Regards,

- Senthil.

--

Reply all

Reply to author

Forward

0 new messages