Hebrew Chapters (or Verses) By Word Frequency

Aaron Laws

unread,

Mar 3, 2024, 8:29:02 PMMar 3

to OpenScriptures Hebrew Bible

I'm contemplating creating an index of Biblical Hebrew chapters or verses based on word frequency. I'm wanting to create lists fitting a description like, "All verses exclusively composed of roots that occur 500 or more times" or maybe "All chapters composed at least 90% of roots that occur 400 or more times".

I'm a software developer, so I have no trouble writing the algorithm, but I need data. I'm looking at https://github.com/openscriptures/morphhb/tree/master/wlc which seems like it has all the information I need unless I'm mistaken. It looks like, for instance, lemma 1121 is "ben" (son) and all its different inflections, which is perfect.

The current post exists to ask a few questions:

1. Has this been done before so that I'm duplicating other work?

2. Is that directory of that git repository a good source of data, or does a better source exist? I see that the latest commit was a few years ago, and I'm not sure what that means.

3. Does what you can make of my proposed algorithm sound correct?

Thank you for your time!

Yours,

Aaron Laws

Avraham Ben Emanuel

unread,

Mar 4, 2024, 3:15:17 AMMar 4

to Aaron Laws, OpenScriptures Hebrew Bible

The bible hasn't changed much over the last few thousands of years but the wlc version is maintained at https://tanach.us/ and last update is Build : 27.1 (19 Oct 2023) so I would take it from there.

Please note that chapters aren't identical in different editions, so basing anything on them can be problematic.

Word frequency needs to deal with the roots of words. see https://search.dicta.org.il/ as a modern search engine which deals with this.

Avi

--
You received this message because you are subscribed to the Google Groups "OpenScriptures Hebrew Bible" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openscriptures...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openscriptures-hb/10b0ce66-d21a-443c-b99e-9734afec6de4n%40googlegroups.com.

--

Avraham Ben Emanuel
אברהם בן עמנואל
https://www.facebook.com/son.of.emanuel

Aaron Laws

unread,

Mar 4, 2024, 7:43:06 AMMar 4

to OpenScriptures Hebrew Bible

Thank you for your response!

On Monday, March 4, 2024 at 3:15:17 AM UTC-5 avraham.b...@gmail.com wrote:

The bible hasn't changed much over the last few thousands of years

Certainly, I'm not exactly looking for the bible as much as I'm looking for an annotated bible, and reliable free digital annotations have changed significantly recently.

but the wlc version is maintained at https://tanach.us/ and last update is Build : 27.1 (19 Oct 2023) so I would take it from there.
Please note that chapters aren't identical in different editions, so basing anything on them can be problematic.
Word frequency needs to deal with the roots of words. see https://search.dicta.org.il/ as a modern search engine which deals with this.

Precisely. I need a "parsed" copy of tanach so that I can calculate frequency based on root.

Daniel Owens

unread,

Mar 4, 2024, 12:51:03 PMMar 4

to openscri...@googlegroups.com

Aaron,

To answer #2, I think that's the correct repository. It's been a few
years since we finished work on parsing, and since it is a volunteer,
crowd-sourced project, probably not a great deal has been done since then.

It sounds like your idea to create an index is dynamic, like you might
ask various different questions and seek answers in the data? I think in
terms of how people have used open data, your idea may be new.

Daniel

Dirk Roorda

unread,

Jun 28, 2024, 10:23:54 AM (8 days ago) Jun 28

to OpenScriptures Hebrew Bible

Hi Aaron,

I stumbled on your question here, and I can point you to a data source with which you can achieve what you want:

https://github.com/ETCBC/bhsa

The data contains the text of the BHS, with linguistic features. One of those features is "freq_lex', which gives, for each word, the frequency of the underlying lexeme in the BHS.

With a bit of programming you can identify the lexeme base of each chapter in the Hebrew Bible, and also the ratio frequent lexemes versus infrequent lexemes.

The data model of all this data is Text-Fabric. Text-Fabric is also a Python library that knows this model and let you compute things and display text fragments together with the features. To get started, consult https://nbviewer.org/github/ETCBC/bhsa/blob/master/tutorial/start.ipynb

If you are interested, I can also invite you to https://etcbc-vu.slack.com where you can discuss issues with fellow researchers and programmers.

Best regards,

Dirk

---

Dirk Roorda

https://pure.knaw.nl/portal/en/persons/dirk-roorda

Aaron Laws

unread,

Jun 28, 2024, 11:06:26 AM (8 days ago) Jun 28

to Dirk Roorda, OpenScriptures Hebrew Bible

On Fri, Jun 28, 2024 at 10:23 AM 'Dirk Roorda' via OpenScriptures Hebrew Bible <openscri...@googlegroups.com> wrote:

Hi Aaron,

I stumbled on your question here, and I can point you to a data source with which you can achieve what you want:
https://github.com/ETCBC/bhsa
The data contains the text of the BHS, with linguistic features. One of those features is "freq_lex', which gives, for each word, the frequency of the underlying lexeme in the BHS.
With a bit of programming you can identify the lexeme base of each chapter in the Hebrew Bible, and also the ratio frequent lexemes versus infrequent lexemes.

The data model of all this data is Text-Fabric. Text-Fabric is also a Python library that knows this model and let you compute things and display text fragments together with the features. To get started, consult https://nbviewer.org/github/ETCBC/bhsa/blob/master/tutorial/start.ipynb

If you are interested, I can also invite you to https://etcbc-vu.slack.com where you can discuss issues with fellow researchers and programmers.

Thank you for your informative response and invitation!

Yours,

Aaron

Reply all

Reply to author

Forward