Search bug: “B12” not found unless using HTML (B<sub>12</sub>)

44 views
Skip to first unread message

Диана Горобец

unread,
Mar 22, 2026, 5:41:12 AM (2 days ago) Mar 22
to OpenAlex Community

I’ve come across what appears to be a systematic issue with search indexing and normalization on your platform, specifically related to chemical notation rendered with HTML tags (e.g. subscript formatting).

Problem summary (reproducible case):
Some records that contain “B12” in titles or abstracts are not discoverable via a standard query like b12 or B12, but are only found when searching using the HTML-formatted version B<sub>12</sub>.

For example:

DOI: 10.1002/mbo3.1199
OpenAlex ID: W3165737450

A similar issue occurs with another record:
“Fed-Batch Fermentation for Propionic, Acetic and Lactic Acid Production”
DOI: 10.13005/ojc/310174
OpenAlex ID: W1536233415

In this case, “B12” appears in the abstract, but searching for b12 does not retrieve the record.

Additional search behavior issues:

  • Searching for B<sub>12</sub> returns many irrelevant results (all entries containing <sub>12</sub> regardless of context).

  • Searching with quotes, e.g. "B<sub>12</sub>", returns no results at all.

  • Case differences (b12 vs B12) also seem to affect recall.

Root cause (assumption):
It appears that HTML markup (e.g. <sub>) is being indexed literally rather than normalized into plain text, and that query normalization is not aligned with how the data is stored.

Suggested improvements:

  1. Strip or normalize HTML tags (like <sub>) during indexing.
    Example: B<sub>12</sub> → B12

  2. Apply case normalization (e.g. lowercase everything) for both indexed data and queries.

  3. Normalize user queries the same way (so b12, B12, and B<sub>12</sub> all resolve to the same token).

  4. Ideally maintain both:

    • original formatted text (for display)

    • normalized text (for search/indexing)

It would also be important to apply this consistently across all chemical notations and similar patterns (not only vitamin B12), including subscripts, superscripts, and case variations.

Message has been deleted

Диана Горобец

unread,
Mar 22, 2026, 6:15:16 AM (2 days ago) Mar 22
to OpenAlex Community
another problem is here:
NOT FOUND:   Works search | OpenAlex
FOUND:   Works search | OpenAlex

The issue is:  api.openalex.org/w1493538882
Establishment of beet molasses as the fermentation substrate for industrial vitamin <scp>B<sub>12</sub></scp> production by <i>Pseudomonas denitrificans</i>

Needed to be fixed too

--
You received this message because you are subscribed to the Google Groups "OpenAlex Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openalex-commun...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/openalex-community/c4214257-f758-41cd-9341-bb69ac8ac7fbn%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages