Phrase search inconsistency in abstracts (token order matches, but full phrase not found)

28 views
Skip to first unread message

Диана Горобец

unread,
Mar 22, 2026, 3:11:39 PM (2 days ago) Mar 22
to OpenAlex Community
I’ve identified another search inconsistency in OpenAlex, this time related to phrase matching in abstracts reconstructed from the inverted index.

Example record:
https://openalex.org/works/W4396242287

---

### Case 1: "B 12" not matching despite correct token positions

From the API:
https://api.openalex.org/w4396242287

In the abstract_inverted_index we can see:
"B":[17,...]
"12":[18,...]

So the tokens are clearly adjacent and should reconstruct as:
"B 12"

However:

❌ This query does NOT return the article:
https://openalex.org/works?page=1&filter=title_and_abstract.search:(%22b+12%22+OR+%22b+12%22+OR+%22b%E2%80%8912%22+OR+%22b%E2%80%8B12%22+OR+%22b-12%22+OR+%22b%2B12%22+OR+%22In+this+study,+we+investigated+vitamin+B+12%22)&include_xpac=true&group_by=&id=gsr1Uksfvc79GNfa9GN754&sort=relevance_score:desc&search.title_and_abstract=Evaluation+of+antioxidant+activity+and+fermentation+properties+of+potential+probiotic+strain+Lactiplantibacillus+plantarum+HY7720+in+plant-based+materials

✅ But if I slightly truncate the phrase (remove "12"), it works:
https://openalex.org/works?page=1&filter=title_and_abstract.search:(%22b+12%22+OR+%22b+12%22+OR+%22b%E2%80%8912%22+OR+%22b%E2%80%8B12%22+OR+%22b-12%22+OR+%22b%2B12%22+OR+%22In+this+study,+we+investigated+vitamin+B+%22)&include_xpac=true&group_by=&id=gsr1Uksfvc79GNfa9GN754&sort=relevance_score:desc&search.title_and_abstract=Evaluation+of+antioxidant+activity+and+fermentation+properties+of+potential+probiotic+strain+Lactiplantibacillus+plantarum+HY7720+in+plant-based+materials

This suggests that even though "B" and "12" are adjacent in the inverted index, the phrase "B 12" is not being matched correctly.

---

### Case 2: Phrase breaks on normal words (not just numbers)

Another example from the same article:

❌ This exact phrase does NOT return the article:
"Lactiplantibacillus plantarum HY7720 was screened"

https://openalex.org/works?page=1&filter=title_and_abstract.search:(%22b+12%22+OR+%22b+12%22+OR+%22b%E2%80%8912%22+OR+%22b%E2%80%8B12%22+OR+%22b-12%22+OR+%22b%2B12%22+OR+%22Lactiplantibacillus+plantarum+HY7720+was+screened%22)&include_xpac=true&group_by=&id=gsr1Uksfvc79GNfa9GN754&sort=relevance_score:desc&search.title_and_abstract=Evaluation+of+antioxidant+activity+and+fermentation+properties+of+potential+probiotic+strain+Lactiplantibacillus+plantarum+HY7720+in+plant-based+materials

✅ But removing the last word makes it work:
"Lactiplantibacillus plantarum HY7720 was "

https://openalex.org/works?page=1&filter=title_and_abstract.search:(%22b+12%22+OR+%22b+12%22+OR+%22b%E2%80%8912%22+OR+%22b%E2%80%8B12%22+OR+%22b-12%22+OR+%22b%2B12%22+OR+%22Lactiplantibacillus+plantarum+HY7720+was+%22)&include_xpac=true&group_by=&id=gsr1Uksfvc79GNfa9GN754&sort=relevance_score:desc&search.title_and_abstract=Evaluation+of+antioxidant+activity+and+fermentation+properties+of+potential+probiotic+strain+Lactiplantibacillus+plantarum+HY7720+in+plant-based+materials

---

### Interpretation

This does not appear to be just a spacing issue.

Instead, it looks like a deeper inconsistency between:
- token positions in `abstract_inverted_index`
- and how phrase queries are actually evaluated

Even when:
- tokens are adjacent
- order is correct
- and no unusual whitespace is present

…the full phrase still fails to match.

---

### Conclusion

This suggests a broader issue with phrase search reliability in OpenAlex, where:
- exact phrase queries do not consistently match reconstructed abstract text
- behavior is inconsistent even for simple word sequences

This is especially problematic for:
- scientific terms (like "B 12")
- and general phrase-based search workflows

---

It would be helpful to understand:
- how phrase queries are evaluated internally
- and whether this behavior is expected or a bug

Happy to provide more examples if needed.
Reply all
Reply to author
Forward
0 new messages