How to overcome 8192 character limit in text indexing for ElasticSearch

25 views

Skip to first unread message

Jon Ku

unread,

Feb 14, 2024, 4:40:48 PMFeb 14

to dotCMS User Group

According to the documentation page at https://www.dotcms.com/docs/latest/how-content-is-mapped-to-elasticsearch

"Only the first 8192 characters of Raw fields are indexed, and thus sorting is only performed based on the first 8192 characters for these fields."

It appears that all text, textarea and wysiwyg fields are Raw, in which case our large blog articles will not be fully indexed, which is a business requirement and was fine when we used Solr (similar to Elastic).

I can see that one advantage of this rule is to make indexing a faster process, and it also keeps the index to a smaller size. We currently have some 70,000 documents and are able to generate an incremental index hourly, on relatively standard hardware (single server for CMS and index generation, with search instance in the cloud).

Does anyone know where this value is controlled, or will we need to set up a custom instance of Elastic and/or a manual indexing process?

Any hints are welcome, thanks.

- Jon

Jon Ku

unread,

Feb 23, 2024, 7:47:37 AMFeb 23

to dotCMS User Group

SOLVED

I am told that the 8192 limit is only for the Raw data used for sorting, but the entire document is indexed for search.