[ARTICLE] A more efficient way to process split strings

14 views

Skip to first unread message

Finnian Reilly

unread,

Aug 14, 2017, 10:32:53 AM8/14/17

to Eiffel Users

Introduction

One of the most commonly use string routines is {READABLE_STRING_GENERAL}.split. This routine is fine for most purposes, but if you wish to process very many or very large strings, this routine is inefficient because it creates a separate string instance for each split part. For most practical purposes it is not necessary to have a separate string instance as all that is required is a data conversion or test for a value. Afterwards the split parts have to be collected by the garbage collector, so that is the reason for the inefficiency.

Efficient String Splitting

The string class EL_ZSTRING (alias ZSTRING) from Eiffel-Loop has a number of agent orientated routines for efficiently processing split strings. For those that don't know: ZSTRING is an alternative to STRING_32 but with a similar memory footprint to STRING_8. Think of it as a compressed STRING_32. There is a slight performance penalty but not by much, and for some routines it actually performs faster than STRING_32.

The ZSTRING split-string routines are as follows: do_with_splits, for_all_split, there_exists_split. All of these routines call back the agent argument with exactly one instance of ZSTRING. They all take a string delimiter of type READABLE_STRING_GENERAL as an argument. This is more useful than a character delimiter as you can for example use: ", " (comma followed by a space) Here are a number of usage examples:

These routines make use of a class EL_SPLIT_ZSTRING_LIST which can be used directly. In a "from start until after loop" the split part is referenced as `item'. Again there is only ever one instant of the item string.

All of these routines make use of an enabling routine: split_intervals, which returns a list of type EL_SEQUENTIAL_INTERVALS. This class is an efficient way of storing a list of integer intervals as an array of type: ARRAY [INTEGER_64]. You can iterate this structure using the standard "from start until after loop". Inside the loop the split strings start-index and end-index are referenced as `item_lower' and `item_upper', and the string count as `item_count'. You can also use split_intervals directly to efficiently process split strings as in this example: