Getting rid of the Java-ism of compression was quite fun. We
tried to push it to its limits. So the tokenizer written 100%
in Prolog now preserves '\n', '\r' and '\r\n'. But then when we for
example generate HTML we have to replace the line terminator
by </div></div>. So how to do this without falling back to an
atom_split/2 with separator '\n' and do it correctly?
Its a litte bit tricky, like for example an input such as 'abc\r'
should have a line count of 2. So anything that views '\r' as
padding will go wrong. The following works fine:
/**
* sys_split_lines(L, I, O):
* The predicate succeeds in L with the lines
* of the input I and output O codes.
*/
% sys_split_lines(-List, +List, -List)
sys_split_lines([A|L]) -->
sys_split_line(X), {atom_codes(A,X)},
sys_split_more(L).
% sys_split_more(-List, +List, -List)
sys_split_more([A|L]) --> sys_convert_sep, !,
sys_split_line(X), {atom_codes(A,X)},
sys_split_more(L).
sys_split_more([]) --> [].
% sys_split_line(-List, +List, -List)
sys_split_line([X|L]) --> \+ sys_convert_sep, [X], !,
sys_split_line(L).
sys_split_line([]) --> [].
The above uses DCG (\+)/1 (% 7.14.11) banned (sic!) by Scryer Prolog.
Where the line separators are kind of plugable, currently defined as follows,
but can be an arbitrary set of arbitrary long code combinations:
% sys_convert_sep(+List, -List)
sys_convert_sep --> [0'\r, 0'\n].
sys_convert_sep --> [0'\n].
sys_convert_sep --> [0'\r].
BTW: I think SWI-Prolog already implements some of the ideas
like encoding switching, which we do not have a demonstrator
for yet. But its a little bit weak and stubborn concerning line
terminators refuses to support CRLF.