splitting strings (swi-prolog)

Dustin Kick

unread,

Mar 2, 2008, 6:43:13 PM3/2/08

to

I'm trying to write a predicate that takes a string as input, and outputs
a list of strings delimited by spaces (for now).

i.e.
?- split_string("this is a test",Out).

Out=["this","is","a","test"]

I'm using swipl, and I've been tracing it, and see that it does what I
want on the way into the functions, but on the way out, it throws away the
strings I was trying to collect. Can anyone school me on this?

split_string(String,Collected_strings):-
string_to_list(String,Charlist),
char_code(' ',Space),
collect_strings(Charlist,Space,Collected_strings).

collect_strings([],_,[]):-!.
collect_strings(Charlist,Last,[String|Collected_strings]):-
collect_chars(Charlist,Nextlist,Last,Collected_chars),
string_to_list(String,Collected_chars),
collect_strings(Nextlist,Last,Collected_strings).

collect_chars([32|Charlist],Charlist,Last,[]):-
Last\==32,!.
collect_chars([32|Charlist],_,32,Collected_chars):-
collect_chars(Charlist,_,32,Collected_chars),!.
collect_chars([Code|Charlist],Nextlist,_,[Code|Collected_chars]):-
Code\==32,
Last1=Code,
collect_chars(Charlist,Nextlist,Last1,Collected_chars),!.

--

Dustin Kick
http://homepage.mac.com/mac_vieuxnez

Nick Wedd

unread,

Mar 3, 2008, 9:02:10 AM3/3/08

to

In message <mac_vieuxnez-0...@192.168.1.199>, Dustin Kick
<mac_vi...@mac.com> writes

Your specification is nice and clear, but I can't follow your code.

If I wrote this it would include a line something like
append( FirstString, [32|Rest], ListOfChars )

Nick
--
Nick Wedd ni...@maproom.co.uk

Dustin Kick

unread,

Mar 3, 2008, 9:30:51 AM3/3/08

to

In article <RGwW8DZi...@maproom.demon.co.uk>, Nick Wedd
<ni...@maproom.co.uk> wrote:

I think I saw that solution, it would yield:
?-split_string("test this",Out).

Out=["test"," "," "," ","this"]

Well, the collect_chars does that part well (I forgot to mention I'm
munching spaces, so I don't get strings returned with a couple of spaces
at the end (or front)), and as I trace it, it show the strings getting
collected in Collected_strings, while moving deeper into the call, but it
throws them away again on the way out.

fodor...@gmail.com

unread,

Mar 3, 2008, 1:25:12 PM3/3/08

to

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%
% split/3
% split(+OldString,+Pattern,-ListStrings)
%
% Examples: string:split("a b c"," ",ListStrings).
% ListStrings = ["a","b","c"]
split(OldString,Pattern,ListStrings):-
split(OldString,Pattern,[],ListStrings).
% split(+OldString,+Pattern,+PartialStart,-ListStrings).
split([],_Pattern,[],[]):-
!.
split([],_Pattern,PartialStart,[PartialStart]):-
!.
split(OldString,Pattern,[],[RestStrings]):-
startsWith(OldString,Pattern,Rest),
!,
split(Rest,Pattern,[],RestStrings).
split(OldString,Pattern,PartialStart,[PartialStart|RestStrings]):-
startsWith(OldString,Pattern,Rest),
!,
split(Rest,Pattern,[],RestStrings).
split([H|T],Pattern,PartialStart,RestStrings):-
!,
append(PartialStart,[H],PartialStartTemp),
split(T,Pattern,PartialStartTemp,RestStrings).

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%
% startsWith/3
% startsWith(OldString,Pattern,Rest)
%
% Examples: startsWith(" aaa"," ",Rest)
% Rest = "aaa"
startsWith(OldString,[],OldString):-
!.
startsWith([H|TOldString],[H|T],Rest):-
!,
startsWith(TOldString,T,Rest).

Dustin Kick

unread,

Mar 3, 2008, 4:17:27 PM3/3/08

to

That's still not munching the spaces for me. It yields:

?-split("this is a test"," ", Out).

Out=["this","is","a"," test"]

I really think the code I'm working on is on the right track for what I
want to do, but I can't figure out why it throws the strings out of the
lists on the way up from the depths I've been tracing it, and at the
bottom, after having parsed all the characters in a list, the
Collected_strings at the top will be ["this","is","a",_GXXX|_GYYY], and "test" will
be stored in String, then on the way out of each call it throws all the
strings in the list away one by one. hmmm... I guess I know where to
look, but can't seem to find it. (probably a head smacker ("Can't
believe I didn't get that!" sort of thing).

--

Dustin Kick
http://homepage.mac.com/mac_vieuxnez

Dustin Kick

unread,

Mar 3, 2008, 4:42:16 PM3/3/08

to

I forgot to put in the boundary condition for the collect_chars,
thinking that the condition for collect_strings could do it by itself... or
rather, not thinking of it at all. anyway, thanks for help rendered, this
was my final solution, if anyone wants to see it. (or not)

split_string(String,Collected_strings):-
string_to_list(String,Charlist),
char_code(' ',Space),
collect_strings(Charlist,Space,Collected_strings).

collect_strings([],_,[]):-!.
collect_strings(Charlist,Last,[String|Collected_strings]):-
collect_chars(Charlist,Nextlist,Last,Collected_chars),
string_to_list(String,Collected_chars),
collect_strings(Nextlist,Last,Collected_strings).

collect_chars([],_,_,[]):-!.

collect_chars([32|Charlist],Charlist,Last,[]):-
Last\==32,!.
collect_chars([32|Charlist],_,32,Collected_chars):-
collect_chars(Charlist,_,32,Collected_chars),!.
collect_chars([Code|Charlist],Nextlist,_,[Code|Collected_chars]):-
Code\==32,
Last1=Code,
collect_chars(Charlist,Nextlist,Last1,Collected_chars),!.

Dustin Kick

unread,

Mar 3, 2008, 4:59:51 PM3/3/08

to

I had to change one thing to make the munching work, this is functional
just the way I wanted:

split_string(String,Collected_strings):-
string_to_list(String,Charlist),
char_code(' ',Space),
collect_strings(Charlist,Space,Collected_strings).

collect_strings([],_,[]):-!.
collect_strings(Charlist,Last,[String|Collected_strings]):-
collect_chars(Charlist,Nextlist,Last,Collected_chars),
string_to_list(String,Collected_chars),
collect_strings(Nextlist,Last,Collected_strings).

collect_chars([],_,_,[]):-!.
collect_chars([32|Charlist],Charlist,Last,[]):-
Last\==32,!.

collect_chars([32|Charlist],Nextlist,32,Collected_chars):- % <----I
wasn't passing the Nextlist through before. collect_chars(Charlist,Nextlist,32,Collected_chars),!.

collect_chars([Code|Charlist],Nextlist,_,[Code|Collected_chars]):-
Code\==32,
Last1=Code,
collect_chars(Charlist,Nextlist,Last1,Collected_chars),!.

Carlo

unread,

Mar 4, 2008, 5:15:13 PM3/4/08

to

"Dustin Kick" <mac_vi...@mac.com> ha scritto nel messaggio
news:mac_vieuxnez-0...@192.168.1.199...

> I'm trying to write a predicate that takes a string as input, and outputs
> a list of strings delimited by spaces (for now).
>
> i.e.
> ?- split_string("this is a test",Out).
>
> Out=["this","is","a","test"]
>
> I'm using swipl, and I've been tracing it, and see that it does what I
> want on the way into the functions, but on the way out, it throws away the
> strings I was trying to collect. Can anyone school me on this?

Usually, input analysis in Prolog is better done using DCG (Definite Clause
Grammars), a standard extension that enable very expressive parsing
handling. It's also efficient. Using SWI-Prolog, search DCG in the help:
then you'll find a concise (but complete) example.
Using that, i propose:

% driver: should be the only public member of your module split_string
%
split_string(S, L) :- phrase(split_str(L), S).

% scan a list of words separed by spaces
%
split_str([H|T]) --> blanks, inwords(H), blanks, split_str(T).
split_str([]) --> [].

% a word is a sequence of (at least one!) not blanks
%
inwords([C|Cs]) --> [C], { ok(C) }, inwords(Cs).
inwords([C]) --> [C], { ok(C) }. %bug: inwords([]) --> [].

% skip blanks (test and lose...)
%
blanks --> [C], { ko(C) }, blanks.
blanks --> [].

ok(C) :- \+ ko(C).
ko(C) :- code_type(C, space).

Bye Carlo

Markus Triska

unread,

Mar 4, 2008, 5:05:47 PM3/4/08

to

Dustin Kick<mac_vi...@mac.com> writes:

> I had to change one thing to make the munching work, this is
> functional just the way I wanted

Consider DCGs for convenience - for example:

string_tokens(Cs, Ts) :- phrase(tokens(Cs, []), Ts).

tokens([], Ts) --> token(Ts).
tokens([C|Cs], Ts) -->
( { C == 0' } -> token(Ts), tokens(Cs, [])
; tokens(Cs, [C|Ts])
).

token([]) --> [].
token([T|Ts]) --> { reverse([T|Ts], Token) }, [Token].

Yielding:

?- string_tokens("this is a test ", ["this", "is", "a", "test"]).
%@ true.

--
comp.lang.prolog FAQ: http://www.logic.at/prolog/faq/

Carlo Capelli

unread,

Mar 5, 2008, 2:14:40 AM3/5/08

to

Markus does it better! I didn't see his post, i was debugging mine...
And shurely there is still a bug, failing to parse a blanks only string...

"Carlo" <cc...@tin.it> ha scritto nel messaggio
news:RRjzj.259973$%k.38...@twister2.libero.it...

Stephan Lukits

unread,

Mar 5, 2008, 6:54:44 AM3/5/08

to

Dustin Kick schrieb:

> I'm trying to write a predicate that takes a string as input, and outputs
> a list of strings delimited by spaces (for now).
>

How about:
A List is a tokenization of a charact sequence separated by a
separator string if every token is (ordered) within the sequence
followed by the separator but the last token.

%tokenized(sting, token list, separator).

tokenized([], [[]], _).

tokenized([C|Cs], [[C|TCs]|Ts], [S|Ss]) :-
C \= S,
tokenized(Cs, [TCs|Ts], [S|Ss]), !.

tokenized([C|Cs], [[]|Ts], [C|Ss]) :-
separated([C|Cs], Ts, [C|Ss], [C|Ss]).

%separated(string, token list, separator, separator).

separated([C|Cs], Ts, [], TempSs) :-
tokenized([C|Cs], Ts, TempSs).

separated([S|Cs], Ts, [S|Ss], TempSs) :-
separated(Cs, Ts, Ss, TempSs).

?- tokenized("test ; string", TokenList, " ; "),
maplist(name, TextList, TokenList).

TokenList = [[116, 101, 115, 116], [115, 116, 114, 105, 110, 103]],
TextList = [test, string]

?- tokenized(String, ["test", "string"], " ; "), name(Text, String).

String = [116, 101, 115, 116, 32, 59, 32, 115, 116|...],
Text = 'test ; string'

Regards
Stephan

Duncan Patton

unread,

Mar 6, 2008, 8:48:34 AM3/6/08

to

On Tue, 04 Mar 2008 23:05:47 +0100
Markus Triska <tri...@logic.at> wrote:

> Dustin Kick<mac_vi...@mac.com> writes:
>
> > I had to change one thing to make the munching work, this is
> > functional just the way I wanted
>
> Consider DCGs for convenience - for example:
>
> string_tokens(Cs, Ts) :- phrase(tokens(Cs, []), Ts).
>
> tokens([], Ts) --> token(Ts).
> tokens([C|Cs], Ts) -->
> ( { C == 0' } -> token(Ts), tokens(Cs, [])
> ; tokens(Cs, [C|Ts])
> ).
>
> token([]) --> [].
> token([T|Ts]) --> { reverse([T|Ts], Token) }, [Token].
>
> Yielding:
>
> ?- string_tokens("this is a test ", ["this", "is", "a", "test"]).
> %@ true.

string_tokens(Cs, StpS, Ts) :- phrase(tokens(Cs, StpS, []), Ts).

tokens([], _, Ts) --> token(Ts).
tokens([C|Cs], StpS, Ts) -->
% ( { C == 0' } -> token(Ts), tokens(Cs, StpS, [])
( { memberchk(C,StpS) } -> token(Ts), tokens(Cs, StpS, [])
; tokens(Cs, StpS, [C|Ts])
).

token([]) --> [].
token([T|Ts]) --> { reverse([T|Ts], Token) }, [Token].

Slight mods ...

Dhu

Dustin Kick

unread,

Mar 6, 2008, 8:00:46 PM3/6/08

to

Thanks to those who have recommended definite clause grammars, I need to
read about them, but it sound like it may have been what I had been
looking for. I didn't find anything searching for tokenization, splitting
strings, mapping characters or anything else I would have thought of.
Definite Clause Grammar, of course, it just makes sense.
If anyone has any ideas how to work difference lists into this, which
I'm hoping will make it more efficient, and give me a chance to put
difference lists into practice, I'd appreciate them.

Duncan Patton

unread,

Mar 6, 2008, 9:56:45 PM3/6/08

to

On Tue, 04 Mar 2008 23:05:47 +0100
Markus Triska <tri...@logic.at> wrote:

> Dustin Kick<mac_vi...@mac.com> writes:
>
> > I had to change one thing to make the munching work, this is
> > functional just the way I wanted
>
> Consider DCGs for convenience - for example:
>
> string_tokens(Cs, Ts) :- phrase(tokens(Cs, []), Ts).
>
> tokens([], Ts) --> token(Ts).
> tokens([C|Cs], Ts) -->

Just as a matter of interest, what's this C == 0' notation?
Why does 0' evaluate to 32 (space)?

Dhu

fodor...@gmail.com

unread,

Mar 7, 2008, 9:36:28 AM3/7/08

to

On Mar 6, 8:00 pm, Dustin Kick<mac_vieux...@mac.com> wrote:
> Thanks to those who have recommended definite clause grammars, I need to
> read about them, but it sound like it may have been what I had been
> looking for. I didn't find anything searching for tokenization, splitting
> strings, mapping characters or anything else I would have thought of.

> Definite Clause Grammar, of course, it just makes sense.
> If anyone has any ideas how to work difference lists into this, which

I think that in most prolog systems, DCG's get translated into Prolog
code with different lists, see:
http://xsb.sourceforge.net/manual1/node155.html

A DCG rule such as:
p(X) -> q(X).
will be translated (expanded) into:
p(X, Li, Lo) :- q(X, Li, Lo).

> I'm hoping will make it more efficient, and give me a chance to put
> difference lists into practice, I'd appreciate them.
> --
>
> Dustin Kickhttp://homepage.mac.com/mac_vieuxnez

DCG

Carlo Capelli

unread,

Mar 7, 2008, 10:46:21 AM3/7/08

to

"Dustin Kick" <mac_vi...@mac.com> ha scritto nel messaggio

news:2t0Aj.110$zE5...@newsfe02.lga...

Many years ago, i wrote an interpreter, and the DCG via this code:

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Definite Clause Grammar translator
% from Clocksin, Mellish
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

:- op(251, fx, { ).
:- op(250, xf, } ).
:- op(255, xfx, -->).

translate((P0 --> Q0), (P :- Q)) :-
left_hand_side(P0, S0, S, P),
right_hand_side(Q0, S0, S, Q1),
flatten(Q1, Q), !.

left_hand_side((NT, Ts), S0, _S, P) :- !,
nonvar(NT),
islist(Ts),
tag(NT, S0, S1, P),
append(Ts, S0, S1).
left_hand_side(NT, S0, S, P) :-
nonvar(NT),
tag(NT, S0, S, P).

right_hand_side((X1, X2), S0, S, P) :- !,
right_hand_side(X1, S0, S1, P1),
right_hand_side(X2, S1, S, P2),
and(P1, P2, P).
right_hand_side((X1 ; X2), S0, S, (P1 ; P2)) :-
or(X1, S0, S, P1),
or(X2, S0, S, P2).
right_hand_side({P}, S, S, P) :- !.
right_hand_side(!, S, S, !) :- !.
right_hand_side(Ts, S0, S, true) :-
islist(Ts),
!, append(Ts, S, S0).
right_hand_side(X, S0, S, P) :-
tag(X, S0, S, P).

or(X, S0, S, P) :-
right_hand_side(X, S0a, S, Pa),
( var(S0a), S0a = S, !, S0 = S0a, ! = Pa;
P = (S0 = S0a, Pa) ).

tag(X, S0, S, P) :-
X =.. [F | A],
append(A, [S0, S], AX),
P =.. [F | AX].

and(true, P, P) :- !.
and(P, true, P) :- !.
and(P, Q, (P, Q)).

flatten(A, A) :-
var(A), !.
flatten((A, B), C) :- !,
flatten1(A, C, R),
flatten(B, R).
flatten(A, A).

flatten1(A, (A, R), R) :-
var(A), !.
flatten1((A, B), C, R) :- !,
flatten1(A, C, R1),
flatten1(B, R1, R).
flatten1(A, (A, R), R).

islist([]) :- !.
islist([_|_]).

append([A|B], C, [A|D]) :- append(B, C, D).
append([], X, X).

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% convert DCG rules to clauses
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
d2c :-
clause((H-->T),true),
translate((H-->T), Clause),
assert(Clause),
display(Clause), nl,
fail.
d2c.

It's not so simple..., and indeed some time after i read inSterling-Shapiro
'The Art of Prolog' a simpler approach, maybe matched in SICTus
implementation.

Bye Carlo

>
> Dustin Kick
> http://homepage.mac.com/mac_vieuxnez
>

Dustin Kick

unread,

Mar 7, 2008, 12:08:59 PM3/7/08

to

I just got around to testing your solution, and it works nicely, just as
you said it would, not that I doubted, but I don't understand the code,
yet. Is there a goal I can run DCGs through to see the expanded code?

--

Dustin Kick
http://homepage.mac.com/mac_vieuxnez

Markus Triska

unread,

Mar 10, 2008, 4:30:07 PM3/10/08

to

Dustin Kick<mac_vi...@mac.com> writes:

> Is there a goal I can run DCGs through to see the expanded code?

Use clause/2 to access its term representation. Also try listing/[01]:

?- listing(tokens).
%@ tokens([], A, B, C) :-
%@ token(A, B, C).
%@ tokens([A|E], C, B, G) :-
%@ ( A==32,
%@ D=B
%@ -> token(C, D, F),
%@ tokens(E, [], F, G)
%@ ; tokens(E, [A|C], B, G)
%@ ).

bart demoen

unread,

Mar 10, 2008, 5:26:16 PM3/10/08

to

On Mon, 10 Mar 2008 21:30:07 +0100, Markus Triska wrote:

> Dustin Kick<mac_vi...@mac.com> writes:
>
>> Is there a goal I can run DCGs through to see the expanded code?
>
> Use clause/2 to access its term representation. Also try listing/[01]:
>
> ?- listing(tokens).
> %@ tokens([], A, B, C) :-
> %@ token(A, B, C).
> %@ tokens([A|E], C, B, G) :-
> %@ ( A==32,
> %@ D=B
> %@ -> token(C, D, F),
> %@ tokens(E, [], F, G)
> %@ ; tokens(E, [A|C], B, G)
> %@ ).
> %@ true.

There are two things I can't grok:

1) the %@ : when I do ?- listing(tokens). those weird symbols don't show
up. We are using the same SWI, or not ?

2) why is there 32 in the output, while the original program had 0' ?
is this unavoidable, an SWI bug or an ISO Prolog inconsistency ?

Cheers

Bart Demoen

Jan Wielemaker

unread,

Mar 11, 2008, 5:39:56 AM3/11/08

to

On 2008-03-10, bart demoen <b...@cs.kuleuven.be> wrote:
> On Mon, 10 Mar 2008 21:30:07 +0100, Markus Triska wrote:
>
>> Dustin Kick<mac_vi...@mac.com> writes:
>>
>>> Is there a goal I can run DCGs through to see the expanded code?
>>
>> Use clause/2 to access its term representation. Also try listing/[01]:
>>
>> ?- listing(tokens).
>> %@ tokens([], A, B, C) :-
>> %@ token(A, B, C).
>> %@ tokens([A|E], C, B, G) :-
>> %@ ( A==32,
>> %@ D=B
>> %@ -> token(C, D, F),
>> %@ tokens(E, [], F, G)
>> %@ ; tokens(E, [A|C], B, G)
>> %@ ).
>> %@ true.
>
> There are two things I can't grok:
>
> 1) the %@ : when I do ?- listing(tokens). those weird symbols don't show
> up. We are using the same SWI, or not ?

I leave that to Markus

> 2) why is there 32 in the output, while the original program had 0' ?
> is this unavoidable, an SWI bug or an ISO Prolog inconsistency ?

You know the answer: as it stands in ISO, it is unavoidable. The
tokeniser must translate 0' into the character code of the space. In
general that is even undefined but SWI-Prolog is internally Unicode,
so it is defined as 32, regardless of the locale. characters codes
however are no special type and therefore cannot be distinguished from
integers. I'm not sure whether ISO would allow for a subtype of
integer that represents character codes. Possibly.

Same for [32] and " ", etc. To a certain extend this can be remedied
using ?- set_prolog_flag(double_quotes, chars). It doesn't fix all
issues though, and a global flag that introduces such big
incompatibilities causes more troubles than it solves. I never touch
that flag for any real programming task.

I once raised a similar issues about [] == [ ] == [/*empty list*/] == '[]'
It is fine for the first three to be equal, but I still have doubts on the
latter. Same for {}, though this causes less confusing on practice.

I don't think there is an easy fix to these issues without introducing
serious compatibility issues.

Cheers --- Jan

Markus Triska

unread,

Mar 11, 2008, 12:35:29 PM3/11/08

to

bart demoen <b...@cs.kuleuven.be> writes:

> 1) the %@ : when I do ?- listing(tokens).

I'm using the default value of ediprolog-prefix ("%@ "), and this is
what gets inserted when I evaluate a query. More information:

http://www.logic.at/prolog/ediprolog/ediprolog.html

An idiosyncratic symbol combination is used to make inserted output
automatically detectable. C-3 F10 flushes answers that were previously
inserted in the buffer, assuming ediprolog-dwim is bound to F10.

Carlo Capelli

unread,

Mar 12, 2008, 5:35:51 AM3/12/08

to

I post the modified code. See the difference in split_str.

% driver: should be the only public member of your module split_string
%
split_string(S, L) :- phrase(split_str(L), S).

% scan a list of words separed by spaces
%

split_str([H|T]) --> blanks, inwords(H), blanks, !, split_str(T).
split_str([], _, _).

% a word is a sequence of (at least one!) not blanks
%
inwords([C|Cs]) --> [C], { ok(C) }, inwords(Cs).
inwords([C]) --> [C], { ok(C) }.

% skip blanks (test and lose...)

%
blanks --> [C], { ko(C) }, blanks.
blanks --> [].

ok(C) :- \+ ko(C).
ko(C) :- code_type(C, space).

Bye Carlo

"Carlo Capelli" <carlo....@rdbos.it> ha scritto nel messaggio
news:iOrzj.6359$q53....@tornado.fastwebnet.it...

% Correction
split_str([H|T]) --> blanks, inwords(H), blanks, !, split_str(T).
split_str([], _, _).