a little parsing challenge ☺

175 views
Skip to first unread message

Xah Lee

unread,
Jul 17, 2011, 3:47:42 AM7/17/11
to
2011-07-16

folks, this one will be interesting one.

the problem is to write a script that can check a dir of text files
(and all subdirs) and reports if a file has any mismatched matching
brackets.

• The files will be utf-8 encoded (unix style line ending).

• If a file has mismatched matching-pairs, the script will display the
file name, and the line number and column number of the first
instance where a mismatched bracket occures. (or, just the char number
instead (as in emacs's “point”))

• the matching pairs are all single unicode chars. They are these and
nothing else: () {} [] “” ‹› «» 【】 〈〉 《》 「」 『』
Note that ‘single curly quote’ is not consider matching pair here.

• You script must be standalone. Must not be using some parser tools.
But can call lib that's part of standard distribution in your lang.

Here's a example of mismatched bracket: ([)], (“[[”), ((, 】etc. (and
yes, the brackets may be nested. There are usually text between these
chars.)

I'll be writing a emacs lisp solution and post in 2 days. Ι welcome
other lang implementations. In particular, perl, python, php, ruby,
tcl, lua, Haskell, Ocaml. I'll also be able to eval common lisp
(clisp) and Scheme lisp (scsh), Java. Other lang such as Clojure,
Scala, C, C++, or any others, are all welcome, but i won't be able to
eval it. javascript implementation will be very interesting too, but
please indicate which and where to install the command line version.

I hope you'll find this a interesting “challenge”. This is a parsing
problem. I haven't studied parsers except some Wikipedia reading, so
my solution will probably be naive. I hope to see and learn from your
solution too.

i hope you'll participate. Just post solution here. Thanks.

Xah

Raymond Hettinger

unread,
Jul 17, 2011, 5:48:42 AM7/17/11
to
On Jul 17, 12:47 am, Xah Lee <xah...@gmail.com> wrote:
> i hope you'll participate. Just post solution here. Thanks.

http://pastebin.com/7hU20NNL


Raymond

Chris Angelico

unread,
Jul 17, 2011, 7:34:41 AM7/17/11
to pytho...@python.org
On Sun, Jul 17, 2011 at 5:47 PM, Xah Lee <xah...@gmail.com> wrote:
> the problem is to write a script that can check a dir of text files
> (and all subdirs) and reports if a file has any mismatched matching
> brackets.

I wonder will it be possible to code the whole thing as a single
regular expression... I'm pretty sure it could be done as a sed or awk
script, but I'm insufficiently expert in either to do the job.

ChrisA

rusi

unread,
Jul 17, 2011, 7:52:25 AM7/17/11
to

No possible: See http://en.wikipedia.org/wiki/Pumping_lemma_for_regular_languages#Use_of_lemma

Informally stated as regexes cant count.

Robert Klemme

unread,
Jul 17, 2011, 9:20:25 AM7/17/11
to

Ruby solution: https://gist.github.com/1087583

Kind regards

robert

mhenn

unread,
Jul 17, 2011, 9:55:25 AM7/17/11
to

I acutally don't know Ruby that well, but it looks like your program
recognizes "[(])" as correct although it is not, because you translate
"[(])" to "(())" (which is indeed correct, but does not resemble the
input correctly anymore).

>
> Kind regards
>
> robert

Thomas 'PointedEars' Lahn

unread,
Jul 17, 2011, 10:15:46 AM7/17/11
to
Chris Angelico wrote:

Did you notice the excessive crosspost? Please do not feed the troll.

In the classical sense is not possible, as classical regular expressions
have no concept of recursion. Indeed, matching brackets are a textbook
example for a non-regular¹ context-free language L = {a^n b^n; n > 0} that
can only be recognized by a pushdown automaton (PDA). (Finite automata
"cannot count".)

However, it is possible (and done) to use classical regular expressions or
non-recursive Regular Expressions (note the different character case) to
match tokens more efficiently with a PDA implementation. This is commonly
called a parser. (Programming languages tend to be specified in terms of a
context-free grammar – they tend to be context-free languages –, which is
why a parser is a part of a compiler or interpreter. See for example
Python.²)

It is possible, with Perl-compatible Regular Expressions (PCRE), provided
that you have enough memory, to use such an extended Regular Expression (not
to be confused with EREs³)⁴:

\((([^()]*|(?R))*)\)

However, even Python 3.2 does not support those expressions (although it
supports some other PCRE patterns, like named subexpressions)⁵, neither do
standard and forked versions of sed(1) (BREs, EREs, using an NFA) nor awk
(EREs, using a DFA or NFA). [That is not to say it would not be possible
with Python, or sed or awk (both of which are off-topic here), but that more
than a Regular Expression would be required.]

On this subject, I recommend reading the preview chapters of the second and
third editions, respectively, of Jeffrey E. F. Friedl's "Mastering Regular
Expressions", which are available online for free at O'Reilly.com⁶.

HTH.

______
¹ because it can be proved that the pumping lemma for regular languages
does not apply to it; see also
<http://en.wikipedia.org/wiki/Chomsky_hierarchy> pp.
² <http://docs.python.org/reference/>
³ <http://en.wikipedia.org/wiki/Regular_expression>
⁴ Cf. <http://php.net/manual/en/regexp.reference.recursive.php>
⁵ <http://docs.python.org/library/re.html>
⁶ <http://oreilly.com/catalog/regex/chapter/ch04.html>,
<http://oreilly.com/catalog/9780596528126/preview#preview>
--
PointedEars

Bitte keine Kopien per E-Mail. / Please do not Cc: me.

gene heskett

unread,
Jul 17, 2011, 10:26:59 AM7/17/11
to pytho...@python.org
On Sunday, July 17, 2011 10:12:27 AM Xah Lee did opine:

This is a very old solution, some of it nearly 30 years old.
Written in C, the trick is in doing it in python I guess.
======================begin cntx.c=======================
/*^k^s
.ds2
.hb
.fb^k^s^b Cntx.c, page #^k^s^b
*****************************************************************
* *
* CC (C Checker) *
* *
* C Source Brackets, Parenthesis, brace, *
* quote and comment Checker *
* *
* T. Jennings -- Sometime in 1983 *
* Slightly tweaked and renamed MINILINT *
* KAB Aug 84 *
* Ported to OS9 and further tweaked *
* Brian Paquette March 91 *
* Brackets, single, double quote counters added *
* & renamed Cntx 04/09/91 *
* by Gene Heskett *
* *
* some additional code for ignoring "syntax" chars inside of *
* double quoted strings added 3/21/93 by Gene Heskett *
* same for single quoted stuffs 7/10/93 by Gene Heskett *
* And long lines handling ability added too. *
* Adding tab ignorers and a counter to tally how many 11/21/93 *
****************************************************************/
#define OS9 /* used for nested comment handling*/
/* comment out for non OS9/6809*/

#include <stdio.h>
#include <ctype.h>
#include <string.h>
#define FALSE 0
#define TRUE 1

#ifdef OS9
#define NO " No "
#define YES " Yes "
char *cmnt;
#endif

/* Very crude but very effective C source debugger. Counts the numbers of
matching braces, parentheses and comments, and displays them at the left
edge of the screen. The best way to see what it does is to do it; try

cntx -v cntx.c

Properly handles parens and braces inside comments; they are ignored.
Also ignores single quotes if doubles are odd number, so singles
can be used in a printf string for punctuation. IF you see the doubles
are odd at line end (the printout tally), all bets are OFF!
Enter cntx on the command line by itself for a usage note.
*/

main(argc,argv)
int argc;
char *argv[];
{
FILE *f;
char c,secnd_c,lastc;
int bracket,parens,braces,squote,dquote,comments;
int perr,bkerr,brerr,sqerr,dqerr;
int verbose,okay;
int filearg = 0;
int line, col, tabc;

if ((argc < 2)||(argc > 3)) getout(0);
if (argc == 3)
{
verbose = TRUE; /* already tested for -v switch */
if((argv[1][0] == '-') && (toupper(argv[1][1]) == 'V'))
filearg = 2; /*file name pointed to by argv[2] */
if((argv[2][0] == '-') && (toupper(argv[2][1]) == 'V'))
filearg = 1;
if(!filearg) getout(192);
}
else
{
verbose = FALSE;
filearg = 1;
}
if ((f = fopen(argv[filearg],"r")) == NULL)
{
fprintf(stderr,"Cntx: can't open '%s'\n",argv[1]);
getout(216);
}
bracket= braces= parens= comments= squote= dquote= 0;
perr= bkerr= brerr= sqerr= dqerr= 0;
line= col= tabc= 0;
secnd_c= lastc= '\0';

while ((c = getc(f)) != EOF)
{
while(c==0x09) /* ignore, but tally the count */
{
tabc+=1;
c=getc(f);
}

/* print running tally if in verbose mode and at beginning of line*/
/* OS9 version prints status of whether or not one is in a comment rather*/
/* than a count, as the Microware C compiler does not nest comments*/

if ((col == 0) && verbose )
{
#ifdef OS9
if (comments)
cmnt = YES;
else cmnt = NO;
printf("%d: [%d] {%d} (%d) \'%d\' \"%d\" /*%s*/
tabcnt=%d\n\n",
line,bracket,braces,parens,squote,dquote,cmnt,tabc);
#else
printf("%d: [%d] {%d} (%d) \'%d\' \"%d\" /*%d*/\n\n",
line,bracket,braces,parens,squote,dquote,comments);
#endif
}

/* additions to help tally squote & dquote errors at line end,
squotes and dquotes should match if we don't count those squotes
present when dquotes are odd number as in inside a printf or
puts statement. Also if they are part of an escape sequence,
don't count */

if (col == 0 && (squote % 2) ) ++sqerr;
if (col == 0 && (dquote % 2) ) ++dqerr;
if (col == 0 && bracket ) ++bkerr;

/* now clears the error to get back in step */
if (col == 0) squote=dquote=0;

/* Don't count parens and braces that are inside comments. This of course
assumes that comments are properly matched; in any case, that will be the
first thing to look for. */

if (comments <= 0)
{ /* 3/20/93, 7/10/93 taking sensitivity out of quoted stuffs */
/* here, do ++dquote if its not a char constant like this
*/
if ( c == '"' ) ++dquote; /* a little simpler */

if ( !(dquote & 1) ) /* was the && of those */
{
if (c == '{' ) ++braces;
if (c == '(' ) ++parens;
if (lastc != '\'' && secnd_c == '[' && c != '\'' ) ++bracket;
/* here, skip squotes in a "text string's" */
if ( secnd_c != '\\' && c== '\'' && !(dquote) ) ++squote;
if ( lastc == '\\' && secnd_c == '\\' && c == '\'' ) ++squote;
if (c == '}' ) --braces;
if (c == ')' ) --parens;
if (lastc != '\'' && secnd_c == ']' && c != '\'' ) --bracket;
}
}

/* Now do comments. This properly handles nested comments;
whether or not the compiler does is your responsibility */

#ifdef OS9

/* The Microware C compiler for OS9 does NOT nest comments. */
/* The comment-close-mark (asterisk-backslash) will terminate */
/* (see K & R) a comment no matter how many '/*' come before it*/

if ((c == '/') && (secnd_c == '*'))
comments = 0;
if ((c == '*') && (secnd_c == '/') && (comments == 0))
++comments;
#else
if ( (c == '/' ) && (secnd_c == '*' ) ) --comments;
if ( (c == '*' ) && (secnd_c == '/' ) ) ++comments;
#endif
++col;
if (c == '\n' && secnd_c != '\\' )
{ /* non-escaped newline == New Line */
col= 0; /* set column 0 */
++line;
}
if (verbose)
putchar(c); /* display text */
lastc= secnd_c; /* update last char */
secnd_c= c;
}
if (verbose)
{
#ifdef OS9
if (comments)
cmnt = YES;
else cmnt = NO;
printf("EOF: [%d] {%d} (%d) \'%d\' \"%d\" /*%s*/\n",
bracket,braces,parens,squote,dquote,cmnt);
#else
printf("EOF: [%d] {%d} (%d) \'%d\' \"%d\" /*%d*/\n",
bracket,braces,parens,squote,dquote,comments);
#endif
}
okay = TRUE;
if (bracket||bkerr) puts("Unbalanced brackets\n"), okay = FALSE;
if (braces) puts("Unbalanced braces\n"),okay = FALSE;
if (parens) puts("Unbalanced parentheses\n"),okay = FALSE;
if (sqerr||(squote%2)) puts("Unmatched single quotes\n"),okay=FALSE;
if (dqerr||(dquote%2)) puts("Unmatched double quotes\n"),okay=FALSE;
if (comments) puts("Unbalanced comments\n"),okay = FALSE;
if (okay) puts("No errors found\n");
}
getout(errex)
int errex;
{
fprintf(stderr,"Usage: Cntx [-v] <filename> [-v]\n");
fprintf(stderr," -v = verbose mode \n");
exit(errex);
}
=====================end cntx.c====================
=================begin cntx.hlp====================
This "Cntx" is based rather loosely on the previously uploaded
file called MINILINT, in that if you use the -v option, it will
show you the file and its report on a line by line basis as
MINILINT did. Cntx however will also check for use and misuse of
more of the usual "C" punctuation. Its smart enough to ignore an
"escaped" character, or those buried in a text string inside a
printf("[[{{'etc"); statement. The basic organization is from
"MINILINT", but much expanded in checking scope. It still is NOT
a "lint" which is why I didn't call it that, but it has turned out
to be awfully handy. Ported to the Amiga, it found some stuff in
the code I was feeding DICE that I had totally missed, and which
was not being properly reported by DICE either, the errors it was
spitting out made no sense whatsoever. I had somehow lost a
terminating "}" in one of the PRINTFORM files in the translation
to a C that required proto statements. Cntx found it, even if
Dillons Integrated C Environments "dcc" didn't. But it still isn't
a "lint", not yet.

Usage: cntx [-v] filename

Without the -v, it rapidly scans the whole source file and gives
only a final report of "no errors found" or "mismatched brackets",
etc.

Added 3/20/93 MEH: One more conditional test now causes it to skip
thru any parens, braces or brackets found within a double quoted
string such as the format string for printf. As the tally needs
to be reset at the start of a new line to maintain the error
checking phasing in case there is an error, the total double
quote count for the whole file is no longer kept. Only the error
tally now shows at the end of a file scan. So to see the line
with the error, you must use the -v option, preferably on a
pauseing screen so that one screen full of data can be seen at a
time. I liked the totals myself, but this does work better. Now
Edition 4.

Added 7/10/93 MEH: Essentially the same as the above paragraph but
for stuff inside a pair of single quotes, so now *any* character
can be single quoted without being an error. Now Edition 5.
===================end cntx.hlp=====================

Sometimes this list is hilarious with its re-inventions of the wheel. :)

The above code isn't the final version, I had it running on linux too, but
one of fedoras (or W.D.s pissy drives) infamous crashes caused that version
to come up missing, forgotten when I was doing the salvage operation to a
new hard drive.

Cheers, gene
--
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
A holding company is a thing where you hand an accomplice the goods while
the policeman searches you.

Thomas Jollans

unread,
Jul 17, 2011, 11:31:33 AM7/17/11
to

I thought I'd have some fun with multi-processing:

https://gist.github.com/1087682

Thomas Boell

unread,
Jul 17, 2011, 11:49:11 AM7/17/11
to

I'm new to Python. I think I'd have done it in a similar way (in any
language). Your use of openers/closers looks nice though. In the
initialization of openers, I guess you implicitly create a kind of
hash, right? Then the 'in' operator checks for the keys. That is elegant
because you have the openers and closers right next to each other, not
in separate lists.

But why do you enumerate with start=1? Shouldn't you start with index 0?


Robert Klemme

unread,
Jul 17, 2011, 12:01:32 PM7/17/11
to

Right you are. The optimization breaks the logic. Good catch!

Kind regards

robert

Robert Klemme

unread,
Jul 17, 2011, 12:54:59 PM7/17/11
to

Turns out with a little possessiveness I can fix my original approach
which has the added benefit of not needing three passes through the file
(the two #tr's are obsolete now).

https://gist.github.com/1087583

Cheers

robert

Raymond Hettinger

unread,
Jul 17, 2011, 3:18:11 PM7/17/11
to
On Jul 17, 7:15 am, Thomas 'PointedEars' Lahn <PointedE...@web.de>
wrote:

> Did you notice the excessive crosspost?  Please do not feed the troll.

IMO, this was a legitimate cross post since it is for a multi-language
programming challenge and everyone can learn from comparing the
results.


Raymond

Raymond Hettinger

unread,
Jul 17, 2011, 3:16:13 PM7/17/11
to
On Jul 17, 8:49 am, Thomas Boell <tbo...@domain.invalid> wrote:
> But why do you enumerate with start=1? Shouldn't you start with index 0?

The problem specification says that the the char number should match
the emacs goto-char function which is indexed from one, not from
zero. This is testable by taking the output of the program and
running it through emacs to see that the cursor gets moved exactly to
the location of the mismatched delimiter.


Raymond

Thomas 'PointedEars' Lahn

unread,
Jul 17, 2011, 4:16:33 PM7/17/11
to
Raymond Hettinger wrote:

> Thomas 'PointedEars' Lahn wrote:
>> Did you notice the excessive crosspost? Please do not feed the troll.
>
> IMO, this was a legitimate cross post since it is for a multi-language
> programming challenge and everyone can learn from comparing the
> results.

Even if so (which I seriously doubt, see also my sig), you cannot reasonably
deny that "Xah Lee" is a well-known Usenet troll, and that this "challenge"
is nothing more than yet another sophisticated attempt at trolling. Please
do not feed.


PointedEars
--
No article in the world is relevant to more than a small handful of
groups. If WW III is announced, it will be announced in
net.announce.important. -- Peter da Silva, bofh.cabal, "Usenet II rules"

Thomas Jollans

unread,
Jul 17, 2011, 4:57:46 PM7/17/11
to pytho...@python.org
On 07/17/2011 10:16 PM, Thomas 'PointedEars' Lahn wrote:
> Raymond Hettinger wrote:
>
>> Thomas 'PointedEars' Lahn wrote:
>>> Did you notice the excessive crosspost? Please do not feed the troll.
>>
>> IMO, this was a legitimate cross post since it is for a multi-language
>> programming challenge and everyone can learn from comparing the
>> results.
>
> Even if so (which I seriously doubt, see also my sig), you cannot reasonably
> deny that "Xah Lee" is a well-known Usenet troll, and that this "challenge"
> is nothing more than yet another sophisticated attempt at trolling. Please
> do not feed.

You know what you're doing? You're feeding the troll.

Yes, I know Xah Lee. Yes, he is known for trolling. So what? That alone
does not mean that every single thing he posts has to be bad. I'm all
with Raymond here.

There's nothing wrong with this post. This is the one time when it's
okay to feed the troll: reinforce good behaviour.

Thomas 'PointedEars' Lahn

unread,
Jul 17, 2011, 5:43:48 PM7/17/11
to
Thomas 'PointedEars' Lahn wrote:

> It is possible [to parse the parentheses language], with Perl-compatible


> Regular Expressions (PCRE), provided that you have enough memory, to use
> such an extended Regular Expression (not to be confused with EREs³)⁴:
>
> \((([^()]*|(?R))*)\)
>
> However, even Python 3.2 does not support those expressions (although it
> supports some other PCRE patterns, like named subexpressions)⁵, neither do
> standard and forked versions of sed(1) (BREs, EREs, using an NFA) nor awk
> (EREs, using a DFA or NFA). [That is not to say it would not be possible
> with Python, or sed or awk (both of which are off-topic here), but that
> more than a Regular Expression would be required.]

Supplemental: Further research shows that the Python `re' module is not
going to implement (PCRE) recursive Regular Expressions. The maintainer's,
Matthew Barnett's, argument (of 2009-03-24) is that such things are better
left to modules such as `pyparsing' [1][2].

However, FWIW, here is the Python port of the start of a language parser
originally written in (and for) ECMAScript:

import re

def matchingBraces(s):
level = 0

for match in re.finditer(r'[{}]', s):
paren = match.group(0)

if paren == "{":
level += 1
else:
if level == 0: return False
level -= 1

return level == 0

As you can see, the theoretically necessary PDA¹ implementation can be
simplified to a braces counter with range checks by iterative use of a
Regular Expression. Extensions to meet the "challenge" are left as an
exercise to the reader.

It has also occurred to me that because parentheses (`(', `)') and brackets
(`[', `]') have special meaning in Regular Expressions (grouping and
character classes, respectively), you could escape all other special
characters in a text and use the RE evaluator itself to find out whether
they are properly nested, having it evaluate one large RE. But I have not
tested this idea yet. (Obviously it cannot be used to satisfy the
"challenge"'s condition that braces – `{', `}' – and other parenthesis-like
characters are to be considered as well, and that the parenthesis-like
characters are to be printed.)

______
¹ Pushdown automaton

References:
[1] <http://bugs.python.org/issue694374>
[2] <http://pyparsing.wikispaces.com/>

Message has been deleted

Rouslan Korneychuk

unread,
Jul 18, 2011, 3:09:38 AM7/18/11
to
I don't know why, but I just had to try it (even though I don't usually
use Perl and had to look up a lot of stuff). I came up with this:

/(?|
(\()(?&matched)([\}\]”›»】〉》」』]|$) |
(\{)(?&matched)([\)\]”›»】〉》」』]|$) |
(\[)(?&matched)([\)\}”›»】〉》」』]|$) |
(“)(?&matched)([\)\}\]›»】〉》」』]|$) |
(‹)(?&matched)([\)\}\]”»】〉》」』]|$) |
(«)(?&matched)([\)\}\]”›】〉》」』]|$) |
(【)(?&matched)([\)\}\]”›»〉》」』]|$) |
(〈)(?&matched)([\)\}\]”›»】》」』]|$) |
(《)(?&matched)([\)\}\]”›»】〉」』]|$) |
(「)(?&matched)([\)\}\]”›»】〉》』]|$) |
(『)(?&matched)([\)\}\]”›»】〉》」]|$))
(?(DEFINE)(?<matched>(?:
\((?&matched)\) |
\{(?&matched)\} |
\[(?&matched)\] |
“(?&matched)” |
‹(?&matched)› |
«(?&matched)» |
【(?&matched)】 |
〈(?&matched)〉 |
《(?&matched)》 |
「(?&matched)」 |
『(?&matched)』 |
[^\(\{\[“‹«【〈《「『\)\}\]”›»】〉》」』]++)*+))
/sx;

If the pattern matches, there is a mismatched bracket. $1 is set to the
mismatched opening bracket. $-[1] is its location. $2 is the mismatched
closing bracket or '' if the bracket was never closed. $-[2] is set to
the location of the closing bracket or the end of the string if the
bracket wasn't closed.


I didn't write all that manually; it was generated with this:

my @open = ('\(','\{','\[','“','‹','«','【','〈','《','「','『');
my @close = ('\)','\}','\]','”','›','»','】','〉','》','」','』');

'(?|'.join('|',map
{'('.$open[$_].')(?&matched)(['.join('',@close[0..($_-1),($_+1)..$#close]).']|$)'}
(0 .. $#open)).')(?(DEFINE)(?<matched>(?:'.join('|',map
{$open[$_].'(?&matched)'.$close[$_]} (0 ..
$#open)).'|[^'.join('',@open,@close).']++)*+))'

Stefan Behnel

unread,
Jul 18, 2011, 3:24:21 AM7/18/11
to pytho...@python.org
Rouslan Korneychuk, 18.07.2011 09:09:


That's solid Perl. Both the code generator and the generated code are
unreadable. Well done!

Stefan

Rouslan Korneychuk

unread,
Jul 18, 2011, 4:04:56 AM7/18/11
to
On 07/18/2011 03:24 AM, Stefan Behnel wrote:
> That's solid Perl. Both the code generator and the generated code are
> unreadable. Well done!
>
> Stefan
>

Why, thank you.

Xah Lee

unread,
Jul 18, 2011, 10:39:28 AM7/18/11
to

On Jul 17, 12:47 am, Xah Lee <xah...@gmail.com> wrote:
> 2011-07-16
>
> folks, this one will be interesting one.
>
> the problem is to write a script that can check a dir of text files
> (and all subdirs) and reports if a file has any mismatched matching
> brackets.
> …

Ok, here's my solution (pasted at bottom). I haven't tried to make it
elegant or terse, yet, seeing that many are already much elegent than
i could possibly do so with my code.

my solution basically use a stack. (i think all of us are doing
similar) Here's the steps:

• Go thru the file char by char, find a bracket char.
• check if the one on stack is a matching opening char. If so remove
it. Else, push the current onto the stack.
• Repeat the above till end of file.
• If the stack is not empty, then the file got mismatched brackets.
Report it.
• Do the above on all files.

Many elegant solutions. Raymond Hettinger is very quick, posted a
solution only after a hour or so when i posted it. Many others are
very short, very nice. Thank you all for writing them. I haven't
studied them yet. I'll run them all and post a summary in 2 days. (i
have few thousands files to run this test thru, many of them have
mismatched brackets. So i have good data to test with.)

PS we still lack a perl, Scheme lisp, tcl, lua versions. These
wouldn't be hard and would be interesting to read. If you are picking
up one of these lang, this would be a good exercise. Haskell too. I
particularly would like to see a javascript version ran from command
line. Maybe somebody can put this exercise to Google folks ... they
are like the js gods.

also, now that we have these home-brewed code, how'd a parser expert
do it? Is it possible to make it even simpler by using some parser
tools? (have no idea what those lex yacc do, or modern incarnations)
I've also been thinking whether this can be done with Parsing
Expression Grammar. That would make the code semantics really elegant
(as opposed home-cooked stack logic).

Xah

;; -*- coding: utf-8 -*-
;; 2011-07-15, Xah Lee
;; go thru a file, check if all brackets are properly matched.
;; e.g. good: (…{…}… “…”…)
;; bad: ( [)]
;; bad: ( ( )

(setq inputDir "~/web/xahlee_org/p/") ; must end in slash

(defvar matchPairs '() "a alist. For each air, the car is opening
char, cdr is closing char.")

(setq matchPairs '(
("(" . ")")
("{" . "}")
("[" . "]")
("“" . "”")
("‹" . "›")
("«" . "»")
("【" . "】")
("〈" . "〉")
("《" . "》")
("「" . "」")
("『" . "』")
)
)

(defvar searchRegex "" "regex string of all pairs to search.")
(setq searchRegex "")
(mapc
(lambda (mypair) ""
(setq searchRegex (concat searchRegex (regexp-quote (car mypair))
"|" (regexp-quote (cdr mypair)) "|") )
)
matchPairs)

(setq searchRegex (replace-regexp-in-string "|$" "" searchRegex t
t)) ; remove the ending “|”

(setq searchRegex (replace-regexp-in-string "|" "\\|" searchRegex t
t)) ; change | to \\| for regex “or” operation

(defun my-process-file (fpath)
"process the file at fullpath fpath ..."
(let (myBuffer (ii 0) myStack ξchar ξpos)

(setq myStack '() ) ; each element is a vector [char position]
(setq ξchar "")

(setq myBuffer (get-buffer-create " myTemp"))
(set-buffer myBuffer)
(insert-file-contents fpath nil nil nil t)

(goto-char 1)
(while (search-forward-regexp searchRegex nil t)
(setq ξpos (point) )
(setq ξchar (buffer-substring-no-properties ξpos (- ξpos 1)) )

;; (princ (format "-----------------------------\nfound char: %s
\n" ξchar) )

(let ((isClosingCharQ nil) (matchedOpeningChar nil) )
(setq isClosingCharQ (rassoc ξchar matchPairs))
(when isClosingCharQ (setq matchedOpeningChar (car
isClosingCharQ) ) )

;; (princ (format "isClosingCharQ is: %s\n" isClosingCharQ) )
;; (princ (format "matchedOpeningChar is: %s\n"
matchedOpeningChar) )

(if
(and
(car myStack) ; not empty
(equal (elt (car myStack) 0) matchedOpeningChar )
)
(progn
;; (princ (format "matched this bottom item on stack: %s
\n" (car myStack)) )
(setq myStack (cdr myStack) )
)
(progn
;; (princ (format "did not match this bottom item on
stack: %s\n" (car myStack)) )
(setq myStack (cons (vector ξchar ξpos) myStack) ) )
)
)
;; (princ "current stack: " )
;; (princ myStack )
;; (terpri )
)

(when (not (equal myStack nil))
(princ "Error file: ")
(princ fpath)
(print (car myStack) )
)
(kill-buffer myBuffer)
))


;; (require 'find-lisp)

(let (outputBuffer)
(setq outputBuffer "*xah match pair output*" )
(with-output-to-temp-buffer outputBuffer
(mapc 'my-process-file (find-lisp-find-files inputDir "\\.html$"))
(princ "Done deal!")
)
)

Thomas 'PointedEars' Lahn

unread,
Jul 18, 2011, 12:46:06 PM7/18/11
to
Rouslan Korneychuk wrote:

> I don't know why, but I just had to try it (even though I don't usually
> use Perl and had to look up a lot of stuff). I came up with this:

I don't know why … you replied to my posting/e-mail (but quoted nothing from
it, much less referred to its content), and posted a lot of Perl code in a
Python newsgroup/on a Python mailing list.

Billy Mays

unread,
Jul 18, 2011, 1:12:08 PM7/18/11
to
On 07/17/2011 03:47 AM, Xah Lee wrote:
> 2011-07-16

I gave it a shot. It doesn't do any of the Unicode delims, because
let's face it, Unicode is for goobers.


import sys, os

pairs = {'}':'{', ')':'(', ']':'[', '"':'"', "'":"'", '>':'<'}
valid = set( v for pair in pairs.items() for v in pair )

for dirpath, dirnames, filenames in os.walk(sys.argv[1]):
for name in filenames:
stack = [' ']
with open(os.path.join(dirpath, name), 'rb') as f:
chars = (c for line in f for c in line if c in valid)
for c in chars:
if c in pairs and stack[-1] == pairs[c]:
stack.pop()
else:
stack.append(c)
print ("Good" if len(stack) == 1 else "Bad") + ': %s' % name

--
Bill

Ian Kelly

unread,
Jul 18, 2011, 2:10:31 PM7/18/11
to Python
On Mon, Jul 18, 2011 at 11:12 AM, Billy Mays
<81282ed9a88799d21e77...@myhashismyemail.com> wrote:
> I gave it a shot.  It doesn't do any of the Unicode delims, because let's
> face it, Unicode is for goobers.

Uh, okay...

Your script also misses the requirement of outputting the index or row
and column of the first mismatched bracket.

Rouslan Korneychuk

unread,
Jul 18, 2011, 2:14:08 PM7/18/11
to
On 07/18/2011 12:46 PM, Thomas 'PointedEars' Lahn wrote:
> Rouslan Korneychuk wrote:
>
>> I don't know why, but I just had to try it (even though I don't usually
>> use Perl and had to look up a lot of stuff). I came up with this:
>
> I don't know why … you replied to my posting/e-mail (but quoted nothing from
> it, much less referred to its content), and posted a lot of Perl code in a
> Python newsgroup/on a Python mailing list.
>

Well, when I said I had to try *it*, I was referring to using a Perl
compatible regular expression, which you brought up. I guess I should
have quoted that part. As for what I posted, the crux of it was a single
regular expression. The Perl code at the bottom was just to point out
that I didn't type that monstrosity out manually. I was going to put
that part in brackets but there were already so many.

s...@netherlands.com

unread,
Jul 18, 2011, 3:34:45 PM7/18/11
to
On Sun, 17 Jul 2011 00:47:42 -0700 (PDT), Xah Lee <xah...@gmail.com> wrote:

>2011-07-16
>
>folks, this one will be interesting one.
>
>the problem is to write a script that can check a dir of text files
>(and all subdirs) and reports if a file has any mismatched matching
>brackets.
>

[snip]


>i hope you'll participate. Just post solution here. Thanks.
>

I have to hunt for a job so I'm not writing a solution for you.
Here is a thin regex framework that may get you started.

-sln

---------------------

use strict;
use warnings;

my @samples = qw(
A98(y[(np)r]x)tp[kk]a.exeb
A98(y[(np)r]x)tp[kk]a}.exeb
A98(‹ynprx)tpk›ka.mpeg
‹A98(ynprx)tpk›ka
“A9«8(yn«pr{{[g[x].}*()+}»)tpkka».”
“A9«8(yn«pr{{[g[x].]}*()+}»)tpkka».”
“A9«8(yn«pr»)tpkka».”
“A9«8(yn«pr»)»”t(()){}[a[b[d]{}]pkka.]“«‹“**^”{[()]}›»”
“A9«8(yn«pr»)”t(()){}[a[b[d]{}]pkka.]“«‹“**^”{[()]}›»”
);

my $regex = qr/

^ (?&FileName) $

(?(DEFINE)

(?<Delim>
\( (?&Content) \)
| \{ (?&Content) \}
| \[ (?&Content) \]
| \“ (?&Content) \”
| \‹ (?&Content) \›
| \« (?&Content) \»
# add more here ..
)

(?<Content>
(?: (?> [^(){}\[\]“”‹›«»]+ ) # add more here ..
| (?&Delim)
)*
)

(?<FileName>
(?&Content)
)
)
/x;


for (@samples)
{
print "$_ - ";
if ( /$regex/ ) {
print "passed \n";
}
else {
print "failed \n";
}
}

__END__

Output:

A98(y[(np)r]x)tp[kk]a.exeb - passed
A98(y[(np)r]x)tp[kk]a}.exeb - failed
A98(‹ynprx)tpk›ka.mpeg - failed
‹A98(ynprx)tpk›ka - passed
“A9«8(yn«pr{{[g[x].}*()+}»)tpkka».” - failed
“A9«8(yn«pr{{[g[x].]}*()+}»)tpkka».” - passed
“A9«8(yn«pr»)tpkka».” - passed
“A9«8(yn«pr»)»”t(()){}[a[b[d]{}]pkka.]“«‹“**^”{[()]}›»” - passed
“A9«8(yn«pr»)”t(()){}[a[b[d]{}]pkka.]“«‹“**^”{[()]}›»” - failed


Thomas 'PointedEars' Lahn

unread,
Jul 18, 2011, 5:59:06 PM7/18/11
to
Ian Kelly wrote:

> Billy Mays wrote:
>> I gave it a shot. It doesn't do any of the Unicode delims, because let's
>> face it, Unicode is for goobers.
>
> Uh, okay...
>
> Your script also misses the requirement of outputting the index or row
> and column of the first mismatched bracket.

Thanks to Python's expressiveness, this can be easily remedied (see below).

I also do not follow Billy's comment about Unicode. Unicode and the fact
that Python supports it *natively* cannot be appreciated enough in a
globalized world.

However, I have learned a lot about being pythonic from his posting (take
those generator expressions, for example!), and the idea of looking at the
top of a stack for reference is a really good one. Thank you, Billy!

Here is my improvement of his code, which should fill the mentioned gaps.
I have also reversed the order in the report line as I think it is more
natural this way. I have tested the code superficially with a directory
containing a single text file. Watch for word-wrap:

# encoding: utf-8
'''
Created on 2011-07-18

@author: Thomas 'PointedEars' Lahn <Point...@web.de>, based on an idea of
Billy Mays <81282ed9a88799d21e77...@myhashismyemail.com>
in <news:j01ph6$knt$1...@speranza.aioe.org>
'''
import sys, os

pairs = {u'}': u'{', u')': u'(', u']': u'[',
u'”': u'“', u'›': u'‹', u'»': u'«',
u'】': u'【', u'〉': u'〈', u'》': u'《',
u'」': u'「', u'』': u'『'}
valid = set(v for pair in pairs.items() for v in pair)

if __name__ == '__main__':


for dirpath, dirnames, filenames in os.walk(sys.argv[1]):
for name in filenames:
stack = [' ']

# you can use chardet etc. instead
encoding = 'utf-8'

with open(os.path.join(dirpath, name), 'r') as f:
reported = False
chars = ((c, line_no, col) for line_no, line in enumerate(f)
for col, c in enumerate(line.decode(encoding)) if c in valid)
for c, line_no, col in chars:
if c in pairs:
if stack[-1] == pairs[c]:
stack.pop()
else:
if not reported:
first_bad = (c, line_no + 1, col + 1)
reported = True
else:
stack.append(c)

print '%s: %s' % (name, ("good" if len(stack) == 1 else "bad
'%s' at %s:%s" % first_bad))

Steven D'Aprano

unread,
Jul 18, 2011, 7:56:21 PM7/18/11
to
Billy Mays wrote:

> On 07/17/2011 03:47 AM, Xah Lee wrote:
>> 2011-07-16
>
> I gave it a shot. It doesn't do any of the Unicode delims, because
> let's face it, Unicode is for goobers.

Goobers... that would be one of those new-fangled slang terms that the young
kids today use to mean its opposite, like "bad", "wicked" and "sick",
correct?

I mention it only because some people might mistakenly interpret your words
as a childish and feeble insult against the 98% of the world who want or
need more than the 127 characters of ASCII, rather than understand you
meant it as a sign of the utmost respect for the richness and diversity of
human beings and their languages, cultures, maths and sciences.


--
Steven


Billy Mays

unread,
Jul 18, 2011, 10:07:44 PM7/18/11
to

TL;DR version: international character sets are a problem, and Unicode
is not the answer to that problem).

As long as I have used python (which I admit has only been 3 years)
Unicode has never appeared to be implemented correctly. I'm probably
repeating old arguments here, but whatever.

Unicode is a mess. When someone says ASCII, you know that they can only
mean characters 0-127. When someone says Unicode, do the mean real
Unicode (and is it 2 byte or 4 byte?) or UTF-32 or UTF-16 or UTF-8?
When using the 'u' datatype with the array module, the docs don't even
tell you if its 2 bytes wide or 4 bytes. Which is it? I'm sure that
all the of these can be figured out, but the problem is now I have to
ask every one of these questions whenever I want to use strings.

Secondly, Python doesn't do Unicode exception handling correctly. (but I
suspect that its a broader problem with languages) A good example of
this is with UTF-8 where there are invalid code points ( such as 0xC0,
0xC1, 0xF5, 0xF6, 0xF7, 0xF8, ..., 0xFF, but you already knew that, as
well as everyone else who wants to use strings for some reason).

When embedding Python in a long running application where user input is
received, it is very easy to make mistake which bring down the whole
program. If any user string isn't properly try/excepted, a user could
craft a malformed string which a UTF-8 decoder would choke on. Using
ASCII (or whatever 8 bit encoding) doesn't have these problems since all
codepoints are valid.

Another (this must have been a good laugh amongst the UniDevs) 'feature'
of unicode is the zero width space (UTF-8 code point 0xE2 0x80 0x8B).
Any string can masquerade as any other string by placing few of these
in a string. Any word filters you might have are now defeated by some
cheesy Unicode nonsense character. Can you just just check for these
characters and strip them out? Yes. Should you have to? I would say no.

Does it get better? Of course! international character sets used for
domain name encoding use yet a different scheme (Punycode). Are the
following two domain names the same: tést.com , xn--tst-bma.com ? Who
knows!

I suppose I can gloss over the pains of using Unicode in C with every
string needing to be an LPS since 0x00 is now a valid code point in
UTF-8 (0x0000 for 2 byte Unicode) or suffer the O(n) look up time to do
strlen or concatenation operations.

Can it get even better? Yep. We also now need to have a Byte order
Mark (BOM) to determine the endianness of our characters. Are they
little endian or big endian? (or perhaps one of the two possible middle
endian encodings?) Who knows? String processing with unicode is
unpleasant to say the least. I suppose that's what we get when we
things are designed by committee.

But Hey! The great thing about standards is that there are so many to
choose from.

--
Bill


rusi

unread,
Jul 18, 2011, 10:50:21 PM7/18/11
to

Thanks for writing that
Every time I try to understand unicode and remain stuck I come to the
conclusion that I must be an imbecile.
Seeing others (probably more intelligent than yours truly) gives me
some solace!

[And I am writing this from India where there are dozens of languages,
almost as many scripts and everyone speaks and writes at least a
couple of non-european ones]

MRAB

unread,
Jul 18, 2011, 11:08:42 PM7/18/11
to pytho...@python.org
That's down to whether it's a narrow or wide Python build. There's a
PEP suggesting a fix for that (PEP 393).

> Secondly, Python doesn't do Unicode exception handling correctly. (but I
> suspect that its a broader problem with languages) A good example of
> this is with UTF-8 where there are invalid code points ( such as 0xC0,
> 0xC1, 0xF5, 0xF6, 0xF7, 0xF8, ..., 0xFF, but you already knew that, as
> well as everyone else who wants to use strings for some reason).
>

Those aren't codepoints, those are invalid bytes for the UTF-8 encoding.

> When embedding Python in a long running application where user input is
> received, it is very easy to make mistake which bring down the whole
> program. If any user string isn't properly try/excepted, a user could
> craft a malformed string which a UTF-8 decoder would choke on. Using
> ASCII (or whatever 8 bit encoding) doesn't have these problems since all
> codepoints are valid.
>

What if you give an application an invalid JPEG, PNG or other image
file? Does that mean that image formats are bad too?

> Another (this must have been a good laugh amongst the UniDevs) 'feature'
> of unicode is the zero width space (UTF-8 code point 0xE2 0x80 0x8B).
> Any string can masquerade as any other string by placing few of these in
> a string. Any word filters you might have are now defeated by some
> cheesy Unicode nonsense character. Can you just just check for these
> characters and strip them out? Yes. Should you have to? I would say no.
>
> Does it get better? Of course! international character sets used for
> domain name encoding use yet a different scheme (Punycode). Are the
> following two domain names the same: tést.com , xn--tst-bma.com ? Who
> knows!
>
> I suppose I can gloss over the pains of using Unicode in C with every
> string needing to be an LPS since 0x00 is now a valid code point in
> UTF-8 (0x0000 for 2 byte Unicode) or suffer the O(n) look up time to do
> strlen or concatenation operations.
>

0x00 is also a valid ASCII code, but C doesn't let you use it!

There's also "Modified UTF-8", in which U+0000 is encoded as 2 bytes,
so that zero-byte can be used as a terminator. You can't do that in
ASCII! :-)

> Can it get even better? Yep. We also now need to have a Byte order Mark
> (BOM) to determine the endianness of our characters. Are they little
> endian or big endian? (or perhaps one of the two possible middle endian
> encodings?) Who knows? String processing with unicode is unpleasant to
> say the least. I suppose that's what we get when we things are designed
> by committee.
>

Proper UTF-8 doesn't have a BOM.

The rule (in Python, at least) is to decode on input and encode on
output. You don't have to worry about endianness when processing
Unicode strings internally; they're just a series of codepoints.

Steven D'Aprano

unread,
Jul 18, 2011, 11:11:05 PM7/18/11
to
rusi wrote:

> Every time I try to understand unicode and remain stuck I come to the
> conclusion that I must be an imbecile.

http://www.joelonsoftware.com/articles/Unicode.html


--
Steven

Benjamin Kaplan

unread,
Jul 18, 2011, 11:54:31 PM7/18/11
to pytho...@python.org
On Mon, Jul 18, 2011 at 7:07 PM, Billy Mays <no...@nohow.com> wrote:
>
> On 7/18/2011 7:56 PM, Steven D'Aprano wrote:
>>
>> Billy Mays wrote:
>>
>>> On 07/17/2011 03:47 AM, Xah Lee wrote:
>>>>
>>>> 2011-07-16
>>>
>>> I gave it a shot.  It doesn't do any of the Unicode delims, because
>>> let's face it, Unicode is for goobers.
>>
>> Goobers... that would be one of those new-fangled slang terms that the young
>> kids today use to mean its opposite, like "bad", "wicked" and "sick",
>> correct?
>>
>> I mention it only because some people might mistakenly interpret your words
>> as a childish and feeble insult against the 98% of the world who want or
>> need more than the 127 characters of ASCII, rather than understand you
>> meant it as a sign of the utmost respect for the richness and diversity of
>> human beings and their languages, cultures, maths and sciences.
>>
>>
>
> TL;DR version: international character sets are a problem, and Unicode is not the answer to that problem).
>
> As long as I have used python (which I admit has only been 3 years) Unicode has never appeared to be implemented correctly.  I'm probably repeating old arguments here, but whatever.
>
> Unicode is a mess.  When someone says ASCII, you know that they can only mean characters 0-127.  When someone says Unicode, do the mean real Unicode (and is it 2 byte or 4 byte?) or UTF-32 or UTF-16 or UTF-8? When using the 'u' datatype with the array module, the docs don't even tell you if its 2 bytes wide or 4 bytes.  Which is it?  I'm sure that all the of these can be figured out, but the problem is now I have to ask every one of these questions whenever I want to use strings.
>

It doesn't matter. When you use the unicode data type in Python, you
get to treat it as a sequence of characters, not a sequence of bytes.
The fact that it's stored internally as UCS-2 or UCS-4 is irrelevant.

>
> Secondly, Python doesn't do Unicode exception handling correctly. (but I suspect that its a broader problem with languages) A good example of this is with UTF-8 where there are invalid code points ( such as 0xC0, 0xC1, 0xF5, 0xF6, 0xF7, 0xF8, ..., 0xFF, but you already knew that, as well as everyone else who wants to use strings for some reason).
>

A Unicode code point is of the form U+XXXX. 0xC0 is not a Unicode code
point, it is a byte. It happens to be an invalid byte using the UTF-8
byte encoding (which is not Unicode, it's a byte string). The Unicode
code point U+00C0 is perfectly valid- it's a LATIN CAPITAL LETTER A
WITH GRAVE.

>
> When embedding Python in a long running application where user input is received, it is very easy to make mistake which bring down the whole program.  If any user string isn't properly try/excepted, a user could craft a malformed string which a UTF-8 decoder would choke on.  Using ASCII (or whatever 8 bit encoding) doesn't have these problems since all codepoints are valid.
>

UTF-8 != Unicode. UTF-8 is one of several byte encodings capable of
representing every character in the Unicode spec, but it is not
Unicode. If you have a Unicode string, it is not a sequence of byes,
it is a sequence of characters. If you want a sequence of bytes, use a
byte string. If you are attempting to interpret a sequence of bytes as
a sequence of text, you're doing it wrong. There's a reason we have
both text and binary modes for opening files- yes, there is a
difference between them.

> Another (this must have been a good laugh amongst the UniDevs) 'feature' of unicode is the zero width space (UTF-8 code point 0xE2 0x80 0x8B). Any string can masquerade as any other string by placing  few of these in a string.  Any word filters you might have are now defeated by some cheesy Unicode nonsense character.  Can you just just check for these characters and strip them out?  Yes.  Should you have to?  I would say no.
>
> Does it get better?  Of course! international character sets used for domain name encoding use yet a different scheme (Punycode).  Are the following two domain names the same: tést.com , xn--tst-bma.com ?  Who knows!
>
> I suppose I can gloss over the pains of using Unicode in C with every string needing to be an LPS since 0x00 is now a valid code point in UTF-8 (0x0000 for 2 byte Unicode) or suffer the O(n) look up time to do strlen or concatenation operations.
>

That is using UTF-8 in C. Which, again, is not the same thing as Unicode.

> Can it get even better?  Yep.  We also now need to have a Byte order Mark (BOM) to determine the endianness of our characters.  Are they little endian or big endian?  (or perhaps one of the two possible middle endian encodings?)  Who knows?  String processing with unicode is unpleasant to say the least.  I suppose that's what we get when we things are designed by committee.
>

And that is UTF-16 and UTF-32. Again, those are byte encodings. They
are not Unicode. When you use a library capable of handling Unicode,
you never see those- you just have a string with characters in it.

> But Hey!  The great thing about standards is that there are so many to choose from.
>
> --
> Bill
>
>
>
>
>
>

> --
> http://mail.python.org/mailman/listinfo/python-list

Steven D'Aprano

unread,
Jul 19, 2011, 12:30:20 AM7/19/11
to
Billy Mays wrote:

> TL;DR version: international character sets are a problem, and Unicode
> is not the answer to that problem).

Shorter version: FUD.

Yes, having a rich and varied character set requires work. Yes, the Unicode
standard itself, and any interface to it (including Python's) are imperfect
(like anything created by fallible humans). But your post is a long and
tedious list of FUD with not one bit of useful advice.

I'm not going to go through the whole post -- life is too short. But here
are two especially egregious example showing that you have some fundamental
misapprehensions about what Unicode actually is:

> Python doesn't do Unicode exception handling correctly. (but I
> suspect that its a broader problem with languages) A good example of
> this is with UTF-8 where there are invalid code points ( such as 0xC0,
> 0xC1, 0xF5, 0xF6, 0xF7, 0xF8, ..., 0xFF, but you already knew that, as
> well as everyone else who wants to use strings for some reason).

and then later:

> Another (this must have been a good laugh amongst the UniDevs) 'feature'
> of unicode is the zero width space (UTF-8 code point 0xE2 0x80 0x8B).


This is confused. Unicode text has code points, text which has been encoded
is nothing but bytes and not code points. "UTF-8 code point" does not even
mean anything.

The zero width space has code point U+200B. The bytes you get depend on
which encoding you want:

>>> zws = u'\N{Zero Width Space}'
>>> zws
u'\u200b'
>>> zws.encode('utf-8')
'\xe2\x80\x8b'
>>> zws.encode('utf-16')
'\xff\xfe\x0b '

But regardless of which bytes it is encoded into, ZWS always has just a
single code point: U+200B.

You say "A good example of this is with UTF-8 where there are invalid code
points ( such as 0xC0, 0xC1, 0xF5, 0xF6, 0xF7, 0xF8, ..., 0xFF" but I don't
even understand why you think this is a problem with Unicode.

0xC0 is not a code point, it is a byte. Not all combinations of bytes are
legal in all files. If you have byte 0xC0 in a file, it cannot be an ASCII
file: there is no ASCII character represented by byte 0xC0, because hex
0xCO = 192, which is larger than 127.

Likewise, if you have a 0xC0 byte in a file, it cannot be UTF-8. It is as
simple as that. Trying to treat it as UTF-8 will give an error, just as
trying to view a mp3 file as if it were a jpeg will give an error. Why you
imagine this is a problem for Unicode is beyond me.

--
Steven

rusi

unread,
Jul 19, 2011, 12:59:29 AM7/19/11
to
On Jul 19, 8:11 am, Steven D'Aprano <steve

Yes Ive read that and understood a little bit more thanks to it.
But for the points raised in this thread this one from Joel is more
relevant:

http://www.joelonsoftware.com/articles/LeakyAbstractions.html

Some evidences of leakiness:
code point vs character vs byte
encoding and decoding
UTF-x and UCS-y

Very important and necessary distinctions? Maybe... But I did not need
them when my world was built of the 127 bricks of ASCII.

My latest brush with unicode was when I tried to port construct to
python3. http://construct.wikispaces.com/

If unicode 'just works' you should be able to do it in a jiffy?
[And if you did I would be glad to be proved wrong :-) ]

Chris Angelico

unread,
Jul 19, 2011, 1:36:33 AM7/19/11
to pytho...@python.org
On Tue, Jul 19, 2011 at 2:59 PM, rusi <rusto...@gmail.com> wrote:
> Some evidences of leakiness:
> code point vs character vs byte
> encoding and decoding
> UTF-x and UCS-y
>
> Very important and necessary distinctions? Maybe... But I did not need
> them when my world was built of the 127 bricks of ASCII.

Codepoint vs byte is NOT an abstraction. Unicode consists of
characters, where each character is represented by a number called its
codepoint. Since computers work with bytes, we need a way of encoding
those characters into bytes. It's no different from encoding a piece
of music in bytes, and having it come out as 0x90 0x64 0x40. Are those
bytes an abstraction of the note? No. They're an encoding of a MIDI
message that requests that the note be struck. The note itself is an
abstraction, if you like; but the bytes to create that note could be
delivered in a variety of other ways.

A Python Unicode string, whether it's Python 2's 'unicode' or Python
3's 'str', is a sequence of characters. Since those characters are
stored in memory, they must be encoded somehow, but that's not our
problem. We need only care about encoding when we save those
characters to disk, transmit them across the network, or in some other
way need to store them as bytes. Otherwise, there is no abstraction,
and no leak.

Chris Angelico

Thomas 'PointedEars' Lahn

unread,
Jul 19, 2011, 2:09:09 AM7/19/11
to
Thomas 'PointedEars' Lahn wrote:

> with open(os.path.join(dirpath, name), 'r') as f:

SHOULD be

with open(os.path.join(dirpath, name), 'rb') as f:

(as in the original), else the some code units might not be read properly.

Xah Lee

unread,
Jul 19, 2011, 4:58:15 AM7/19/11
to

might check out my take

〈Xah's Unicode Tutorial〉
http://xahlee.org/Periodic_dosage_dir/unicode.html

especially good for emacs users.

if you grew up with english, unicode might seem complex or difficult
due to unfamiliarity.

but for asian people, when you dont have alphabets, it's kinda strange
to think that a byte is char. The notion simply don't exist and
impossible to establish. There are many encodings for chinese before
unicode. Even today, unicode isn't used in taiwan or china. Taiwan
uses big5, china uses GB18030, which contains all chars of unicode.

~8 years ago i thought that it'd be great if china adopted unicode
sometimes in the future... so that we all just have one charset to
deal with. But that's never gonna happen. On the contrary, am thinking
now there's the possibility that the world adopts GB18030 someday. lol
if you go to alexa.com for traffic ranking, a good percentage of the
top few are chinese these days. more and more as i observed since mid
2000s.

by the way, here's what these matching pairs are used for.

‹french quote›
«french quote»

the 〈〉 《》 are chinese brackets used for book titles etc. (CD, TV
program, show title, etc.)
the 「」 『』 are traditional chinese quotes, like english's ‘sinle
curly’, “double curly”
the 【】 〖〗 〔〕 and few others are variant brakets, similar to english's
() {} [].

Xah

Xah Lee

unread,
Jul 19, 2011, 12:54:38 PM7/19/11
to
On Sunday, July 17, 2011 2:48:42 AM UTC-7, Raymond Hettinger wrote:
> On Jul 17, 12:47 am, Xah Lee <xah...@gmail.com> wrote:
> > i hope you'll participate. Just post solution here. Thanks.
>
> http://pastebin.com/7hU20NNL

just installed py3.
there seems to be a bug.
in this file

http://xahlee.org/p/time_machine/tm-ch04.html

there's a mismatched double curly quote. at position 28319.

the python code above doesn't seem to spot it?

here's the elisp script output when run on that dir:

Error file: c:/Users/h3/web/xahlee_org/p/time_machine/tm-ch04.html
["“" 28319]
Done deal!

Xah

Xah Lee

unread,
Jul 19, 2011, 1:14:42 PM7/19/11
to

On Jul 18, 10:12 am, Billy Mays
<81282ed9a88799d21e77957df2d84bd6514d9...@myhashismyemail.com> wrote:

as Ian Kelly mentioned, your script fail because it doesn't report the
position or line/column number of first mismatched bracket. This is
rather significant part to this small problem. Avoiding unicode also
lessen the value of this exercise, because handling unicode in python
isn't trivial, at least with respect to this small exercise.

I added other unicode brackets to your list of brackets, but it seems
your code still fail to catch a file that has mismatched curly quotes.
(e.g. http://xahlee.org/p/time_machine/tm-ch04.html )

LOL Billy.

Xah

Xah Lee

unread,
Jul 19, 2011, 1:32:22 PM7/19/11
to

On Jul 18, 2:59 pm, Thomas 'PointedEars' Lahn <PointedE...@web.de>
wrote:

> Ian Kelly wrote:
> > Billy Mays wrote:
> >> I gave it a shot.  It doesn't do any of the Unicode delims, because let's
> >> face it, Unicode is for goobers.
>
> > Uh, okay...
>
> > Your script also misses the requirement of outputting the index or row
> > and column of the first mismatched bracket.
>
> Thanks to Python's expressiveness, this can be easily remedied (see below).  
>
> I also do not follow Billy's comment about Unicode.  Unicode and the fact
> that Python supports it *natively* cannot be appreciated enough in a
> globalized world.
>
> However, I have learned a lot about being pythonic from his posting (take
> those generator expressions, for example!), and the idea of looking at the
> top of a stack for reference is a really good one.  Thank you, Billy!
>
> Here is my improvement of his code, which should fill the mentioned gaps.
> I have also reversed the order in the report line as I think it is more
> natural this way.  I have tested the code superficially with a directory
> containing a single text file.  Watch for word-wrap:
>
> # encoding: utf-8
> '''
> Created on 2011-07-18
>
> @author: Thomas 'PointedEars' Lahn <PointedE...@web.de>, based on an idea of
> Billy Mays <81282ed9a88799d21e77957df2d84bd6514d9...@myhashismyemail.com>

Thanks for the fix.
Though, it seems still wrong.

On the file http://xahlee.org/p/time_machine/tm-ch04.html

there is a mismatched curly double quote at 28319.

the script reports:
tm-ch04.html: bad ')' at 68:2

that doesn't seems right. Line 68 is empty. There's no opening or
closing round bracket anywhere close. Nearest are lines 11 and 127.

Maybe Billy Mays's algorithm is wrong.

Xah (fairly discouraged now, after running 3 python scripts all
failed)

Billy Mays

unread,
Jul 19, 2011, 1:33:40 PM7/19/11
to
On 07/19/2011 01:14 PM, Xah Lee wrote:
> I added other unicode brackets to your list of brackets, but it seems
> your code still fail to catch a file that has mismatched curly quotes.
> (e.g.http://xahlee.org/p/time_machine/tm-ch04.html )
>
> LOL Billy.
>
> Xah

I suspect its due to the file mode being opened with 'rb' mode. Also,
the diction of characters at the top, the closing token is the key,
while the opening one is the value. Not sure if thats obvious.

Also returning the position of the first mismatched pair is somewhat
ambiguous. File systems store files as streams of octets (mine do
anyways) rather than as characters. When you ask for the position of
the the first mismatched pair, do you mean the position as per
file.tell() or do you mean the nth character in the utf-8 stream?

Also, you may have answered this earlier but I'll ask again anyways: You
ask for the first mismatched pair, Are you referring to the inner most
mismatched, or the outermost? For example, suppose you have this file:

foo[(])bar

Would the "(" be the first mismatched character or would the "]"?

--
Bill

Xah Lee

unread,
Jul 19, 2011, 1:49:09 PM7/19/11
to
On Jul 17, 8:31 am, Thomas Jollans <t...@jollybox.de> wrote:

> On Jul 17, 9:47 am,XahLee <xah...@gmail.com> wrote:
>
>
>
>
>
>
>
>
>
> > 2011-07-16
>
> > folks, this one will be interesting one.
>
> > the problem is to write a script that can check a dir of text files
> > (and all subdirs) and reports if a file has any mismatched matching
> > brackets.
>
> > • The files will be utf-8 encoded (unix style line ending).
>
> > • If a file has mismatched matching-pairs, the script will display the
> > file name, and the  line number and column number of the first
> > instance where a mismatched bracket occures. (or, just the char number
> > instead (as in emacs's “point”))
>
> > • the matching pairs are all single unicode chars. They are these and
> > nothing else: () {} [] “” ‹› «» 【】 〈〉 《》 「」 『』
> > Note that ‘single curly quote’ is not consider matching pair here.
>
> > • You script must be standalone. Must not be using some parser tools.
> > But can call lib that's part of standard distribution in your lang.
>
> > Here's a example of mismatched bracket: ([)], (“[[”), ((, 】etc. (and
> > yes, the brackets may be nested. There are usually text between these
> > chars.)
>
> > I'll be writing a emacs lisp solution and post in 2 days. Ι welcome
> > other lang implementations. In particular, perl, python, php, ruby,
> > tcl, lua, Haskell, Ocaml. I'll also be able to eval common lisp
> > (clisp) and Scheme lisp (scsh), Java. Other lang such as Clojure,
> > Scala, C, C++, or any others, are all welcome, but i won't be able to
> > eval it. javascript implementation will be very interesting too, but
> > please indicate which and where to install the command line version.
>
> > I hope you'll find this a interesting “challenge”. This is a parsing
> > problem. I haven't studied parsers except some Wikipedia reading, so
> > my solution will probably be naive. I hope to see and learn from your
> > solution too.

>
> > i hope you'll participate. Just post solution here. Thanks.
>
> I thought I'd have some fun with multi-processing:
>
> https://gist.github.com/1087682

hi Thomas. I ran the program, all cpu went max (i have a quad), but
after i think 3 minutes nothing happens, so i killed it.

is there something special one should know to run the script?

I'm using Python 3.2.1 on Windows 7.

Xah

Thomas Jollans

unread,
Jul 19, 2011, 2:07:14 PM7/19/11
to pytho...@python.org


That script doesn't check that the balance is zero at the end of file.

Patch:

--- ../xah-raymond-old.py 2011-07-19 20:05:13.000000000 +0200
+++ ../xah-raymond.py 2011-07-19 20:03:14.000000000 +0200
@@ -16,6 +16,8 @@
elif c in closers:
if not stack or c != stack.pop():
return i
+ if stack:
+ return i
return -1

def scan(directory, encoding='utf-8'):

Ian Kelly

unread,
Jul 19, 2011, 2:08:06 PM7/19/11
to Xah Lee, pytho...@python.org
On Tue, Jul 19, 2011 at 10:54 AM, Xah Lee <xah...@gmail.com> wrote:
> On Sunday, July 17, 2011 2:48:42 AM UTC-7, Raymond Hettinger wrote:
>> On Jul 17, 12:47 am, Xah Lee <xah...@gmail.com> wrote:
>> > i hope you'll participate. Just post solution here. Thanks.
>>
>> http://pastebin.com/7hU20NNL
>
> just installed py3.
> there seems to be a bug.
> in this file
>
> http://xahlee.org/p/time_machine/tm-ch04.html
>
> there's a mismatched double curly quote. at position 28319.
>
> the python code above doesn't seem to spot it?

It would appear that Raymond forgot to check that the stack is empty
at the end of the check_balance function. It's an easy enough thing
to fix.

Xah Lee

unread,
Jul 19, 2011, 2:12:42 PM7/19/11
to
On Jul 19, 10:33 am, Billy Mays
<81282ed9a88799d21e77957df2d84bd6514d9...@myhashismyemail.com> wrote:

yes i haven't been precise. Thanks for brining it up.

thinking about it now, i think it's a bit hard to define precisely. My
elisp code actually reports the “)”, so it's wrong too. LOL

Xah

Thomas Jollans

unread,
Jul 19, 2011, 2:14:53 PM7/19/11
to pytho...@python.org

Well, it overdoes the multi-processing “a little”. Checking each
character in a separate process might have been overkill.

Here's a sane version:

https://gist.github.com/1087682/2240a0834463d490c29ed0f794ad15128849ff8e

old, crazy version:
https://gist.github.com/1087682/6841c3875f7e88c23e0a053ac0d0f0565d8713e2

Thomas Jollans

unread,
Jul 19, 2011, 2:17:28 PM7/19/11
to pytho...@python.org
Oh, by the way:

On 19/07/11 19:49, Xah Lee wrote:

> I ran the program, all cpu went max

Mission accomplished.

Terry Reedy

unread,
Jul 19, 2011, 3:09:41 PM7/19/11
to pytho...@python.org
On 7/19/2011 2:12 PM, Xah Lee wrote:

>> Also, you may have answered this earlier but I'll ask again anyways: You
>> ask for the first mismatched pair, Are you referring to the inner most
>> mismatched, or the outermost? For example, suppose you have this file:
>>
>> foo[(])bar
>>
>> Would the "(" be the first mismatched character or would the "]"?
>
> yes i haven't been precise. Thanks for brining it up.
>
> thinking about it now, i think it's a bit hard to define precisely.

Then it is hard to code precisely.

> My elisp code actually reports the “)”, so it's wrong too. LOL

This sort of exercise should start with a series of test cases, starting
with the simplest.

testpairs = (
('', True), # or whatever you want the OK response to be
('a', True),
('abdsdfdsdff', True),

('()', True), # and so on for each pair of fences
('(', False), # or exact error output wanted
(')', False), # and so on

The above could be generated programatically from the set of pairs that
should be the input to the program, so that the pairs are not hardcoded
into the logic.

'([)]', ???),
...
)

--
Terry Jan Reedy


Robert Klemme

unread,
Jul 20, 2011, 2:23:09 AM7/20/11
to
On 18.07.2011 16:39, Xah Lee wrote:
>
> On Jul 17, 12:47 am, Xah Lee<xah...@gmail.com> wrote:
>> 2011-07-16
>>
>> folks, this one will be interesting one.
>>
>> the problem is to write a script that can check a dir of text files
>> (and all subdirs) and reports if a file has any mismatched matching
>> brackets.
>> …
>
> Ok, here's my solution (pasted at bottom). I haven't tried to make it
> elegant or terse, yet, seeing that many are already much elegent than
> i could possibly do so with my code.
>
> my solution basically use a stack. (i think all of us are doing
> similar) Here's the steps:
>
> • Go thru the file char by char, find a bracket char.
> • check if the one on stack is a matching opening char. If so remove
> it. Else, push the current onto the stack.
> • Repeat the above till end of file.
> • If the stack is not empty, then the file got mismatched brackets.
> Report it.
> • Do the above on all files.

Small correction: my solution works differently (although internally the
regexp engine will roughly do the same). So, my approach summarized

- traverse a directory tree
- for each found item of type "file"
- read the whole content
- throw it at a regexp which is anchored at the beginning
and does the recursive parsing
- report file if the match is shorter than the file

Note: special feature for recursive matching is used which Perl's regexp
engine likely can do as well but many others don't.

Cheers

robert

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

jmfauth

unread,
Jul 20, 2011, 2:29:56 AM7/20/11
to
On 19 juil, 21:09, Terry Reedy <tjre...@udel.edu> wrote:
> On 7/19/2011 2:12 PM, Xah Lee wrote:
>
> >> Also, you may have answered this earlier but I'll ask again anyways: You
> >> ask for the first mismatched pair, Are you referring to the inner most
> >> mismatched, or the outermost?  For example, suppose you have this file:
>
> >> foo[(])bar
>
> >> Would the "(" be the first mismatched character or would the "]"?
>
> > yes i haven't been precise. Thanks for brining it up.
>
> > thinking about it now, i think it's a bit hard to define precisely.
>
> Then it is hard to code precisely.
>

Not really. The trick is to count the different opener/closer
separately.
That is what I am doing to check balanced brackets in
chemical formulas. The rules are howerver not the same
as in math.

Interestingly, I fall on this "problem". enumerate() is very
nice to parse a string from left to right.

>>> for i, c in enumerate('abcd'):
... print i, c
...
0 a
1 b
2 c
3 d
>>>

But, if I want to parse a string from right to left,
what's the trick?
The best I found so far:

>>> s = 'abcd'
>>> for i, c in enumerate(reversed(s)):
... print len(s) - 1 - i, c
...
3 d
2 c
1 b
0 a
>>>

Mark Tarver

unread,
Jul 20, 2011, 1:43:42 AM7/20/11
to
On Jul 17, 8:47 am, Xah Lee <xah...@gmail.com> wrote:
> 2011-07-16
>
> folks, this one will be interesting one.
>
> the problem is to write a script that can check a dir of text files
> (and all subdirs) and reports if a file has any mismatched matching
> brackets.
>
> • The files will be utf-8 encoded (unix style line ending).
>
> • If a file has mismatched matching-pairs, the script will display the
> file name, and the  line number and column number of the first
> instance where a mismatched bracket occures. (or, just the char number
> instead (as in emacs's “point”))
>
> • the matching pairs are all single unicode chars. They are these and
> nothing else: () {} [] “” ‹› «» 【】 〈〉 《》 「」 『』
> Note that ‘single curly quote’ is not consider matching pair here.
>
> • You script must be standalone. Must not be using some parser tools.
> But can call lib that's part of standard distribution in your lang.
>
> Here's a example of mismatched bracket: ([)], (“[[”), ((, 】etc. (and
> yes, the brackets may be nested. There are usually text between these
> chars.)
>
> I'll be writing a emacs lisp solution and post in 2 days. Ι welcome
> other lang implementations. In particular, perl, python, php, ruby,
> tcl, lua, Haskell, Ocaml. I'll also be able to eval common lisp
> (clisp) and Scheme lisp (scsh), Java. Other lang such as Clojure,
> Scala, C, C++, or any others, are all welcome, but i won't be able to
> eval it. javascript implementation will be very interesting too, but
> please indicate which and where to install the command line version.
>
> I hope you'll find this a interesting “challenge”. This is a parsing
> problem. I haven't studied parsers except some Wikipedia reading, so
> my solution will probably be naive. I hope to see and learn from your
> solution too.
>
> i hope you'll participate. Just post solution here. Thanks.
>
>  Xah

Parsing technology based on BNF enables an elegant solution. First
take a basic bracket balancing program which parenthesises the
contents of the input. e.g. in Shen-YACC

(defcc <br>
"(" <br> ")" <br$> := [<br> | <br$>];
<item> <br>;
<e> := [];)

(defcc <br$>
<br>;)

(defcc <item>
-*- := (if (element? -*- ["(" ")"]) (fail) [-*-]);)

Given (compile <br> ["(" 1 2 3 ")" 4]) the program produces [[1 2 3]
4]. When this program is used to parse the input, whatever residue is
left indicates where the parse has failed. In Shen-YACC

(define tellme
Stuff -> (let Br (<br> (@p Stuff []))
Residue (fst Br)
(if (empty? Residue)
(snd Br)
(error "parse failure at position ~A~%"
(- (length Stuff) (length Residue))))))

e.g.

(tellme ["(" 1 2 3 ")" "(" 4])
parse failure at position 5

(tellme ["(" 1 2 3 ")" "(" ")" 4])
[[1 2 3] [] 4]

The extension of this program to the case described is fairly simple.
Qi-YACC is very similar.

Nice problem.

I do not have further time to correspond right now.

Mark

Ian Kelly

unread,
Jul 20, 2011, 3:29:10 AM7/20/11
to jmfauth, pytho...@python.org
On Wed, Jul 20, 2011 at 12:29 AM, jmfauth <wxjm...@gmail.com> wrote:
>> Then it is hard to code precisely.
>>
>
> Not really. The trick is to count the different opener/closer
> separately.
> That is what I am doing to check balanced brackets in
> chemical formulas. The rules are howerver not the same
> as in math.

I think the difficulty is not in the algorithm, but in adhering to the
desired output when it is ambiguously described.

> But, if I want to parse a string from right to left,
> what's the trick?
> The best I found so far:
>
>>>> s = 'abcd'
>>>> for i, c in enumerate(reversed(s)):
> ...     print len(s) - 1 - i, c

That violates DRY, since you have reversal logic in the iterator
algebra and then again in the loop body. I prefer to keep all such
logic in the iterator algebra, if possible. This is one possibility,
if you don't mind it building an intermediate list:

>>> for i, c in reversed(list(enumerate(s))):
...

Otherwise, here's another non-DRY solution:

>>> from itertools import izip
>>> for i, c in izip(reversed(xrange(len(s))), reversed(s)):
...

Unfortunately, this is one space where there just doesn't seem to be a
single obvious way to do it.

jmfauth

unread,
Jul 20, 2011, 3:54:22 AM7/20/11
to
On 20 juil, 09:29, Ian Kelly <ian.g.ke...@gmail.com> wrote:

> Otherwise, here's another non-DRY solution:
>
> >>> from itertools import izip
> >>> for i, c in izip(reversed(xrange(len(s))), reversed(s)):
>
> ...
>
> Unfortunately, this is one space where there just doesn't seem to be a
> single obvious way to do it.

Well, I see. Thanks.

There is still the old, brave solution, I'm in fact using.

>>> s = 'abcd'
>>> for i in xrange(len(s)-1, -1, -1):
... print i, s[i]


...
3 d
2 c
1 b
0 a
>>>

---

DRY? acronym for ?

Steven D'Aprano

unread,
Jul 20, 2011, 4:18:42 AM7/20/11
to
On Wed, 20 Jul 2011 05:54 pm jmfauth wrote:

> DRY? acronym for ?

I'd like to tell you, but I already told somebody else...

*grins*


http://en.wikipedia.org/wiki/Don't_repeat_yourself
http://c2.com/cgi/wiki?DontRepeatYourself


--
Steven

Xah Lee

unread,
Jul 20, 2011, 6:31:24 AM7/20/11
to
i've just cleaned up my elisp code and wrote a short elisp tutorial.

Here:

〈Emacs Lisp: Batch Script to Validate Matching Brackets〉
http://xahlee.org/emacs/elisp_validate_matching_brackets.html

plain text version follows. Please let me know what you think.

am still working on going thru all code in other langs. Will get to
the ruby one, and that perl regex, and the other fixed python ones.
(possibly also the 2 common lisp codes but am not sure they are
runnable as is or just some non-working showoff. lol)

===============================================
Emacs Lisp: Batch Script to Validate Matching Brackets

Xah Lee, 2011-07-19

This page shows you how to write a elisp script that checks thousands
of files for mismatched brackets.

----------------------------------------------------------------
The Problem

------------------------------------------------
Summary

I have 5 thousands files containing many matching pairs. I want to to
know if any of them contains mismatched brackets.

------------------------------------------------
Detail

The matching pairs includes these: () {} [] “” ‹› «» 〈〉 《》 【】 〖〗 「」
『』.

The program should be able to check all files in a dir, and report any
file that has mismatched bracket, and also indicate the line number or
positon where a mismatch occurs.

For those curious, if you want to know what these brackets are, see:

• Syntax Design: Use of Unicode Matching Brackets as Specialized
Delimiters
• Intro to Chinese Punctuation with Computer Language Syntax
Perspectives

For other notes and conveniences about dealing with brackets in emacs,
see:

• Emacs: Defining Keys to Navigate Brackets
• “extend-selection” at A Text Editor Feature: Extend Selection by
Semantic Unit
• “select-text-in-quote” at Suggestions on Emacs's mark-word
Command

----------------------------------------------------------------
Solution

Here's outline of steps.

• Go thru the file char by char, find a bracket char.

• Check if the one on stack is a matching opening char. If so


remove it. Else, push the current onto the stack.

• Repeat the above till no more bracket char in the file.


• If the stack is not empty, then the file got mismatched
brackets. Report it.
• Do the above on all files.

Here's some interesting use of lisp features to implement the above.

------------------------------------------------
Define Matching Pair Chars as “alist”

We begin by defining the chars we want to check, as a “association
list” (aka “alist”). Like this:

(setq matchPairs '(
("(" . ")")
("{" . "}")
("[" . "]")
("“" . "”")
("‹" . "›")
("«" . "»")
("【" . "】")
("〖" . "〗")
("〈" . "〉")
("《" . "》")
("「" . "」")
("『" . "』")
)
)

If you care only to check for curly quotes, you can remove elements
above. This is convenient because some files necessarily have
mismatched pairs such as the parenthesis, because that char is used
for many non-bracketing purposes (e.g. ASCII smiley).

A “alist” in lisp is basically a list of pairs (called key and value),
with the ability to search for a key or a value. The first element of
a pair is called its key, the second element is its value. Each pair
is a “cons”, like this: (cons mykey myvalue), which can also be
written using this syntax: (mykey . myvalue) for more easy reading.

The purpose of lisp's “alist” is similar to Python's dictionary or
Pretty Home Page's array. It is also similar to hashmap, except that
alist can have duplicate keys, can search by values, maintains order,
and alist is not intended for massive number of elements. Elisp has a
hashmap datatype if you need that. (See: Emacs Lisp Tutorial: Hash
Table.)

(info "(elisp) Association Lists")

------------------------------------------------
Generate Regex String from alist

To search for a set of chars in emacs, we can read the buffer char-by-
char, or, we can simply use “search-forward-regexp”. To use that,
first we need to generate a regex string from our matchPairs alist.

First, we defines/declare the string. Not a necessary step, but we do
it for clarity.

(setq searchRegex "")

Then we go thru the matchPairs alist. For each pair, we use “car” and
“cdr” to get the chars and “concat” it to the string. Like this:

(mapc
(lambda (mypair) ""
(setq searchRegex (concat searchRegex (regexp-quote (car mypair))
"|" (regexp-quote (cdr mypair)) "|") )
)
matchPairs)

Then we remove the ending “|”.

(setq searchRegex (substring searchRegex 0 -1)) ; remove the ending
“|”

Then, change | it to \\|. In elisp regex, the | is literal. The “regex
or” is \|. And if you are using regex in elisp, elisp does not have a
special regex string syntax, it only understands normal strings. So,
to feed to regex \|, you need to espace the first backslash. So, your
regex needs to have \\|. Here's how we do it:

(setq searchRegex (replace-regexp-in-string "|" "\\|" searchRegex t
t)) ; change | to \\| for regex “or” operation

You could shorten the above into just 2 lines by using \\| in the
“mapc” step and not as a extra step of replacing | by \\|.

See also: emacs regex tutorial.

------------------------------------------------
Implement Stack Using Lisp List

Stack is done using lisp's list. e.g. '(1 2 3). The bottom of stack is
the first element. To add to the stack, do it like this: (setq mystack
(cons newitem mystack)). To remove a item from stack is this: (setq
mystack (cdr mystack)). The stack begin as a empty list: '().

For each element in the stack, we need the char and also its position,
so that we can report the position if the file does have mismatched
pairs.

We use a vector as entries for the stack. Each entry is like this:
(vector char pos). (See: Emacs Lisp Tutorial: List & Vector.)

Here's how to fetch a char from stack bottom, check if current char
matches, push to stack, pop from stack.

; check if current char is a closing char and is in our match pairs
alist.
; use “rassoc” to check alist's set of “values”.
; It returns the first key/value pair found, or nil
(rassoc char matchPairs)

; add to stack
(setq myStack (cons (vector char pos) myStack) )

; pop stack
(setq myStack (cdr myStack) )

------------------------------------------------
Complete Code

Here's the complete code.

;; -*- coding: utf-8 -*-
;; 2011-07-15
;; go thru a file, check if all brackets are properly matched.
;; e.g. good: (…{…}… “…”…)
;; bad: ( [)]
;; bad: ( ( )

(setq inputFile "xx_test_file.txt" ) ; a test file.
(setq inputDir "~/web/xahlee_org/") ; must end in slash

(defvar matchPairs '() "a alist. For each pair, the car is opening
char, cdr is closing char.")
(setq matchPairs '(
("(" . ")")
("{" . "}")
("[" . "]")
("“" . "”")
("‹" . "›")
("«" . "»")
("【" . "】")
("〖" . "〗")
("" . "")
("" . "")
("「" . "」")
("『" . "』")
)
)

(defvar searchRegex "" "regex string of all pairs to search.")
(setq searchRegex "")
(mapc
(lambda (mypair) ""
(setq searchRegex (concat searchRegex (regexp-quote (car mypair))
"|" (regexp-quote (cdr mypair)) "|") )
)
matchPairs)

(setq searchRegex (replace-regexp-in-string "|$" "" searchRegex t
t)) ; remove the ending “|”

(setq searchRegex (replace-regexp-in-string "|" "\\|" searchRegex t
t)) ; change | to \\| for regex “or” operation

(defun my-process-file (fpath)
"process the file at fullpath fpath ..."
(let (myBuffer myStack ξchar ξpos)

(setq myStack '() ) ; each element is a vector [char position]
(setq ξchar "") ; the current char found

(when t
;; (not (string-match "/xx" fpath)) ; in case you want to skip
certain files

(setq myBuffer (get-buffer-create " myTemp"))
(set-buffer myBuffer)
(insert-file-contents fpath nil nil nil t)

(goto-char 1)
(setq case-fold-search t)
(while (search-forward-regexp searchRegex nil t)
(setq ξpos (point) )
(setq ξchar (buffer-substring-no-properties ξpos (- ξpos
1)) )

;; (princ (format "-----------------------------\nfound char:
%s\n" ξchar) )

(let ((isClosingCharQ nil) (matchedOpeningChar nil) )
(setq isClosingCharQ (rassoc ξchar matchPairs))
(when isClosingCharQ (setq matchedOpeningChar (car
isClosingCharQ) ) )

;; (princ (format "isClosingCharQ is: %s\n"
isClosingCharQ) )
;; (princ (format "matchedOpeningChar is: %s\n"
matchedOpeningChar) )

(if
(and
(car myStack) ; not empty
(equal (elt (car myStack) 0) matchedOpeningChar )
)
(progn
;; (princ (format "matched this bottom item on stack:
%s\n" (car myStack)) )
(setq myStack (cdr myStack) )
)
(progn
;; (princ (format "did not match this bottom item on
stack: %s\n" (car myStack)) )
(setq myStack (cons (vector ξchar ξpos) myStack) ) )
)
)
;; (princ "current stack: " )
;; (princ myStack )
;; (terpri )
)

(when (not (equal myStack nil))
(princ "Error file: ")
(princ fpath)
(print (car myStack) )
)
(kill-buffer myBuffer)
)
))

(require 'find-lisp)

(let (outputBuffer)
(setq outputBuffer "*xah match pair output*" )
(with-output-to-temp-buffer outputBuffer
;; (my-process-file inputFile)
(mapc 'my-process-file (find-lisp-find-files inputDir "\\.txt$"))
(princ "Done deal!")
)
)

I added many comments and debug code for easy understanding. If you
are not familiar with the many elisp idioms such as opening file,
buffers, printing to output, see: Emacs Lisp Idioms (for writing
interactive commands) ◇ Text Processing with Emacs Lisp Batch Style.

To run the code, simply open it in emacs. Edit the line at the top for
“inputDir”. Then call “eval-buffer”.

Here's a sample output:

Error file: c:/Users/h3/web/xahlee_org/p/time_machine/
Hettie_Potter_orig.txt
[")" 3625]
Error file: c:/Users/h3/web/xahlee_org/p/time_machine/
Hettie_Potter.txt
[")" 2338]
Error file: c:/Users/h3/web/xahlee_org/p/arabian_nights/xx/v1fn.txt
["”" 185795]
Done deal!

The weird ξ you see in my code is greek x. I use unicode char in
variable name for experimental purposes. You can just ignore it. (See:
Programing Style: Variable Naming: English Words Considered Harmful.)

------------------------------------------------
Advantages of Emacs Lisp

Note that the great advantage of using elisp for text processing,
instead of {perl, python, ruby, …} is that many things are taken care
by the emacs environment.

I don't need to write code to declare file's encoding (emacs
automatically detects). No reading file is involved. Just open, save,
or move thru characters. No code needed for doing safety backup. (the
var “make-backup-files” controls that). You can easily open the files
by its path with a click or key press. I can add just 2 lines so that
clicking on the error char in the output jumps to the location in the
file.

Any elisp script you write inside emacs automatically become extension
of emacs and can be used in a interactive way.

This problem is posted to a few comp.lang newsgroups as a fun
challenge. You can see several solutions in python, ruby, perl, common
lisp, at: a little parsing challenge ☺ (2011-07-17) @ Source
groups.google.com.

Xah

Uri Guttman

unread,
Jul 20, 2011, 12:31:05 PM7/20/11