Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
C++ Question: switch / case vs. "chained" if elses for dealing with the ASCII character set
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  17 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Randy Kramer  
View profile  
 More options Apr 28 2011, 9:36 pm
From: Randy Kramer <rhkra...@gmail.com>
Date: Thu, 28 Apr 2011 20:36:02 -0500
Local: Thurs, Apr 28 2011 9:36 pm
Subject: C++ Question: switch / case vs. "chained" if elses for dealing with the ASCII character set
In the lexer I'm working on, at one point I will have to examine every
character and process it in different ways depending on whether:

   * it is one of the special markup characters (*, _, =, /, -)
(slightly different logic for each, or at least, different values and
flags to be set)
   * it is punctuation that can appear at the front of a word
   * it is punctuation that can appear at the end of a word
   * it is an uppercase alphabetic character (all handled the same)
   * it is a lowercase alphabetic character (all handled the same)
   * it is a control character (all handled the same)
   * it is an extended ASCII character (all handled the same) (or this
might be "anything else")

I'm trying to decide whether it's best to use a long switch / case
statement with cases for, example, for every uppercase letter of the
alphabet that fall through to the logic for uppercase letters, or
whether I should have a chain of if else if else if statements.

Or something else.

I've looked on the web a little--I thought the if else would be more
efficient (but maybe I shouldn't care about efficiency so much yet) but
the sources I found say the switch case is even more efficient because
the compiler constructs a jump table (typically) and many compilers
don't do construct a jump table for a chain of if else if statements.

Somebody else recommended using a map (iiuc, they start out with a
typedef std::map<std::string,/*function pointer type*/> map_t;).  I
think that may be more learning than I want to try to do, and I'm not
sure I want to try to deal with function pointers, etc.

For readability, I think the switch / case statement will do pretty
well, which is probably the direction I'll head unless I get a strong
pointer in a different direction.

Randy Kramer


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Lex Trotman  
View profile  
 More options Apr 28 2011, 10:24 pm
From: Lex Trotman <ele...@gmail.com>
Date: Fri, 29 Apr 2011 12:24:25 +1000
Local: Thurs, Apr 28 2011 10:24 pm
Subject: Re: [scintilla] C++ Question: switch / case vs. "chained" if elses for dealing with the ASCII character set
Hi Randy,

Quick question before anything, does the language you are working on
*only* support ASCII characters??

Cheers
Lex

On 29 April 2011 11:36, Randy Kramer <rhkra...@gmail.com> wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Randy Kramer  
View profile  
 More options Apr 28 2011, 10:55 pm
From: Randy Kramer <rhkra...@gmail.com>
Date: Thu, 28 Apr 2011 21:55:58 -0500
Local: Thurs, Apr 28 2011 10:55 pm
Subject: Re: [scintilla] C++ Question: switch / case vs. "chained" if elses for dealing with the ASCII character set
Lex,

On Thursday 28 April 2011 10:24:25 pm Lex Trotman wrote:

> Quick question before anything, does the language you are working on
> *only* support ASCII characters??

No.  Maybe in some incarnations it will only support ASCII, but it
should also support UTF-8 and maybe multi-byte code pages.

But, at least for a long time, the situation should be much like Neil
says in a statement somewhere, anything other than ASCII will not be
syntactically significant with respect to the lexer and folder.

I'll elaborate a little more.  It is for a wiki markup language
(Foswiki / TWiki / askRhk), and the lexer must find and handle
WikiWords which are recognized as:

   * One or more uppercase letters, followed by
   * One or more lowercase letters, followed by
   * One or more uppercase letters, optionally followed by
   * Any alphanumeric characters (and maybe _)

I know a little about UTF-8 (maybe just enough to be a danger to
myself), but don't know if there are uppercase and lowercase letters in
the UTF-8 world beyond 7-bit ASCII.  I imagine if there are, then
someday, someone will want to recognize WikiWords using those
characters.  I think I hope I'm long gone by then, but you never
know. ;-)

Randy Kramer


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Lex Trotman  
View profile  
 More options Apr 29 2011, 12:04 am
From: Lex Trotman <ele...@gmail.com>
Date: Fri, 29 Apr 2011 14:04:36 +1000
Local: Fri, Apr 29 2011 12:04 am
Subject: Re: [scintilla] C++ Question: switch / case vs. "chained" if elses for dealing with the ASCII character set
On 29 April 2011 12:55, Randy Kramer <rhkra...@gmail.com> wrote:

> Lex,

> On Thursday 28 April 2011 10:24:25 pm Lex Trotman wrote:
>> Quick question before anything, does the language you are working on
>> *only* support ASCII characters??

> No.  Maybe in some incarnations it will only support ASCII, but it
> should also support UTF-8 and maybe multi-byte code pages.

Well you are going to have to know what the encoding of the bytes in
the Scintilla buffer is to convert to Unicode code points.

> But, at least for a long time, the situation should be much like Neil
> says in a statement somewhere, anything other than ASCII will not be
> syntactically significant with respect to the lexer and folder.

Neil is right ... for programming languages ... most only accept
non-ASCII in strings, but with a wiki you are working with natural
languages, *bold* to you or me but *gras* to Phillipe (I think).

And if Phillipe wishes to emphasize *façade* the C with cedilla is not
in the ASCII character set.

> I'll elaborate a little more.  It is for a wiki markup language
> (Foswiki / TWiki / askRhk), and the lexer must find and handle
> WikiWords which are recognized as:

>   * One or more uppercase letters, followed by
>   * One or more lowercase letters, followed by
>   * One or more uppercase letters, optionally followed by
>   * Any alphanumeric characters (and maybe _)

> I know a little about UTF-8 (maybe just enough to be a danger to
> myself), but don't know if there are uppercase and lowercase letters in
> the UTF-8 world beyond 7-bit ASCII.

BANG!!! Ow my foot...

There are thousands of letters outside the ASCII character set, like
the ç above.  And the problems occur even with the Latin languages
which the English character set is a subset of (French, German etc)
not to mention European languages like Greek, Russian etc where the
whole character set is outside the ASCII.  But at least if you can get
it to work for similar languages like French it should work for a
significant proportion of the rest.

And you thought it was going to be easy for a simple language like a
wiki.  Assuming that you know the encoding and so can get code points
not bytes out of the buffer there are several tools that can test
Letters, Upper/Lower case, Numerics etc.  One such is the Glib unichar
library, and for Scintilla with GTK it is already built in, but it can
be linked with apps without GTK.
But even that won't help you when the ç is not entered as a single
code point, but is a lower case c 0x63 followed by a cedilla combining
character 0x0327.

As you can probably guess I have been having those problems elsewhere
(basically for now I have treated all combining characters as letters,
its not exactly correct but good enough for my purposes).

It would be a very long switch since there are 2^21 Unicode code points :-)

Better to use if else if using the character class functions described
above.  The functions I believe use a mixture of tables and range
checks so they are reasonably efficient in both memory and time.

>> > Somebody else recommended using a map (iiuc, they start out with a
>> > typedef std::map<std::string,/*function pointer type*/> map_t;).  I
>> > think that may be more learning than I want to try to do, and I'm
>> > not sure I want to try to deal with function pointers, etc.

How are you going to fill the map with the 2^21 values, I don't think
it applies to what you are doing, bad advice or misunderstood :-(

>> > For readability, I think the switch / case statement will do pretty
>> > well, which is probably the direction I'll head unless I get a
>> > strong pointer in a different direction.

Consider this a strong pointer if you want to be considered friendly
by anyone who speaks non-English, at least names, locations talked
about etc may contain non-ASCII even if the main text is in English.

It makes such a significant difference to your code that I don't even
recommend writing it for ASCII for now and converting later.

Cheers
Lex


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Mike Lischke  
View profile  
 More options Apr 29 2011, 3:22 am
From: Mike Lischke <mike.lisc...@googlemail.com>
Date: Fri, 29 Apr 2011 09:22:17 +0200
Local: Fri, Apr 29 2011 3:22 am
Subject: Re: [scintilla] C++ Question: switch / case vs. "chained" if elses for dealing with the ASCII character set

> There are thousands of letters outside the ASCII character set, like
> the ç above.  And the problems occur even with the Latin languages
> which the English character set is a subset of (French, German etc)
> not to mention European languages like Greek, Russian etc where the
> whole character set is outside the ASCII.  But at least if you can get
> it to work for similar languages like French it should work for a
> significant proportion of the rest.

This all doesn't matter much for text parsing. Unicode (and utf-8 is just a transformation format of that, hence the name) is clustered into character classes. And in any decent unicode library exist functions to test if a certain character belongs to one of the classes (numbers, letters, punctuation etc.). We frequently use GLib to help us out here (http://developer.gnome.org/glib/2.28/glib-Unicode-Manipulation.html, Lex mentioned it already).

> But even that won't help you when the ç is not entered as a single
> code point, but is a lower case c 0x63 followed by a cedilla combining
> character 0x0327.

That's what normalization is for. You can transform any unicode string in one of different forms that give you a consistent way of describing all characters in a string (single code point or composition). See http://unicode.org/reports/tr15/, especially Table 1 for the 4 forms and the following figures which show what happens with a character depending on the chosen form.

>>>> I'm trying to decide whether it's best to use a long switch / case
>>>> statement with cases for, example, for every uppercase letter of
>>>> the alphabet that fall through to the logic for uppercase letters,
>>>> or whether I should have a chain of if else if else if statements.

Instead of using one big switch (be it a case or an if/else cascade) you might be better suited by using a recursive-decent parser, which is probably one of the fastest ways for parsing complex languages. It's like partitioning a set of input values into several groups and calling different functions for each group/partition to handle that. In these functions you further partition the input set until you can reach a final decision. This is similar to the commonly known solution to find one value out of 1024 by only asking 10 questions at most (via binary partition). In your case the partitioning criterion could be the character classes. I recommend not to try to hack together a temporary solution for ANSI only and hoping to extend it for any Unicode character later, but instead attack and solve it once and for all.

Mike
--
www.soft-gems.net
www.stv-lo.de
www.hibiki-daiko.de


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Lex Trotman  
View profile  
 More options Apr 29 2011, 3:45 am
From: Lex Trotman <ele...@gmail.com>
Date: Fri, 29 Apr 2011 17:45:48 +1000
Local: Fri, Apr 29 2011 3:45 am
Subject: Re: [scintilla] C++ Question: switch / case vs. "chained" if elses for dealing with the ASCII character set

> That's what normalization is for. You can transform any unicode string in one of different forms that give you a consistent way of describing all characters in a string (single code point or composition). See http://unicode.org/reports/tr15/, especially Table 1 for the 4 forms and the following figures which show what happens with a character depending on the chosen form.

Sadly that doesn't always compose to a single code point :-( see
characters made up of base plus more than one combining character,
still leaves combining characters in NFC and NFKC, I've been bitten.

>>>>> I'm trying to decide whether it's best to use a long switch / case
>>>>> statement with cases for, example, for every uppercase letter of
>>>>> the alphabet that fall through to the logic for uppercase letters,
>>>>> or whether I should have a chain of if else if else if statements.

> Instead of using one big switch (be it a case or an if/else cascade) you might be better suited by using a recursive-decent parser, which is probably one of the fastest ways for parsing complex languages.

I don't think that wiki markup would gain much from recursive descent
parsing, its very flat, in fact it would almost be parsed by regexes
if they had the Unicode classes (Like Perl extensions).  The very
similar Asciidoc language is parsed by Python regexes. Sadly using a
regex engine on the Scintilla buffer requires the gap to be closed
each time IIUC, which could get expensive.

> It's like partitioning a set of input values into several groups and calling different functions for each group/partition to handle that. In these functions you further partition the input set until you can reach a final decision. This is similar to the commonly known solution to find one value out of 1024 by only asking 10 questions at most (via binary partition). In your case the partitioning criterion could be the character classes.
>I recommend not to try to hack together a temporary solution for ANSI only and hoping to extend it for any Unicode character later, but instead attack and solve it once and for all.

Great minds think alike, or we've both been bitten :-)

Cheers
Lex


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Philippe Lhoste  
View profile  
 More options Apr 29 2011, 4:38 am
From: Philippe Lhoste <Phi...@GMX.net>
Date: Fri, 29 Apr 2011 10:38:01 +0200
Local: Fri, Apr 29 2011 4:38 am
Subject: Re: C++ Question: switch / case vs. "chained" if elses for dealing with the ASCII character set
On 29/04/2011 03:36, Randy Kramer wrote:

A long switch / case would be awful, breaking the DRY principle ("don't repeat yourself").
You can easily check if a character is an Ascii uppercase or lowercase char, Scintilla
might offer functions for that, or just check the numerical range. Idem for control chars.
Markup chars: sometime they are doubled or more, like === Title or '''Italics''' (made up,
I never recall the exact syntax, which varies between markups anyway).
Again, lexers can use helper function to test a small sequence of chars.
Finding if a punctuation is at the start or the end of a word can be (should be) done with
the finite state machine: you are inside a word, if you meet an upper char after lower
chars, change state. Idem if you find a punctuation char, etc.

About Unicode: I don't know if the markup languages you work on have special rules for
these chars. Indeed, traditional wikis take MixedCase sequences as link to other wiki
pages. But I am suspecting that most of them doesn't see something like
R publiqueArabe gypte like such link... I even guess that an accented char might split a
word in two, syntactically speaking!
Even worse, it can depend on an implementation of the wiki rules... (if there are several,
that's the case for Markdown, for example).

Modern wiki languages like Wikimedia one just don't rely anymore on this (which I find
rather ugly and can create empty links where none were intended, eg. with names like
JavaScript). They prefer to use special markup like [[Egypt]] or [[ gypt]], or
[[ gypte|R publique arabe d' gypte]] when we want to put free text with a page link.

Well, you know all this, all I wanted to say is that you probably can ignore Unicode chars
as special markup, as either the language is old and probably target only English, or it
is modern and has better methods to handle these chars.

--
Philippe Lhoste
--  (near) Paris -- France
--  http://Phi.Lho.free.fr
--  --  --  --  --  --  --  --  --  --  --  --  --  --


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Mike Lischke  
View profile  
 More options Apr 29 2011, 4:03 am
From: Mike Lischke <mike.lisc...@googlemail.com>
Date: Fri, 29 Apr 2011 10:03:13 +0200
Local: Fri, Apr 29 2011 4:03 am
Subject: Re: [scintilla] C++ Question: switch / case vs. "chained" if elses for dealing with the ASCII character set

>> That's what normalization is for. You can transform any unicode string in one of different forms that give you a consistent way of describing all characters in a string (single code point or composition). See http://unicode.org/reports/tr15/, especially Table 1 for the 4 forms and the following figures which show what happens with a character depending on the chosen form.

> Sadly that doesn't always compose to a single code point :-( see
> characters made up of base plus more than one combining character,
> still leaves combining characters in NFC and NFKC, I've been bitten.

That's true, but when choosing the right form the result should merely be a "base char"  plus some "decoration". I'm by far not a natural language expert, but I'd expect decisions made based on that "base char" alone should be fairly good (so the following char can be ignored if it belongs to a certain class). However with a RD parser it would be trivial to handle it properly. In a loop you will have to find some other ways to maintain a state stack or something like that.

> I don't think that wiki markup would gain much from recursive descent
> parsing, its very flat, in fact it would almost be parsed by regexes
> if they had the Unicode classes (Like Perl extensions).  The very
> similar Asciidoc language is parsed by Python regexes. Sadly using a
> regex engine on the Scintilla buffer requires the gap to be closed
> each time IIUC, which could get expensive.

I also used RE a while back for similar tasks but have found it to be much slower compared to RD parsers, extremely hard to debug and error prone. Trying to extend an already complex RE is a major task too. REs would be the last tool I'd choose.

>> It's like partitioning a set of input values into several groups and calling different functions for each group/partition to handle that. In these functions you further partition the input set until you can reach a final decision. This is similar to the commonly known solution to find one value out of 1024 by only asking 10 questions at most (via binary partition). In your case the partitioning criterion could be the character classes.

>> I recommend not to try to hack together a temporary solution for ANSI only and hoping to extend it for any Unicode character later, but instead attack and solve it once and for all.

> Great minds think alike, or we've both been bitten :-)

Oh yes :-)

Mike
--
www.soft-gems.net
www.stv-lo.de
www.hibiki-daiko.de


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Lex Trotman  
View profile  
 More options Apr 29 2011, 5:06 am
From: Lex Trotman <ele...@gmail.com>
Date: Fri, 29 Apr 2011 19:06:23 +1000
Local: Fri, Apr 29 2011 5:06 am
Subject: Re: [scintilla] Re: C++ Question: switch / case vs. "chained" if elses for dealing with the ASCII character set

> Well, you know all this, all I wanted to say is that you probably can ignore
> Unicode chars as special markup, as either the language is old and probably
> target only English, or it is modern and has better methods to handle these
> chars.

Well, there you go Randy, three pieces of different advice :-)

Serves you right for omitting that markup *can be around phrases* not
just words!!! That means they can have anything inside them.

Cheers
Lex


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Lex Trotman  
View profile  
 More options Apr 29 2011, 7:36 am
From: Lex Trotman <ele...@gmail.com>
Date: Fri, 29 Apr 2011 21:36:38 +1000
Local: Fri, Apr 29 2011 7:36 am
Subject: Re: [scintilla] Re: C++ Question: switch / case vs. "chained" if elses for dealing with the ASCII character set
On 29 April 2011 19:06, Lex Trotman <ele...@gmail.com> wrote:

>> Well, you know all this, all I wanted to say is that you probably can ignore
>> Unicode chars as special markup, as either the language is old and probably
>> target only English, or it is modern and has better methods to handle these
>> chars.

> Well, there you go Randy, three pieces of different advice :-)

> Serves you right for omitting that markup *can be around phrases* not
> just words!!! That means they can have anything inside them.

Randy,

Since you mentioned it elsewhere, I had a quick look at txt2tags and
markdown lexers, they would probably provide most of the
infrastructure you need, just change the characters for bold etc.  The
actual rule they use (which avoids all the palaver with word
characters) is *bold* is detected by the rule space then asterisk then
anything except the sequence (any character not a space then
asterisk).  Same rule for all the other inline markups.  Txt2tags
acknowledges its a copy of markdown so maybe you should do the same.

The problem will however arise with WikiWords which neither of those
languages support, and then the unicode character class solutions we
talked about before can apply.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Randy Kramer  
View profile  
 More options Apr 29 2011, 11:08 am
From: Randy Kramer <rhkra...@gmail.com>
Date: Fri, 29 Apr 2011 10:08:26 -0500
Local: Fri, Apr 29 2011 11:08 am
Subject: Re: [scintilla] C++ Question: switch / case vs. "chained" if elses for dealing with the ASCII character set
Mike,

Thanks for the response!  I guess you're serious about the suggestion to
use a recursive descent parser.  ;-) Assuming you are, I'll need a lot
of help--I'm willing, and somewhat interested in looking into it, but I
tried it before and didn't get very far.

At one point in time during my sojourn into writing a lexer for
Scintilla (or maybe even back when I wrote some similar things for
Nedit and Kate), I did some level of investigation of pre-packaged
solutions (like lex and yacc), and then tried to understand recursive
descent parsers (which, iiuc, is not what lex or yacc are).

At the time, I had a lot of trouble with all kinds of things, acronyms
(like LALR??), the concept of a context free grammar, whether the
Foswiki / TWiki markup is context-free (not remembering for sure, I
think I eventually made an uneducated guess that it is not context
free), and whether a recursive descent parser for the language would or
would not (be guaranteed) to terminate.

Anyway, the point is, I'm willing to look into this a second (or third,
or fourth, or whatever time this is for me), but help would be
appreciated.

The first step I'm going to take is to re-read the Wikipedia article on
recursive descent parsers
(http://en.wikipedia.org/wiki/Recursive_descent_parser), and try to
understand the first example program there, written in C.  (Can you
imagine how well I did at that when I hardly understood C at all, and
I'm not sure I'm much better now.  Well, actually, I feel like I
understand 2000% more than I did a few months ago, but that might still
not be very much.)

Anyway, I'll appreciate any help you (or anyone else) wants to offer,
including pointers to possibly a simpler example of a recursive descent
parser, and or a reminder / help in deciding whether Foswiki markup is
context free or not, and understanding what that means to me.

(On the other hand, I may go read that Wikipedia article and understand
that example in minutes, but I sort of doubt it.)

regards,
Randy Kramer

On Friday 29 April 2011 03:22:17 am Mike Lischke wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Lex Trotman  
View profile  
 More options Apr 29 2011, 6:51 pm
From: Lex Trotman <ele...@gmail.com>
Date: Sat, 30 Apr 2011 08:51:41 +1000
Local: Fri, Apr 29 2011 6:51 pm
Subject: Re: [scintilla] C++ Question: switch / case vs. "chained" if elses for dealing with the ASCII character set
Hi Randy, Mike,

On 30 April 2011 01:08, Randy Kramer <rhkra...@gmail.com> wrote:

> Mike,

> Thanks for the response!  I guess you're serious about the suggestion to
> use a recursive descent parser.  ;-) Assuming you are, I'll need a lot
> of help--I'm willing, and somewhat interested in looking into it, but I
> tried it before and didn't get very far.

> At one point in time during my sojourn into writing a lexer for
> Scintilla (or maybe even back when I wrote some similar things for
> Nedit and Kate), I did some level of investigation of pre-packaged
> solutions (like lex and yacc), and then tried to understand recursive
> descent parsers (which, iiuc, is not what lex or yacc are).

Let me repeat the reply I gave back then.  Yacc etc, and recursive
descent etc parsers are *not* the correct solution for Scintilla
Lexers:

1. Scintilla lexers are not parsers, they only try to recognise enough
syntax to colourise and fold the code, they have no need to recognise
the language in detail.  This is very little more than the first level
lexical analysis phase of parsing, hence they are called lexers.

2. Scintilla does not parse a file from beginning to end, it can start
anywhere, since the parser methods above generate state as they parse,
they cannot start at an arbitrary point

3. Scintilla is an editor, much of the time the file contents are not
correct syntax, full parsers take a lot of effort to make them handle
errors and keep going.

These reasons are why even the C/C++/Java Lexers are not parsers.

And in the particular case of wiki languages, these types of parsers
are not appropriate because the language cannot be described by a
formal grammar, they have no structure they are just formatting
markup.

As I said elsewhere Randy, copy the markdown parser, change the
characters it detects for the various constructs, get a mostly working
lexer then think about WhackyWords.

Cheers
Lex


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Mike Lischke  
View profile  
 More options Apr 30 2011, 4:51 am
From: Mike Lischke <mike.lisc...@googlemail.com>
Date: Sat, 30 Apr 2011 10:51:49 +0200
Local: Sat, Apr 30 2011 4:51 am
Subject: Re: [scintilla] C++ Question: switch / case vs. "chained" if elses for dealing with the ASCII character set
Hey Randy,

> Thanks for the response!  I guess you're serious about the suggestion to
> use a recursive descent parser.  ;-) Assuming you are, I'll need a lot
> of help--I'm willing, and somewhat interested in looking into it, but I
> tried it before and didn't get very far.

Well, parsers/lexers (and their generation) is a wide area. My "bible" for this is the "dragon book" ("COMPILERS Principles, Techniques and Tools", 1986/87, Aho/Sethi/Ullmann). However, I'm not sure how far you wanna go with that when it comes to your actual problem. I haven't followed the entire thread, so I have to admit I don't know what the original problem was.

For syntax highlighting in Scintilla you usually only need a lexer to tokenize the input, not more, as there is usually a 1:1 relationship between a token type and a text color. There might be cases where the meaning (and therefor the color) of a keyword changes depending on the context (say, using "private" in a class declaration or as a variable name, just as example).

For text folding a bit more work is necessary but still you probably don't need a full parser, but just a technique to get the work done. For simpler (normal?) cases you just can work with a loop and a case switcher. I used that approach for the MySQL lexer in Scintilla, though I had to add some context tracking to solve ambiguities. My suggestion to use a recursive-decent parser was rather meant as an idea what technique to use instead of if/then cascades or a big switch.

> At one point in time during my sojourn into writing a lexer for
> Scintilla (or maybe even back when I wrote some similar things for
> Nedit and Kate), I did some level of investigation of pre-packaged
> solutions (like lex and yacc), and then tried to understand recursive
> descent parsers (which, iiuc, is not what lex or yacc are).

It's been a while since I last looked at the source code of lex and yacc, so I cannot say what type they use, however yacc *produces* table-based parsers (lex just generates the lexer, aka tokenizer). As an example for a parser generator that generates recursive-decent parsers I'd like to mention ANTLR.

> At the time, I had a lot of trouble with all kinds of things, acronyms
> (like LALR??), the concept of a context free grammar, whether the
> Foswiki / TWiki markup is context-free (not remembering for sure, I
> think I eventually made an uneducated guess that it is not context
> free), and whether a recursive descent parser for the language would or
> would not (be guaranteed) to terminate.

Is that what you are after, writing a parser for your wiki language? I ask because that is probably way off topic on this mailing list.

> The first step I'm going to take is to re-read the Wikipedia article on
> recursive descent parsers
> (http://en.wikipedia.org/wiki/Recursive_descent_parser), and try to
> understand the first example program there, written in C.  (Can you
> imagine how well I did at that when I hardly understood C at all, and
> I'm not sure I'm much better now.  Well, actually, I feel like I
> understand 2000% more than I did a few months ago, but that might still
> not be very much.)

:-) Well, I have seen many articles that used an abstract language to show the code (usually Pascal-like, as Pascal was once developed as language for teaching that kind of stuff), but I'm not sure that would help you better. The key point for an RD parser is quite simple:

read a token from the lexer -> test this against some condition -> call a function or maybe only a branch in a switch statement to handle this token (could be recursive, hence the name). In the context of Scintilla that could be an action like increasing the folding level. So that is not rocket science.

Not sure if that helps you but I have written a parser generator (based on lex/yacc) some years ago which is still available from my dev web site (first link in the sig). It comes with an editor for *.l and *.y files, so that might help you to get started with those grammars. I also have worked with ANTLR a while ago and still have a full parser grammar for MSVC RC files in one of my repositories (with all bells and whistles like macro support, unicode support, an evaluator etc. etc.). So just let me know if you want a copy.

Mike
--
www.soft-gems.net
www.stv-lo.de
www.hibiki-daiko.de


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Lex Trotman  
View profile  
 More options Apr 30 2011, 5:11 am
From: Lex Trotman <ele...@gmail.com>
Date: Sat, 30 Apr 2011 19:11:24 +1000
Local: Sat, Apr 30 2011 5:11 am
Subject: Re: [scintilla] C++ Question: switch / case vs. "chained" if elses for dealing with the ASCII character set
Hi Mike,

[...]

> I haven't followed the entire thread, so I have to admit I don't know what the original problem was.

Was assistance on writing a Scintilla Lexer for a wiki language, so
Randy is ok on this list AFAIK.

> For syntax highlighting in Scintilla you usually only need a lexer to tokenize the input, not more, as there is usually a 1:1 relationship between a token type and a text color. There might be cases where the meaning (and therefor the color) of a keyword changes depending on the context (say, using "private" in a class declaration or as a variable name, just as example).

Yep agree.

> For text folding a bit more work is necessary but still you probably don't need a full parser, but just a technique to get the work done. For simpler (normal?) cases you just can work with a loop and a case switcher. I used that approach for the MySQL lexer in Scintilla, though I had to add some context tracking to solve ambiguities. My suggestion to use a recursive-decent parser was rather meant as an idea what technique to use instead of if/then cascades or a big switch.

Yep, but then you didn't use it either because its overkill for the task :-)

> It's been a while since I last looked at the source code of lex and yacc, so I cannot say what type they use, however yacc *produces* table-based parsers (lex just generates the lexer, aka tokenizer). As an example for a parser generator that generates recursive-decent parsers I'd like to mention ANTLR.

The problem is that these don't like starting in the middle of the
file the way Scintilla lexers do, re-processing the whole file is a
bit wasteful.

> Is that what you are after, writing a parser for your wiki language? I ask because that is probably way off topic on this mailing list.

As I said above, he is in the right place AFAICT.

I'm sure you meant to be helpful Mike, but when you didn't know the
problem I think your advice has confused the situation.  The advice
above about how you wrote your lexer would have been much more helpful
and is in line with advice from other people, so Randy won't get so
confused :-)

Cheers
Lex


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Philippe Lhoste  
View profile  
 More options Apr 30 2011, 5:45 pm
From: Philippe Lhoste <Phi...@GMX.net>
Date: Sat, 30 Apr 2011 23:45:50 +0200
Local: Sat, Apr 30 2011 5:45 pm
Subject: Re: C++ Question: switch / case vs. "chained" if elses for dealing with the ASCII character set
On 30/04/2011 00:51, Lex Trotman wrote:

> 2. Scintilla does not parse a file from beginning to end, it can start
> anywhere, since the parser methods above generate state as they parse,
> they cannot start at an arbitrary point

To be clear: AFAIK, Scintilla always lexes the document from the start.
There is no other way to know in which state it is in the middle of the
file.
Perhaps you wanted to mean that a Scintilla lexer can resume lexing from
the middle of the document because it can look at the state just before,
keeping the information in memory.

--
Philippe Lhoste
--  (near) Paris -- France
--  http://Phi.Lho.free.fr
--  --  --  --  --  --  --  --  --  --  --  --  --  --


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Lex Trotman  
View profile  
 More options Apr 30 2011, 8:26 pm
From: Lex Trotman <ele...@gmail.com>
Date: Sun, 1 May 2011 10:26:24 +1000
Local: Sat, Apr 30 2011 8:26 pm
Subject: Re: [scintilla] Re: C++ Question: switch / case vs. "chained" if elses for dealing with the ASCII character set
On 1 May 2011 07:45, Philippe Lhoste <Phi...@gmx.net> wrote:

> On 30/04/2011 00:51, Lex Trotman wrote:

>> 2. Scintilla does not parse a file from beginning to end, it can start
>> anywhere, since the parser methods above generate state as they parse,
>> they cannot start at an arbitrary point

> To be clear: AFAIK, Scintilla always lexes the document from the start.
> There is no other way to know in which state it is in the middle of the
> file.
> Perhaps you wanted to mean that a Scintilla lexer can resume lexing from the
> middle of the document because it can look at the state just before, keeping
> the information in memory.

Yes, I was only thinking about the re-lexing as edits occur (after all
thats most of the time lexing will occur).  Scintilla tells the lexer
from where to re-lex from.  And the saved state used by most current
Scintilla lexers is minimal, often the style of the character before
is sufficient.  In particular they don't keep a recursive descent
stack at every character or every token of the file, which was the
real point. Recursive descent and Yacc parsers are interested in much
more detail and so have much more state, and so are much less
appropriate for restarting at an arbitrary point.

Cheers
Lex


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Randy Kramer  
View profile  
 More options May 2 2011, 9:27 am
From: Randy Kramer <rhkra...@gmail.com>
Date: Mon, 2 May 2011 08:27:49 -0500
Local: Mon, May 2 2011 9:27 am
Subject: Re: [scintilla] Re: C++ Question: switch / case vs. "chained" if elses for dealing with the ASCII character set
Thanks to everybody who has responded to my question!

I will probably have some followup questions at some point, but for now,
I sort of took the weekend off, with some "mild" reviewing of things
like lex, yacc, and recursive descent parsers.  I may spend a little
more time there, just for my own general learning.

It is often the case that when I have trouble with something, if I set
it aside for a while (like a year, 18 months, whatever ;-) I can often
come back to it and fairly quickly understand the things I had trouble
with on my previous reviews.

So, I seem to be doing better there, and want to spend a little more
time there before I get back to the lexer.

When I get back to the lexer, I'll review that library that somebody
mentioned for use with unicode / UTF-8 (among other things, I'll need
to confirm it works with Windows), and then decide whether to use it,
and whether to handle unicode / UTF-8 TWikiWords now, or make
provisions for the future, or what.

Based on what I decide there, I think I can decide whether to use a
switch / case or "chained" if elses.

Thanks!
Randy Kramer

On and before Saturday 30 April 2011 08:26:24 pm several people wrote
---< good stuff, which I've snipped for now >---


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »