In the lexer I'm working on, at one point I will have to examine every character and process it in different ways depending on whether:
* it is one of the special markup characters (*, _, =, /, -) (slightly different logic for each, or at least, different values and flags to be set) * it is punctuation that can appear at the front of a word * it is punctuation that can appear at the end of a word * it is an uppercase alphabetic character (all handled the same) * it is a lowercase alphabetic character (all handled the same) * it is a control character (all handled the same) * it is an extended ASCII character (all handled the same) (or this might be "anything else")
I'm trying to decide whether it's best to use a long switch / case statement with cases for, example, for every uppercase letter of the alphabet that fall through to the logic for uppercase letters, or whether I should have a chain of if else if else if statements.
Or something else.
I've looked on the web a little--I thought the if else would be more efficient (but maybe I shouldn't care about efficiency so much yet) but the sources I found say the switch case is even more efficient because the compiler constructs a jump table (typically) and many compilers don't do construct a jump table for a chain of if else if statements.
Somebody else recommended using a map (iiuc, they start out with a typedef std::map<std::string,/*function pointer type*/> map_t;). I think that may be more learning than I want to try to do, and I'm not sure I want to try to deal with function pointers, etc.
For readability, I think the switch / case statement will do pretty well, which is probably the direction I'll head unless I get a strong pointer in a different direction.
> In the lexer I'm working on, at one point I will have to examine every > character and process it in different ways depending on whether:
> * it is one of the special markup characters (*, _, =, /, -) > (slightly different logic for each, or at least, different values and > flags to be set) > * it is punctuation that can appear at the front of a word > * it is punctuation that can appear at the end of a word > * it is an uppercase alphabetic character (all handled the same) > * it is a lowercase alphabetic character (all handled the same) > * it is a control character (all handled the same) > * it is an extended ASCII character (all handled the same) (or this > might be "anything else")
> I'm trying to decide whether it's best to use a long switch / case > statement with cases for, example, for every uppercase letter of the > alphabet that fall through to the logic for uppercase letters, or > whether I should have a chain of if else if else if statements.
> Or something else.
> I've looked on the web a little--I thought the if else would be more > efficient (but maybe I shouldn't care about efficiency so much yet) but > the sources I found say the switch case is even more efficient because > the compiler constructs a jump table (typically) and many compilers > don't do construct a jump table for a chain of if else if statements.
> Somebody else recommended using a map (iiuc, they start out with a > typedef std::map<std::string,/*function pointer type*/> map_t;). I > think that may be more learning than I want to try to do, and I'm not > sure I want to try to deal with function pointers, etc.
> For readability, I think the switch / case statement will do pretty > well, which is probably the direction I'll head unless I get a strong > pointer in a different direction.
> Randy Kramer
> -- > You received this message because you are subscribed to the Google Groups "scintilla-interest" group. > To post to this group, send email to scintilla-interest@googlegroups.com. > To unsubscribe from this group, send email to scintilla-interest+unsubscribe@googlegroups.com. > For more options, visit this group at http://groups.google.com/group/scintilla-interest?hl=en.
On Thursday 28 April 2011 10:24:25 pm Lex Trotman wrote:
> Quick question before anything, does the language you are working on > *only* support ASCII characters??
No. Maybe in some incarnations it will only support ASCII, but it should also support UTF-8 and maybe multi-byte code pages.
But, at least for a long time, the situation should be much like Neil says in a statement somewhere, anything other than ASCII will not be syntactically significant with respect to the lexer and folder.
I'll elaborate a little more. It is for a wiki markup language (Foswiki / TWiki / askRhk), and the lexer must find and handle WikiWords which are recognized as:
* One or more uppercase letters, followed by * One or more lowercase letters, followed by * One or more uppercase letters, optionally followed by * Any alphanumeric characters (and maybe _)
I know a little about UTF-8 (maybe just enough to be a danger to myself), but don't know if there are uppercase and lowercase letters in the UTF-8 world beyond 7-bit ASCII. I imagine if there are, then someday, someone will want to recognize WikiWords using those characters. I think I hope I'm long gone by then, but you never know. ;-)
> On 29 April 2011 11:36, Randy Kramer <rhkra...@gmail.com> wrote: > > In the lexer I'm working on, at one point I will have to examine > > every character and process it in different ways depending on > > whether:
> > * it is one of the special markup characters (*, _, =, /, -) > > (slightly different logic for each, or at least, different values > > and flags to be set) > > * it is punctuation that can appear at the front of a word > > * it is punctuation that can appear at the end of a word > > * it is an uppercase alphabetic character (all handled the same) > > * it is a lowercase alphabetic character (all handled the same) > > * it is a control character (all handled the same) > > * it is an extended ASCII character (all handled the same) (or > > this might be "anything else")
> > I'm trying to decide whether it's best to use a long switch / case > > statement with cases for, example, for every uppercase letter of > > the alphabet that fall through to the logic for uppercase letters, > > or whether I should have a chain of if else if else if statements.
> > Or something else.
> > I've looked on the web a little--I thought the if else would be > > more efficient (but maybe I shouldn't care about efficiency so much > > yet) but the sources I found say the switch case is even more > > efficient because the compiler constructs a jump table (typically) > > and many compilers don't do construct a jump table for a chain of > > if else if statements.
> > Somebody else recommended using a map (iiuc, they start out with a > > typedef std::map<std::string,/*function pointer type*/> map_t;). I > > think that may be more learning than I want to try to do, and I'm > > not sure I want to try to deal with function pointers, etc.
> > For readability, I think the switch / case statement will do pretty > > well, which is probably the direction I'll head unless I get a > > strong pointer in a different direction.
> > Randy Kramer
> > -- > > You received this message because you are subscribed to the Google > > Groups "scintilla-interest" group. To post to this group, send > > email to scintilla-interest@googlegroups.com. To unsubscribe from > > this group, send email to > > scintilla-interest+unsubscribe@googlegroups.com. For more options, > > visit this group at > > http://groups.google.com/group/scintilla-interest?hl=en.
On 29 April 2011 12:55, Randy Kramer <rhkra...@gmail.com> wrote:
> Lex,
> On Thursday 28 April 2011 10:24:25 pm Lex Trotman wrote: >> Quick question before anything, does the language you are working on >> *only* support ASCII characters??
> No. Maybe in some incarnations it will only support ASCII, but it > should also support UTF-8 and maybe multi-byte code pages.
Well you are going to have to know what the encoding of the bytes in the Scintilla buffer is to convert to Unicode code points.
> But, at least for a long time, the situation should be much like Neil > says in a statement somewhere, anything other than ASCII will not be > syntactically significant with respect to the lexer and folder.
Neil is right ... for programming languages ... most only accept non-ASCII in strings, but with a wiki you are working with natural languages, *bold* to you or me but *gras* to Phillipe (I think).
And if Phillipe wishes to emphasize *façade* the C with cedilla is not in the ASCII character set.
> I'll elaborate a little more. It is for a wiki markup language > (Foswiki / TWiki / askRhk), and the lexer must find and handle > WikiWords which are recognized as:
> * One or more uppercase letters, followed by > * One or more lowercase letters, followed by > * One or more uppercase letters, optionally followed by > * Any alphanumeric characters (and maybe _)
> I know a little about UTF-8 (maybe just enough to be a danger to > myself), but don't know if there are uppercase and lowercase letters in > the UTF-8 world beyond 7-bit ASCII.
BANG!!! Ow my foot...
There are thousands of letters outside the ASCII character set, like the ç above. And the problems occur even with the Latin languages which the English character set is a subset of (French, German etc) not to mention European languages like Greek, Russian etc where the whole character set is outside the ASCII. But at least if you can get it to work for similar languages like French it should work for a significant proportion of the rest.
And you thought it was going to be easy for a simple language like a wiki. Assuming that you know the encoding and so can get code points not bytes out of the buffer there are several tools that can test Letters, Upper/Lower case, Numerics etc. One such is the Glib unichar library, and for Scintilla with GTK it is already built in, but it can be linked with apps without GTK. But even that won't help you when the ç is not entered as a single code point, but is a lower case c 0x63 followed by a cedilla combining character 0x0327.
As you can probably guess I have been having those problems elsewhere (basically for now I have treated all combining characters as letters, its not exactly correct but good enough for my purposes).
> I imagine if there are, then > someday, someone will want to recognize WikiWords using those > characters. I think I hope I'm long gone by then, but you never > know. ;-)
> Randy Kramer
>> On 29 April 2011 11:36, Randy Kramer <rhkra...@gmail.com> wrote: >> > In the lexer I'm working on, at one point I will have to examine >> > every character and process it in different ways depending on >> > whether:
>> > * it is one of the special markup characters (*, _, =, /, -) >> > (slightly different logic for each, or at least, different values >> > and flags to be set) >> > * it is punctuation that can appear at the front of a word >> > * it is punctuation that can appear at the end of a word >> > * it is an uppercase alphabetic character (all handled the same) >> > * it is a lowercase alphabetic character (all handled the same) >> > * it is a control character (all handled the same) >> > * it is an extended ASCII character (all handled the same) (or >> > this might be "anything else")
>> > I'm trying to decide whether it's best to use a long switch / case >> > statement with cases for, example, for every uppercase letter of >> > the alphabet that fall through to the logic for uppercase letters, >> > or whether I should have a chain of if else if else if statements.
>> > Or something else.
>> > I've looked on the web a little--I thought the if else would be >> > more efficient (but maybe I shouldn't care about efficiency so much >> > yet) but the sources I found say the switch case is even more >> > efficient because the compiler constructs a jump table (typically) >> > and many compilers don't do construct a jump table for a chain of >> > if else if statements.
It would be a very long switch since there are 2^21 Unicode code points :-)
Better to use if else if using the character class functions described above. The functions I believe use a mixture of tables and range checks so they are reasonably efficient in both memory and time.
>> > Somebody else recommended using a map (iiuc, they start out with a >> > typedef std::map<std::string,/*function pointer type*/> map_t;). I >> > think that may be more learning than I want to try to do, and I'm >> > not sure I want to try to deal with function pointers, etc.
How are you going to fill the map with the 2^21 values, I don't think it applies to what you are doing, bad advice or misunderstood :-(
>> > For readability, I think the switch / case statement will do pretty >> > well, which is probably the direction I'll head unless I get a >> > strong pointer in a different direction.
Consider this a strong pointer if you want to be considered friendly by anyone who speaks non-English, at least names, locations talked about etc may contain non-ASCII even if the main text is in English.
It makes such a significant difference to your code that I don't even recommend writing it for ASCII for now and converting later.
> There are thousands of letters outside the ASCII character set, like > the ç above. And the problems occur even with the Latin languages > which the English character set is a subset of (French, German etc) > not to mention European languages like Greek, Russian etc where the > whole character set is outside the ASCII. But at least if you can get > it to work for similar languages like French it should work for a > significant proportion of the rest.
This all doesn't matter much for text parsing. Unicode (and utf-8 is just a transformation format of that, hence the name) is clustered into character classes. And in any decent unicode library exist functions to test if a certain character belongs to one of the classes (numbers, letters, punctuation etc.). We frequently use GLib to help us out here (http://developer.gnome.org/glib/2.28/glib-Unicode-Manipulation.html, Lex mentioned it already).
> But even that won't help you when the ç is not entered as a single > code point, but is a lower case c 0x63 followed by a cedilla combining > character 0x0327.
That's what normalization is for. You can transform any unicode string in one of different forms that give you a consistent way of describing all characters in a string (single code point or composition). See http://unicode.org/reports/tr15/, especially Table 1 for the 4 forms and the following figures which show what happens with a character depending on the chosen form.
>>>> I'm trying to decide whether it's best to use a long switch / case >>>> statement with cases for, example, for every uppercase letter of >>>> the alphabet that fall through to the logic for uppercase letters, >>>> or whether I should have a chain of if else if else if statements.
Instead of using one big switch (be it a case or an if/else cascade) you might be better suited by using a recursive-decent parser, which is probably one of the fastest ways for parsing complex languages. It's like partitioning a set of input values into several groups and calling different functions for each group/partition to handle that. In these functions you further partition the input set until you can reach a final decision. This is similar to the commonly known solution to find one value out of 1024 by only asking 10 questions at most (via binary partition). In your case the partitioning criterion could be the character classes. I recommend not to try to hack together a temporary solution for ANSI only and hoping to extend it for any Unicode character later, but instead attack and solve it once and for all.
> That's what normalization is for. You can transform any unicode string in one of different forms that give you a consistent way of describing all characters in a string (single code point or composition). See http://unicode.org/reports/tr15/, especially Table 1 for the 4 forms and the following figures which show what happens with a character depending on the chosen form.
Sadly that doesn't always compose to a single code point :-( see characters made up of base plus more than one combining character, still leaves combining characters in NFC and NFKC, I've been bitten.
>>>>> I'm trying to decide whether it's best to use a long switch / case >>>>> statement with cases for, example, for every uppercase letter of >>>>> the alphabet that fall through to the logic for uppercase letters, >>>>> or whether I should have a chain of if else if else if statements.
> Instead of using one big switch (be it a case or an if/else cascade) you might be better suited by using a recursive-decent parser, which is probably one of the fastest ways for parsing complex languages.
I don't think that wiki markup would gain much from recursive descent parsing, its very flat, in fact it would almost be parsed by regexes if they had the Unicode classes (Like Perl extensions). The very similar Asciidoc language is parsed by Python regexes. Sadly using a regex engine on the Scintilla buffer requires the gap to be closed each time IIUC, which could get expensive.
> It's like partitioning a set of input values into several groups and calling different functions for each group/partition to handle that. In these functions you further partition the input set until you can reach a final decision. This is similar to the commonly known solution to find one value out of 1024 by only asking 10 questions at most (via binary partition). In your case the partitioning criterion could be the character classes. >I recommend not to try to hack together a temporary solution for ANSI only and hoping to extend it for any Unicode character later, but instead attack and solve it once and for all.
Great minds think alike, or we've both been bitten :-)
> In the lexer I'm working on, at one point I will have to examine every > character and process it in different ways depending on whether:
> * it is one of the special markup characters (*, _, =, /, -) > (slightly different logic for each, or at least, different values and > flags to be set) > * it is punctuation that can appear at the front of a word > * it is punctuation that can appear at the end of a word > * it is an uppercase alphabetic character (all handled the same) > * it is a lowercase alphabetic character (all handled the same) > * it is a control character (all handled the same) > * it is an extended ASCII character (all handled the same) (or this > might be "anything else")
> I'm trying to decide whether it's best to use a long switch / case > statement with cases for, example, for every uppercase letter of the > alphabet that fall through to the logic for uppercase letters, or > whether I should have a chain of if else if else if statements.
A long switch / case would be awful, breaking the DRY principle ("don't repeat yourself"). You can easily check if a character is an Ascii uppercase or lowercase char, Scintilla might offer functions for that, or just check the numerical range. Idem for control chars. Markup chars: sometime they are doubled or more, like === Title or '''Italics''' (made up, I never recall the exact syntax, which varies between markups anyway). Again, lexers can use helper function to test a small sequence of chars. Finding if a punctuation is at the start or the end of a word can be (should be) done with the finite state machine: you are inside a word, if you meet an upper char after lower chars, change state. Idem if you find a punctuation char, etc.
About Unicode: I don't know if the markup languages you work on have special rules for these chars. Indeed, traditional wikis take MixedCase sequences as link to other wiki pages. But I am suspecting that most of them doesn't see something like R publiqueArabe gypte like such link... I even guess that an accented char might split a word in two, syntactically speaking! Even worse, it can depend on an implementation of the wiki rules... (if there are several, that's the case for Markdown, for example).
Modern wiki languages like Wikimedia one just don't rely anymore on this (which I find rather ugly and can create empty links where none were intended, eg. with names like JavaScript). They prefer to use special markup like [[Egypt]] or [[ gypt]], or [[ gypte|R publique arabe d' gypte]] when we want to put free text with a page link.
Well, you know all this, all I wanted to say is that you probably can ignore Unicode chars as special markup, as either the language is old and probably target only English, or it is modern and has better methods to handle these chars.
-- Philippe Lhoste -- (near) Paris -- France -- http://Phi.Lho.free.fr -- -- -- -- -- -- -- -- -- -- -- -- -- --
>> That's what normalization is for. You can transform any unicode string in one of different forms that give you a consistent way of describing all characters in a string (single code point or composition). See http://unicode.org/reports/tr15/, especially Table 1 for the 4 forms and the following figures which show what happens with a character depending on the chosen form.
> Sadly that doesn't always compose to a single code point :-( see > characters made up of base plus more than one combining character, > still leaves combining characters in NFC and NFKC, I've been bitten.
That's true, but when choosing the right form the result should merely be a "base char" plus some "decoration". I'm by far not a natural language expert, but I'd expect decisions made based on that "base char" alone should be fairly good (so the following char can be ignored if it belongs to a certain class). However with a RD parser it would be trivial to handle it properly. In a loop you will have to find some other ways to maintain a state stack or something like that.
> I don't think that wiki markup would gain much from recursive descent > parsing, its very flat, in fact it would almost be parsed by regexes > if they had the Unicode classes (Like Perl extensions). The very > similar Asciidoc language is parsed by Python regexes. Sadly using a > regex engine on the Scintilla buffer requires the gap to be closed > each time IIUC, which could get expensive.
I also used RE a while back for similar tasks but have found it to be much slower compared to RD parsers, extremely hard to debug and error prone. Trying to extend an already complex RE is a major task too. REs would be the last tool I'd choose.
>> It's like partitioning a set of input values into several groups and calling different functions for each group/partition to handle that. In these functions you further partition the input set until you can reach a final decision. This is similar to the commonly known solution to find one value out of 1024 by only asking 10 questions at most (via binary partition). In your case the partitioning criterion could be the character classes.
>> I recommend not to try to hack together a temporary solution for ANSI only and hoping to extend it for any Unicode character later, but instead attack and solve it once and for all.
> Great minds think alike, or we've both been bitten :-)
> Well, you know all this, all I wanted to say is that you probably can ignore > Unicode chars as special markup, as either the language is old and probably > target only English, or it is modern and has better methods to handle these > chars.
Well, there you go Randy, three pieces of different advice :-)
Serves you right for omitting that markup *can be around phrases* not just words!!! That means they can have anything inside them.
On 29 April 2011 19:06, Lex Trotman <ele...@gmail.com> wrote:
>> Well, you know all this, all I wanted to say is that you probably can ignore >> Unicode chars as special markup, as either the language is old and probably >> target only English, or it is modern and has better methods to handle these >> chars.
> Well, there you go Randy, three pieces of different advice :-)
> Serves you right for omitting that markup *can be around phrases* not > just words!!! That means they can have anything inside them.
Randy,
Since you mentioned it elsewhere, I had a quick look at txt2tags and markdown lexers, they would probably provide most of the infrastructure you need, just change the characters for bold etc. The actual rule they use (which avoids all the palaver with word characters) is *bold* is detected by the rule space then asterisk then anything except the sequence (any character not a space then asterisk). Same rule for all the other inline markups. Txt2tags acknowledges its a copy of markdown so maybe you should do the same.
The problem will however arise with WikiWords which neither of those languages support, and then the unicode character class solutions we talked about before can apply.
Thanks for the response! I guess you're serious about the suggestion to use a recursive descent parser. ;-) Assuming you are, I'll need a lot of help--I'm willing, and somewhat interested in looking into it, but I tried it before and didn't get very far.
At one point in time during my sojourn into writing a lexer for Scintilla (or maybe even back when I wrote some similar things for Nedit and Kate), I did some level of investigation of pre-packaged solutions (like lex and yacc), and then tried to understand recursive descent parsers (which, iiuc, is not what lex or yacc are).
At the time, I had a lot of trouble with all kinds of things, acronyms (like LALR??), the concept of a context free grammar, whether the Foswiki / TWiki markup is context-free (not remembering for sure, I think I eventually made an uneducated guess that it is not context free), and whether a recursive descent parser for the language would or would not (be guaranteed) to terminate.
Anyway, the point is, I'm willing to look into this a second (or third, or fourth, or whatever time this is for me), but help would be appreciated.
The first step I'm going to take is to re-read the Wikipedia article on recursive descent parsers (http://en.wikipedia.org/wiki/Recursive_descent_parser), and try to understand the first example program there, written in C. (Can you imagine how well I did at that when I hardly understood C at all, and I'm not sure I'm much better now. Well, actually, I feel like I understand 2000% more than I did a few months ago, but that might still not be very much.)
Anyway, I'll appreciate any help you (or anyone else) wants to offer, including pointers to possibly a simpler example of a recursive descent parser, and or a reminder / help in deciding whether Foswiki markup is context free or not, and understanding what that means to me.
(On the other hand, I may go read that Wikipedia article and understand that example in minutes, but I sort of doubt it.)
regards, Randy Kramer
On Friday 29 April 2011 03:22:17 am Mike Lischke wrote:
> Instead of using one big switch (be it a case or an if/else cascade) > you might be better suited by using a recursive-decent parser, which > is probably one of the fastest ways for parsing complex languages. > It's like partitioning a set of input values into several groups and > calling different functions for each group/partition to handle that. > In these functions you further partition the input set until you can > reach a final decision. This is similar to the commonly known > solution to find one value out of 1024 by only asking 10 questions at > most (via binary partition). In your case the partitioning criterion > could be the character classes. I recommend not to try to hack > together a temporary solution for ANSI only and hoping to extend it > for any Unicode character later, but instead attack and solve it once > and for all.
On 30 April 2011 01:08, Randy Kramer <rhkra...@gmail.com> wrote:
> Mike,
> Thanks for the response! I guess you're serious about the suggestion to > use a recursive descent parser. ;-) Assuming you are, I'll need a lot > of help--I'm willing, and somewhat interested in looking into it, but I > tried it before and didn't get very far.
> At one point in time during my sojourn into writing a lexer for > Scintilla (or maybe even back when I wrote some similar things for > Nedit and Kate), I did some level of investigation of pre-packaged > solutions (like lex and yacc), and then tried to understand recursive > descent parsers (which, iiuc, is not what lex or yacc are).
Let me repeat the reply I gave back then. Yacc etc, and recursive descent etc parsers are *not* the correct solution for Scintilla Lexers:
1. Scintilla lexers are not parsers, they only try to recognise enough syntax to colourise and fold the code, they have no need to recognise the language in detail. This is very little more than the first level lexical analysis phase of parsing, hence they are called lexers.
2. Scintilla does not parse a file from beginning to end, it can start anywhere, since the parser methods above generate state as they parse, they cannot start at an arbitrary point
3. Scintilla is an editor, much of the time the file contents are not correct syntax, full parsers take a lot of effort to make them handle errors and keep going.
These reasons are why even the C/C++/Java Lexers are not parsers.
And in the particular case of wiki languages, these types of parsers are not appropriate because the language cannot be described by a formal grammar, they have no structure they are just formatting markup.
As I said elsewhere Randy, copy the markdown parser, change the characters it detects for the various constructs, get a mostly working lexer then think about WhackyWords.
> Thanks for the response! I guess you're serious about the suggestion to > use a recursive descent parser. ;-) Assuming you are, I'll need a lot > of help--I'm willing, and somewhat interested in looking into it, but I > tried it before and didn't get very far.
Well, parsers/lexers (and their generation) is a wide area. My "bible" for this is the "dragon book" ("COMPILERS Principles, Techniques and Tools", 1986/87, Aho/Sethi/Ullmann). However, I'm not sure how far you wanna go with that when it comes to your actual problem. I haven't followed the entire thread, so I have to admit I don't know what the original problem was.
For syntax highlighting in Scintilla you usually only need a lexer to tokenize the input, not more, as there is usually a 1:1 relationship between a token type and a text color. There might be cases where the meaning (and therefor the color) of a keyword changes depending on the context (say, using "private" in a class declaration or as a variable name, just as example).
For text folding a bit more work is necessary but still you probably don't need a full parser, but just a technique to get the work done. For simpler (normal?) cases you just can work with a loop and a case switcher. I used that approach for the MySQL lexer in Scintilla, though I had to add some context tracking to solve ambiguities. My suggestion to use a recursive-decent parser was rather meant as an idea what technique to use instead of if/then cascades or a big switch.
> At one point in time during my sojourn into writing a lexer for > Scintilla (or maybe even back when I wrote some similar things for > Nedit and Kate), I did some level of investigation of pre-packaged > solutions (like lex and yacc), and then tried to understand recursive > descent parsers (which, iiuc, is not what lex or yacc are).
It's been a while since I last looked at the source code of lex and yacc, so I cannot say what type they use, however yacc *produces* table-based parsers (lex just generates the lexer, aka tokenizer). As an example for a parser generator that generates recursive-decent parsers I'd like to mention ANTLR.
> At the time, I had a lot of trouble with all kinds of things, acronyms > (like LALR??), the concept of a context free grammar, whether the > Foswiki / TWiki markup is context-free (not remembering for sure, I > think I eventually made an uneducated guess that it is not context > free), and whether a recursive descent parser for the language would or > would not (be guaranteed) to terminate.
Is that what you are after, writing a parser for your wiki language? I ask because that is probably way off topic on this mailing list.
> The first step I'm going to take is to re-read the Wikipedia article on > recursive descent parsers > (http://en.wikipedia.org/wiki/Recursive_descent_parser), and try to > understand the first example program there, written in C. (Can you > imagine how well I did at that when I hardly understood C at all, and > I'm not sure I'm much better now. Well, actually, I feel like I > understand 2000% more than I did a few months ago, but that might still > not be very much.)
:-) Well, I have seen many articles that used an abstract language to show the code (usually Pascal-like, as Pascal was once developed as language for teaching that kind of stuff), but I'm not sure that would help you better. The key point for an RD parser is quite simple:
read a token from the lexer -> test this against some condition -> call a function or maybe only a branch in a switch statement to handle this token (could be recursive, hence the name). In the context of Scintilla that could be an action like increasing the folding level. So that is not rocket science.
Not sure if that helps you but I have written a parser generator (based on lex/yacc) some years ago which is still available from my dev web site (first link in the sig). It comes with an editor for *.l and *.y files, so that might help you to get started with those grammars. I also have worked with ANTLR a while ago and still have a full parser grammar for MSVC RC files in one of my repositories (with all bells and whistles like macro support, unicode support, an evaluator etc. etc.). So just let me know if you want a copy.
> I haven't followed the entire thread, so I have to admit I don't know what the original problem was.
Was assistance on writing a Scintilla Lexer for a wiki language, so Randy is ok on this list AFAIK.
> For syntax highlighting in Scintilla you usually only need a lexer to tokenize the input, not more, as there is usually a 1:1 relationship between a token type and a text color. There might be cases where the meaning (and therefor the color) of a keyword changes depending on the context (say, using "private" in a class declaration or as a variable name, just as example).
Yep agree.
> For text folding a bit more work is necessary but still you probably don't need a full parser, but just a technique to get the work done. For simpler (normal?) cases you just can work with a loop and a case switcher. I used that approach for the MySQL lexer in Scintilla, though I had to add some context tracking to solve ambiguities. My suggestion to use a recursive-decent parser was rather meant as an idea what technique to use instead of if/then cascades or a big switch.
Yep, but then you didn't use it either because its overkill for the task :-)
> It's been a while since I last looked at the source code of lex and yacc, so I cannot say what type they use, however yacc *produces* table-based parsers (lex just generates the lexer, aka tokenizer). As an example for a parser generator that generates recursive-decent parsers I'd like to mention ANTLR.
The problem is that these don't like starting in the middle of the file the way Scintilla lexers do, re-processing the whole file is a bit wasteful.
> Is that what you are after, writing a parser for your wiki language? I ask because that is probably way off topic on this mailing list.
As I said above, he is in the right place AFAICT.
I'm sure you meant to be helpful Mike, but when you didn't know the problem I think your advice has confused the situation. The advice above about how you wrote your lexer would have been much more helpful and is in line with advice from other people, so Randy won't get so confused :-)
> 2. Scintilla does not parse a file from beginning to end, it can start > anywhere, since the parser methods above generate state as they parse, > they cannot start at an arbitrary point
To be clear: AFAIK, Scintilla always lexes the document from the start. There is no other way to know in which state it is in the middle of the file. Perhaps you wanted to mean that a Scintilla lexer can resume lexing from the middle of the document because it can look at the state just before, keeping the information in memory.
-- Philippe Lhoste -- (near) Paris -- France -- http://Phi.Lho.free.fr -- -- -- -- -- -- -- -- -- -- -- -- -- --
On 1 May 2011 07:45, Philippe Lhoste <Phi...@gmx.net> wrote:
> On 30/04/2011 00:51, Lex Trotman wrote:
>> 2. Scintilla does not parse a file from beginning to end, it can start >> anywhere, since the parser methods above generate state as they parse, >> they cannot start at an arbitrary point
> To be clear: AFAIK, Scintilla always lexes the document from the start. > There is no other way to know in which state it is in the middle of the > file. > Perhaps you wanted to mean that a Scintilla lexer can resume lexing from the > middle of the document because it can look at the state just before, keeping > the information in memory.
Yes, I was only thinking about the re-lexing as edits occur (after all thats most of the time lexing will occur). Scintilla tells the lexer from where to re-lex from. And the saved state used by most current Scintilla lexers is minimal, often the style of the character before is sufficient. In particular they don't keep a recursive descent stack at every character or every token of the file, which was the real point. Recursive descent and Yacc parsers are interested in much more detail and so have much more state, and so are much less appropriate for restarting at an arbitrary point.
> -- > Philippe Lhoste > -- (near) Paris -- France > -- http://Phi.Lho.free.fr > -- -- -- -- -- -- -- -- -- -- -- -- -- --
> -- > You received this message because you are subscribed to the Google Groups > "scintilla-interest" group. > To post to this group, send email to scintilla-interest@googlegroups.com. > To unsubscribe from this group, send email to > scintilla-interest+unsubscribe@googlegroups.com. > For more options, visit this group at > http://groups.google.com/group/scintilla-interest?hl=en.
Thanks to everybody who has responded to my question!
I will probably have some followup questions at some point, but for now, I sort of took the weekend off, with some "mild" reviewing of things like lex, yacc, and recursive descent parsers. I may spend a little more time there, just for my own general learning.
It is often the case that when I have trouble with something, if I set it aside for a while (like a year, 18 months, whatever ;-) I can often come back to it and fairly quickly understand the things I had trouble with on my previous reviews.
So, I seem to be doing better there, and want to spend a little more time there before I get back to the lexer.
When I get back to the lexer, I'll review that library that somebody mentioned for use with unicode / UTF-8 (among other things, I'll need to confirm it works with Windows), and then decide whether to use it, and whether to handle unicode / UTF-8 TWikiWords now, or make provisions for the future, or what.
Based on what I decide there, I think I can decide whether to use a switch / case or "chained" if elses.
Thanks! Randy Kramer
On and before Saturday 30 April 2011 08:26:24 pm several people wrote ---< good stuff, which I've snipped for now >---