Yeah so the branches are a mess.
Master is old and not in use. Just like you I learned everything at once (Jflex, java etc.) so the first version of the parser was a total mess.
Brittle as all hell and I felt I didn't really have any control over it (fixed one thing another one broke). So I started a rewrite of the parser in a separate branch while also making releases in the master branch.
I ended up copying stuff over in some really stupid ways so even though their history is much closer than implied by the git message they can't really be merged since they have alot of the same commits but with different SHA:s. :(
Considering the state it's in I'd need to force push phpstorm-61 as master but I haven't done that since I'm not sure if that would totally break stuff for someone.
For help learning this stuff I had two primary sources, the tutorials you found (which are kindof old now) and the handle bars plugin by dmarcotte (
https://github.com/dmarcotte/idea-handlebars).
There's also the Lua plugin I mention in the readme (
https://github.com/juzna/intellij-latte) that I learned some parser stuff from.
I had a look in the forums yesterday and the activity levels seem to be alot lower than when I was working on the plugin unfortunately.
During the time I spent on it you could often get answers from the Jetbrains devs about stuff which was a huge help.
The later versions of Intellij IDEA have a built in decompiler so now you can actually read the source from PHPStorm but before that all the PHP stuff was just wild guesses and trying to figure out how the Symfony2/Twig plugin does things.
Now you can at least read the source but figuring out how the more advanced stuff works is really hard and it was one of the reasons development kind of died off.
I do remember reading some Jflex tutorial about how that stuff works but can't remember where I found it. Actually it might have been this:
http://jflex.de/manual.html#ExampleThe basics of it are that you have different states starting with YYINITIAL.
YYINITIAL will match either <% or $ or {.
When it matches <% it takes two steps back (yypushback(2)) and then switches state to SS_BLOCK_START which starts at line 141.
Everything in that section can be matched when the lexer is in the SS_BLOCK_START state.
Since we moved everything back two steps it will now read <% again but since it's in SS_BLOCK_START it will now match the token definition which is
SS_BLOCK_START= <% on line 66.
Reading this code again after about a year and a half I realize that one of the confusing things about this is that I sometimes use the same name for a state and for a token definition.
That's something I'd really change since it's confusing enough as it is.
Anyways, what you need to do is threefold.
Write token definitions which are either a somewhat limited form of regex or a literal string (line 55 to 91).
Write state definitions for the lexer (line 93-111).
Write the actual states (113-302) and that's the tricky part.
The states are a stack that you can either push a state onto or pop a state off of returning you to the previous state.
An easy example would be
<% else %>.
<% will first put you in the
SS_BLOCK_START state.
From there it first matches
<% again and now matches
SS_START_KEYWORD and returns a token type for that.
Next it matches
else as
SS_ELSE_KEYWORD and returns a token type for that.
And last but not least it matches
SS_BLOCK_END for
%> and returns the token type for that.
SS_BLOCK_END also pops the current state (
SS_BLOCK_START) off the stack and puts you back in
YYINITIAL.
Please note that I excluded the white space matches.
If you for some reason write code that fails to return to the initial state then you're screwed and it'll crash the IDE running the plugin.
There's a plugin called PsiViewer which you can install which let's you inspect the results.
In PsiViewer you'll see the whole token tree which will help you see if you got the results you wanted/expected.
It also helps to find the name of tokens in PHP or yaml when you need to work with those.
This got really long and there's more to discuss with those who are interested.
Should we discuss the more complicated technical stuff somewhere else?