Remove trailing comments exercise

Csaba Gabor

unread,

Nov 4, 2009, 6:51:10 AM11/4/09

to

I'm looking for a
function stripEndComments(code) {
// remove trailing comments and whitespace from
/* the end of code, which is presumed to be valid
// javascript */
... }

My previous post at
http://groups.google.com/group/comp.lang.javascript/browse_frm/thread/2aa9a60623eb5883/
may amount to more than just an exercise, so I am
slicing off part of it into an independent exercise
(and this one IS just an exercise).

Assume the use of the function
function checkSyntax(code) {
// returns false if code is not syntactically OK
// returns browser's (string) interpretation of the code if it's
OK,
// encapsulated in an anonymous function
try {
var f = new Function(code);
return f.toString(); }
catch (err) { return false; } } // syntax error

Some examples:
foo + bar // two comments /* or one? *//
=> foo + bar

"Foo" + "bar" /* three */ // lines
// of comments /* should all be
/* stripped off *////
=> "Foo" + "bar"

For the rambunctious: remove trailing empty statements, too:
code = "baz/* junk */+borf; fubar ; /* more junk */ ; ;; ;"
=> baz/* junk */+borf; fubar

Csaba Gabor from Vienna

SAM

unread,

Nov 4, 2009, 7:11:21 AM11/4/09

to

Le 11/4/09 12:51 PM, Csaba Gabor a �crit :

> I'm looking for a
> function stripEndComments(code) {
> // remove trailing comments and whitespace from
> /* the end of code, which is presumed to be valid
> // javascript */
> ... }

(...)

> For the rambunctious: remove trailing empty statements, too:
> code = "baz/* junk */+borf; fubar ; /* more junk */ ; ;; ;"
> => baz/* junk */+borf; fubar

I get,
Firefox.3 :
baz + borf;
fubar;
IE.5, 6 and 7 :

baz/* junk */+borf; fubar ; /* more junk */ ; ;; ;

not yet finished ?

--
sm

Csaba Gabor

unread,

Nov 4, 2009, 7:37:12 AM11/4/09

to

On Nov 4, 1:11 pm, SAM <stephanemoriaux.NoAd...@wanadoo.fr.invalid>
wrote:
> Le 11/4/09 12:51 PM, Csaba Gabor a écrit :

Hi SAM, what you have shown is what FF/IE returns if
you put the mentioned strings into a function and then
do a .toString() on it. FF cleans all comments
whereas IE leaves them in.

However, in this exercise, I'd like to strip the TRAILING
comments only, in an as browser independent fashion
as possible (without recasting the code string into
a different form). The part to the right of the
=> above indicates the string that the desired
function, stripEndComments, should return.
Therefore, you can use checkSyntax as a false vs.
nonempty-string check, but I don't think you'll find
the actual nonempty string return values useful for the
purposes of this exercise.

Richard Cornford

unread,

Nov 4, 2009, 7:37:17 AM11/4/09

to

On Nov 4, 11:51 am, Csaba Gabor wrote:
> I'm looking for a
> function stripEndComments(code) {
> // remove trailing comments and whitespace from
> /* the end of code, which is presumed to be valid
> // javascript */
> ... }
>

> My previous post at ...

> may amount to more than just an exercise, so I am
> slicing off part of it into an independent exercise
> (and this one IS just an exercise).
>
> Assume the use of the function
> function checkSyntax(code) {
> // returns false if code is not syntactically OK
> // returns browser's (string) interpretation of the code if it's
> OK,
> // encapsulated in an anonymous function
> try {
> var f = new Function(code);
> return f.toString(); }
> catch (err) { return false; } } // syntax error
>
> Some examples:
> foo + bar // two comments /* or one? *//
> => foo + bar
>
> "Foo" + "bar" /* three */ // lines
> // of comments /* should all be
> /* stripped off *////
> => "Foo" + "bar"
>
> For the rambunctious: remove trailing empty statements, too:
> code = "baz/* junk */+borf; fubar ; /* more junk */ ; ;; ;"
> => baz/* junk */+borf; fubar

This problem includes the problem of not reacting to comment
delimiters whenever they appear in strings in the source code. For
example, stripping everything from the // to the end of the line in
the following would be disastrous:-

var prefixToIRI = {
'xsd':'http://www.w3.org/2001/XMLSchema',
'env':'http://schemas.xmlsoap.org/soap/envelope/',
'xsi':'http://www.w3.org/2001/XMLSchema-instance',
'xml':'http://www.w3.org/XML/1998/namespace',
'xmlns':'http://www.w3.org/2000/xmlns'
};

So for this task it seems necessary to identify the string literals
within the source, which is getting towards tokenising the source.
Tokenising the source was already implied in the task of verifying the
syntax of the code (along with identifying comments) so maybe this
stage should not be separated from the previous task if you genuinely
want all the comments removed.

Richard.

Csaba Gabor

unread,

Nov 4, 2009, 8:01:46 AM11/4/09

to

On Nov 4, 1:37 pm, Richard Cornford <Rich...@litotes.demon.co.uk>
wrote:

I don't want to strip all comments, just those at the
very tail end of the code string (as the 3rd example suggests).
For example:
foo(); // comment1
bar(); // comment2
=>
foo(); // comment1
bar()

> var prefixToIRI = {
> 'xsd':'http://www.w3.org/2001/XMLSchema',
> 'env':'http://schemas.xmlsoap.org/soap/envelope/',
> 'xsi':'http://www.w3.org/2001/XMLSchema-instance',
> 'xml':'http://www.w3.org/XML/1998/namespace',
> 'xmlns':'http://www.w3.org/2000/xmlns'
>
> };
>
> So for this task it seems necessary to identify the string literals
> within the source, which is getting towards tokenising the source.

Hopefully, we can stay away from tokenising. If we do have to
enter the business of tokenising (in any substantive way) to
solve this problem, it would no longer be an exercise.
Perhaps it is better to use the browser's embedded parser to help out.

> Tokenising the source was already implied in the task of verifying the
> syntax of the code (along with identifying comments) so maybe this
> stage should not be separated from the previous task if you genuinely
> want all the comments removed.

Removing all the comments would seem to be a messier
problem (which I haven't thought about in this context).
I've done this (removed all comments) in the past for
PHP code, and it was around 60 lines of somewhat
intricate code (in parsing the original code string).
But I do not advocate such approach for this exercise.

> Richard

Stevo

unread,

Nov 4, 2009, 8:23:41 AM11/4/09

to

Csaba Gabor wrote:
> I'm looking for a
> function stripEndComments(code) {
> // remove trailing comments and whitespace from
> /* the end of code, which is presumed to be valid
> // javascript */
> ... }
>
>
> My previous post at
> http://groups.google.com/group/comp.lang.javascript/browse_frm/thread/2aa9a60623eb5883/
> may amount to more than just an exercise, so I am
> slicing off part of it into an independent exercise
> (and this one IS just an exercise).

Why are you talking about this as an exercise all the time? Is that your
way of getting people to write your code for you? Pretend it's just
an abstract exercise for fun?

SAM

unread,

Nov 4, 2009, 8:58:59 AM11/4/09

to

Le 11/4/09 1:37 PM, Csaba Gabor a �crit :

> On Nov 4, 1:11 pm, SAM <stephanemoriaux.NoAd...@wanadoo.fr.invalid>
> wrote:

>> Le 11/4/09 12:51 PM, Csaba Gabor a �crit :

>>
>>> I'm looking for a
>>> function stripEndComments(code) {
>>> // remove trailing comments and whitespace from
>>> /* the end of code, which is presumed to be valid
>>> // javascript */
>>> ... }
>> (...)
>>> For the rambunctious: remove trailing empty statements, too:
>>> code = "baz/* junk */+borf; fubar ; /* more junk */ ; ;; ;"
>>> => baz/* junk */+borf; fubar
>> I get,
>> Firefox.3 :
>> baz + borf;
>> fubar;
>> IE.5, 6 and 7 :
>> baz/* junk */+borf; fubar ; /* more junk */ ; ;; ;
>>
>> not yet finished ?
>
> Hi SAM, what you have shown is what FF/IE returns if
> you put the mentioned strings into a function and then
> do a .toString() on it. FF cleans all comments
> whereas IE leaves them in.

Yes (the function checkSyntax() you've given).

> However, in this exercise, I'd like to strip the TRAILING
> comments only, in an as browser independent fashion
> as possible (without recasting the code string into
> a different form).

javascript:alert("baz/* junk */+borf; fubar ; /* more junk */ ; ;; ;
;;".replace(/\/[/\*][^\*]+\*\/|\s+|\s*;(?=\s*;)/g,''))

==> baz+borf;fubar;;

can't remove the last ';'

> The part to the right of the
> => above indicates the string that the desired
> function, stripEndComments, should return.
> Therefore, you can use checkSyntax as a false vs.
> nonempty-string check, but I don't think you'll find
> the actual nonempty string return values useful for the
> purposes of this exercise.

(not yet understood what is "the" purpose ... comments no ... but yes)

javascript:alert("baz/* junk */+borf; fubar ; /* more junk */ ; ;; ;
;;".replace(/\/[/\*][^\*]+\*\/(?=\s*;)|\s+|;(?=\s*;)/g,''))

==> baz/*junk*/+borf;fubar;;

--
sm

abozhilov

unread,

Nov 4, 2009, 12:59:39 PM11/4/09

to

On 4 Ноем, 13:51, Csaba Gabor <dans...@gmail.com> wrote:
> I'm looking for a
> function stripEndComments(code) {
> // remove trailing comments and whitespace from
> /* the end of code, which is presumed to be valid
> // javascript */
> ... }

Something like this?

code.replace(/\s*;[\s;]*/g, ';\n').replace(/^\/(?:\/[^\n]+|\*[^\/*]*?\*
\/)/gm, '');

Csaba Gabor

unread,

Nov 4, 2009, 3:06:33 PM11/4/09

to

On Nov 4, 6:59 pm, abozhilov <fort...@gmail.com> wrote:

> On 4 îÏÅÍ, 13:51, Csaba Gabor <dans...@gmail.com> wrote:
>
> > I'm looking for a
> > function stripEndComments(code) {

> > š // remove trailing comments and whitespace from
> > š /* the end of code, which is presumed to be valid
> > š // javascript */
> > š ... }

>
> Something like this?
>
> code.replace(/\s*;[\s;]*/g, ';\n').replace(/^\/(?:\/[^\n]+|\*[^\/*]*?\*
> \/)/gm, '');

You might be able to figure out a way to do this
with regular expressions, but I'm thinking that
it will be VERY messy because you will have to
account for strings and regular expressions such as:
var code = "var messy='it was windy/*sunny*'+" and */cold/*"

The first part of your code fails on:
var code = "var semi=' ; ; ; '";

While the second replace fails on
var code = "var k=i + j /* // */";

Thomas 'PointedEars' Lahn

unread,

Nov 4, 2009, 3:58:42 PM11/4/09

to

Csaba Gabor wrote:

> abozhilov wrote:

>> Csaba Gabor wrote:
>> > š // remove trailing comments and whitespace from
>> > š /* the end of code, which is presumed to be valid
>> > š // javascript */
>> > š ... }
>>
>> Something like this?
>>
>> code.replace(/\s*;[\s;]*/g, ';\n').replace(/^\/(?:\/[^\n]+|\*[^\/*]*?\*
>> \/)/gm, '');
>
> You might be able to figure out a way to do this
> with regular expressions, but I'm thinking that
> it will be VERY messy

How fortunate then that you don't know what you are talking about.
It is rather easy to do if you do it properly. For example:

code = code.replace(
/('(?:[^']|\\')*')|("(?:[^"]|\\")*")|(\/\/.*)|(\s+$)/gm,
function(m, p1, p2, p3, p4) {
return (p3 || p4) ? "" : m;
});

> because you will have to
> account for strings and regular expressions such as:
> var code = "var messy='it was windy/*sunny*'+" and */cold/*"

The concatenation here is rather pointless. Any tokenizer or parser will
see this equivalent to

var code = "var messy='it was windy/*sunny*' and */cold/*"

And

var messy='it was windy/*sunny*' and */cold/*

is not syntactically correct to begin with. Which also points out that
there is not Regular Expression here.

PointedEars
--
var bugRiddenCrashPronePieceOfJunk = (
navigator.userAgent.indexOf('MSIE 5') != -1
&& navigator.userAgent.indexOf('Mac') != -1
) // Plone, register_function.js:16

Thomas 'PointedEars' Lahn

unread,

Nov 4, 2009, 6:25:43 PM11/4/09

to

Thomas 'PointedEars' Lahn wrote:

> Csaba Gabor wrote:
>> [...] you will have to account for strings and regular expressions such

>> as:
>> var code = "var messy='it was windy/*sunny*'+" and */cold/*"

^ ^ ^
> The concatenation here is rather pointless. [...]

In fact, there is no concatenation here because it ...

> is not syntactically correct to begin with. Which also points out that
> there is not Regular Expression here.

PointedEars
--
Anyone who slaps a 'this page is best viewed with Browser X' label on
a Web page appears to be yearning for the bad old days, before the Web,
when you had very little chance of reading a document written on another
computer, another word processor, or another network. -- Tim Berners-Lee

Dr J R Stockton

unread,

Nov 4, 2009, 6:29:35 PM11/4/09

to

In comp.lang.javascript message <7766145b-786d-478a-8a6e-08f2e27826ba@l2
g2000yqd.googlegroups.com>, Wed, 4 Nov 2009 03:51:10, Csaba Gabor
<dan...@gmail.com> posted:

>I'm looking for a
>function stripEndComments(code) {
> // remove trailing comments and whitespace from
> /* the end of code, which is presumed to be valid
> // javascript */
> ... }

Whitespace is trivial.

You must recognise strings, and not count // or /* within them.
You must allow for RegExp literals such as /slash=\//.
Remove all /* ... */ comment; or only if last on one line?

--
(c) John Stockton, Surrey, UK. ?@merlyn.demon.co.uk Turnpike v6.05 MIME.
Web <URL:http://www.merlyn.demon.co.uk/> - FAQish topics, acronyms, & links.
Proper <= 4-line sig. separator as above, a line exactly "-- " (SonOfRFC1036)
Do not Mail News to me. Before a reply, quote with ">" or "> " (SonOfRFC1036)

Csaba Gabor

unread,

Nov 4, 2009, 8:31:27 PM11/4/09

to

On Nov 4, 9:06 pm, Csaba Gabor <dans...@gmail.com> wrote:
> On Nov 4, 6:59 pm, abozhilov <fort...@gmail.com> wrote:
> > On 4 îÏÅÍ, 13:51, Csaba Gabor <dans...@gmail.com> wrote:
>
> > > I'm looking for a
> > > function stripEndComments(code) {
> > > š // remove trailing comments and whitespace from
> > > š /* the end of code, which is presumed to be valid
> > > š // javascript */
> > > š ... }
>

> You might be able to figure out a way to do this
> with regular expressions, but I'm thinking that
> it will be VERY messy because you will have to
> account for strings and regular expressions such as:

> var code = "var messy='it was windy/*sunny*'+" and */cold/*"

Oops, I see I've made a transcription error. It should read:
var code = "var messy='it was windy/*sunny*'+' and */cold/*'"

But the following may be slightly more interesting:
var code =
"var mess='it\\'s windy//*sunny*'+' & */cold/*' //asdf"

Thomas 'PointedEars' Lahn

unread,

Nov 4, 2009, 9:26:47 PM11/4/09

to

Csaba Gabor wrote:

> On Nov 4, 9:06 pm, Csaba Gabor <dans...@gmail.com> wrote:
>> You might be able to figure out a way to do this
>> with regular expressions, but I'm thinking that
>> it will be VERY messy because you will have to
>> account for strings and regular expressions such as:
>>
>> var code = "var messy='it was windy/*sunny*'+" and */cold/*"
>
> Oops, I see I've made a transcription error. It should read:
> var code = "var messy='it was windy/*sunny*'+' and */cold/*'"

Still no RegExp here:

var messy='it was windy/*sunny* and */cold/*'
^ ^

> But the following may be slightly more interesting:
> var code =
> "var mess='it\\'s windy//*sunny*'+' & */cold/*' //asdf"

You are still on the wrong track.

var mess='it\\'s windy//*sunny* & */cold/*' //asdf
^ ^

It is really merely an issue to recognize and ignore string literals first,
then to recognize and ignore RegExp initializers outside of them. My
replace function already implements the former; adapting it to also take
care of the latter is left as an exercise to the reader.

PointedEars
--
Prototype.js was written by people who don't know javascript for people
who don't know javascript. People who don't know javascript are not
the best source of advice on designing systems that use javascript.
-- Richard Cornford, cljs, <f806at$ail$1$8300...@news.demon.co.uk>

Lasse Reichstein Nielsen

unread,

Nov 5, 2009, 1:19:04 AM11/5/09

to

Thomas 'PointedEars' Lahn <Point...@web.de> writes:

> Csaba Gabor wrote:
>
>> abozhilov wrote:
>>> Csaba Gabor wrote:
>>> > š // remove trailing comments and whitespace from
>>> > š /* the end of code, which is presumed to be valid
>>> > š // javascript */
>>> > š ... }

...

> How fortunate then that you don't know what you are talking about.
> It is rather easy to do if you do it properly. For example:
>
> code = code.replace(
> /('(?:[^']|\\')*')|("(?:[^"]|\\")*")|(\/\/.*)|(\s+$)/gm,
> function(m, p1, p2, p3, p4) {
> return (p3 || p4) ? "" : m;
> });

The ('(?:[^']|\\')*') part fails to recognize the end of the following
string literal:
'foo \\'
and will match up to the next "'". Ditto for double-quoted strings.
Try
('(?:[^'\\]|\\[^])*')
(Here I'm also allowing backslash-newline in string literals, even
though it's not in the standard, otherwise replace "[^]" with ".").

And it's easy to add standard (not-single-line) comments as well:
(\/\*(?:[^*]*\*+)*\/)

This only works in the absence of regexp literals.
RegExps are harder to recognize, because it's the syntactic starting
point that distinguishes the starting slash from a division.
E.g.,
/foo + 42/g
might be a RegExp, if occuring in an expression context, but not
if it occurs where an operator is expected:
bar/foo + 42/g
(I.e., it's not tokenizable without context information).

And if you can't recognize regexps, you can mess up the recognition
of comments and strings as well.

/L
--
Lasse Reichstein Holst Nielsen
'Javascript frameworks is a disruptive technology'

Csaba Gabor

unread,

Nov 5, 2009, 5:20:06 AM11/5/09

to

On Nov 5, 7:19 am, Lasse Reichstein Nielsen <lrn.unr...@gmail.com>
wrote:

> Thomas 'PointedEars' Lahn <PointedE...@web.de> writes:
> > Csaba Gabor wrote:
>
> >> abozhilov wrote:
> >>> Csaba Gabor wrote:
> >>> > // remove trailing comments and whitespace from

> >>> > /* the end of code, which is presumed to be valid

> >>> > // javascript */

> >>> > ... }
> ...
> > How fortunate then that you don't know what you are talking about.
> > It is rather easy to do if you do it properly. For example:
>
> > code = code.replace(
> > /('(?:[^']|\\')*')|("(?:[^"]|\\")*")|(\/\/.*)|(\s+$)/gm,
> > function(m, p1, p2, p3, p4) {
> > return (p3 || p4) ? "" : m;
> > });
>
> The ('(?:[^']|\\')*') part fails to recognize the end of the following
> string literal:
> 'foo \\'
> and will match up to the next "'". Ditto for double-quoted strings.
> Try
> ('(?:[^'\\]|\\[^])*')
> (Here I'm also allowing backslash-newline in string literals, even
> though it's not in the standard, otherwise replace "[^]" with ".").

Very interesting. I've not seen that [^] construct in
javascript before. With a PHP regular expression if ] is
the first character following the ^ in a character class,
it means to exclude the right closing bracket ]. Evidently,
PHP's [^]] translates to [^\]] in JS

> And it's easy to add standard (not-single-line) comments as well:
> (\/\*(?:[^*]*\*+)*\/)

Or: (\/\*.*?(?=\*\/)..)
though I have not extensively tested it

> This only works in the absence of regexp literals.
> RegExps are harder to recognize, because it's the syntactic starting
> point that distinguishes the starting slash from a division.
> E.g.,
> /foo + 42/g
> might be a RegExp, if occuring in an expression context, but not
> if it occurs where an operator is expected:
> bar/foo + 42/g
> (I.e., it's not tokenizable without context information).
>
> And if you can't recognize regexps, you can mess up the
> recognition of comments and strings as well.

Indeed. Thanks for that nice reply Lasse. I would be highly
curious to see a reg exp variant developed to completion.
Perhaps there should be a separate 'Remove all comments' thread.

My solution to the 'Remove trailing comments' exercise follows.
My reason in posing the exercise was to highlight that in the
best spirit of programming, one may use the browser's syntax
checking capabilities to do the heavy lifting, rather than
having to parse the entire code string manually.

Reminder, I only want to remove the final comments at the end of
the code, and not at the end of each line. In short, I want to
be able to get at the last code that actually "does something"
(or might be doing something).

After getting rid of trailing whitespace and vacuous lines,
we consider that there exactly three situations. The final
characters are either:
1) Part of a comment started by //
2) The end of a comment started by /*
3) Not a comment

How to test for this (and what to do when we know which case)?

syntaxCheck(code + ' x y') will pass iff case 1 holds
and we have a // style comment. In that situation find
the previous //, strip the final / and perform the test
(on the stripped version). If it passes, recurse (since
we're still in the comment). If it fails, strip off one
more character from the end (the first / of the // pair),
and recurse on that. We can't be too greedy in the
passes case because we may have situations like ///

If case 1, above, does not hold, and the code does not
end with */, then it is evidently not part of a comment,
so it is case 3, and we are done.

Otherwise, find the prior /*. It is either the start
of the comment or in the middle of it. To test for
this, replace the /*...*/ with */
If this passes the syntax check, then we are still
in the middle of a comment, so we recurse on the just
tested string. Otherwise, we're at the start of a
comment so recurse on the just tested string less the
final two characters.

Here's the code:
function stripEndComments(code) {
// Trim trailing comments from code
// First trim whitespace and vacuous statements
code = code.replace(/(\s*;)*\s*$/,"");

// Next check for double slash type of comment at end
if (checkSyntax(code + ' x y')) {
var pos=code.lastIndexOf("//"),
cS = checkSyntax(code.substr(0,pos+1) + ' x y');
return stripEndComments(code.substr(0,pos+!!cS)); }

// In this next case there are no more trailing comments
if (code.substr(-2)!="*/") return code;

// Here deal with /* ... /* ... */ comments
var c = code.substr(0,code.lastIndexOf("/*"));
return stripEndComments(c.substr(0,c.length-2*!checkSyntax(c)));
}

Csaba Gabor from Vienna

Csaba Gabor

unread,

Nov 5, 2009, 7:15:42 AM11/5/09

to

On Nov 5, 11:20 am, Csaba Gabor <dans...@gmail.com> wrote:
> On Nov 5, 7:19 am, Lasse Reichstein Nielsen <lrn.unr...@gmail.com>
> wrote:
> > Thomas 'PointedEars' Lahn <PointedE...@web.de> writes:
> > > Csaba Gabor wrote:
>
> > >> abozhilov wrote:
> > >>> Csaba Gabor wrote:
> > >>> > // remove trailing comments and whitespace from
> > >>> > /* the end of code, which is presumed to be valid
> > >>> > // javascript */
> > >>> > ... }

> My solution to the 'Remove trailing comments' exercise follows.

> My reason in posing the exercise was to highlight that in the
> best spirit of programming, one may use the browser's syntax
> checking capabilities to do the heavy lifting, rather than
> having to parse the entire code string manually.
>
> Reminder, I only want to remove the final comments at the end of
> the code, and not at the end of each line. In short, I want to
> be able to get at the last code that actually "does something"
> (or might be doing something).
>
> After getting rid of trailing whitespace and vacuous lines,
> we consider that there exactly three situations. The final
> characters are either:
> 1) Part of a comment started by //
> 2) The end of a comment started by /*
> 3) Not a comment

Slightly revised code:

function stripEndComments(code) {
// Trim trailing comments from code
// First trim whitespace and vacuous statements

code = code.replace(/[\s;]*\s*$/,"");

// Next check for double slash type of comment at end
if (checkSyntax(code + ' x y')) {
var pos=code.lastIndexOf("//"),
cS = checkSyntax(code.substr(0,pos+1) + ' x y');
return stripEndComments(code.substr(0,pos+!!cS)); }

// In this next case there are no more trailing comments

if (code.substr(code.length-2)!="*/") return code;

// Here deal with /* ... /* ... */ comments
var c = code.substr(0,code.lastIndexOf("/*"));
return stripEndComments(c.substr(0,c.length-2*!checkSyntax(c)));
}

What changed:
code.substr(-2) => code.substr(code.length-2)
since some IEs do not like a negative arguments to .substr()

SAM

unread,

Nov 5, 2009, 8:13:17 AM11/5/09

to

Le 11/5/09 11:20 AM, Csaba Gabor a �crit :

>
> Very interesting. I've not seen that [^] construct in
> javascript before. With a PHP regular expression if ] is
> the first character following the ^ in a character class,
> it means to exclude the right closing bracket ]. Evidently,
> PHP's [^]] translates to [^\]] in JS

The characters '(' and '[' have not to be antislashed
when they are between [ ] or ( )
alone the closers ']' ')' have to be

Others characters that could have to be :
o '-' except if it is at the all end
(ie. [m-s-] : one character from m to s or sign -)
o '+' except if it is at the beginning
(ie. [+ms] : character m or s or +)

>> And it's easy to add standard (not-single-line) comments as well:
>> (\/\*(?:[^*]*\*+)*\/)
>
> Or: (\/\*.*?(?=\*\/)..)
> though I have not extensively tested it

All depends the way you code ...

var reg = /(\/\*.*?(?=\*\/))/g;
var reg = new RegExp('(/\\*.*?(?=\\*/))','g');

<https://developer.mozilla.org/en/Core_JavaScript_1.5_Reference/Global_Objects/RegExp>

var myString = 'some blah /* comment ?!; comment-2 /|\ */ + no comment';
myString = myString.replace(reg, '');
alert(myString);

But that Regexp doesn't work ...
This one is a little better :
var reg = new RegExp('(/\\*[^*]*\\*/)','g');

alert(myString.replace(/(\/\*[^*]*\*\/)/g,''));
or :
alert(myString.replace(/\/\*[^*]*\*\//g,''));

Of course, this RegExp doesn't work with :
myString = 'some blah /* comment?!; comment-2* /|\ */ + no comment';
where one '*' is introduced in the comment.

alert(myString.replace(/\/\*([^*]|\*(?!\/))+\*\//g,''));
OK (for "that" string !)

> Reminder, I only want to remove the final comments at the end of
> the code,

$ : to tell it's the end

> and not at the end of each line. In short, I want to
> be able to get at the last code that actually "does something"
> (or might be doing something).
>
> After getting rid of trailing whitespace and vacuous lines,
> we consider that there exactly three situations. The final
> characters are either:
> 1) Part of a comment started by //
> 2) The end of a comment started by /*
> 3) Not a comment

var reg = /[\/\s][\/*][^};]*(?![};])$/g;

var strg = 'var f = function(){ foo(); /* comment */} //no se';
alert(strg.replace(reg,''));

var strg = 'var f = function(){ foo(); /* comment */} /*no se*/';
alert(strg.replace(reg,''));

both ==> var f = function(){ foo(); /* comment */}

var strg = 'var f = function(){ foo(); // comment \n} /*no se*/';
alert(strg.replace(reg,''));
==>
var f = function(){ foo(); // comment
}

var strg = 'var f = function(){ foo(); // comment **\n} /*no se*/';
alert(strg.replace(reg,''));
==>
var f = function(){ foo(); // comment **
}

Not tested with IE ...

can try your reg exps and your strings here:
<http://www.regextester.com/>
<http://www.google.com/search?q=tester+regex>
<http://stephane.moriaux.pagesperso-orange.fr/truc/js_regexp_testeur>
--
sm

Thomas 'PointedEars' Lahn

unread,

Nov 5, 2009, 4:18:16 PM11/5/09

to

Lasse Reichstein Nielsen wrote:

> Thomas 'PointedEars' Lahn <Point...@web.de> writes:
>> Csaba Gabor wrote:
>>> abozhilov wrote:
>>>> Csaba Gabor wrote:

>>>> > Å¡ // remove trailing comments and whitespace from
>>>> > Å¡ /* the end of code, which is presumed to be valid
>>>> > Å¡ // javascript */
>>>> > Å¡ ... }

> ...
>> How fortunate then that you don't know what you are talking about.
>> It is rather easy to do if you do it properly. For example:
>>
>> code = code.replace(
>> /('(?:[^']|\\')*')|("(?:[^"]|\\")*")|(\/\/.*)|(\s+$)/gm,
>> function(m, p1, p2, p3, p4) {
>> return (p3 || p4) ? "" : m;
>> });
>
> The ('(?:[^']|\\')*') part fails to recognize the end of the following
> string literal:
> 'foo \\'
> and will match up to the next "'". Ditto for double-quoted strings.

Not here (Iceweasel 3.5.4, JavaScript 1.8.1). Have you used "'foo \\'" or
"'foo \\\\'" for the test? Because the latter is the representation of
'foo \\' in a string value, while "'foo \\'" as a string value represents
the syntactically invalid 'foo \' (which is why it must be matched up to the
next apostrophe to be a string literal).

/* 'foo \\' */
var code = "'foo \\\\' '";

/* ["'foo \\'", "'foo \\'"] */
/('(?:[^']|\\')*')/.exec(code)

If I am overlooking something, can you explain why the recognition of this
string literal should fail?

> [...]

> And it's easy to add standard (not-single-line) comments as well:
> (\/\*(?:[^*]*\*+)*\/)
>
> This only works in the absence of regexp literals.
> RegExps are harder to recognize, because it's the syntactic starting
> point that distinguishes the starting slash from a division.
> E.g.,
> /foo + 42/g
> might be a RegExp, if occuring in an expression context, but not
> if it occurs where an operator is expected:
> bar/foo + 42/g
> (I.e., it's not tokenizable without context information).
>
> And if you can't recognize regexps, you can mess up the recognition
> of comments and strings as well.

Thank you. I am working on an ECMAScript-compliant source code parser and
you have given me quite something to think about.

PointedEars
--
Danny Goodman's books are out of date and teach practices that are
positively harmful for cross-browser scripting.
-- Richard Cornford, cljs, <cife6q$253$1$8300...@news.demon.co.uk> (2004)

Lasse Reichstein Nielsen

unread,

Nov 6, 2009, 12:36:09 PM11/6/09

to

Thomas 'PointedEars' Lahn <Point...@web.de> writes:

> Lasse Reichstein Nielsen wrote:
>> The ('(?:[^']|\\')*') part fails to recognize the end of the following
>> string literal:
>> 'foo \\'
>> and will match up to the next "'". Ditto for double-quoted strings.
>
> Not here (Iceweasel 3.5.4, JavaScript 1.8.1). Have you used "'foo \\'" or
> "'foo \\\\'" for the test? Because the latter is the representation of
> 'foo \\' in a string value, while "'foo \\'" as a string value represents
> the syntactically invalid 'foo \' (which is why it must be matched up to the
> next apostrophe to be a string literal).

(I'll write all strings as string literals from here, to (try to) avoid
confusion).

To be honest, I didn't test it, and the argument for why it didn't
work was wrong because of that.
It still doesn't work, but for the opposite reason of initial guess:
it doesn't exclude "\\'" from ending the string literal, whereas I had
guessed that it wouldn't correctly recognize "\\\\'" as ending it.

Try:

var code = "'abc\\'def'";
// I.e., code contains two strings literals
var re = /('(?:[^']|\\')*')/g;
alert(re.exec(code)[0]);

It alerts the string "'abc\\'", i.e., it does end at the first
"'", even if the quote is escaped.

The reason it does so is that [^'] matches backslash as well, and
with a higher priority than what comes after, so it matches the
backslash as well.

The immediate fix of swapping the alternatives:
var re = /('(?:\\'|[^'])*'/g;
and giving \\' priority over [^'], will match "\\'" as a non-string-ender,
but will also ignore "\\\\'". It's necessary to know whether there is an
even number of backslashes before the quote in order to know whether it's
escaped or not. The RegExp below is the simplest one I have found to do that.

> /* 'foo \\' */
> var code = "'foo \\\\' '";
>
> /* ["'foo \\'", "'foo \\'"] */
> /('(?:[^']|\\')*')/.exec(code)
>
> If I am overlooking something, can you explain why the recognition of this
> string literal should fail?

It works. It's the escaped backslash before a quote that fails:
"'foo \\\\' + 'bar'" that fails

...

> Thank you. I am working on an ECMAScript-compliant source code parser and
> you have given me quite something to think about.

Glad to be of service :)
ECMAScript syntax is ... interesting. Context depending lexing combined
with semicolon-insertion gives ample room to make mistakes :)

var b=2,g=1;
var a = 84
/b/g; // <- it's division :)

Csaba Gabor

unread,

Nov 7, 2009, 5:21:39 AM11/7/09

to

On Nov 6, 6:36 pm, Lasse Reichstein Nielsen <lrn.unr...@gmail.com>
wrote:

> Thomas 'PointedEars' Lahn <PointedE...@web.de> writes:
> > Lasse Reichstein Nielsen wrote:
> var re = /('(?:[^']|\\')*')/g;
> alert(re.exec(code)[0]);
>
> It alerts the string "'abc\\'", i.e., it does end at the first
> "'", even if the quote is escaped.

The above recognizes from one single quote to
the final single quote in the string. One may
just as well write var re = /('.*?')/g

> The reason it does so is that [^'] matches backslash as well, and
> with a higher priority than what comes after, so it matches the
> backslash as well.
>
> The immediate fix of swapping the alternatives:
> var re = /('(?:\\'|[^'])*'/g;

The above recognizes from a single quote to either the next
single quote not preceded by a backslash if such a single
quote exists; else to the last single quote. To observe:

var code = "abc'def\\'ghi'jkl\\\\'mno\\\\'pqr";
var re = /'(?:\\'|[^'])*'/g
alert (code.replace(re, "XXX"));

> and giving \\' priority over [^'], will match "\\'" as a non-string-ender,
> but will also ignore "\\\\'". It's necessary to know whether there is an
> even number of backslashes before the quote in order to know whether it's
> escaped or not. The RegExp below is the simplest one I have found to do that.

Lasse, to me, the RegExp below looks identical to the first
one above. So in the absence of me seeing it, here is a
regular expression that recognizes single quoted strings.
It will match from a single quote to the next single quote
not preceded by an odd number of backslashes.

var re = /'(?:\\.|[^\\'])*'/g

> > /* 'foo \\' */
> > var code = "'foo \\\\' '";
>
> > /* ["'foo \\'", "'foo \\'"] */
> > /('(?:[^']|\\')*')/.exec(code)

> Glad to be of service :)

> ECMAScript syntax is ... interesting. Context depending lexing combined
> with semicolon-insertion gives ample room to make mistakes :)
>
> var b=2,g=1;
> var a = 84
> /b/g; // <- it's division :)

This is highly interesting, where the interpretation of that
final line also depends on what comes before it. For example:

var b=2,g=1;
var a = 84;

/b/g; // <- it's a regular expression :)

or

whole(truth) /b+c/g; // division
vs.
while(truth) /b+c/g; // RegExp

I wonder about other examples of (non embedded) code being
interpreted differently depending on what precedes it.

Also, while your example of [^] works on my FF1.5, it does
not complile on my IE 6. Ie. adding
var re=/[^]/;
results in an error message from IE.

Csaba Gabor

unread,

Nov 7, 2009, 6:15:51 AM11/7/09

to

On Nov 7, 11:21 am, Csaba Gabor <dans...@gmail.com> wrote:
> On Nov 6, 6:36 pm, Lasse Reichstein Nielsen <lrn.unr...@gmail.com>
> wrote:
>
> > Thomas 'PointedEars' Lahn <PointedE...@web.de> writes:
> > > Lasse Reichstein Nielsen wrote:
> > var re = /('(?:[^']|\\')*')/g;
> > alert(re.exec(code)[0]);
>
> > It alerts the string "'abc\\'", i.e., it does end at the first
> > "'", even if the quote is escaped.
>
> The above recognizes from one single quote to
> the final single quote in the string. One may
> just as well write var re = /('.*?')/g

final => next. Sorry about that

If the ? in the RegExp I supplied is omitted, then
it captures till the final single quote

VK

unread,

Nov 7, 2009, 12:34:54 PM11/7/09

to

Thomas 'PointedEars' Lahn wrote:
> It is really merely an issue to recognize and ignore string literals first,
> then to recognize and ignore RegExp initializers outside of them. My
> replace function already implements the former; adapting it to also take
> care of the latter is left as an exercise to the reader.

Your replace function so far converts a syntactically correct source
into syntactically incorrect one:
/foobar//foobar
comes to
/foobar
which is "unterminated regular expression literal"

P.S. It is a bit of fun to watch people making a robust parser
algorithm for an algorithmically unparseable matter. But keep going, I
have more...

Thomas 'PointedEars' Lahn

unread,

Nov 7, 2009, 2:04:58 PM11/7/09

to

VK wrote:

> Thomas 'PointedEars' Lahn wrote:
>> It is really merely an issue to recognize and ignore string literals
>> first, then to recognize and ignore RegExp initializers outside of them.
>> My replace function already implements the former; adapting it to also
>> take care of the latter is left as an exercise to the reader.
>
> Your replace function so far converts a syntactically correct source
> into syntactically incorrect one:
> /foobar//foobar
> comes to
> /foobar
> which is "unterminated regular expression literal"

If you had paid attention, you would have known that I am aware of the
RegExp issue.

> P.S. It is a bit of fun to watch people making a robust parser
> algorithm for an algorithmically unparseable matter.

It is not algorithmically unparseable. Otherwise there would be no script
engine that accepts RegExp initializer, would there? The context in which
`/' is not recognized as the start of a RegExp initializer is grammatically
well-defined, and if you had cared to read the Specification you would have
known.

> But keep going, I have more...

You would.

PointedEars
--
Use any version of Microsoft Frontpage to create your site.
(This won't prevent people from viewing your source, but no one
will want to steal it.)
-- from <http://www.vortex-webdesign.com/help/hidesource.htm> (404-comp.)

Lasse Reichstein Nielsen

unread,

Nov 8, 2009, 10:38:36 AM11/8/09

to

Csaba Gabor <dan...@gmail.com> writes:

[correct description of how the regexps work]

>> and giving \\' priority over [^'], will match "\\'" as a non-string-ender,
>> but will also ignore "\\\\'". It's necessary to know whether there is an
>> even number of backslashes before the quote in order to know whether it's
>> escaped or not. The RegExp below is the simplest one I have found to do that.
>
> Lasse, to me, the RegExp below looks identical to the first
> one above. So in the absence of me seeing it, here is a
> regular expression that recognizes single quoted strings.
> It will match from a single quote to the next single quote
> not preceded by an odd number of backslashes.
>
> var re = /'(?:\\.|[^\\'])*'/g

My mistake. The "RegExp below" that I was referring to was one that I
had written in a double-quoted message, but I managed to remove that
quote before posting.

It was indeed equivalent to the one you wrote here (I think it had the
alternative in the opposite order, but that's not important since they
are mutually exclusive.

>> var b=2,g=1;
>> var a = 84
>> /b/g; // <- it's division :)
>
> This is highly interesting, where the interpretation of that
> final line also depends on what comes before it. For example:
>
> var b=2,g=1;
> var a = 84;
> /b/g; // <- it's a regular expression :)
>
> or
>
> whole(truth) /b+c/g; // division
> vs.
> while(truth) /b+c/g; // RegExp
>
> I wonder about other examples of (non embedded) code being
> interpreted differently depending on what precedes it.

There are a few:
An object literal, {foo: 42}, is alos a valid statement block
with a labeled expression statement. In an expression context,
it can only be the object literal, in a statement context, it
can only be the statement block, and since expressions can be
statements (ExpressionStatement) there is a rule that says that
an ExpressionStatement cannot begin with "{" (or "function").

> Also, while your example of [^] works on my FF1.5, it does
> not complile on my IE 6. Ie. adding
> var re=/[^]/;
> results in an error message from IE.

Tsk, tsk. :)