"next identifier" parsing proposal

4 views
Skip to first unread message

Broofa

unread,
Jan 19, 2008, 12:22:46 AM1/19/08
to jgrousedoc
Hey Denis, I'd like to raise the issue of source code parsing again.
I know this has been discussed before, but our code has reached a size
and rate of change where manually ensuring the names of functions and
vars are kept up to date is simply not practical anymore. I'm finding
or hearing about incorrect names on an almost daily basis.

The frustrating part is that 99% of these problems go away if there
were an easy way to tell jGD to use the next identifier it sees in the
source code.

I know, I know, jGD doesn't "do" source code. But I'm not suggesting
it do anything fancy - merely that it use a couple simple regex rules
to find whatever identifier might follow the comment. To illustrate
how simple and effective this technique can be, and hopefully provide
a useful tool to jGD users, I've put together the "jgdlint" script (in
Files section of this group). The script is easy to use, just pass it
the names of files that you want it to check, and it will do some
basic (very basic) regex manipulation to find the next identifier.
Once it has that, it compares it with the name in the @function/@var
tag and, if the two differ, it logs a short message to stdout. Here's
an example of it's output:

localhost> jgdlint ZenSet.js

In ZenSet.js:
- "notifyAdded" (line 115) != "notifyItemAdded" (line 119)
- "notifyRemoved" (line 126) != "notifyItemRemoved" (line 130)
- "notifyUPDATED" (line 137) != "notifyUpdated" (line 139)
- "_attach" (line 211) != "_detach" (line 213)

Found 4 (possible) problems in 25 tags

When I ran this over our current codebase it reported that 64 of 929
tags (7%) were wrong. And only one of those turned out to be a false
positive. Which I think demonstrates two important points:
1. Even with engineers who are pretty good at maintaining
documentation, a substantial amount of our code was incorrect.
(Frankly, I'm *damn* good at maintaining docs, and I was amazed at how
many of those problems were my fault!)
2. A simple, regex-based, "next identifier" approach to finding
names works surprisingly well.

Having said that, I have a feature request: Allow us to comment our
code using a special name or token meaning, "use the next identifier
found in the source code as the name". For example, we might use "-"
for this, as follows:

@function - this function's name is parsed from code
@var {private Number} - this var's name is parsed from code

There are two elegant aspects to this. First, this is an _optional_
feature. User's don't have to use it if they don't want to or it
doesn't work - jGD will continue to operate exactly as it already
does. And second, it's a wonderfully easy system to explain, which
means users will actually understand where it will and will not be
appropriate. Ironically, this may actually provide a better
experience than jsdoc or other grammer-based tools offer. But don't
quote me on that. :-)

denis.r...@gmail.com

unread,
Jan 19, 2008, 4:34:26 PM1/19/08
to jgrousedoc
Hmm

I am not a big fan of this approach, but I am ready to incorporate
into jGrouseDoc a Java function that would return next identifier.
The function would receive a string with start and stop indices and
should return a string with identifier (full name, like com.foo.Bar,
if full name was specified)

The algorithm should be able to work with code styles used in major JS
libraries, like Dojo, jQuery, ExtJS, MooTools and Prototype.

Once such function is provided, I promise to integrate it :-)

Regards,
Denis

Robert Kieffer

unread,
Jan 19, 2008, 10:01:28 PM1/19/08
to jgrou...@googlegroups.com
I think we're still coming at this problem from very different perspectives. By asking for a function that parses fully qualified identifier paths out of source code, you're basically sending us on a Holy Grail quest. It simply does not exist, and never will due to the nature of JavaScript. Even Rhino based parsers fail in this regard, and they have a complete parse tree at their disposal. Thus, to say jGD will only address the parsing problem when that problem is solved is much like sending the Knights of the Round Table out on a holy crusade; it's a good way of distracting a lot of power-hungry guys with big swords, but not likely to produce much in the way of cups you can drink out of. :-D

If instead we ask, "what is causing jGD users the most pain?", the problem takes on a very different look. The pain in jGD right now is in maintaining the names of namespace _members_ - i.e. the names used in @ function/@var/@event/@ifunction/@property comments - not in namespaces (@class/@struct/@object/@interface comments). There are two reasons for this:

  - There tend to be 10-20x more members than namespaces. (e.g. we have ~100 @classes, but ~1000 @functions and @vars)
  - Member names change 4-5x more often than namespace names

Together, this means that for every change to a class name, there are 40-50 changes to function and variable names.

I draw this distinction between members and namespaces because in practice members are almost always documented using just the local names, while namespaces are documented with the full name. If you're documenting a class, you're gonna start out with a "@class Foo.Bar.MyClass", so you don't have to prefix the name of each function and var with "Foo.Bar.MyClass". Users are strongly motivited to work this way if for no other reason than the amount of typing it saves.

Thus, if jGD can parse **local** names, it solves the naming problem for namespace members, which is 95% of the problem. And parsing local names out of source code is a much, MUCH simpler problem.

If jGD can tell users, "Just type a '-' to use the next local name found in your source code", it gives them a tool that ...
  - Solves most of the parsing problems they care about
  - Is simple and easy to understand in terms of where/when it can be applied (which means it can be used or ignored as they see fit)
  - Is trivial to implement (and customize, if based on regex rules in a config file)

denis.r...@gmail.com

unread,
Jan 19, 2008, 10:14:48 PM1/19/08
to jgrousedoc
Hey, hey, no need to make it more complex than it should be!

My only point regarding full names was that if full name was
specified, return it as a whole, like for code

/**
@var -
*/
com.foo.bar.myVar = 'bla'

the function in question should return 'com.foo.bar.myVar'

but, for example, if only local name had been specified

{
....
/**
@var -
*/
myVar : 'bla',
...
}
then only 'myVar' should be returned. The point is that '.' must be
treated as part of identifier name - that's it!

jGrouseDoc would be responsible for tracking namespaces, etc

Regards,
Denis

Robert Kieffer

unread,
Jan 19, 2008, 10:48:31 PM1/19/08
to jgrou...@googlegroups.com
*heh* sorry, didn't mean to overreact - it's just that the full name issue is where other tools have gone astray.  I think it's much simpler for users if there's absolutely no ambiguity about the behavior of the "-" shortcut and how namespaces are defined.  i.e. We should be telling users, "Use jGD comments to specify namespaces, and use '-' (if you want) to parse out the local name."

If "-" sometimes returns a local name and sometimes returns a full name, and if the parse logic that's used is complex enough a user can't tell when it will do one .vs. the other - than it's just going to be confusing instead of useful.   Moreover in the cases where the fullname parse is wrong - and it *will* be wrong much of the time - then "-" is useless to the user, even if all they needed was the local name.  Here are just a few cases to illustrate my point:

   /* @class - Namespace should be "MyClass", not "window.MyClass" */
   window.MyClass = function() {...
   var p = MyClass.prototype;

   /* @var - Name should just be "myVar", not " p.myVar" */
   p.myVar = 'foo';

   /* @function - I'm at all sure how this should be parsed */
   MyClass.getInstance().myFunction =function() {...

If you really think full names are required, perhaps a second shortcut may be in order - a double-dash ("--") maybe? But I think the whole fullname thing is a bit of a wild goose chase.  The parse logic is more complex and will be wrong as often as it's right, and it's just not needed anywhere near as often.

Cheers,

- rwk

Broofa

unread,
Jan 20, 2008, 7:55:23 AM1/20/08
to jgrousedoc
Side note: It would be interesting/useful to allow the "-" shortcut
in conjunction with a full namespace path. E.g:

/* @var Foo.MyClass.- Manually specify the path, but parse the local
name */

denis.r...@gmail.com

unread,
Jan 20, 2008, 9:41:49 AM1/20/08
to jgrousedoc
OK, point taken. I was thinking from my own programming patterns :-)
Let's settle for function that returns local name.

Regards,
Denis

denis.r...@gmail.com

unread,
Jan 20, 2008, 3:35:44 PM1/20/08
to jgrousedoc
And regarding shortcut - i believe that ? would be a better choice,
since it is typically used to indicate a placeholder for a parameter

Broofa

unread,
Feb 1, 2008, 9:20:46 AM2/1/08
to jgrousedoc
I like tour suggestion for the "?" character as the token to use. I'd
like to propose a coule variations on that to consider:
- If a tag has no identifier and no description then default to the
parsed identifier. E.g. the following would get "foobar" as the
identifier name:
/** @var */
var foobar = ...

/** @function {private} */
function foobar(...

- Allow the ? to be mixed with namespace paths. E.g. the following
would have an identifier of "Some.Namespace.foobar":
/** @function {private} Some.Namespace.? */
function foobar(...

In other news, I implemented a Java version of jgdlint. This includes
a "parseIdentifier" method that I believe meets your description above
for parsing a file line-by-line to locate and return an identifier.
Thus, I've opened an enhancement ticket for this feature, and attached
the jgdlint.java file there:

http://code.google.com/p/jgrousedoc/issues/detail?id=68

The usage for this is pretty much the same as before, "java jgdlint
[file1.js [file2.js [etc...]]]". The identifier and jgd tag parsing
logic are a bit smarter than the Bash version I posted previously.
But, of course, the jgd tag parsing code is just a placeholder in lieu
of what jGD actually does.

FWIW, I ran this on our codebase of 200 files. Of the 1127 tags it
checked, it found 45 errors where there were discrepencies between the
parsed identifier and the identifier declared in the comment. Of
those, all but one were legitimate mistakes in our documentation.
i.e. The parser has a ~.09% rate for false-positives (if you take into
account the caveat that it ignores namespace-level tags - @class/
@interface/@object/@struct - where it's false-positive rate is
significantly higher. But those aren't as important for the reasons I
mentioned previously.)

I also added a check for properties that look like they should be
private (i.e. are prefixed with a "_"), but that don't have a
'private' modifier. It found 133 of those.

(Anyhow we could _really_ use this feature, like, today. Those 45
errors? All of them were introduced in the last two weeks!)

denis.r...@gmail.com

unread,
Feb 2, 2008, 12:17:21 PM2/2/08
to jgrousedoc
Thanks for the contribution, I have started integration of this
functionality to jGrouseDoc.

One of the caveats that I have noticed is that it won't work for cases
when you are using special functions to create new classes, like, for
example, dojo.declare(...) - in this case the extracted name is
"declare", but I doubt that anything could be done in such cases.

Given the fact that we have introduced "simplified" way to provide
summaries, I don't think that we could get around without using of ?,
otherwise the following construct becomes quite ambiguous:

/** @var placeholder for some data*/
var foobar = ...

while
/** @var ? placeholder for some data*/
var foobar = ...

guarantees predictable behavior

Regards,
Denis

Broofa

unread,
Feb 4, 2008, 8:47:48 AM2/4/08
to jgrousedoc
Hey Denis, I've started playing around with the parsing support, and
so far it's working great! Thanks so much for adding this.

Re: making '?' optional, you're right about it being necessary for
that style of @var declaration. I was suggesting that when a
description is not supplied _and_ there's no identifier ('?' or
otherwise) - i.e. it's just a bare tag - than defaulting to the parsed
identifier might be useful.

These cases are currently treated as an errors anyway, so there's
little/no downside to this.
Reply all
Reply to author
Forward
0 new messages