[GSoC] Source meta information model proposal

314 views
Skip to first unread message

Christopher Medrela

unread,
Mar 16, 2015, 6:53:53 PM3/16/15
to clo...@googlegroups.com
Hello! My name is Christopher Medrela and I'd like to work at "source metadata
information model" project mentored by Alex Miller at Google Summer of Code. I
hope that this mailing list is the right place to discuss such projects (if I'm
wrong, correct me).

I'd like to introduce some standard of source meta information model to
represent the code from the API perspective. There exist a lot of tools like
codox, autodoc, Grimoire, ClojureDocs, crosscls.info and so on. Each of these
tools has some repetitive code responsible for extracting this kind of
information.

I'd like to discuss what the model should be and from which tasks you will
benefit most? You can find the model and tasks proposal in "model schema" and
"tasks (deliverables)" sections in my [proposal] draft. I'd like to emphasis
that the tasks list is very highly inspired by Alex Miller.

Francis Avila

unread,
Mar 16, 2015, 9:10:08 PM3/16/15
to clo...@googlegroups.com
Wishlist: for macros, metadata about the vars a macro will define. (E.g., (defmacro defrecord [...]) will define (->NAME arg...), (map->NAME m) when executed.)

This would allow a lot more source analysis for the common case of def* macros which are just fancy ways of def-ing vars, but without having to eval the macro. Maybe even without having to eval anything, if the mechanism is entirely declarative!

This is an issue Colin Fleming (creator of Cursive) has talked about before, and I remember him saying something about creating such a mechanism for library authors to use for better Cursive integration. It'd be nice if such a mechanism could be standardized so the entire ecosystem could benefit. You should talk to him about this because he's given a fair amount of thought to this problem.

Andy Fingerhut

unread,
Mar 16, 2015, 9:33:34 PM3/16/15
to clo...@googlegroups.com
Christopher:

I think considering autodoc to be no longer maintained because the last commit was Sep 1 2014 might be a bit hasty.  No commits can mean "stable and working", too, not only "abandoned".

Tom Faulhaber commits updates to the published Clojure API docs for Clojure itself and its contrib libraries every time a release of one of those is made (not sure if it is automatically done via scripts, or if he does something by hand each time).

Andy

--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clojure+u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to the Google Groups "Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clojure+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Alex Miller

unread,
Mar 17, 2015, 12:06:03 AM3/17/15
to clo...@googlegroups.com
Tom lovingly maintains autodoc and the Clojure projects' automated doc generation with it. But it mostly just continues doing what it does.

Reid McKenzie

unread,
Mar 17, 2015, 5:40:53 AM3/17/15
to clo...@googlegroups.com
Hey Christopher,

I'm Reid, the Grimoire maintainer.
I'm delighted to see that someone besides myself and Alex is interested in this project and I wish you the best in your GSoC application.

I'm somewhat concerned in reading your proposal that while you claim the proposed data structure represents a "minimal superset of the other documentation engines", it is by no means a superset of the lib-grimoire infrastructure.
In fact the proposed "list of defs, list of namespaces" fails entirely to address the issues with multiple versions of multiple artifacts on multiple platforms that I've spent the last four releases of Grimoire and lib-grimoire wrangling.
The Thing structure and list/walk/read/write API I've provided via lib-grimoire seem to solve these issues at least in my work of trying to build a general documentation engine behind Grimoire so that I can just add artifacts left and right.

If I wanted to build a system like Grimoire where the fundamental navigation is entirely `group -> artifact -> version -> platform -> ns -> def` hierarchical decent, why would I choose your representation over the representation already present and usable in lib-grimoire?
I note with some entertainment that there exists a set of namespaces duplicated between ClojureScript's Clojure and ClojureScript sources.
For these namespaces and the defs therein, CrossClj.info incorrectly displays the Clojure namespaces when doing a qualified symbol lookup.
Grimoire used to have the same issue until I asked David Nolen and he said that's a bug.
You absolutely need the platform to disambiguate what definition and source you're looking at.

If I wanted to display the documentation for all the forks of `org.clojure/clojure`, how could I get all of that into your system?
All the forks have defs in the namespace `clojure.core`.
How do you propose to tell them apart?
You have to have the Maven artifactid and groupid.

Multiple versions of a single artifact?
Now you need the Maven version as well.

Now we've encoded all the same information that lib-grimoire already does.
Why would I agree to use this pair of lists structure and then reconstruct a hierarchy from node data when I could use a fundamentally hierarchical structure for instance the one proposed [here](https://github.com/clojure-grimoire/lib-grimoire/issues/9)?
[by my metrics](http://conj.io/heatmap), the most-visited URLs on Grimoire specify a single def or other content precisely.
Why would I want to adopt a data format which does not inherently represent this common case of documentation being a lookup when as above I could build or use one that does?
That you discard these datums with no discussion is worrysome to say the least since I found that they were needed and was forced to invent them after I had done without them for some time.

I agree that there is a lot of duplicated code between the various documentation systems.
Most of it has to do with finding namespaces on the classpath, loading them and documenting them.
Thankfully we already have `clojure.tools.namespace` which is supposed to solve this problem.
Currently `clojure.tools.namespace` can't find ClojureScript namespaces.
As I recall, codox ducks this limitation by implementing its own classpath searching code.
(apologies to Stuart Sierra who I bugged about patching this a while back.)

Your goal of "Building a repl plugin that will search statically across the code accessible by the project so you can query all artifacts without actually loading them." could use some clarification.
It took me longer than I care to admit to realize you were talking about browsing _documentation_ via some packaged documentation structure(s) before the user actually loads code.

Since the novel work you propose is really a documentation distribution convention, you should say more about that.
Are you suggesting that every artifact package a "doc.edn" file?
Is this an open research question you don't have an answer to yet?

This is a set of problems I've sunk a lot of time into and would like to see addressed.
It is certainly not my intent to dissuade you from this project.
Whatever documentation packaging and distribution system you come up with I'm likely to wind up supporting in Grimoire so it's in my direct interests to make sure that you skip as many of the pitfalls that I hit as possible.
I hope that you find this feedback constructive, and I look forwards to seeing how this project evolves.
Feel free to bug me about Grimoire related stuff, especially if you find bugs or documentation issues :P

All the best,

Reid

Francesco Bellomi

unread,
Mar 17, 2015, 2:07:00 PM3/17/15
to clo...@googlegroups.com
Hi Christopher,

I'm Francesco, the maintainer of crossclj.info

I think it's a very interesting project. Some comments on your proposal:

1) I think the information model is by far most important deliverable. I agree with Reid that a sound "coordinate system" is very important

2) The toolchain you want to implement is in part already there. 
Almost all the metadata you list in your current model are already extracted by codox; codox already separates the extraction phase from the output phase, so you could simply implement the "indexing library" by providing a different output module, in order to generate your model instead of HTML; the internal representation used by codox is already (somehow) a list of namespaces, each one with a list of vars; codox has a rich set of tools (lein plugin, etc.) which would be immediately reusable. Outputting JSON instead of EDN is a one-liner with Cheshire.

3) I'm not sure about the choice of storing "artifact coordinates" as part of a namespace's metadata. Currently it's not the case, the publishing recipe is kept separated (say, in project.clj or in a POM file), and a namespace at runtime has no notion of its coordinates. I think it's the same for all JVM languages. Are we providing extra flexibility, or are we simply complecting two different kind of information?

4) One relevant use case is IDE integration: IDEs cannot evaluate source code at runtime in order to generate the relevant metadata (it's not safe), so having this kind of static info would be really useful, ie. for providing users hints on vars generated by macros

Francesco

Alex Miller

unread,
Mar 17, 2015, 2:40:14 PM3/17/15
to clo...@googlegroups.com
Hey all, 

Responding here to several things from both Reid and Francesco. I wanted to step back slightly to set some context. I proposed the project that Chris has posted and some of the points that Reid brought up are really source from my proposal so I wanted to take the blame as it were for anything in the original proposal that sucked - not his fault. :) 

My intention in the proposal was not that it necessarily capture everything but really convey the core idea. I expected that working through all the details *is* the project and would be done as an iterative process talking to me and other experts (like Reid and Francesco!!). I should also mention that another student has already submitted a proposal to GSOC for this project and only one student can work on it. I'm still not exactly sure how that is supposed to get resolved.

Anyhow, more comments below. I may also try to dig out some of the longer form work I've done on this since it seems to have interest now.


On Tuesday, March 17, 2015 at 9:07:00 AM UTC-5, Francesco Bellomi wrote:
Hi Christopher,

I'm Francesco, the maintainer of crossclj.info

I think it's a very interesting project. Some comments on your proposal:

1) I think the information model is by far most important deliverable. I agree with Reid that a sound "coordinate system" is very important

Totally agreed, and I've said this to both Chris and the other student. 

I do not think what's on the proposal page captures many of the important aspects of the problem. Even focusing just on vars ignores what I consider to be equally important pieces like records and protocols. I think it's important to capture the parts of the source that are needed to *use* a project, either as a consumer or as an extender. Some of these fringes are particularly the places where the various existing indexers vary.

The coordinate system is particularly important and tricky and I think keeping multiple serialization formats (text-based and Datomic-backed are two reasonably different targets) in mind is a good way to avoid tailoring it to closely to your output needs (html pages). This is indeed the part I've spent the most time hammocking on. I know you guys have as well.
 
2) The toolchain you want to implement is in part already there. 
Almost all the metadata you list in your current model are already extracted by codox; codox already separates the extraction phase from the output phase, so you could simply implement the "indexing library" by providing a different output module, in order to generate your model instead of HTML; the internal representation used by codox is already (somehow) a list of namespaces, each one with a list of vars; codox has a rich set of tools (lein plugin, etc.) which would be immediately reusable. Outputting JSON instead of EDN is a one-liner with Cheshire.

I don't think JSON is a valuable serialization target. :)  I'd defer any notion of toolchain until there is some consensus on model.

3) I'm not sure about the choice of storing "artifact coordinates" as part of a namespace's metadata. Currently it's not the case, the publishing recipe is kept separated (say, in project.clj or in a POM file), and a namespace at runtime has no notion of its coordinates. I think it's the same for all JVM languages. Are we providing extra flexibility, or are we simply complecting two different kind of information?

Yeah, that doesn't make sense to me either. I think versioning needs to happen at a higher level that can sync up with maven coords. 

Alex Miller

unread,
Mar 17, 2015, 2:56:13 PM3/17/15
to clo...@googlegroups.com
I'm answering these out of order, sorry. :)


On Tuesday, March 17, 2015 at 12:40:53 AM UTC-5, Reid McKenzie wrote:
Hey Christopher,

I'm Reid, the Grimoire maintainer.
I'm delighted to see that someone besides myself and Alex is interested in this project and I wish you the best in your GSoC application.

I'm somewhat concerned in reading your proposal that while you claim the proposed data structure represents a "minimal superset of the other documentation engines", it is by no means a superset of the lib-grimoire infrastructure.
In fact the proposed "list of defs, list of namespaces" fails entirely to address the issues with multiple versions of multiple artifacts on multiple platforms that I've spent the last four releases of Grimoire and lib-grimoire wrangling.

Agreed, I think addressing versioning is a critical aspect.
 
The Thing structure and list/walk/read/write API I've provided via lib-grimoire seem to solve these issues at least in my work of trying to build a general documentation engine behind Grimoire so that I can just add artifacts left and right.

I haven't looked at this yet, but I think that's a requirement for this.
 
If I wanted to build a system like Grimoire where the fundamental navigation is entirely `group -> artifact -> version -> platform -> ns -> def` hierarchical decent, why would I choose your representation over the representation already present and usable in lib-grimoire?
I note with some entertainment that there exists a set of namespaces duplicated between ClojureScript's Clojure and ClojureScript sources.
For these namespaces and the defs therein, CrossClj.info incorrectly displays the Clojure namespaces when doing a qualified symbol lookup.
Grimoire used to have the same issue until I asked David Nolen and he said that's a bug.
You absolutely need the platform to disambiguate what definition and source you're looking at.

That's interesting. I suspect the new reader conditional and cljc portable file extension will further complicate this.
 
If I wanted to display the documentation for all the forks of `org.clojure/clojure`, how could I get all of that into your system?
All the forks have defs in the namespace `clojure.core`.
How do you propose to tell them apart?
You have to have the Maven artifactid and groupid.

Yes.
 
Multiple versions of a single artifact?
Now you need the Maven version as well.

Yes.
 
Now we've encoded all the same information that lib-grimoire already does.
Why would I agree to use this pair of lists structure and then reconstruct a hierarchy from node data when I could use a fundamentally hierarchical structure for instance the one proposed [here](https://github.com/clojure-grimoire/lib-grimoire/issues/9)?
[by my metrics](http://conj.io/heatmap), the most-visited URLs on Grimoire specify a single def or other content precisely.
Why would I want to adopt a data format which does not inherently represent this common case of documentation being a lookup when as above I could build or use one that does?
That you discard these datums with no discussion is worrysome to say the least since I found that they were needed and was forced to invent them after I had done without them for some time.

Like I said in the other mail, Chris (or whoever does this project) is going to start with less context than you or I. The project hasn't started yet and I expect this kind of feedback to be incorporated.
 
I agree that there is a lot of duplicated code between the various documentation systems.
Most of it has to do with finding namespaces on the classpath, loading them and documenting them.
Thankfully we already have `clojure.tools.namespace` which is supposed to solve this problem.
Currently `clojure.tools.namespace` can't find ClojureScript namespaces.
As I recall, codox ducks this limitation by implementing its own classpath searching code.
(apologies to Stuart Sierra who I bugged about patching this a while back.)

If stuff like this needs to be built, I think that's a perfectly adequate sub-goal for the project.
 
Your goal of "Building a repl plugin that will search statically across the code accessible by the project so you can query all artifacts without actually loading them." could use some clarification.
It took me longer than I care to admit to realize you were talking about browsing _documentation_ via some packaged documentation structure(s) before the user actually loads code.

Since the novel work you propose is really a documentation distribution convention, you should say more about that.
Are you suggesting that every artifact package a "doc.edn" file?
Is this an open research question you don't have an answer to yet?

I do not know what the serialization format should be, but I envision this being a jar of stuff that can be automatically built on a project and deployed to maven repos with a known classifier (just like the -javadoc and -source jars built and distributed with most Java projects). Except instead of distributing source metadata as html, we can distribute it as *data*. Given a known classifier, this artifact can be consumed automatically by build or other dev tools. For example, imagine that you could download source metadata artifacts for all of your project dependencies and run a dynamic web service that gave you "clojure docs" specifically for your project. Or let you search across them, scoped specifically to your dependencies.

Another aspect that I didn't list in the original proposal to keep things simpler is the idea of merging multiple metadata artifacts for the same project. For example, examples or other explanatory information (for Clojure itself for example) could be maintained in a secondary repository of just that, then merged into the automatically extracted source metadata to produce something like what you get with clojure docs or grimoire. In particular, people are often bothered by the conciseness of Clojure's docs. There are no plans to change that, but there is the option to combine those docstrings with (for example) a curated repo of examples or extended description. 
 

This is a set of problems I've sunk a lot of time into and would like to see addressed.
It is certainly not my intent to dissuade you from this project.
Whatever documentation packaging and distribution system you come up with I'm likely to wind up supporting in Grimoire so it's in my direct interests to make sure that you skip as many of the pitfalls that I hit as possible.
I hope that you find this feedback constructive, and I look forwards to seeing how this project evolves.
Feel free to bug me about Grimoire related stuff, especially if you find bugs or documentation issues :P

We will regardless. :) Thanks for the feedback.
 

All the best,

Reid

Reid McKenzie

unread,
Mar 18, 2015, 6:18:30 AM3/18/15
to clo...@googlegroups.com
Alex, glad to see we're on the same wavelength about this more or less.

Christopher, some other deliverables worth considering:

- What format is documentation in? As Grimoire is evidence, plain doc
text is pretty badly formatted on average certainly in comparison to
HTML or even markdown. It'd be awesome if we had a convention for
non-plain text documentation and for indicating format to
documentation tools accordingly.

- Notes. Alex mentioned "extended documentation". Clojure has often
been criticized for having "excessively short" docstrings. While I
agree that a few well chosen words typically do better than many
words, there are cases such as the documentation for defmulti where
more than just what passes for the docstring is required. defmulti
for instance is just one entry point to the subject of multimethods
which while documented well on clojure.org is given sufficient
general treatment in no "docstrings". Andy F has Thalia, a
project which modifies docstrings in place to try and augment some
of the docs which are considered most .. terse. It seems that Alex
wants whatever format this project produces to support arbitrarily
many "additional" docstrings for a given entity. One arguable defect
of the Grimoire representation as it stands is that I implicitly
limited myself to a single "note" (user docstring/extension) per
entity.

- Examples. Most frequently, example code is presented as REPL
sessions. Again, there is no common format for these. Some "examples"
include the prompt string. Others don't. Some include STDOUT and
STDERR inline with returned results. Others comment it out.
It'd be awesome if we had tooling (including a common format or
format indicator) for representing examples. Right now Grimoire and
ClojureDocs use plain text which defeats all analysis and linking
efforts. Honestly a nREPL session replay could probably be
sufficient, but this is a research question. How you deal with deps
to run an example is a related question.

It'd be awesome for instance if I could share examples with the
4clojure folks or with Devn Walters' project getclojure.

- Links. I previously played with a representation which I called
"var-link" for uniquely naming any of
[group artifact version platform ns def].
The idea was that you could write a URI of the form
ns:org.clojure/clojure/1.6.0/clojure.core or
def:org.clojure/clojure/1.6.0/clojure.core/concat etc.
This idea predates the existing grimoire.things namespace, and was
abandoned in favor of it since grimoire.things does almost this job.
However grimoire.things fails entirely to provide or read any
reasonable URI or other structured text representation for Things.
It'd be great if we had a standard representation as such, so that
documentation writers could explicitly link to other entities from
docstrings. Again this goes back to the docstring formatting goal.
What if I could explicitly link to say nth from first and second's
docs? Link to defmethod and prefer-method from defmulti's docs?
I think you get the idea.

Just some ideas that've been rattling around in my head unimplemented.

Reid

richar...@googlemail.com

unread,
Mar 18, 2015, 2:50:22 PM3/18/15
to clo...@googlegroups.com


Am Mittwoch, 18. März 2015 07:18:30 UTC+1 schrieb Reid McKenzie:
Alex, glad to see we're on the same wavelength about this more or less.

Christopher, some other deliverables worth considering:

Hi all, so I'm the other student Alex mentioned. I'm not participating so much in this discussion, since there will be enough time during the community bonding phase of the GSoC. My proposal is already on Melange, the place where all proposals go in the end. I've made it public now (http://www.google-melange.com/gsoc/proposal/public/google/gsoc2015/rmoehn/5629499534213120). It will go through some more revisions, though. But it mainly contains general stuff about the project and process and how I intend to go about things. Getting input from the community and knowledgeable people (i.e. you) will be part of those.

Richard

Christopher Medrela

unread,
Mar 18, 2015, 8:28:41 PM3/18/15
to clo...@googlegroups.com
Hello! Alex decided to proceed with Richard. Therefore, I'd like to find some
other project. I'm glad to see so much feedback and I'd really like to reply to
all your feedback but there is not much time to the end of application period
and therefore I will focus exclusively on the another project. I hope that
yours feedback will be helpful for Richard (and the entire community). If
that's not the case and I've wasted your time, I'm really sorry.

Reid McKenzie

unread,
Mar 19, 2015, 9:04:32 PM3/19/15
to clo...@googlegroups.com
Found var-link kicking around in my projects dir so I re-published it for this thread.

https://github.com/clojure-grimoire/var-link

Reid

Alex Miller

unread,
Mar 19, 2015, 11:06:34 PM3/19/15
to clo...@googlegroups.com
Chris (and anyone else), Daniel mentioned to me in a note that it is ok for multiple students to submit a proposal for the same project. We do not know how many spots we will be given as an organization and whether particular students will meet whatever guidelines are set out by Google. So I think I was a bit incorrect in my understanding of the process and you would still be welcome to submit a proposal with the caveat that there are no guarantees that any particular proposal will move forward.

Alex

Christopher Medrela

unread,
Mar 20, 2015, 9:32:07 PM3/20/15
to clo...@googlegroups.com
OK, but will you be able to mentor two students at the same time? Google warns
that it's easy to underestimate how much time mentoring takes. 

BTW, I found "typed Overtone" project which I found equally interesting as this
one. And I think that Clojure community will benefit much more when two
students work at different projects. So I will stick with "typed overtone".

Ambrose Bonnaire-Sergeant

unread,
Mar 20, 2015, 9:36:21 PM3/20/15
to clojure
Hi Christopher,

I recommend still sending a proposal for Alex's project just in case. It's hard to predict
what constraints we will need to satisfy for project allocation (we might have a small number
of allocations from Google, a student may choose a project with another organisation).

Thanks,
Ambrose

--
Reply all
Reply to author
Forward
0 new messages