Preserve metadata for qualified identifiers and alias segments

57 views
Skip to first unread message

i Dorgan

unread,
Jun 4, 2021, 12:00:30 PM6/4/21
to elixir-lang-core
Hi all,

I'm writing a function that takes a quoted expression and calculates the start and end positions of the node in the source code. So for example for this expression:

:"foo#{
  2
}bar"

It would tell us that it starts at line: 1, column: 1 and ends at line: 3, column: 6. The idea is that by knowing the boundaries of a node, a refactoring tool can say things like "replace the code between these positions with this other code".

The issue I'm facing is that there are two cases where the AST does not contain enough information to calculate those positions, the first one is qualified identifiers:

foo
.
bar

which produces the ast:

{{:., [line: 2, column: 1],
  [
    {:foo, [line: 1, column: 1], nil},
    :bar
  ]},
 [no_parens: true, line: 2, column: 1], []}

Note that we don't have any information about the location of :bar, only for the dot. This makes it impossible to accurately calculate the ending location for the expression, and we are forced to assume :bar is at the same line as the dot.

The second case happens with aliases:

Foo.
Bar
.Baz

produces:

{:__aliases__, [line: 1, column: 1], [:Foo, :Bar, :Baz]}

Here we have even less information, we know nothing about dots or segments location, and we are forced to assume everything happens at the same line.

I looked into the parser and this information is being discarded in the build_dot function for qualified identifiers and in build_dot_alias for aliases.

My proposal is to keep that information in the ast metadata instead of discarding it when the :token_metadata option is true, similarly to how it is done with do/end, closing and end_of_expression.

The quoted form of the first example would be something like this:

{{:.,
  [
    identifier_location: [line: 3, column: 1],
    line: 2,
    column: 1
  ],
  [
    {:foo, [line: 1, column: 1], nil},
    :bar
  ]},
 [no_parens: true, line: 2, column: 1], []}

For the aliases it would be a bit more involved, because there are two kind of locations that would need to be preserved: dots and segments. I've considered something like this to keep only the segments:

{:__aliases__,
 [
   line: 1,
   column: 1,
   alias_segments: [
     [token: :Foo, line: 1, column: 1],
     [token: :Bar, line: 2, column: 1],
     [token: :Baz, line: 4, column: 1]
   ]
 ], [:Foo, :Bar, :Baz]}

I already have a working version, so I will gladly submit a PR if you consider this to be viable. I'm still unsure on how to tackle the dots positions in a meaningful way. While just knowing the segments positions is enough for my use cases, I figure dot positions may also need to be preserved for the sake of completeness.

I'd like to know your thoughts!

José Valim

unread,
Jun 4, 2021, 3:13:11 PM6/4/21
to elixir-l...@googlegroups.com
The dot one is easy, I think we can have the outer meta be the meta of the call identifier. A PR is welcome.

For aliases, it is trickier, as you said. One alternative is to have something similar to [end: ...] that we have for constructs like do-blocks, so we can at least say where the whole alias extends to? WDYT?

--
You received this message because you are subscribed to the Google Groups "elixir-lang-core" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elixir-lang-co...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elixir-lang-core/815d3113-dae6-4e99-8427-a873a704c4aan%40googlegroups.com.

i Dorgan

unread,
Jun 4, 2021, 4:02:08 PM6/4/21
to elixir-lang-core
> The dot one is easy, I think we can have the outer meta be the meta of the call identifier. A PR is welcome.
Great! I will prepare a PR soon

> One alternative is to have something similar to [end: ...] that we have for constructs like do-blocks, so we can at least say where the whole alias extends to? WDYT? 
Sounds reasonable to me. It would also be way less noisy.
I think what's most valuable is to be able to tell the boundaries of a node, not so much what happens in between.

Regarding the naming of the fields, do you think end_of_expression would be fine for both? It is described as "denotes when the end of expression effectively happens", which is what we would be adding here. Moreover, they would be the same positions that are already added in such field if the expression is part of a block.

José Valim

unread,
Jun 4, 2021, 4:10:56 PM6/4/21
to elixir-l...@googlegroups.com
I think for the dot we don't need end_of_expression, we just update the outer meta to include the outer identifier.

For aliases, I guess we can reuse closing? Or maybe last_dot? And what happens when the alias has no dot? We don't set it?

i Dorgan

unread,
Jun 4, 2021, 4:34:47 PM6/4/21
to elixir-lang-core
> I think for the dot we don't need end_of_expression, we just update the outer meta to include the outer identifier.
Sounds good to me


>  For aliases, I guess we can reuse closing? Or maybe last_dot?
Closing points to the location of the closing pair, which implies there is something wrapped in {}, () or [](or end in the case of anonymous functions), which is why I was leaning towards end_of_expression. The only issue I see with end_of_expression is that we need to calculate the length of the segment(because end_of_expression always point at the very end of the expression, not just where the last token starts).

The problem with last_dot is that the last segment may or may not be in the same line as the dot, for example:

Foo.
Bar

or

Foo
.

Bar

Both of which evaluate to the same ast. So the name should refer to the last segment(:Bar in this case). Maybe last_segment? The syntax reference docs mention "each segment separated by dot as an argument", so it would be consistent with that description.

> And what happens when the alias has no dot? We don't set it?
If the alias has no dot I think we could safely skip the new field, especially if we go for last_segment since there is only one segment.

José Valim

unread,
Jun 4, 2021, 5:10:41 PM6/4/21
to elixir-l...@googlegroups.com
Ah, let’s call it :last then and it points to the segment. There is always one too, so it is always available.

i Dorgan

unread,
Jun 4, 2021, 5:32:19 PM6/4/21
to elixir-lang-core
Sounds good!
I will send some PRs soon :)

Steve Morin

unread,
Jun 5, 2021, 8:45:09 PM6/5/21
to Elixir Lang Core
Will that meta data extend to comments?



--
Steve Morin | Entrepreneur, Engineering Leader, Startup Advisor 
Editor at | https://productivegrowth.substack.com/
Live the dream start a startup. Make the world ... a better place.

i Dorgan

unread,
Jun 6, 2021, 12:20:14 PM6/6/21
to elixir-lang-core
It wouldn't be needed, with the `preserve_comments` option you already get the comment lines, so it's easy to figure out it's boundaries.
And in any case comments are not part of the ast, it's up to the user to figure out how to merge them together

Steve Morin

unread,
Jun 13, 2021, 12:36:20 PM6/13/21
to Elixir Lang Core

RE: And in any case comments are not part of the ast, it's up to the user to figure out how to merge them together.

Making it as easy as possible to preserve comments will aid AST manipulation of files that are human created and then augmented programmatically. 

i Dorgan

unread,
Jun 13, 2021, 12:53:02 PM6/13/21
to elixir-lang-core
Hi Steve,

I agree, part of the work to make that a reality started with the proposal to make public some functionality from the Elixir formatter(https://groups.google.com/u/1/g/elixir-lang-core/c/-8CPorfVTxg).

From a previous discussion in this mailing list(https://groups.google.com/u/1/g/elixir-lang-core/c/GM0yM5Su1Zc/m/poIKsiEVDQAJ) my conclusion was that comments won't be added as part of the AST and having two ASTs(one for macros and one for source code manipulation that would have different semantics) would result in confusion, but if at least we could have access to a) the internal representation of comments in the formatter and b) a function to transform regular ast into an algebra document, then half of the work would already be done.

Because such change won't be done, the best place to figure it out would be in userland code, all we need is the formatter to "cooperate". I started working on the Sourceror library(https://github.com/doorgan/sourceror) to help solve this issue without having Elixir itself introduce breaking changes.

However, after the last comments-in-ast proposal we now have the literal_encoder option for Code.string_to_quoted/2, so maybe it would be worth exploring that front too. It may require a significant amount of work in the tokenizer, though.

-Dorgan

José Valim

unread,
Jun 13, 2021, 12:57:49 PM6/13/21
to elixir-l...@googlegroups.com
The formatter is able to stitch comments back together and this API is now available to users too.

Reply all
Reply to author
Forward
0 new messages