Keeping the AUTHORS file up to date

76 views
Skip to first unread message

Oscar Benjamin

unread,
Aug 21, 2021, 11:58:45 AM8/21/21
to sympy
Hi all,

I just merged a PR to update the AUTHORS file:
https://github.com/sympy/sympy/pull/21881

This is something that I've needed to do before each release and is
quite time consuming so in that PR I also added the scripts that check
the AUTHORS file to the CI tests. Now any PR that does not have up to
date author information will fail in CI. In particular this means that
any new contributor will need to add their name to the AUTHORS file
before their first PR can be merged.

The name and email address in the AUTHORS file needs to match up with
the name/email in the git commits (as set by git config). Otherwise
there will need to be a line in the .mailmap file associating the
name/email in the authors file with the name/email in the commits. The
most common reason this is needed is if someone makes commits through
the github web UI which will always record a no reply email address.

This will be problematic for new contributors but it means that the
author information will always be up to date and will be more accurate
since anyone contributing will be required to specify the name and
email address up front.

The simple instruction for any new contributor is that you should
first set your name and email address in your git config:

$ git config --global user.name "John Doe"
$ git config --global user.email joh...@example.com

See e.g.:
https://git-scm.com/book/en/v2/Getting-Started-First-Time-Git-Setup

If git is correctly configured in all of the commits then the sympy
git repo has two scripts that can be used to check and update the
author information. The first is

$ bin/mailmap_update.py

This extracts all name/email combinations from all commits in the repo
and checks that every email address is mapped to a unique name. If
there are two commits with the same email addresses but different
names then a mailmap line is needed to specify which name is the one
that should be used in the authors file. Likewise if the same name is
used with multiple email addresses then a mailmap line is needed to
specify which email address should go with the name. (I'm not sure how
to handle genuinely distinct people who actually have the same
name...).

If there are no errors reported by mailmap_update.py then the second
script can be used to update the AUTHORS file:

$ bin/authors_update.py

This also extracts all name/email combinations from commits and then
runs them through the mapping in mailmap and then if any are not
listed in the authors file the script will add them. It will print
something to say what it has done. You can use git diff to see the
changes. These changes in .mailmap and AUTHORS need to be committed
and pushed.

I expect some new contributors will find this difficult especially
since they might not discover that their git config has the wrong
name/email until after they have already made the commits and pushed
them. It will reduce the chances that someone accidentally uses the
wrong git config though (once your commits are merged to master this
is no way to remove them even if you do not want that name/email to be
used publicly). The simple fix is just to add some lines to .mailmap
but many contributors might prefer to actually fix their config and
redo the commits. In general I would rather have correct name/email
information recorded in the commits than have a .mailmap entry that
disambiguates them.

Lastly whenever we have changes to CI like this there is a risk that a
PR that has already passed CI will be merged resulting in CI failure
on the master branch that then prevents any PRs from being merged
until it gets fixed. CI can be rerun for a PR by closing and
reopening.

--
Oscar

Aaron Meurer

unread,
Aug 21, 2021, 11:59:00 PM8/21/21
to sympy
Thanks. I think we can do this, but let's make sure the error messages
from the scripts give sufficient information on what to do when they
fail. Right now, I don't think they really describe things well from
the point of view of someone who doesn't know about them yet. We might
also want to put something longer on the wiki and link to it.
Otherwise, you will spend more time describing to people how to fix
this than you currently do updating the AUTHORS yourself.

Also, for anyone wondering why we do this in the first place, there
are two reasons. Firstly is that AUTHORS credits everyone who has
contributed, and is shipped with every release (so it's present even
when the git history isn't). Secondly, by keeping .mailmap up-to-date,
we can always have an accurate count of how many contributors we have.
Without it, it would be very difficult to count, because the same
person would appear duplicated in the git history. This information is
useful when doing various statistics. For example, consider the
question, "how many people contributed to SymPy in the last year?"
This question is easy to answer with 'git shortlog -ns --since="1 year
ago" | wc -l' (the answer as of right now is 152), but only because
duplicates are properly filtered.

Aaron Meurer

Aaron Meurer
> --
> You received this message because you are subscribed to the Google Groups "sympy" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to sympy+un...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/sympy/CAHVvXxThP7%2Bb2WQbExN670tgM8AE7hiS3PisJcNSHnUf-H9y-g%40mail.gmail.com.

Chris Smith

unread,
Aug 23, 2021, 2:28:50 PM8/23/21
to sympy
> have the same names

Authors are identified by email addresses. If an author with a given name uses more than one email address, then (according to the `bin/mailmap_update.py` note, 

        Ambiguous name warning: if a person uses more than
        one email address, entries should be added to .mailmap
        to merge them into a single canonical address.
/c

Oscar Benjamin

unread,
Aug 24, 2021, 7:39:07 PM8/24/21
to sympy
I've realised another problem with this which is that if a PR was
opened before this change then it is not enough to merge with the
latest master and rerun the bin/authors_update.py script. The script
updates the AUTHORS file in the order of the commits but that order is
different when merging master into the PR rather than merging the PR
into master which is what happens in CI. That means that any PR from
before this change needs to be rebased against master (rather than
have master merged in) and then the script should be rerun.

It's not clear to me yet if this is just a temporary problem for PRs
that were opened before the change but haven't been merged yet. In
general though anything that involves appending to the end of a file
as the authors_update.py script does can lead to merge conflicts so
this might be an ongoing problem that requires recording the author
information in a different way. For example storing the author
information in alphabetical order would solve most of the problems
(they could be processed into a different order at release time).


Oscar
> --
> You received this message because you are subscribed to the Google Groups "sympy" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to sympy+un...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/sympy/91fba620-78c3-496b-b2e3-6ba37d3b211dn%40googlegroups.com.

Aaron Meurer

unread,
Aug 24, 2021, 8:07:57 PM8/24/21
to sympy
Are you simulating having the PR run in a "merged with master" state
when running the authors script on CI? Maybe we should update the
script itself so that it can do this.

It tries to order based on when the person first has commits in
master. So if someone made a PR a long time ago, but it gets merged
today, the person's name will be appended to the end of the AUTHORS
rather than somewhere in the middle, which is where it would go if we
sorted by commit date. More specifically, it uses the order that
someone first appears based on git log --topo-order --reverse (with
some minor cleanups due to the fact that early history is not in git).
I'm not 100% sure this logic is correct, so if someone's name is being
added not somewhere other than the end, we should fix that. I believe
the way git works, if two PRs add a line to the end of the file, it
will consider them to be in merge conflict. In other words, you should
have a situation where a PR breaks master unexpectevly. Rather, PRs
that add new people will just create merge conflicts with each other
(which is not really ideal either, but it is preferable of the two).

Sorting the file might be a good idea, given that every PR that adds
someone is basically guaranteed to create merge conflicts with every
other PR that adds someone. With a sorted file, this would only happen
if two names happen to be next to each other alphabetically (I think.
It really depends on the intricacies of how git decides something is a
merge conflict, so we should double check this). We can easily
generate the current date sorted list using the git command plus the
fixups that are currently in the authors script.

Aaron Meurer
> To view this discussion on the web visit https://groups.google.com/d/msgid/sympy/CAHVvXxQ8xcZEJ77oWaX-j6o%2B8gK9x36GfL7LFPy_Lqp3EwcM9A%40mail.gmail.com.

Oscar Benjamin

unread,
Aug 24, 2021, 8:30:13 PM8/24/21
to sympy
On Wed, 25 Aug 2021 at 01:07, Aaron Meurer <asme...@gmail.com> wrote:
>
> Are you simulating having the PR run in a "merged with master" state
> when running the authors script on CI? Maybe we should update the
> script itself so that it can do this.

It's not a simulation. When a PR is pushed GitHub actions will make a
temporary commit that merges the PR into master and then all tests run
on that commit. The problem I'm referring to is that usually if you
suggest that a contributor should "merge with master" in their PR the
merge is in the opposite direction (merge master into the PR rather
than PR into master). For most purposes this doesn't make any
difference because the final state of the changes to files is the
same. However the authors script looks through the topological order
of the commits and that is not the same in both situations. That's why
currently a rebase is needed: before the PR can be merged the
topological order needs to correspond to the chronological order in
which PRs are merged (but before the PRs are merged the *eventual*
chronological order is not actually known).


Oscar

Aaron Meurer

unread,
Aug 25, 2021, 4:22:51 AM8/25/21
to sympy
Right, the point is that the script currently doesn't give the same
results in a branch vs. that branch after being merged into master (or
rebased on top of master). But I think we could modify it so that it
does do this. I don't know if there's a fancy git command we can do,
but I think the simplest way would be to clone the repo into a temp
directory and merge the branch into master before running the git log
--topo-sort.

But even so, I don't think this will solve the merge conflict problem
which, unless I am mistaken, will happen just from multiple people
adding names to the bottom of the file.

Aaron Meurer

>
>
> Oscar
>
> --
> You received this message because you are subscribed to the Google Groups "sympy" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to sympy+un...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/sympy/CAHVvXxRHj27Xccr7nwtR%2B1yR_-OBY7fribJ1%3DRshzkP3e2iMjw%40mail.gmail.com.

prathamesh bhole

unread,
Aug 25, 2021, 4:43:54 AM8/25/21
to sympy
Today I made a commit to my pull request which was created way before this change. Now the code quality test is failing because of exit code 1 by authors_update.py. Can you please help, so that I can clear the tests

Oscar Benjamin

unread,
Aug 25, 2021, 5:26:49 AM8/25/21
to sympy
Yes, the merge conflict issue comes from insisting that many PRs need to make changes to the end of the same file. A simple fix is to make it so that the changes are not at the end e.g. by listing the names in alphabetical order.

The bin/authors_update.py script has a load of code to carefully get the order right so presumably at some point someone thought that the ordering was important. It's possible that we could have the names temporarily stored in alphabetical order in a separate file that isn't included in the release. Then before release a script could collate the names and write them to the AUTHORS file in commit order.

I wonder how other projects do this...


Oscar
 

Oscar Benjamin

unread,
Aug 25, 2021, 5:33:38 AM8/25/21
to sympy
On Wed, 25 Aug 2021 at 09:43, prathamesh bhole <prathame...@gmail.com> wrote:
Today I made a commit to my pull request which was created way before this change. Now the code quality test is failing because of exit code 1 by authors_update.py. Can you please help, so that I can clear the tests

Hi Prathamesh,

I'm presuming that you have added the main sympy repo as a remote called upstream. If you haven't yet done that then you can do that like:

$ git remote add upstream https://github.com/sympy/sympy.git

Now you can rebase your PR branch over the master branch in the main repo like this:

$ git checkout my_pr_branch_name
$ git fetch upstream
$ git rebase upstream/master

The rebase might work straight away or it might encounter a merge conflict. If there is a conflict then use git status a lot and read the messages from git carefully - they do just about tell you how to resolve the conflict.

When the rebase is complete rerun the authors script and commit the changes

$ bin/authors_update.py
$ git diff   # <-- check what is actually changed
$ git add AUTHORS
$ git commit -m 'maint: update AUTHORS file'

Finally you can force push to update the PR:

$ git push --force


Oscar

prathamesh bhole

unread,
Aug 25, 2021, 6:16:01 AM8/25/21
to sy...@googlegroups.com
@Oscar the test is still failing with the same error

--
You received this message because you are subscribed to the Google Groups "sympy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sympy+un...@googlegroups.com.

Oscar Benjamin

unread,
Aug 25, 2021, 7:08:30 AM8/25/21
to sympy
On Wed, 25 Aug 2021 at 11:16, prathamesh bhole <prathame...@gmail.com> wrote:
@Oscar the test is still failing with the same error

Look carefully at the diff and check every changed line:

There are changes in the authors file from the merge conflict not being fully resolved.

Run the bin/authors_update.py script locally before pushing. It should look like this:

$ bin/authors_update.py 

No changes made to AUTHORS.



Oscar

Aaron Meurer

unread,
Aug 25, 2021, 6:55:08 PM8/25/21
to sympy
On Wed, Aug 25, 2021 at 3:26 AM Oscar Benjamin
Honestly, the reason is that the AUTHORS file was always in that
order, and I always tried to maintain the ordering when updating it
manually. So when we wrote the script, we made it so that it gave the
same ordering that was already there. See the discussion at
https://github.com/sympy/sympy/pull/10239. So the ordering isn't that
important. IMO alphabetical is fine. The only real issue with it is
that there's no reasonable way to automatically sort by last name
(which not everyone even has), so we have to just sort by first name
(we do a fake last name sort in the release notes, but we should
probably just do first name there too). We can maybe have some flag to
the script that prints the authors in the current order in case anyone
is ever interested in that.

>
> I wonder how other projects do this...

In my experience most other projects aren't great about maintaining a
.mailmap file. Maybe once we make the tooling here work well we can
separate it from SymPy so that other projects can make use of it as
well.

One alternative I know of is to use the authors activity from rever
https://regro.github.io/rever-docs/api/activities/authors.html. This
generates AUTHORS and .mailmap from a yaml file. I believe the code
for it is based on the SymPy scripts. I'm not sure if that level of
indirection actually solves the problem or makes it worse, however.

Aaron Meurer

>
>
> Oscar
>
>
> --
> You received this message because you are subscribed to the Google Groups "sympy" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to sympy+un...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/sympy/CAHVvXxR16TftRd2mRXrq6Nsdra5S%2BSYbvhtgGX%2Bf9Wmx4vAE9w%40mail.gmail.com.

Aaron Meurer

unread,
Aug 25, 2021, 7:01:32 PM8/25/21
to sympy
Regarding usability, you are going to have to keep explaining things
to everyone until we make the scripts better. I would suggest

- Merge the two scripts together. If we want to be able to do the
functionalities separately we can do that with command line flags.
- Make the output text explain things for someone who doesn't know
what .mailmap is or anything.
- Even better, we can make the script have an interactive mode, which
asks people basic questions and updates things automatically based on
that.
- Write a longer description of everything on the wiki page, and link
to it from the development workflow.

Actually, now that running it is required, this becomes much simpler,
because the only errors will be due to git metadata mismatches for the
person running the script. Before we had it on CI, it would show
duplicate emails and missing lines for lots of different people, but
now, it should almost always be a single person. So we can make this
assumption in the output of the script.

Aaron Meurer

Chris Smith

unread,
Aug 26, 2021, 5:09:35 PM8/26/21
to sympy
>  many PRs need to make changes to the end of the same file. A simple fix is to make it so that the changes are not at the end

Can't you just put a decorator line (or comment?) on the last line so anyone's name would come before that?

/c

Aaron Meurer

unread,
Aug 26, 2021, 5:26:10 PM8/26/21
to sympy
On Thu, Aug 26, 2021 at 3:09 PM Chris Smith <smi...@gmail.com> wrote:
>
> > many PRs need to make changes to the end of the same file. A simple fix is to make it so that the changes are not at the end
>
> Can't you just put a decorator line (or comment?) on the last line so anyone's name would come before that?

I don't think that would fix it. Git can't automatically merge two
changes in the exact same place in a file, the end or otherwise.

If you think about it, say a file in master looks like

A
D
E
F

and one person's branch adds a line

A
+B
D
E
F

and another person's branch adds another line

A
+C
D
E
F

then how can git merge the two changes together? Even if it assumes
that both new lines should be there, what order should they go in,
ABCDEF or ACBDEF? Git refuses the temptation to guess and makes the
user specify how it should be done, in the form of a merge conflict
resolution.

Aaron Meurer
> --
> You received this message because you are subscribed to the Google Groups "sympy" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to sympy+un...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/sympy/333e6bab-d872-4d89-bf1a-e4fa92637221n%40googlegroups.com.

Oscar

unread,
Aug 30, 2021, 3:46:11 PM8/30/21
to sympy
On Thursday, 26 August 2021 at 22:26:10 UTC+1 Aaron Meurer wrote:
On Thu, Aug 26, 2021 at 3:09 PM Chris Smith <smi...@gmail.com> wrote:
>
> > many PRs need to make changes to the end of the same file. A simple fix is to make it so that the changes are not at the end
>
> Can't you just put a decorator line (or comment?) on the last line so anyone's name would come before that?

I don't think that would fix it. Git can't automatically merge two
changes in the exact same place in a file, the end or otherwise.

It's actually worse than I thought. The file also lists the exact number of authors:


If new contributors A and B open PRs at the same time. Each of them will run the authors_update.py script and it will increase the number of authors by 1. Since they both make the same change to the same line there won't be a merge conflict but after merging both the number of authors will be incorrect unless the authors_update.py script is run again.

That also means that merging any one PR from a new contributor invalidates all other open PRs from new contributors. We can fix that by changing that line to say there are "more than X authors" and then X can just be updated at release time. If we also switch to alphabetical ordering of authors then that would eliminate most of the problem. The instructions for a new contributor would just be to run the script once and merge conflicts would be unlikely.

The alternative to alphabetical ordering in the AUTHORS file itself would be to have a separate file that gives names in alphabetical ordering and then that can be translated into an ordered list at release time. I think either way we would need a step at release time if the file is going to list the number of authors so maybe that's actually the best option. Perhaps in fact the alphabetically ordered file can just be the .mailmap file but it would need to list all authors whereas it currently lists only those who have ambiguous commit metadata.

Okay so here's a proposal:

List all authors in .mailmap which can be sorted alphabetically by primary name/email combination. New contributors have to add their name and email address in there (plus any alternate name/email combinations used in their commits). They should then run a script that sorts the file and commit and push that change to .mailmap. CI checks that the mailmap lists all authors and fails if not. Then at release time the AUTHORS file itself is updated keeping the ordering system that is already in use. On the release branch we can have CI fail if the AUTHORS file is not up to date but new contributors don't usually open PRs for the release branch so conflicts are unlikely. In future that would mean that merge conflicts around the AUTHORS file are unlikely and we would always have an accurate list of names of contributors.

Does that sound reasonable?

Any change here will also lead to the same problem of merge conflicts for new contributors with currently open PRs though. In fact some of those new contributors who have recent open PRs right now have already been struggling with this and would need to merge/rebase and rerun the script again.

(The reason I haven't yet written detailed instructions for how to do this outside this thread is that I expected that the process would need to be changed almost immediately.)

Oscar

Aaron Meurer

unread,
Aug 30, 2021, 5:44:20 PM8/30/21
to sympy
On Mon, Aug 30, 2021 at 1:46 PM Oscar <oscar.j....@gmail.com> wrote:
>
> On Thursday, 26 August 2021 at 22:26:10 UTC+1 Aaron Meurer wrote:
>>
>> On Thu, Aug 26, 2021 at 3:09 PM Chris Smith <smi...@gmail.com> wrote:
>> >
>> > > many PRs need to make changes to the end of the same file. A simple fix is to make it so that the changes are not at the end
>> >
>> > Can't you just put a decorator line (or comment?) on the last line so anyone's name would come before that?
>>
>> I don't think that would fix it. Git can't automatically merge two
>> changes in the exact same place in a file, the end or otherwise.
>
>
> It's actually worse than I thought. The file also lists the exact number of authors:
>
> https://github.com/sympy/sympy/blob/345c3d686c06703fad6990a61e76a5f7043a0760/AUTHORS#L7
>
> If new contributors A and B open PRs at the same time. Each of them will run the authors_update.py script and it will increase the number of authors by 1. Since they both make the same change to the same line there won't be a merge conflict but after merging both the number of authors will be incorrect unless the authors_update.py script is run again.
>
> That also means that merging any one PR from a new contributor invalidates all other open PRs from new contributors. We can fix that by changing that line to say there are "more than X authors" and then X can just be updated at release time. If we also switch to alphabetical ordering of authors then that would eliminate most of the problem. The instructions for a new contributor would just be to run the script once and merge conflicts would be unlikely.

I think we can remove the number line from the file. We can edit the
script to print the number to the terminal so that if you want to know
you can just run the script (or just count the lines, which isn't that
hard).

>
> The alternative to alphabetical ordering in the AUTHORS file itself would be to have a separate file that gives names in alphabetical ordering and then that can be translated into an ordered list at release time. I think either way we would need a step at release time if the file is going to list the number of authors so maybe that's actually the best option. Perhaps in fact the alphabetically ordered file can just be the .mailmap file but it would need to list all authors whereas it currently lists only those who have ambiguous commit metadata.
>
> Okay so here's a proposal:
>
> List all authors in .mailmap which can be sorted alphabetically by primary name/email combination. New contributors have to add their name and email address in there (plus any alternate name/email combinations used in their commits). They should then run a script that sorts the file and commit and push that change to .mailmap. CI checks that the mailmap lists all authors and fails if not. Then at release time the AUTHORS file itself is updated keeping the ordering system that is already in use. On the release branch we can have CI fail if the AUTHORS file is not up to date but new contributors don't usually open PRs for the release branch so conflicts are unlikely. In future that would mean that merge conflicts around the AUTHORS file are unlikely and we would always have an accurate list of names of contributors.
>
> Does that sound reasonable?

I think so. As long as the script is idempotent so that you can always
just run the script to make sure everything is sorted properly. I do
think making the script interactive would be even better, so that
people don't have to figure out how .mailmap works, but this should at
least fix the merge conflict problems.

Aaron Meurer

>
> Any change here will also lead to the same problem of merge conflicts for new contributors with currently open PRs though. In fact some of those new contributors who have recent open PRs right now have already been struggling with this and would need to merge/rebase and rerun the script again.
>
> (The reason I haven't yet written detailed instructions for how to do this outside this thread is that I expected that the process would need to be changed almost immediately.)
>
> Oscar
>
> --
> You received this message because you are subscribed to the Google Groups "sympy" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to sympy+un...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/sympy/6e4efe78-bc64-41fe-bf82-f0753a17cafcn%40googlegroups.com.

Oscar Benjamin

unread,
Aug 30, 2021, 6:39:30 PM8/30/21
to sympy
On Mon, 30 Aug 2021 at 22:44, Aaron Meurer <asme...@gmail.com> wrote:
On Mon, Aug 30, 2021 at 1:46 PM Oscar <oscar.j....@gmail.com> wrote:
>
> Okay so here's a proposal:
>
> List all authors in .mailmap which can be sorted alphabetically by primary name/email combination. New contributors have to add their name and email address in there (plus any alternate name/email combinations used in their commits). They should then run a script that sorts the file and commit and push that change to .mailmap. CI checks that the mailmap lists all authors and fails if not. Then at release time the AUTHORS file itself is updated keeping the ordering system that is already in use. On the release branch we can have CI fail if the AUTHORS file is not up to date but new contributors don't usually open PRs for the release branch so conflicts are unlikely. In future that would mean that merge conflicts around the AUTHORS file are unlikely and we would always have an accurate list of names of contributors.
>
> Does that sound reasonable?

I think so. As long as the script is idempotent so that you can always
just run the script to make sure everything is sorted properly. I do
think making the script interactive would be even better, so that
people don't have to figure out how .mailmap works, but this should at
least fix the merge conflict problems.

For now a more immediate fix is just to disable this check in CI:

I'll open a new PR once I have implemented a better approach but I think it would be good just to stop causing trouble for currently open PRs right now.


Oscar
Reply all
Reply to author
Forward
0 new messages