StarCoder2, an open source LLM trained on the Software Heritage source code archive, now public

Robbie Morrison

Mar 7, 2024, 4:48:04 AMMar 7
to Openmod Initiative
Hello all

Not remotely sure how these developments can and will intersect and
interact with our modeling work and our community .. but passing along
this information nonetheless.

Software Heritage (SWH) is a source code archiving initiative, somewhat
comparable to the Wayback Machine for the web. StarCoder2 is a new LLM
(large language model) for software that has been trained on the SWH
archive. StarCoder2 is open source and the full model was just recently
made public. Some URLs:

SWH view this StarCoder exercise "as the most transparent, open, and
ethical approach to building LLMs for code to date". The reliance on
the Software Heritage archive apparently allows for the long term
availability of the data and the precise identification of its origin
makes SWH hope that glitches and possible improvements will surface, and
be addressed, more easily. SWH itself was *not* involved in the
development of StarCoder2.

My guess is most open energy system modeling frameworks are present in
the SWH archive.

I know some on this list have been experimenting with Microsoft Copilot
not only to draft code but also to write basic models specific to a
particular framework. If anyone wants to comment, feel free to respond!

Finally, there is much work in the open source law community regarding
current FOSS licensing, the need to revise fundamental definitions for
FOSS (such as the Open Source Definition), and new and/or modified FOSS
licenses to react to, accommodate, and/or stave off new problems arising
from the AI revolution.

with best wishes, Robbie
Robbie Morrison
Schillerstraße 85
10627 Berlin

