Inthe area of multi-agent reinforcement learning, thus far there is no single benchmark that is widely accepted as a standard for evaluating the performance of new algorithms. One popular candidate for such a benchmark is the StarCraft Multi-Agent Challenge (SMAC) [1], which is a multi-agent adaptation of a certain mini-game inside the popular real-time strategy game StarCraft II.
In this section, we provide a brief explanation of the SMAC mini-game and its rules. For a more detailed description, we refer the reader to the original SMAC paper [1]. Inside SMAC, the environment consists of a 2D rectangular map with several soldiers, called "units", each belonging to one of two teams. The goal of each team is to defeat all units belonging to the other team. One team, called "the enemy", is controlled by Starcraft II's in-game AI, while the other team, called "the allies", is controlled by reinforcement learning agents. Each unit in the allied team is controlled by a separate agent, and they need to work as a team to defeat the enemies. At the end of each episode (i.e. when one of the teams is eliminated or when a time limit is reached), the agents receive a shared reward based on how much damage they dealt, how many units they killed, and whether they won or lost the encounter.
When designing SMAClite, we started with the basic idea that it should be a lightweight version of SMAC, "lightweight" referring primarily to the computational resources required to run it. While we did keep computational performance in mind while working on the project, throughout the implementation process we gradually converged on a set of extra tenets that we believe are important for it to be a good benchmark for multi-agent reinforcement learning algorithms.
A major tenet that influenced our implementation is a complete decoupling of the SMAClite environment from the Starcraft II game engine. This means that the environment can be run on any machine that has a Python interpreter installed, without requiring any installation of Starcraft II itself. We decided to implement the engine in Python because it is a language that is widely used in the field of reinforcement learning, and thus the code should be easily readable, and extendable, by all users.
Next, the resulting environment needed to be easily modifiable, without any expert knowledge about Starcraft. To this end, the environment also serves as a framework capable of loading JSON files representing combat scenarios and units that are to be used in the game. All of the scenarios and units shipped with the environment are defined in these JSON files, and thus also serve as examples of valid JSON files for the framework.
Finally, while keeping in mind all of the above, we wished the environment to be as close to the original SMAC as possible. This means that the environment should be able to run the same scenarios as the original SMAC and that the same units should be available for use in the game, while maintaining as many game mechanics as possible identical to Starcraft II. This also means we kept the environment state, observation, action, and reward spaces nearly identical to the original SMAC. We demonstrate this in section 5.3 of our
Assuming these tenets, we believe that SMAClite is a good candidate for a lightweight benchmark for multi-agent reinforcement learning algorithms, or that it at the very least improves upon the original SMAC in the above aspects.
We implemented most of the game logic using the Numpy library for array operations in Python, and added a simple graphical interface using the Pygame library. The graphical interface, and the Pygame library itself, are not required to run the environment, but they are useful for verifying its correctness and visualizing the state of the game.
One aspect of the engine worth mentioning is the collision avoidance algorithm. In Starcraft II, in order to move in a realistic manner and not bump into each other, the in-game units use a proprietary algorithm that is not publicly available (but there was an interesting talk about it from Starcraft's developers at GDC 2011). We had to either implement our own algorithm mimicking the one in SC2 or use an open-source alternative. We decided to go with the latter and used the ORCA algorithm [2] for collision avoidance. Notably, this algorithm has also been used in another real-time strategy game, called Warhammer 40,000: Space Marine.
We make available two versions of the ORCA algorithm compatible with SMAClite: one written by us in Numpy, shipped directly with the environment and serving as a direct port of the original implementation. The other version is a Python wrapper around the original C++ implementation of the algorithm, which is available in this code repository. This module is a fork of pre-existing python bindings for the algorithm, which we modified to work with our engine. The latter version is much faster than the Numpy implementation, but requires extra installation steps. To use the C++ version of ORCA in the environment, one needs to set the use_cpp_rvo2 environment flag to True.
To install SMAClite, one can simply run pip install . in a cloned SMAClite repository. To install the optional C++ bindings, one can run python setup.py build && python setup.py install in a clone of their repository.
To train and evaluate the models, we used the EPyMARL framework made available by our research group. An example command used to train a model using the C++ version of the RVO2 algorithm is shown below.
We trained a number of multi-agent reinforcement learning algorithms on various scenarios within the SMAClite environment. For each scenario/algorithm combination, we used 10 different seeds from 0 to 9, to account for the stochasticity of the environment, since in each game step the units execute their commands in random order. The results can be seen in the below image. Note that the graph formatting is identical to the one in our recent benchmark paper [3], allowing for a direct comparison to results observed in the original SMAC.
In general, the ranking of the algorithms within the scope of each scenario is similar in SMAClite and SMAC. This is one of the reasons we believe that SMAClite poses an equivalent challenge to the original SMAC.
In this section, we present and briefly describe a few interesting behaviours observed in the SMAClite environment. The videos below use the Pygame graphical interface and show the behaviours of agents trained using EPyMARL. The agents were trained for 4 million timesteps in the first two scenarios, and 1 million in the last one.
As seen in the video, the agents learned to "attack and run", at all times keeping their distance from the zealots. This strategy is effective, as the zealots are unable to catch up to the stalkers, and the stalkers can deal damage from a safe distance most of the time. Notably, this is a technique used by human players in the original Starcraft II, and many other real-time strategy games, called kiting.
Notably, the agents learned to exploit the environment's reward scheme to exceed the theoretical maximum reward for this scenario (this can also be seen in the graphs above). Since the enemy medivac is capable of healing enemy units, and the agents get rewards for dealing damage to enemy units, keeping the enemy medivac alive longer results in more possible reward to be acquired. This is why the agents in the video above keep the enemy medivac alive for as long as possible, and only attack it when it is the last enemy unit alive. Note that this behaviour is not present in the original SMAC because enemy healing there incurs negative reward, but we decided to keep it in SMAClite, as it is an interesting interaction. It can also be noted that, because the handwritten enemy AI is programmed to prioritise healers when attacking, the agents' medivac stays at a safe distance until all enemy marines are dead.
The Starcraft, and Starcraft-like, environments pose extremely interesting and versatile challenges in multi-agent reinforcement learning. We hope that the highly customizable and extensible nature of SMAClite inspires the general research community to experiment more with various real-time strategy challenges, not necessarily confined to what Starcraft II has to offer. It would be interesting to experiment more with advanced terrain setups, new custom unit types, and more.
One particularly desirable improvement is adapting the stochasticity changes introduced by the authors of SMAC v2 [4]. This version of the SMAC environment randomizes the allied unit types and their starting positions from episode to episode, making it a more difficult challenge for agents to beat.
The full report on the project can be found in the paper, where one can find more detailed descriptions of the environment and the algorithms used, as well as details on all the experiments conducted. All technical details are documented together with the code in the repository.
Other challenges abound in MARL. Agents must also reason over other actors that may exhibit changing behavior. In our restaurant example, imagine that as a new dishwasher improves at their job, the chefs should learn to expect dishes to be ready sooner, enabling them to prepare dishes faster and potentially using better (cleaner) kitchen tools.
Finally, agents must be able to assign credit to themselves and others when things go well, or blame when things go poorly, as often it is difficult or impossible to determine exactly whose actions led to a particular outcome. Imagine an uncooperative worker that always blames others when things go wrong or a worker that always takes credit when things go well. These types of workers can make it difficult for a team to learn, whether it be from success or from failure.
Clearly, MARL poses unique learning challenges that single-agent reinforcement learning algorithms cannot handle. But how can we test algorithms for MARL without opening hundreds of failing brunch restaurants? A team of researchers has proposed a new set of benchmark tests which can be used to test and develop new learning algorithms that can handle these unique and challenging settings.
3a8082e126