Multi-agent RL

class marl.marl.MAS(agents_list=[], name='mas')[source]

Bases: object

The class of multi-agent “system”.

Parameters
  • agents_list – (list) The list of agents in the MAS

  • name – (str) The name of the system

append(agent)[source]

Add an agent to the system.

Parameters

agent – (Agent) The agents to be added

action(observation)[source]

Return the joint action.

Parameters

observation – The joint observation

get_by_name(name)[source]
get_by_id(id)[source]
class marl.marl.MARL(agents_list=[], name='marl')[source]

Bases: marl.agent.agent.TrainableAgent, marl.marl.MAS

The class for a multi-agent reinforcement learning.

Parameters
  • agents_list – (list) The list of agents in the MARL model

  • name – (str) The name of the system

store_experience(*args)[source]

Store a transition in the experience buffer.

update_model(t)[source]

Update the model.

reset_exploration(nb_timesteps)[source]

Reset the exploration process.

update_exploration(t)[source]

Update the exploration process.

action(observation)[source]

Return an action given an observation (action in selected according to the exploration process).

Parameters

observation – The observation

greedy_action(observation)[source]

Return the greedy action given an observation :param observation: The observation

save_policy(folder='.', filename='', timestep=None)[source]

Save the policy in a file called ‘<filename>-<agent_name>-<timestep>’.

Parameters
  • folder – (str) The path to the directory where to save the model(s)

  • filename – (str) A specific name for the file (ex: ‘test2’)

  • timestep – (int) The current timestep

load_model(filename)[source]

Agents

Base Agent

class marl.agent.agent.Agent(policy, name='UnknownAgent')[source]

Bases: object

The class of generic agent.

Parameters
  • policy – (Policy) The policy of the agent

  • name – (str) The name of the agent

agents = {'DDPGAgent': <marl.tools.ClassSpec object>, 'DQNAgent': <marl.tools.ClassSpec object>, 'DeepACAgent': <marl.tools.ClassSpec object>, 'MADDPGAgent': <marl.tools.ClassSpec object>, 'MinimaxQAgent': <marl.tools.ClassSpec object>, 'PHCAgent': <marl.tools.ClassSpec object>, 'QTableAgent': <marl.tools.ClassSpec object>}
counter = 0
action(observation)[source]

Return the action given an observation :param observation: The observation

greedy_action(observation)[source]

Return the greedy action given an observation :param observation: The observation

test(env, nb_episodes=1, max_num_step=200, render=True, time_laps=0.0)[source]

Test a model.

Parameters
  • env – (Gym) The environment

  • nb_episodes – (int) The number of episodes to test

  • max_num_step – (int) The maximum number a step before stopping an episode

  • render – (bool) Whether to visualize the test or not (using render function of the environment)

classmethod make(id, **kwargs)[source]
classmethod register(id, entry_point, **kwargs)[source]
classmethod available()[source]
class marl.agent.agent.TrainableAgent(policy, observation_space=None, action_space=None, model=None, experience='ReplayMemory-10000', exploration='EpsGreedy', gamma=0.99, lr=0.001, batch_size=32, name='TrainableAgent')[source]

Bases: marl.agent.agent.Agent

The class of trainable agent.

Parameters
  • policy – (Policy) The policy

  • observation_space – (gym.Spaces) The observation space

  • action_space – (gym.Spaces) The action space

  • experience – (Experience) The experience memory data structure

  • exploration – (Exploration) The exploration process

  • lr – (float) The learning rate

  • batch_size (gamma,) – (float) The training parameters

  • name – (str) The name of the agent

property observation_space
property action_space
store_experience(*args)[source]

Store a transition in the experience buffer.

update_model(t)[source]

Update the model.

reset_exploration(nb_timesteps)[source]

Reset the exploration process.

update_exploration(t)[source]

Update the exploration process.

action(observation)[source]

Return an action given an observation (action in selected according to the exploration process).

Parameters

observation – The observation

save_policy(folder='.', filename='', timestep=None)[source]

Save the policy in a file called ‘<filename>-<agent_name>-<timestep>’.

Parameters
  • filename – (str) A specific name for the file (ex: ‘test2’)

  • timestep – (int) The current timestep

save_all()[source]
learn(env, nb_timesteps, max_num_step=100, test_freq=1000, save_freq=1000, save_folder='models', render=False, time_laps=0.0, verbose=1)[source]

Start the learning part.

Parameters
  • env – (Gym) The environment

  • nb_timesteps – (int) The total duration (in number of steps)

  • max_num_step – (int) The maximum number a step before stopping episode

  • test_freq – (int) The frequency of testing model

  • save_freq – (int) The frequency of saving model

class marl.agent.agent.MATrainable(mas, index)[source]

Bases: object

set_mas(mas)[source]
marl.agent.agent.register(id, entry_point, **kwargs)[source]
marl.agent.agent.make(id, **kwargs)[source]
marl.agent.agent.available()[source]

Q-value based model

class marl.agent.q_agent.QAgent(qmodel, observation_space, action_space, experience='ReplayMemory-1', exploration='EpsGreedy', gamma=0.99, lr=0.1, batch_size=1, target_update_freq=None, name='QAgent')[source]

Bases: marl.agent.agent.TrainableAgent

The class of trainable agent using Qvalue-based methods

Parameters
  • qmodel – (Model or torch.nn.Module) The q-value model

  • observation_space – (gym.Spaces) The observation space

  • action_space – (gym.Spaces) The action space

  • experience – (Experience) The experience memory data structure

  • exploration – (Exploration) The exploration process

  • gamma – (float) The training parameters

  • lr – (float) The learning rate

  • batch_size – (int) The size of a batch

  • target_update_freq – (int) The update frequency of the target model

  • name – (str) The name of the agent

update_model(t)[source]

Update the model.

Parameters

t – (int) The current timestep

target(Q, batch)[source]

Compute the target value.

Parameters
  • Q – (Model or torch.nn.Module) The model of the Q-value

  • batch – (list) A list composed of the different information about the batch required

value(observation, action)[source]

Compute the value.

Parameters
  • observation – The observation

  • action – The action

update_q(curr_value, target_value, batch)[source]

Update the Q value.

Parameters
  • curr_value – (torch.Tensor) The current value

  • target_value – (torch.Tensor) The target value

  • batch – (list) A list composed of the different information about the batch required

update_target_model()[source]

Update the target model.

class marl.agent.q_agent.QTableAgent(observation_space, action_space, exploration='EpsGreedy', gamma=0.99, lr=0.1, target_update_freq=None, name='QTableAgent')[source]

Bases: marl.agent.q_agent.QAgent

The class of trainable agent using Q-table to model the function Q

Parameters
  • observation_space – (gym.Spaces) The observation space

  • action_space – (gym.Spaces) The action space

  • exploration – (Exploration) The exploration process

  • gamma – (float) The training parameters

  • lr – (float) The learning rate

  • target_update_freq – (int) The update frequency of the target model

  • name – (str) The name of the agent

update_q(curr_value, target_value, batch)[source]

Update the Q value.

Parameters
  • curr_value – (torch.Tensor) The current value

  • target_value – (torch.Tensor) The target value

  • batch – (list) A list composed of the different information about the batch required

update_target_model()[source]

Update the target model.

target(Q, batch)[source]

Compute the target value.

Parameters
  • Q – (Model or torch.nn.Module) The model of the Q-value

  • batch – (list) A list composed of the different information about the batch required

value(observation, action)[source]

Compute the value.

Parameters
  • observation – The observation

  • action – The action

class marl.agent.q_agent.MinimaxQAgent(observation_space, my_action_space, other_action_space, index=None, mas=None, exploration='EpsGreedy', gamma=0.99, lr=0.1, target_update_freq=None, name='MinimaxQAgent')[source]

Bases: marl.agent.q_agent.QAgent, marl.agent.agent.MATrainable

The class of trainable agent using minimax-Q-table algorithm

Parameters
  • observation_space – (gym.Spaces) The observation space

  • my_action_space – (gym.Spaces) My action space

  • other_action_space – (gym.Spaces) The action space of the other agent

  • index – (int) The position of the agent in the list of agent

  • mas – (marl.agent.MAS) The multi-agent system corresponding to the agent

  • exploration – (Exploration) The exploration process

  • gamma – (float) The training parameters

  • lr – (float) The learning rate

  • target_update_freq – (int) The update frequency of the target model

  • name – (str) The name of the agent

update_q(curr_value, target_value, batch)[source]

Update the Q value.

Parameters
  • curr_value – (torch.Tensor) The current value

  • target_value – (torch.Tensor) The target value

  • batch – (list) A list composed of the different information about the batch required

update_target_model()[source]

Update the target model.

target(Q, joint_batch)[source]

Compute the target value.

Parameters
  • Q – (Model or torch.nn.Module) The model of the Q-value

  • batch – (list) A list composed of the different information about the batch required

value(observation, action)[source]

Compute the value.

Parameters
  • observation – The observation

  • action – The action

class marl.agent.q_agent.DQNAgent(qmodel, observation_space, action_space, experience='ReplayMemory-10000', exploration='EpsGreedy', gamma=0.99, lr=0.0005, batch_size=32, target_update_freq=1000, name='DQNAgent')[source]

Bases: marl.agent.q_agent.QAgent

The class of trainable agent using a neural network to model the function Q

Parameters
  • qmodel – (Model or torch.nn.Module) The q-value model

  • observation_space – (gym.Spaces) The observation space

  • action_space – (gym.Spaces) The action space

  • experience – (Experience) The experience memory data structure

  • exploration – (Exploration) The exploration process

  • gamma – (float) The training parameters

  • lr – (float) The learning rate

  • batch_size – (int) The size of a batch

  • target_update_freq – (int) The update frequency of the target model

  • name – (str) The name of the agent

update_q(curr_value, target_value, batch)[source]

Update the Q value.

Parameters
  • curr_value – (torch.Tensor) The current value

  • target_value – (torch.Tensor) The target value

  • batch – (list) A list composed of the different information about the batch required

update_target_model()[source]

Update the target model.

target(Q, batch)[source]

Compute the target value.

Parameters
  • Q – (Model or torch.nn.Module) The model of the Q-value

  • batch – (list) A list composed of the different information about the batch required

value(observation, action)[source]

Compute the value.

Parameters
  • observation – The observation

  • action – The action

class marl.agent.q_agent.ContinuousDQNAgent(qmodel, actor_policy, observation_space, action_space, experience='ReplayMemory-10000', exploration='EpsGreedy', gamma=0.99, lr=0.0005, batch_size=32, target_update_freq=1000, name='DQNAgent')[source]

Bases: marl.agent.q_agent.DQNAgent

The class of trainable agent using a neural network to model the function Q

Parameters
  • qmodel – (Model or torch.nn.Module) The q-value model

  • observation_space – (gym.Spaces) The observation space

  • action_space – (gym.Spaces) The action space

  • experience – (Experience) The experience memory data structure

  • exploration – (Exploration) The exploration process

  • gamma – (float) The training parameters

  • lr – (float) The learning rate

  • batch_size – (int) The size of a batch

  • target_update_freq – (int) The update frequency of the target model

  • name – (str) The name of the agent

target(Q, batch)[source]

Compute the target value.

Parameters
  • Q – (Model or torch.nn.Module) The model of the Q-value

  • batch – (list) A list composed of the different information about the batch required

value(observation, action)[source]

Compute the value.

Parameters
  • observation – The observation

  • action – The action

Policy Gradient based model

class marl.agent.pg_agent.PGAgent(critic, actor_policy, observation_space, action_space, actor_model=None, experience='ReplayMemory-1000', exploration='EpsGreedy', lr_actor=0.001, gamma=0.95, batch_size=32, target_update_freq=None, name='PGAgent')[source]

Bases: marl.agent.agent.TrainableAgent

The class of generic trainable agent using policy-based methods

Parameters
  • critic – (QAgent) The critic agent

  • actor_policy – (Policy) The policy for the actor

  • actor_model – (Model or nn.Module) The model for the actor

  • observation_space – (gym.Spaces) The observation space

  • action_space – (gym.Spaces) The action space

  • experience – (Experience) The experience memory data structure

  • exploration – (Exploration) The exploration process

  • lr_actor – (float) The learning rate for the actor

  • lr_critic – (float) The learning rate for the critic

  • gamma – (float) The training parameters

  • batch_size – (int) The size of a batch

  • target_update_freq – (int) The update frequency of the target model

  • name – (str) The name of the agent

property lr_actor
property lr_critic
update_model(t)[source]

Update the model.

Parameters

t – (int) The current timestep

update_target_policy()[source]

Update the target policy.

update_actor(batch)[source]

Update the actor.

class marl.agent.pg_agent.DeepACAgent(critic_model, actor_model, observation_space, action_space, experience='ReplayMemory-1000', exploration='EpsGreedy', lr_actor=0.001, lr_critic=0.001, gamma=0.95, batch_size=32, target_update_freq=None, name='DeepACAgent')[source]

Bases: marl.agent.pg_agent.PGAgent

Deep Actor-Critic Agent class. The critic is train following DQN algorithm and the policy is represented by a neural network with a softmax output.

Parameters
  • critic_model – (nn.Module) The critic’s model

  • actor_model – (Model or nn.Module) The model for the actor

  • observation_space – (gym.Spaces) The observation space

  • action_space – (gym.Spaces) The action space

  • experience – (Experience) The experience memory data structure

  • exploration – (Exploration) The exploration process

  • lr_actor – (float) The learning rate for the actor

  • lr_critic – (float) The learning rate for the critic

  • gamma – (float) The training parameters

  • batch_size – (int) The size of a batch

  • target_update_freq – (int) The update frequency of the target model

  • name – (str) The name of the agent

update_target_policy()[source]

Update the target policy.

update_actor(batch)[source]

Update the actor.

class marl.agent.pg_agent.PHCAgent(observation_space, action_space, exploration='EpsGreedy', delta=0.01, lr_critic=0.01, gamma=0.95, target_update_freq=None, name='PHCAgent')[source]

Bases: marl.agent.pg_agent.PGAgent

Policy Hill Climbing Agent’s class. The critic is train following standard Q-learning algorithm.

Parameters
  • observation_space – (gym.Spaces) The observation space

  • action_space – (gym.Spaces) The action space

  • exploration – (Exploration) The exploration process

  • delta – (float) The learning rate for the actor

  • lr_critic – (float) The learning rate for the critic

  • gamma – (float) The training parameters

  • target_update_freq – (int) The update frequency of the target model

  • name – (str) The name of the agent

update_target_policy()[source]

Update the target policy.

property delta
update_actor(batch)[source]

Update the actor.

class marl.agent.pg_agent.DDPGAgent(critic_model, actor_model, observation_space, action_space, experience='ReplayMemory-1000', exploration='OUNoise', lr_actor=0.01, lr_critic=0.01, gamma=0.95, batch_size=32, target_update_freq=None, name='DDPGAgent')[source]

Bases: marl.agent.pg_agent.PGAgent

Deep Deterministic Policy Gradient Agent’s class. The critic is train following standard “SARSA” algorithm (ContinuousDQN).

Parameters
  • critic_model – (nn.Module) The critic’s model

  • actor_model – (nn.Module) The model for the actor

  • observation_space – (gym.Spaces) The observation space

  • action_space – (gym.Spaces) The action space

  • experience – (Experience) The experience memory data structure

  • exploration – (Exploration) The exploration process

  • lr_actor – (float) The learning rate for the actor

  • lr_critic – (float) The learning rate for the critic

  • gamma – (float) The training parameters

  • batch_size – (int) The size of a batch

  • target_update_freq – (int) The update frequency of the target model

  • name – (str) The name of the agent

update_target_policy()[source]

Update the target policy.

update_actor(batch)[source]

Update the actor.

Multi-agent Policy Gradient based model

class marl.agent.maac_agent.MAPGAgent(critic_model, actor_policy, observation_space, action_space, actor_model=None, index=None, mas=None, experience='ReplayMemory-1000', exploration='EpsGreedy', lr_actor=0.001, lr_critic=0.001, gamma=0.95, batch_size=32, tau=0.01, use_target_net=False, name='MAACAgent')[source]

Bases: marl.agent.agent.TrainableAgent, marl.agent.agent.MATrainable

The class of trainable agent using multi-agent policy gradient methods.

Parameters
  • critic_model – (Model or torch.nn.Module) The critic model

  • actor_policy – (Policy) actor policy

  • actor_model – (Model or torch.nn.Module) The actor model

  • observation_space – (gym.Spaces) The observation space

  • action_space – (gym.Spaces) The action space

  • index – (int) The index of the agent in the multi-agent system

  • mas – (MARL) The multi-agent system in which the agent is included

  • experience – (Experience) The experience memory data structure

  • exploration – (Exploration) The exploration process

  • lr_actor – (float) The learning rate for each actor

  • lr_critic – (float) The learning rate for each critic

  • gamma – (float) The discount factor

  • batch_size – (int) The batch size

  • tau – (float) The update rate

  • name – (str) The name of the agent

soft_update(local_model, target_model, tau)[source]
update_model(t)[source]

Update the model.

update_critic(local_batch, global_batch)[source]
target(local_batch, global_batch)[source]
class marl.agent.maac_agent.MAACAgent(critic_model, actor_model, observation_space, action_space, index=None, experience='ReplayMemory-1000', exploration='EpsGreedy', lr_actor=0.001, lr_critic=0.001, gamma=0.95, batch_size=32, tau=0.01, use_target_net=False, name='MAACAgent')[source]

Bases: marl.agent.maac_agent.MAPGAgent

The class of trainable agent using multi-agent actor-critic methods.

Parameters
  • critic_model – (Model or torch.nn.Module) The critic model

  • actor_model – (Model or torch.nn.Module) The actor model

  • observation_space – (gym.Spaces) The observation space

  • action_space – (gym.Spaces) The action space

  • index – (int) The index of the agent in the multi-agent system

  • mas – (MARL) The multi-agent system in which the agent is included

  • experience – (Experience) The experience memory data structure

  • exploration – (Exploration) The exploration process

  • lr_actor – (float) The learning rate for each actor

  • lr_critic – (float) The learning rate for each critic

  • gamma – (float) The discount factor

  • batch_size – (int) The batch size

  • tau – (float) The update rate

  • use_target_net – (bool) If true use a target model

  • name – (str) The name of the agent

update_actor(local_batch, global_batch)[source]
class marl.agent.maac_agent.MADDPGAgent(critic_model, actor_model, observation_space, action_space, index=None, experience='ReplayMemory-1000', exploration='OUNoise', lr_actor=0.01, lr_critic=0.01, gamma=0.95, batch_size=32, tau=0.01, use_target_net=100, name='MADDPGAgent')[source]

Bases: marl.agent.maac_agent.MAPGAgent

The class of trainable agent using multi-agent deep deterministic policy gradient methods.

Parameters
  • critic_model – (Model or torch.nn.Module) The critic model

  • actor_model – (Model or torch.nn.Module) The actor model

  • observation_space – (gym.Spaces) The observation space

  • action_space – (gym.Spaces) The action space

  • index – (int) The index of the agent in the multi-agent system

  • mas – (MARL) The multi-agent system in which the agent is included

  • experience – (Experience) The experience memory data structure

  • exploration – (Exploration) The exploration process

  • lr_actor – (float) The learning rate for each actor

  • lr_critic – (float) The learning rate for each critic

  • gamma – (float) The discount factor

  • batch_size – (int) The batch size

  • tau – (float) The update rate

  • use_target_net – (bool) If true use a target model

  • name – (str) The name of the agent

update_actor(local_batch, global_batch)[source]

Experience

Experience

class marl.experience.experience.Experience[source]

Bases: object

experience = {'PrioritizedReplayMemory': <marl.tools.ClassSpec object>, 'PrioritizedReplayMemory-1': <marl.tools.ClassSpec object>, 'PrioritizedReplayMemory-100': <marl.tools.ClassSpec object>, 'PrioritizedReplayMemory-1000': <marl.tools.ClassSpec object>, 'PrioritizedReplayMemory-10000': <marl.tools.ClassSpec object>, 'PrioritizedReplayMemory-100000': <marl.tools.ClassSpec object>, 'PrioritizedReplayMemory-2000': <marl.tools.ClassSpec object>, 'PrioritizedReplayMemory-30000': <marl.tools.ClassSpec object>, 'PrioritizedReplayMemory-500': <marl.tools.ClassSpec object>, 'PrioritizedReplayMemory-5000': <marl.tools.ClassSpec object>, 'PrioritizedReplayMemory-50000': <marl.tools.ClassSpec object>, 'RNNPrioritizedReplayMemory': <marl.tools.ClassSpec object>, 'RNNPrioritizedReplayMemory-1': <marl.tools.ClassSpec object>, 'RNNPrioritizedReplayMemory-100': <marl.tools.ClassSpec object>, 'RNNPrioritizedReplayMemory-1000': <marl.tools.ClassSpec object>, 'RNNPrioritizedReplayMemory-10000': <marl.tools.ClassSpec object>, 'RNNPrioritizedReplayMemory-100000': <marl.tools.ClassSpec object>, 'RNNPrioritizedReplayMemory-2000': <marl.tools.ClassSpec object>, 'RNNPrioritizedReplayMemory-30000': <marl.tools.ClassSpec object>, 'RNNPrioritizedReplayMemory-500': <marl.tools.ClassSpec object>, 'RNNPrioritizedReplayMemory-5000': <marl.tools.ClassSpec object>, 'RNNPrioritizedReplayMemory-50000': <marl.tools.ClassSpec object>, 'RNNReplayMemory': <marl.tools.ClassSpec object>, 'RNNReplayMemory-1': <marl.tools.ClassSpec object>, 'RNNReplayMemory-100': <marl.tools.ClassSpec object>, 'RNNReplayMemory-1000': <marl.tools.ClassSpec object>, 'RNNReplayMemory-10000': <marl.tools.ClassSpec object>, 'RNNReplayMemory-100000': <marl.tools.ClassSpec object>, 'RNNReplayMemory-2000': <marl.tools.ClassSpec object>, 'RNNReplayMemory-30000': <marl.tools.ClassSpec object>, 'RNNReplayMemory-500': <marl.tools.ClassSpec object>, 'RNNReplayMemory-5000': <marl.tools.ClassSpec object>, 'RNNReplayMemory-50000': <marl.tools.ClassSpec object>, 'ReplayMemory': <marl.tools.ClassSpec object>, 'ReplayMemory-1': <marl.tools.ClassSpec object>, 'ReplayMemory-100': <marl.tools.ClassSpec object>, 'ReplayMemory-1000': <marl.tools.ClassSpec object>, 'ReplayMemory-10000': <marl.tools.ClassSpec object>, 'ReplayMemory-100000': <marl.tools.ClassSpec object>, 'ReplayMemory-2000': <marl.tools.ClassSpec object>, 'ReplayMemory-30000': <marl.tools.ClassSpec object>, 'ReplayMemory-500': <marl.tools.ClassSpec object>, 'ReplayMemory-5000': <marl.tools.ClassSpec object>, 'ReplayMemory-50000': <marl.tools.ClassSpec object>}
push(*args)[source]
sample(batch_siz=1)[source]
none_transition()[source]
classmethod make(id, **kwargs)[source]
classmethod register(id, entry_point, **kwargs)[source]
classmethod available()[source]
marl.experience.experience.register(id, entry_point, **kwargs)[source]
marl.experience.experience.make(id, **kwargs)[source]
marl.experience.experience.available()[source]

ReplayBuffer

class marl.experience.replay_buffer.ReplayMemory(capacity, transition_type='FFTransition')[source]

Bases: marl.experience.experience.Experience

push(*transition)[source]
sample(batch_size=1)[source]
get_transition(index)[source]
sample_index(batch_size)[source]
class marl.experience.replay_buffer.PrioritizedReplayMemory(capacity, alpha=0.6, beta=0.4, eps=1e-06, transition_type='FFTransition')[source]

Bases: marl.experience.experience.Experience

beta_increment_per_sampling = 0.001
property capacity
push(error, *transition)[source]
push_error(error)[source]
push_transition(observation, action, reward, done_flag, next_observation)[source]
sample(batch_size=1)[source]
update(idx, error)[source]

Exploration

Exploration

class marl.exploration.expl_process.ExplorationProcess[source]

Bases: object

The generic exploration class

process = {'EpsGreedy': <marl.tools.ClassSpec object>, 'EpsGreedy-cst001': <marl.tools.ClassSpec object>, 'EpsGreedy-cst002': <marl.tools.ClassSpec object>, 'EpsGreedy-cst01': <marl.tools.ClassSpec object>, 'EpsGreedy-cst02': <marl.tools.ClassSpec object>, 'EpsGreedy-cst05': <marl.tools.ClassSpec object>, 'EpsGreedy-cst1': <marl.tools.ClassSpec object>, 'EpsGreedy-lin': <marl.tools.ClassSpec object>, 'Greedy': <marl.tools.ClassSpec object>, 'OUNoise': <marl.tools.ClassSpec object>}
reset(training_duration)[source]

Intialize some additional values and reset the others

Parameters

training_duration – (int) Number of timesteps while training

update(t)[source]

If required update exploration parameters

Parameters

t – (int) The current timestep

__call__()[source]

Call self as a function.

classmethod make(id, *args, **kwargs)[source]
classmethod register(id, entry_point, **kwargs)[source]
classmethod available()[source]
marl.exploration.expl_process.register(id, entry_point, **kwargs)[source]
marl.exploration.expl_process.make(id, **kwargs)[source]
marl.exploration.expl_process.available()[source]

Eps-Greedy

class marl.exploration.greedy.Greedy[source]

Bases: marl.exploration.eps_greedy.EpsGreedy

The Greedy process

Parameters
  • eps_deb – (float) The intial amount of exploration

  • eps_fin – (float) The final amount of exploration

__call__(policy, observation)[source]

Choose an action according to the policy and the exploration rate

class marl.exploration.eps_greedy.EpsGreedy(eps_deb=1.0, eps_fin=0.1, deb_expl=0.1, fin_expl=0.9)[source]

Bases: marl.exploration.expl_process.ExplorationProcess

The epsilon-greedy exploration class

Parameters
  • eps_deb – (float) The initial amount of exploration to process

  • eps_fin – (float) The final amount of exploration to process

  • deb_expl – (float) The percentage of time before starting exploration (default: 0.1)

  • deb_expl – (float) The percentage of time before starting exploration (default: 0.1)

reset(training_duration)[source]

Reinitialize some parameters

update(t)[source]

Update epsilon linearly

__call__(policy, observation)[source]

Choose an action according to the policy and the exploration rate

Ornstein–Uhlenbeck Process

class marl.exploration.ou_noise.OUNoise(size, dt=0.01, mu=0.0, theta=0.15, sigma=0.2)[source]

Bases: marl.exploration.expl_process.ExplorationProcess

The Ornstein-Uhlenbeck process.

Parameters
  • size – (float) The number of variables to add noise

  • seed – (float) The seed

  • mu – (float) The drift term

  • theta – (float) The amount of keeping previous state

  • sigma – (float) The amount of noise

reset(t=None)[source]

Reinitialize the state of the process

update(t)[source]

If required update exploration parameters

Parameters

t – (int) The current timestep

__call__(policy, observation)[source]

Call self as a function.

sample()[source]

Update internal state and return it as a noise sample.

Policies

Base Policy

class marl.policy.policy.Policy[source]

Bases: object

policy = {'DeterministicPolicy': <marl.tools.ClassSpec object>, 'QPolicy': <marl.tools.ClassSpec object>, 'RandomPolicy': <marl.tools.ClassSpec object>, 'StochasticPolicy': <marl.tools.ClassSpec object>}
__call__(state)[source]

Call self as a function.

classmethod make(id, **kwargs)[source]
classmethod register(id, entry_point, **kwargs)[source]
classmethod available()[source]
class marl.policy.policy.ModelBasedPolicy(model)[source]

Bases: marl.policy.policy.Policy

load(filename)[source]
save(filename)[source]
marl.policy.policy.register(id, entry_point, **kwargs)[source]
marl.policy.policy.make(id, **kwargs)[source]
marl.policy.policy.available()[source]

Several Policies

class marl.policy.policies.RandomPolicy(action_space)[source]

Bases: marl.policy.policy.Policy

The class of random policies

Parameters
  • model – (Model or torch.nn.Module) The q-value model

  • action_space – (gym.Spaces) The action space

__call__(state)[source]

Return a random action given the state

Parameters

state – (Tensor) The current state

class marl.policy.policies.QPolicy(model, observation_space=None, action_space=None)[source]

Bases: marl.policy.policy.ModelBasedPolicy

The class of policies based on a Q function

Parameters
  • model – (Model or torch.nn.Module) The q-value model

  • observation_space – (gym.Spaces) The observation space

  • action_space – (gym.Spaces) The action space

__call__(state)[source]

Return an action given the state

Parameters

state – (Tensor) The current state

property Q
class marl.policy.policies.StochasticPolicy(model, observation_space=None, action_space=None)[source]

Bases: marl.policy.policy.ModelBasedPolicy

The class of stochastic policies

Parameters
  • model – (Model or torch.nn.Module) The model of the policy

  • observation_space – (gym.Spaces) The observation space

  • action_space – (gym.Spaces) The action space

forward(x)[source]
__call__(observation)[source]

Call self as a function.

class marl.policy.policies.DeterministicPolicy(model, observation_space=None, action_space=None)[source]

Bases: marl.policy.policy.ModelBasedPolicy

The class of deterministic policies

Parameters
  • model – (Model or torch.nn.Module) The model of the policy

  • observation_space – (gym.Spaces) The observation space

  • action_space – (gym.Spaces) The action space

__call__(observation)[source]

Call self as a function.

Models

Value and Q-Value array

class marl.model.qvalue.VTable(obs_sp)[source]

Bases: marl.model.model.Model

The class of state value function for discret state space.

Parameters

obs_sp – (int) The number of possible observations

__call__(state=None)[source]

Call self as a function.

property shape
class marl.model.qvalue.QTable(obs_sp, act_sp)[source]

Bases: marl.model.model.Model

The class of action value function for discret state and action space.

Parameters
  • obs_sp – (int) The number of possible observations

  • act_sp – (int) The number of possible actions

property q_table
__call__(state=None, action=None)[source]

Call self as a function.

property shape
class marl.model.qvalue.MultiQTable(obs_sp, act_sp)[source]

Bases: marl.model.model.Model

The class of actions value function for multi-agent with discret state and action space. This kind of value function is used in minimax-Q algorithm.

Parameters
  • obs_sp – (int) The number of possible observations

  • act_sp – (int) The number of possible actions

property q_table
property shape
__call__(state=None, action=None)[source]

Call self as a function.

class marl.model.qvalue.ActionProb(obs_sp, act_sp)[source]

Bases: marl.model.model.Model

The class of action probabilities for PHC algorithm.

Parameters
  • obs_sp – (int) The number of possible observations

  • act_sp – (int) The number of possible actions

__call__(state=None, action=None)[source]

Call self as a function.

property shape

Neural network model

marl.model.nn.mlpnet.hidden_init(layer)[source]
class marl.model.nn.mlpnet.MlpNet(obs_sp, act_sp, hidden_size=[64, 64], hidden_activ=<class 'torch.nn.modules.activation.ReLU'>, last_activ=None, lay_norm=False)[source]

Bases: torch.nn.modules.module.Module

reset_parameters()[source]
forward(x)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class marl.model.nn.mlpnet.GumbelMlpNet(obs_sp, act_sp, hidden_size=[64, 64], hidden_activ=<class 'torch.nn.modules.activation.ReLU'>, tau=1.0, lay_norm=False)[source]

Bases: marl.model.nn.mlpnet.MlpNet

forward(x)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class marl.model.nn.mlpnet.ContinuousCritic(obs_sp, act_sp, hidden_size=[64, 64])[source]

Bases: torch.nn.modules.module.Module

reset_parameters()[source]
forward(obs, act)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.