Multi-agent RL¶

class marl.marl.MAS(agents_list=[], name='mas')[source]¶

Bases: object

The class of multi-agent “system”.

Parameters

agents_list – (list) The list of agents in the MAS
name – (str) The name of the system

append(agent)[source]¶

Add an agent to the system.

Parameters: agent – (Agent) The agents to be added

action(observation)[source]¶

Return the joint action.

Parameters: observation – The joint observation

get_by_name(name)[source]¶

get_by_id(id)[source]¶

class marl.marl.MARL(agents_list=[], name='marl')[source]¶

Bases: marl.agent.agent.TrainableAgent, marl.marl.MAS

The class for a multi-agent reinforcement learning.

Parameters

agents_list – (list) The list of agents in the MARL model
name – (str) The name of the system

store_experience(*args)[source]¶: Store a transition in the experience buffer.

update_model(t)[source]¶: Update the model.

reset_exploration(nb_timesteps)[source]¶: Reset the exploration process.

update_exploration(t)[source]¶: Update the exploration process.

action(observation)[source]¶

Return an action given an observation (action in selected according to the exploration process).

Parameters: observation – The observation

greedy_action(observation)[source]¶: Return the greedy action given an observation :param observation: The observation

save_policy(folder='.', filename='', timestep=None)[source]¶

Save the policy in a file called ‘<filename>-<agent_name>-<timestep>’.

Parameters

folder – (str) The path to the directory where to save the model(s)
filename – (str) A specific name for the file (ex: ‘test2’)
timestep – (int) The current timestep

load_model(filename)[source]¶

Agents¶

Base Agent¶

class marl.agent.agent.Agent(policy, name='UnknownAgent')[source]¶

Bases: object

The class of generic agent.

Parameters

policy – (Policy) The policy of the agent
name – (str) The name of the agent

agents = {'DDPGAgent': <marl.tools.ClassSpec object>, 'DQNAgent': <marl.tools.ClassSpec object>, 'DeepACAgent': <marl.tools.ClassSpec object>, 'MADDPGAgent': <marl.tools.ClassSpec object>, 'MinimaxQAgent': <marl.tools.ClassSpec object>, 'PHCAgent': <marl.tools.ClassSpec object>, 'QTableAgent': <marl.tools.ClassSpec object>}¶

counter = 0¶

action(observation)[source]¶: Return the action given an observation :param observation: The observation

greedy_action(observation)[source]¶: Return the greedy action given an observation :param observation: The observation

test(env, nb_episodes=1, max_num_step=200, render=True, time_laps=0.0)[source]¶

Test a model.

Parameters

env – (Gym) The environment
nb_episodes – (int) The number of episodes to test
max_num_step – (int) The maximum number a step before stopping an episode
render – (bool) Whether to visualize the test or not (using render function of the environment)

classmethod make(id, **kwargs)[source]¶

classmethod register(id, entry_point, **kwargs)[source]¶

classmethod available()[source]¶

class marl.agent.agent.TrainableAgent(policy, observation_space=None, action_space=None, model=None, experience='ReplayMemory-10000', exploration='EpsGreedy', gamma=0.99, lr=0.001, batch_size=32, name='TrainableAgent')[source]¶

Bases: marl.agent.agent.Agent

The class of trainable agent.

Parameters

policy – (Policy) The policy
observation_space – (gym.Spaces) The observation space
action_space – (gym.Spaces) The action space
experience – (Experience) The experience memory data structure
exploration – (Exploration) The exploration process
lr – (float) The learning rate
batch_size (gamma,) – (float) The training parameters
name – (str) The name of the agent

property observation_space¶

property action_space¶

store_experience(*args)[source]¶: Store a transition in the experience buffer.

update_model(t)[source]¶: Update the model.

reset_exploration(nb_timesteps)[source]¶: Reset the exploration process.

update_exploration(t)[source]¶: Update the exploration process.

action(observation)[source]¶

Return an action given an observation (action in selected according to the exploration process).

Parameters: observation – The observation

save_policy(folder='.', filename='', timestep=None)[source]¶

Save the policy in a file called ‘<filename>-<agent_name>-<timestep>’.

Parameters

filename – (str) A specific name for the file (ex: ‘test2’)
timestep – (int) The current timestep

save_all()[source]¶

learn(env, nb_timesteps, max_num_step=100, test_freq=1000, save_freq=1000, save_folder='models', render=False, time_laps=0.0, verbose=1)[source]¶

Start the learning part.

Parameters

env – (Gym) The environment
nb_timesteps – (int) The total duration (in number of steps)
max_num_step – (int) The maximum number a step before stopping episode
test_freq – (int) The frequency of testing model
save_freq – (int) The frequency of saving model

class marl.agent.agent.MATrainable(mas, index)[source]¶

Bases: object

set_mas(mas)[source]¶

marl.agent.agent.register(id, entry_point, **kwargs)[source]¶

marl.agent.agent.make(id, **kwargs)[source]¶

marl.agent.agent.available()[source]¶

Q-value based model¶

class marl.agent.q_agent.QAgent(qmodel, observation_space, action_space, experience='ReplayMemory-1', exploration='EpsGreedy', gamma=0.99, lr=0.1, batch_size=1, target_update_freq=None, name='QAgent')[source]¶

Bases: marl.agent.agent.TrainableAgent

The class of trainable agent using Qvalue-based methods

Parameters

qmodel – (Model or torch.nn.Module) The q-value model
observation_space – (gym.Spaces) The observation space
action_space – (gym.Spaces) The action space
experience – (Experience) The experience memory data structure
exploration – (Exploration) The exploration process
gamma – (float) The training parameters
lr – (float) The learning rate
batch_size – (int) The size of a batch
target_update_freq – (int) The update frequency of the target model
name – (str) The name of the agent

update_model(t)[source]¶

Update the model.

Parameters: t – (int) The current timestep

target(Q, batch)[source]¶

Compute the target value.

Parameters

Q – (Model or torch.nn.Module) The model of the Q-value
batch – (list) A list composed of the different information about the batch required

value(observation, action)[source]¶

Compute the value.

Parameters

observation – The observation
action – The action

update_q(curr_value, target_value, batch)[source]¶

Update the Q value.

Parameters

curr_value – (torch.Tensor) The current value
target_value – (torch.Tensor) The target value
batch – (list) A list composed of the different information about the batch required

update_target_model()[source]¶: Update the target model.

class marl.agent.q_agent.QTableAgent(observation_space, action_space, exploration='EpsGreedy', gamma=0.99, lr=0.1, target_update_freq=None, name='QTableAgent')[source]¶

Bases: marl.agent.q_agent.QAgent

The class of trainable agent using Q-table to model the function Q

Parameters

observation_space – (gym.Spaces) The observation space
action_space – (gym.Spaces) The action space
exploration – (Exploration) The exploration process
gamma – (float) The training parameters
lr – (float) The learning rate
target_update_freq – (int) The update frequency of the target model
name – (str) The name of the agent

update_q(curr_value, target_value, batch)[source]¶

Update the Q value.

Parameters

curr_value – (torch.Tensor) The current value
target_value – (torch.Tensor) The target value
batch – (list) A list composed of the different information about the batch required

update_target_model()[source]¶: Update the target model.

target(Q, batch)[source]¶

Compute the target value.

Parameters

Q – (Model or torch.nn.Module) The model of the Q-value
batch – (list) A list composed of the different information about the batch required

value(observation, action)[source]¶

Compute the value.

Parameters

observation – The observation
action – The action

class marl.agent.q_agent.MinimaxQAgent(observation_space, my_action_space, other_action_space, index=None, mas=None, exploration='EpsGreedy', gamma=0.99, lr=0.1, target_update_freq=None, name='MinimaxQAgent')[source]¶

Bases: marl.agent.q_agent.QAgent, marl.agent.agent.MATrainable

The class of trainable agent using minimax-Q-table algorithm

Parameters

observation_space – (gym.Spaces) The observation space
my_action_space – (gym.Spaces) My action space
other_action_space – (gym.Spaces) The action space of the other agent
index – (int) The position of the agent in the list of agent
mas – (marl.agent.MAS) The multi-agent system corresponding to the agent
exploration – (Exploration) The exploration process
gamma – (float) The training parameters
lr – (float) The learning rate
target_update_freq – (int) The update frequency of the target model
name – (str) The name of the agent

update_q(curr_value, target_value, batch)[source]¶

Update the Q value.

Parameters

curr_value – (torch.Tensor) The current value
target_value – (torch.Tensor) The target value
batch – (list) A list composed of the different information about the batch required

update_target_model()[source]¶: Update the target model.

target(Q, joint_batch)[source]¶

Compute the target value.

Parameters

Q – (Model or torch.nn.Module) The model of the Q-value
batch – (list) A list composed of the different information about the batch required

value(observation, action)[source]¶

Compute the value.

Parameters

observation – The observation
action – The action

class marl.agent.q_agent.DQNAgent(qmodel, observation_space, action_space, experience='ReplayMemory-10000', exploration='EpsGreedy', gamma=0.99, lr=0.0005, batch_size=32, target_update_freq=1000, name='DQNAgent')[source]¶

Bases: marl.agent.q_agent.QAgent

The class of trainable agent using a neural network to model the function Q

Parameters

qmodel – (Model or torch.nn.Module) The q-value model
observation_space – (gym.Spaces) The observation space
action_space – (gym.Spaces) The action space
experience – (Experience) The experience memory data structure
exploration – (Exploration) The exploration process
gamma – (float) The training parameters
lr – (float) The learning rate
batch_size – (int) The size of a batch
target_update_freq – (int) The update frequency of the target model
name – (str) The name of the agent

update_q(curr_value, target_value, batch)[source]¶

Update the Q value.

Parameters

curr_value – (torch.Tensor) The current value
target_value – (torch.Tensor) The target value
batch – (list) A list composed of the different information about the batch required

update_target_model()[source]¶: Update the target model.

target(Q, batch)[source]¶

Compute the target value.

Parameters

Q – (Model or torch.nn.Module) The model of the Q-value
batch – (list) A list composed of the different information about the batch required

value(observation, action)[source]¶

Compute the value.

Parameters

observation – The observation
action – The action

class marl.agent.q_agent.ContinuousDQNAgent(qmodel, actor_policy, observation_space, action_space, experience='ReplayMemory-10000', exploration='EpsGreedy', gamma=0.99, lr=0.0005, batch_size=32, target_update_freq=1000, name='DQNAgent')[source]¶

Bases: marl.agent.q_agent.DQNAgent

The class of trainable agent using a neural network to model the function Q

Parameters

qmodel – (Model or torch.nn.Module) The q-value model
observation_space – (gym.Spaces) The observation space
action_space – (gym.Spaces) The action space
experience – (Experience) The experience memory data structure
exploration – (Exploration) The exploration process
gamma – (float) The training parameters
lr – (float) The learning rate
batch_size – (int) The size of a batch
target_update_freq – (int) The update frequency of the target model
name – (str) The name of the agent

target(Q, batch)[source]¶

Compute the target value.

Parameters

Q – (Model or torch.nn.Module) The model of the Q-value
batch – (list) A list composed of the different information about the batch required

value(observation, action)[source]¶

Compute the value.

Parameters

observation – The observation
action – The action

Policy Gradient based model¶

class marl.agent.pg_agent.PGAgent(critic, actor_policy, observation_space, action_space, actor_model=None, experience='ReplayMemory-1000', exploration='EpsGreedy', lr_actor=0.001, gamma=0.95, batch_size=32, target_update_freq=None, name='PGAgent')[source]¶

Bases: marl.agent.agent.TrainableAgent

The class of generic trainable agent using policy-based methods

Parameters

critic – (QAgent) The critic agent
actor_policy – (Policy) The policy for the actor
actor_model – (Model or nn.Module) The model for the actor
observation_space – (gym.Spaces) The observation space
action_space – (gym.Spaces) The action space
experience – (Experience) The experience memory data structure
exploration – (Exploration) The exploration process
lr_actor – (float) The learning rate for the actor
lr_critic – (float) The learning rate for the critic
gamma – (float) The training parameters
batch_size – (int) The size of a batch
target_update_freq – (int) The update frequency of the target model
name – (str) The name of the agent

property lr_actor¶

property lr_critic¶

update_model(t)[source]¶

Update the model.

Parameters: t – (int) The current timestep

update_target_policy()[source]¶: Update the target policy.

update_actor(batch)[source]¶: Update the actor.

class marl.agent.pg_agent.DeepACAgent(critic_model, actor_model, observation_space, action_space, experience='ReplayMemory-1000', exploration='EpsGreedy', lr_actor=0.001, lr_critic=0.001, gamma=0.95, batch_size=32, target_update_freq=None, name='DeepACAgent')[source]¶

Bases: marl.agent.pg_agent.PGAgent

Deep Actor-Critic Agent class. The critic is train following DQN algorithm and the policy is represented by a neural network with a softmax output.

Parameters

critic_model – (nn.Module) The critic’s model
actor_model – (Model or nn.Module) The model for the actor
observation_space – (gym.Spaces) The observation space
action_space – (gym.Spaces) The action space
experience – (Experience) The experience memory data structure
exploration – (Exploration) The exploration process
lr_actor – (float) The learning rate for the actor
lr_critic – (float) The learning rate for the critic
gamma – (float) The training parameters
batch_size – (int) The size of a batch
target_update_freq – (int) The update frequency of the target model
name – (str) The name of the agent

update_target_policy()[source]¶: Update the target policy.

update_actor(batch)[source]¶: Update the actor.

class marl.agent.pg_agent.PHCAgent(observation_space, action_space, exploration='EpsGreedy', delta=0.01, lr_critic=0.01, gamma=0.95, target_update_freq=None, name='PHCAgent')[source]¶

Bases: marl.agent.pg_agent.PGAgent

Policy Hill Climbing Agent’s class. The critic is train following standard Q-learning algorithm.

Parameters

observation_space – (gym.Spaces) The observation space
action_space – (gym.Spaces) The action space
exploration – (Exploration) The exploration process
delta – (float) The learning rate for the actor
lr_critic – (float) The learning rate for the critic
gamma – (float) The training parameters
target_update_freq – (int) The update frequency of the target model
name – (str) The name of the agent

update_target_policy()[source]¶: Update the target policy.

property delta¶

update_actor(batch)[source]¶: Update the actor.

class marl.agent.pg_agent.DDPGAgent(critic_model, actor_model, observation_space, action_space, experience='ReplayMemory-1000', exploration='OUNoise', lr_actor=0.01, lr_critic=0.01, gamma=0.95, batch_size=32, target_update_freq=None, name='DDPGAgent')[source]¶

Bases: marl.agent.pg_agent.PGAgent

Deep Deterministic Policy Gradient Agent’s class. The critic is train following standard “SARSA” algorithm (ContinuousDQN).

Parameters

critic_model – (nn.Module) The critic’s model
actor_model – (nn.Module) The model for the actor
observation_space – (gym.Spaces) The observation space
action_space – (gym.Spaces) The action space
experience – (Experience) The experience memory data structure
exploration – (Exploration) The exploration process
lr_actor – (float) The learning rate for the actor
lr_critic – (float) The learning rate for the critic
gamma – (float) The training parameters
batch_size – (int) The size of a batch
target_update_freq – (int) The update frequency of the target model
name – (str) The name of the agent

update_target_policy()[source]¶: Update the target policy.

update_actor(batch)[source]¶: Update the actor.

Multi-agent Policy Gradient based model¶

class marl.agent.maac_agent.MAPGAgent(critic_model, actor_policy, observation_space, action_space, actor_model=None, index=None, mas=None, experience='ReplayMemory-1000', exploration='EpsGreedy', lr_actor=0.001, lr_critic=0.001, gamma=0.95, batch_size=32, tau=0.01, use_target_net=False, name='MAACAgent')[source]¶

Bases: marl.agent.agent.TrainableAgent, marl.agent.agent.MATrainable

The class of trainable agent using multi-agent policy gradient methods.

Parameters

critic_model – (Model or torch.nn.Module) The critic model
actor_policy – (Policy) actor policy
actor_model – (Model or torch.nn.Module) The actor model
observation_space – (gym.Spaces) The observation space
action_space – (gym.Spaces) The action space
index – (int) The index of the agent in the multi-agent system
mas – (MARL) The multi-agent system in which the agent is included
experience – (Experience) The experience memory data structure
exploration – (Exploration) The exploration process
lr_actor – (float) The learning rate for each actor
lr_critic – (float) The learning rate for each critic
gamma – (float) The discount factor
batch_size – (int) The batch size
tau – (float) The update rate
name – (str) The name of the agent

soft_update(local_model, target_model, tau)[source]¶

update_model(t)[source]¶: Update the model.

update_critic(local_batch, global_batch)[source]¶

target(local_batch, global_batch)[source]¶

class marl.agent.maac_agent.MAACAgent(critic_model, actor_model, observation_space, action_space, index=None, experience='ReplayMemory-1000', exploration='EpsGreedy', lr_actor=0.001, lr_critic=0.001, gamma=0.95, batch_size=32, tau=0.01, use_target_net=False, name='MAACAgent')[source]¶

Bases: marl.agent.maac_agent.MAPGAgent

The class of trainable agent using multi-agent actor-critic methods.

Parameters

critic_model – (Model or torch.nn.Module) The critic model
actor_model – (Model or torch.nn.Module) The actor model
observation_space – (gym.Spaces) The observation space
action_space – (gym.Spaces) The action space
index – (int) The index of the agent in the multi-agent system
mas – (MARL) The multi-agent system in which the agent is included
experience – (Experience) The experience memory data structure
exploration – (Exploration) The exploration process
lr_actor – (float) The learning rate for each actor
lr_critic – (float) The learning rate for each critic
gamma – (float) The discount factor
batch_size – (int) The batch size
tau – (float) The update rate
use_target_net – (bool) If true use a target model
name – (str) The name of the agent

update_actor(local_batch, global_batch)[source]¶

class marl.agent.maac_agent.MADDPGAgent(critic_model, actor_model, observation_space, action_space, index=None, experience='ReplayMemory-1000', exploration='OUNoise', lr_actor=0.01, lr_critic=0.01, gamma=0.95, batch_size=32, tau=0.01, use_target_net=100, name='MADDPGAgent')[source]¶

Bases: marl.agent.maac_agent.MAPGAgent

The class of trainable agent using multi-agent deep deterministic policy gradient methods.

Parameters

critic_model – (Model or torch.nn.Module) The critic model
actor_model – (Model or torch.nn.Module) The actor model
observation_space – (gym.Spaces) The observation space
action_space – (gym.Spaces) The action space
index – (int) The index of the agent in the multi-agent system
mas – (MARL) The multi-agent system in which the agent is included
experience – (Experience) The experience memory data structure
exploration – (Exploration) The exploration process
lr_actor – (float) The learning rate for each actor
lr_critic – (float) The learning rate for each critic
gamma – (float) The discount factor
batch_size – (int) The batch size
tau – (float) The update rate
use_target_net – (bool) If true use a target model
name – (str) The name of the agent

update_actor(local_batch, global_batch)[source]¶

Experience¶

class marl.experience.experience.Experience[source]¶

Bases: object

experience = {'PrioritizedReplayMemory': <marl.tools.ClassSpec object>, 'PrioritizedReplayMemory-1': <marl.tools.ClassSpec object>, 'PrioritizedReplayMemory-100': <marl.tools.ClassSpec object>, 'PrioritizedReplayMemory-1000': <marl.tools.ClassSpec object>, 'PrioritizedReplayMemory-10000': <marl.tools.ClassSpec object>, 'PrioritizedReplayMemory-100000': <marl.tools.ClassSpec object>, 'PrioritizedReplayMemory-2000': <marl.tools.ClassSpec object>, 'PrioritizedReplayMemory-30000': <marl.tools.ClassSpec object>, 'PrioritizedReplayMemory-500': <marl.tools.ClassSpec object>, 'PrioritizedReplayMemory-5000': <marl.tools.ClassSpec object>, 'PrioritizedReplayMemory-50000': <marl.tools.ClassSpec object>, 'RNNPrioritizedReplayMemory': <marl.tools.ClassSpec object>, 'RNNPrioritizedReplayMemory-1': <marl.tools.ClassSpec object>, 'RNNPrioritizedReplayMemory-100': <marl.tools.ClassSpec object>, 'RNNPrioritizedReplayMemory-1000': <marl.tools.ClassSpec object>, 'RNNPrioritizedReplayMemory-10000': <marl.tools.ClassSpec object>, 'RNNPrioritizedReplayMemory-100000': <marl.tools.ClassSpec object>, 'RNNPrioritizedReplayMemory-2000': <marl.tools.ClassSpec object>, 'RNNPrioritizedReplayMemory-30000': <marl.tools.ClassSpec object>, 'RNNPrioritizedReplayMemory-500': <marl.tools.ClassSpec object>, 'RNNPrioritizedReplayMemory-5000': <marl.tools.ClassSpec object>, 'RNNPrioritizedReplayMemory-50000': <marl.tools.ClassSpec object>, 'RNNReplayMemory': <marl.tools.ClassSpec object>, 'RNNReplayMemory-1': <marl.tools.ClassSpec object>, 'RNNReplayMemory-100': <marl.tools.ClassSpec object>, 'RNNReplayMemory-1000': <marl.tools.ClassSpec object>, 'RNNReplayMemory-10000': <marl.tools.ClassSpec object>, 'RNNReplayMemory-100000': <marl.tools.ClassSpec object>, 'RNNReplayMemory-2000': <marl.tools.ClassSpec object>, 'RNNReplayMemory-30000': <marl.tools.ClassSpec object>, 'RNNReplayMemory-500': <marl.tools.ClassSpec object>, 'RNNReplayMemory-5000': <marl.tools.ClassSpec object>, 'RNNReplayMemory-50000': <marl.tools.ClassSpec object>, 'ReplayMemory': <marl.tools.ClassSpec object>, 'ReplayMemory-1': <marl.tools.ClassSpec object>, 'ReplayMemory-100': <marl.tools.ClassSpec object>, 'ReplayMemory-1000': <marl.tools.ClassSpec object>, 'ReplayMemory-10000': <marl.tools.ClassSpec object>, 'ReplayMemory-100000': <marl.tools.ClassSpec object>, 'ReplayMemory-2000': <marl.tools.ClassSpec object>, 'ReplayMemory-30000': <marl.tools.ClassSpec object>, 'ReplayMemory-500': <marl.tools.ClassSpec object>, 'ReplayMemory-5000': <marl.tools.ClassSpec object>, 'ReplayMemory-50000': <marl.tools.ClassSpec object>}¶

push(*args)[source]¶

sample(batch_siz=1)[source]¶

none_transition()[source]¶

classmethod make(id, **kwargs)[source]¶

classmethod register(id, entry_point, **kwargs)[source]¶

classmethod available()[source]¶

marl.experience.experience.register(id, entry_point, **kwargs)[source]¶

marl.experience.experience.make(id, **kwargs)[source]¶

marl.experience.experience.available()[source]¶

ReplayBuffer¶

class marl.experience.replay_buffer.ReplayMemory(capacity, transition_type='FFTransition')[source]¶

Bases: marl.experience.experience.Experience

push(*transition)[source]¶

sample(batch_size=1)[source]¶

get_transition(index)[source]¶

sample_index(batch_size)[source]¶

class marl.experience.replay_buffer.PrioritizedReplayMemory(capacity, alpha=0.6, beta=0.4, eps=1e-06, transition_type='FFTransition')[source]¶

Bases: marl.experience.experience.Experience

beta_increment_per_sampling = 0.001¶

property capacity¶

push(error, *transition)[source]¶

push_error(error)[source]¶

push_transition(observation, action, reward, done_flag, next_observation)[source]¶

sample(batch_size=1)[source]¶

update(idx, error)[source]¶

Exploration¶

class marl.exploration.expl_process.ExplorationProcess[source]¶

Bases: object

The generic exploration class

process = {'EpsGreedy': <marl.tools.ClassSpec object>, 'EpsGreedy-cst001': <marl.tools.ClassSpec object>, 'EpsGreedy-cst002': <marl.tools.ClassSpec object>, 'EpsGreedy-cst01': <marl.tools.ClassSpec object>, 'EpsGreedy-cst02': <marl.tools.ClassSpec object>, 'EpsGreedy-cst05': <marl.tools.ClassSpec object>, 'EpsGreedy-cst1': <marl.tools.ClassSpec object>, 'EpsGreedy-lin': <marl.tools.ClassSpec object>, 'Greedy': <marl.tools.ClassSpec object>, 'OUNoise': <marl.tools.ClassSpec object>}¶

reset(training_duration)[source]¶

Intialize some additional values and reset the others

Parameters: training_duration – (int) Number of timesteps while training

update(t)[source]¶

If required update exploration parameters

Parameters: t – (int) The current timestep

__call__()[source]¶: Call self as a function.

classmethod make(id, *args, **kwargs)[source]¶

classmethod register(id, entry_point, **kwargs)[source]¶

classmethod available()[source]¶

marl.exploration.expl_process.register(id, entry_point, **kwargs)[source]¶

marl.exploration.expl_process.make(id, **kwargs)[source]¶

marl.exploration.expl_process.available()[source]¶

Eps-Greedy¶

class marl.exploration.greedy.Greedy[source]¶

Bases: marl.exploration.eps_greedy.EpsGreedy

The Greedy process

Parameters

eps_deb – (float) The intial amount of exploration
eps_fin – (float) The final amount of exploration

__call__(policy, observation)[source]¶: Choose an action according to the policy and the exploration rate

class marl.exploration.eps_greedy.EpsGreedy(eps_deb=1.0, eps_fin=0.1, deb_expl=0.1, fin_expl=0.9)[source]¶

Bases: marl.exploration.expl_process.ExplorationProcess

The epsilon-greedy exploration class

Parameters

eps_deb – (float) The initial amount of exploration to process
eps_fin – (float) The final amount of exploration to process
deb_expl – (float) The percentage of time before starting exploration (default: 0.1)
deb_expl – (float) The percentage of time before starting exploration (default: 0.1)

reset(training_duration)[source]¶: Reinitialize some parameters

update(t)[source]¶: Update epsilon linearly

__call__(policy, observation)[source]¶: Choose an action according to the policy and the exploration rate

Ornstein–Uhlenbeck Process¶

class marl.exploration.ou_noise.OUNoise(size, dt=0.01, mu=0.0, theta=0.15, sigma=0.2)[source]¶

Bases: marl.exploration.expl_process.ExplorationProcess

The Ornstein-Uhlenbeck process.

Parameters

size – (float) The number of variables to add noise
seed – (float) The seed
mu – (float) The drift term
theta – (float) The amount of keeping previous state
sigma – (float) The amount of noise

reset(t=None)[source]¶: Reinitialize the state of the process

update(t)[source]¶

If required update exploration parameters

Parameters: t – (int) The current timestep

__call__(policy, observation)[source]¶: Call self as a function.

sample()[source]¶: Update internal state and return it as a noise sample.

Policies¶

Base Policy¶

class marl.policy.policy.Policy[source]¶

Bases: object

policy = {'DeterministicPolicy': <marl.tools.ClassSpec object>, 'QPolicy': <marl.tools.ClassSpec object>, 'RandomPolicy': <marl.tools.ClassSpec object>, 'StochasticPolicy': <marl.tools.ClassSpec object>}¶

__call__(state)[source]¶: Call self as a function.

classmethod make(id, **kwargs)[source]¶

classmethod register(id, entry_point, **kwargs)[source]¶

classmethod available()[source]¶

class marl.policy.policy.ModelBasedPolicy(model)[source]¶

Bases: marl.policy.policy.Policy

load(filename)[source]¶

save(filename)[source]¶

marl.policy.policy.register(id, entry_point, **kwargs)[source]¶

marl.policy.policy.make(id, **kwargs)[source]¶

marl.policy.policy.available()[source]¶

Several Policies¶

class marl.policy.policies.RandomPolicy(action_space)[source]¶

Bases: marl.policy.policy.Policy

The class of random policies

Parameters

model – (Model or torch.nn.Module) The q-value model
action_space – (gym.Spaces) The action space

__call__(state)[source]¶

Return a random action given the state

Parameters: state – (Tensor) The current state

class marl.policy.policies.QPolicy(model, observation_space=None, action_space=None)[source]¶

Bases: marl.policy.policy.ModelBasedPolicy

The class of policies based on a Q function

Parameters

model – (Model or torch.nn.Module) The q-value model
observation_space – (gym.Spaces) The observation space
action_space – (gym.Spaces) The action space

__call__(state)[source]¶

Return an action given the state

Parameters: state – (Tensor) The current state

property Q¶

class marl.policy.policies.StochasticPolicy(model, observation_space=None, action_space=None)[source]¶

Bases: marl.policy.policy.ModelBasedPolicy

The class of stochastic policies

Parameters

model – (Model or torch.nn.Module) The model of the policy
observation_space – (gym.Spaces) The observation space
action_space – (gym.Spaces) The action space

forward(x)[source]¶

__call__(observation)[source]¶: Call self as a function.

class marl.policy.policies.DeterministicPolicy(model, observation_space=None, action_space=None)[source]¶

Bases: marl.policy.policy.ModelBasedPolicy

The class of deterministic policies

Parameters

model – (Model or torch.nn.Module) The model of the policy
observation_space – (gym.Spaces) The observation space
action_space – (gym.Spaces) The action space

__call__(observation)[source]¶: Call self as a function.

Models¶

Value and Q-Value array¶

class marl.model.qvalue.VTable(obs_sp)[source]¶

Bases: marl.model.model.Model

The class of state value function for discret state space.

Parameters: obs_sp – (int) The number of possible observations

__call__(state=None)[source]¶: Call self as a function.

property shape¶

class marl.model.qvalue.QTable(obs_sp, act_sp)[source]¶

Bases: marl.model.model.Model

The class of action value function for discret state and action space.

Parameters

obs_sp – (int) The number of possible observations
act_sp – (int) The number of possible actions

property q_table¶

__call__(state=None, action=None)[source]¶: Call self as a function.

property shape¶

class marl.model.qvalue.MultiQTable(obs_sp, act_sp)[source]¶

Bases: marl.model.model.Model

The class of actions value function for multi-agent with discret state and action space. This kind of value function is used in minimax-Q algorithm.

Parameters

obs_sp – (int) The number of possible observations
act_sp – (int) The number of possible actions

property q_table¶

property shape¶

__call__(state=None, action=None)[source]¶: Call self as a function.

class marl.model.qvalue.ActionProb(obs_sp, act_sp)[source]¶

Bases: marl.model.model.Model

The class of action probabilities for PHC algorithm.

Parameters

obs_sp – (int) The number of possible observations
act_sp – (int) The number of possible actions

__call__(state=None, action=None)[source]¶: Call self as a function.

property shape¶

Neural network model¶

marl.model.nn.mlpnet.hidden_init(layer)[source]¶

class marl.model.nn.mlpnet.MlpNet(obs_sp, act_sp, hidden_size=[64, 64], hidden_activ=<class 'torch.nn.modules.activation.ReLU'>, last_activ=None, lay_norm=False)[source]¶

Bases: torch.nn.modules.module.Module

reset_parameters()[source]¶

forward(x)[source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class marl.model.nn.mlpnet.GumbelMlpNet(obs_sp, act_sp, hidden_size=[64, 64], hidden_activ=<class 'torch.nn.modules.activation.ReLU'>, tau=1.0, lay_norm=False)[source]¶

Bases: marl.model.nn.mlpnet.MlpNet

forward(x)[source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class marl.model.nn.mlpnet.ContinuousCritic(obs_sp, act_sp, hidden_size=[64, 64])[source]¶

Bases: torch.nn.modules.module.Module

reset_parameters()[source]¶

forward(obs, act)[source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.