Multi-agent RL¶
-
class
marl.marl.
MAS
(agents_list=[], name='mas')[source]¶ Bases:
object
The class of multi-agent “system”.
- Parameters
agents_list – (list) The list of agents in the MAS
name – (str) The name of the system
-
append
(agent)[source]¶ Add an agent to the system.
- Parameters
agent – (Agent) The agents to be added
-
class
marl.marl.
MARL
(agents_list=[], name='marl')[source]¶ Bases:
marl.agent.agent.TrainableAgent
,marl.marl.MAS
The class for a multi-agent reinforcement learning.
- Parameters
agents_list – (list) The list of agents in the MARL model
name – (str) The name of the system
-
action
(observation)[source]¶ Return an action given an observation (action in selected according to the exploration process).
- Parameters
observation – The observation
-
greedy_action
(observation)[source]¶ Return the greedy action given an observation :param observation: The observation
-
save_policy
(folder='.', filename='', timestep=None)[source]¶ Save the policy in a file called ‘<filename>-<agent_name>-<timestep>’.
- Parameters
folder – (str) The path to the directory where to save the model(s)
filename – (str) A specific name for the file (ex: ‘test2’)
timestep – (int) The current timestep
Agents¶
Base Agent¶
-
class
marl.agent.agent.
Agent
(policy, name='UnknownAgent')[source]¶ Bases:
object
The class of generic agent.
- Parameters
policy – (Policy) The policy of the agent
name – (str) The name of the agent
-
agents
= {'DDPGAgent': <marl.tools.ClassSpec object>, 'DQNAgent': <marl.tools.ClassSpec object>, 'DeepACAgent': <marl.tools.ClassSpec object>, 'MADDPGAgent': <marl.tools.ClassSpec object>, 'MinimaxQAgent': <marl.tools.ClassSpec object>, 'PHCAgent': <marl.tools.ClassSpec object>, 'QTableAgent': <marl.tools.ClassSpec object>}¶
-
counter
= 0¶
-
action
(observation)[source]¶ Return the action given an observation :param observation: The observation
-
greedy_action
(observation)[source]¶ Return the greedy action given an observation :param observation: The observation
-
test
(env, nb_episodes=1, max_num_step=200, render=True, time_laps=0.0)[source]¶ Test a model.
- Parameters
env – (Gym) The environment
nb_episodes – (int) The number of episodes to test
max_num_step – (int) The maximum number a step before stopping an episode
render – (bool) Whether to visualize the test or not (using render function of the environment)
-
class
marl.agent.agent.
TrainableAgent
(policy, observation_space=None, action_space=None, model=None, experience='ReplayMemory-10000', exploration='EpsGreedy', gamma=0.99, lr=0.001, batch_size=32, name='TrainableAgent')[source]¶ Bases:
marl.agent.agent.Agent
The class of trainable agent.
- Parameters
policy – (Policy) The policy
observation_space – (gym.Spaces) The observation space
action_space – (gym.Spaces) The action space
experience – (Experience) The experience memory data structure
exploration – (Exploration) The exploration process
lr – (float) The learning rate
batch_size (gamma,) – (float) The training parameters
name – (str) The name of the agent
-
property
observation_space
¶
-
property
action_space
¶
-
action
(observation)[source]¶ Return an action given an observation (action in selected according to the exploration process).
- Parameters
observation – The observation
-
save_policy
(folder='.', filename='', timestep=None)[source]¶ Save the policy in a file called ‘<filename>-<agent_name>-<timestep>’.
- Parameters
filename – (str) A specific name for the file (ex: ‘test2’)
timestep – (int) The current timestep
-
learn
(env, nb_timesteps, max_num_step=100, test_freq=1000, save_freq=1000, save_folder='models', render=False, time_laps=0.0, verbose=1)[source]¶ Start the learning part.
- Parameters
env – (Gym) The environment
nb_timesteps – (int) The total duration (in number of steps)
max_num_step – (int) The maximum number a step before stopping episode
test_freq – (int) The frequency of testing model
save_freq – (int) The frequency of saving model
Q-value based model¶
-
class
marl.agent.q_agent.
QAgent
(qmodel, observation_space, action_space, experience='ReplayMemory-1', exploration='EpsGreedy', gamma=0.99, lr=0.1, batch_size=1, target_update_freq=None, name='QAgent')[source]¶ Bases:
marl.agent.agent.TrainableAgent
The class of trainable agent using Qvalue-based methods
- Parameters
qmodel – (Model or torch.nn.Module) The q-value model
observation_space – (gym.Spaces) The observation space
action_space – (gym.Spaces) The action space
experience – (Experience) The experience memory data structure
exploration – (Exploration) The exploration process
gamma – (float) The training parameters
lr – (float) The learning rate
batch_size – (int) The size of a batch
target_update_freq – (int) The update frequency of the target model
name – (str) The name of the agent
-
target
(Q, batch)[source]¶ Compute the target value.
- Parameters
Q – (Model or torch.nn.Module) The model of the Q-value
batch – (list) A list composed of the different information about the batch required
-
value
(observation, action)[source]¶ Compute the value.
- Parameters
observation – The observation
action – The action
-
class
marl.agent.q_agent.
QTableAgent
(observation_space, action_space, exploration='EpsGreedy', gamma=0.99, lr=0.1, target_update_freq=None, name='QTableAgent')[source]¶ Bases:
marl.agent.q_agent.QAgent
The class of trainable agent using Q-table to model the function Q
- Parameters
observation_space – (gym.Spaces) The observation space
action_space – (gym.Spaces) The action space
exploration – (Exploration) The exploration process
gamma – (float) The training parameters
lr – (float) The learning rate
target_update_freq – (int) The update frequency of the target model
name – (str) The name of the agent
-
update_q
(curr_value, target_value, batch)[source]¶ Update the Q value.
- Parameters
curr_value – (torch.Tensor) The current value
target_value – (torch.Tensor) The target value
batch – (list) A list composed of the different information about the batch required
-
class
marl.agent.q_agent.
MinimaxQAgent
(observation_space, my_action_space, other_action_space, index=None, mas=None, exploration='EpsGreedy', gamma=0.99, lr=0.1, target_update_freq=None, name='MinimaxQAgent')[source]¶ Bases:
marl.agent.q_agent.QAgent
,marl.agent.agent.MATrainable
The class of trainable agent using minimax-Q-table algorithm
- Parameters
observation_space – (gym.Spaces) The observation space
my_action_space – (gym.Spaces) My action space
other_action_space – (gym.Spaces) The action space of the other agent
index – (int) The position of the agent in the list of agent
mas – (marl.agent.MAS) The multi-agent system corresponding to the agent
exploration – (Exploration) The exploration process
gamma – (float) The training parameters
lr – (float) The learning rate
target_update_freq – (int) The update frequency of the target model
name – (str) The name of the agent
-
update_q
(curr_value, target_value, batch)[source]¶ Update the Q value.
- Parameters
curr_value – (torch.Tensor) The current value
target_value – (torch.Tensor) The target value
batch – (list) A list composed of the different information about the batch required
-
class
marl.agent.q_agent.
DQNAgent
(qmodel, observation_space, action_space, experience='ReplayMemory-10000', exploration='EpsGreedy', gamma=0.99, lr=0.0005, batch_size=32, target_update_freq=1000, name='DQNAgent')[source]¶ Bases:
marl.agent.q_agent.QAgent
The class of trainable agent using a neural network to model the function Q
- Parameters
qmodel – (Model or torch.nn.Module) The q-value model
observation_space – (gym.Spaces) The observation space
action_space – (gym.Spaces) The action space
experience – (Experience) The experience memory data structure
exploration – (Exploration) The exploration process
gamma – (float) The training parameters
lr – (float) The learning rate
batch_size – (int) The size of a batch
target_update_freq – (int) The update frequency of the target model
name – (str) The name of the agent
-
update_q
(curr_value, target_value, batch)[source]¶ Update the Q value.
- Parameters
curr_value – (torch.Tensor) The current value
target_value – (torch.Tensor) The target value
batch – (list) A list composed of the different information about the batch required
-
class
marl.agent.q_agent.
ContinuousDQNAgent
(qmodel, actor_policy, observation_space, action_space, experience='ReplayMemory-10000', exploration='EpsGreedy', gamma=0.99, lr=0.0005, batch_size=32, target_update_freq=1000, name='DQNAgent')[source]¶ Bases:
marl.agent.q_agent.DQNAgent
The class of trainable agent using a neural network to model the function Q
- Parameters
qmodel – (Model or torch.nn.Module) The q-value model
observation_space – (gym.Spaces) The observation space
action_space – (gym.Spaces) The action space
experience – (Experience) The experience memory data structure
exploration – (Exploration) The exploration process
gamma – (float) The training parameters
lr – (float) The learning rate
batch_size – (int) The size of a batch
target_update_freq – (int) The update frequency of the target model
name – (str) The name of the agent
Policy Gradient based model¶
-
class
marl.agent.pg_agent.
PGAgent
(critic, actor_policy, observation_space, action_space, actor_model=None, experience='ReplayMemory-1000', exploration='EpsGreedy', lr_actor=0.001, gamma=0.95, batch_size=32, target_update_freq=None, name='PGAgent')[source]¶ Bases:
marl.agent.agent.TrainableAgent
The class of generic trainable agent using policy-based methods
- Parameters
critic – (QAgent) The critic agent
actor_policy – (Policy) The policy for the actor
actor_model – (Model or nn.Module) The model for the actor
observation_space – (gym.Spaces) The observation space
action_space – (gym.Spaces) The action space
experience – (Experience) The experience memory data structure
exploration – (Exploration) The exploration process
lr_actor – (float) The learning rate for the actor
lr_critic – (float) The learning rate for the critic
gamma – (float) The training parameters
batch_size – (int) The size of a batch
target_update_freq – (int) The update frequency of the target model
name – (str) The name of the agent
-
property
lr_actor
¶
-
property
lr_critic
¶
-
class
marl.agent.pg_agent.
DeepACAgent
(critic_model, actor_model, observation_space, action_space, experience='ReplayMemory-1000', exploration='EpsGreedy', lr_actor=0.001, lr_critic=0.001, gamma=0.95, batch_size=32, target_update_freq=None, name='DeepACAgent')[source]¶ Bases:
marl.agent.pg_agent.PGAgent
Deep Actor-Critic Agent class. The critic is train following DQN algorithm and the policy is represented by a neural network with a softmax output.
- Parameters
critic_model – (nn.Module) The critic’s model
actor_model – (Model or nn.Module) The model for the actor
observation_space – (gym.Spaces) The observation space
action_space – (gym.Spaces) The action space
experience – (Experience) The experience memory data structure
exploration – (Exploration) The exploration process
lr_actor – (float) The learning rate for the actor
lr_critic – (float) The learning rate for the critic
gamma – (float) The training parameters
batch_size – (int) The size of a batch
target_update_freq – (int) The update frequency of the target model
name – (str) The name of the agent
-
class
marl.agent.pg_agent.
PHCAgent
(observation_space, action_space, exploration='EpsGreedy', delta=0.01, lr_critic=0.01, gamma=0.95, target_update_freq=None, name='PHCAgent')[source]¶ Bases:
marl.agent.pg_agent.PGAgent
Policy Hill Climbing Agent’s class. The critic is train following standard Q-learning algorithm.
- Parameters
observation_space – (gym.Spaces) The observation space
action_space – (gym.Spaces) The action space
exploration – (Exploration) The exploration process
delta – (float) The learning rate for the actor
lr_critic – (float) The learning rate for the critic
gamma – (float) The training parameters
target_update_freq – (int) The update frequency of the target model
name – (str) The name of the agent
-
property
delta
¶
-
class
marl.agent.pg_agent.
DDPGAgent
(critic_model, actor_model, observation_space, action_space, experience='ReplayMemory-1000', exploration='OUNoise', lr_actor=0.01, lr_critic=0.01, gamma=0.95, batch_size=32, target_update_freq=None, name='DDPGAgent')[source]¶ Bases:
marl.agent.pg_agent.PGAgent
Deep Deterministic Policy Gradient Agent’s class. The critic is train following standard “SARSA” algorithm (ContinuousDQN).
- Parameters
critic_model – (nn.Module) The critic’s model
actor_model – (nn.Module) The model for the actor
observation_space – (gym.Spaces) The observation space
action_space – (gym.Spaces) The action space
experience – (Experience) The experience memory data structure
exploration – (Exploration) The exploration process
lr_actor – (float) The learning rate for the actor
lr_critic – (float) The learning rate for the critic
gamma – (float) The training parameters
batch_size – (int) The size of a batch
target_update_freq – (int) The update frequency of the target model
name – (str) The name of the agent
Multi-agent Policy Gradient based model¶
-
class
marl.agent.maac_agent.
MAPGAgent
(critic_model, actor_policy, observation_space, action_space, actor_model=None, index=None, mas=None, experience='ReplayMemory-1000', exploration='EpsGreedy', lr_actor=0.001, lr_critic=0.001, gamma=0.95, batch_size=32, tau=0.01, use_target_net=False, name='MAACAgent')[source]¶ Bases:
marl.agent.agent.TrainableAgent
,marl.agent.agent.MATrainable
The class of trainable agent using multi-agent policy gradient methods.
- Parameters
critic_model – (Model or torch.nn.Module) The critic model
actor_policy – (Policy) actor policy
actor_model – (Model or torch.nn.Module) The actor model
observation_space – (gym.Spaces) The observation space
action_space – (gym.Spaces) The action space
index – (int) The index of the agent in the multi-agent system
mas – (MARL) The multi-agent system in which the agent is included
experience – (Experience) The experience memory data structure
exploration – (Exploration) The exploration process
lr_actor – (float) The learning rate for each actor
lr_critic – (float) The learning rate for each critic
gamma – (float) The discount factor
batch_size – (int) The batch size
tau – (float) The update rate
name – (str) The name of the agent
-
class
marl.agent.maac_agent.
MAACAgent
(critic_model, actor_model, observation_space, action_space, index=None, experience='ReplayMemory-1000', exploration='EpsGreedy', lr_actor=0.001, lr_critic=0.001, gamma=0.95, batch_size=32, tau=0.01, use_target_net=False, name='MAACAgent')[source]¶ Bases:
marl.agent.maac_agent.MAPGAgent
The class of trainable agent using multi-agent actor-critic methods.
- Parameters
critic_model – (Model or torch.nn.Module) The critic model
actor_model – (Model or torch.nn.Module) The actor model
observation_space – (gym.Spaces) The observation space
action_space – (gym.Spaces) The action space
index – (int) The index of the agent in the multi-agent system
mas – (MARL) The multi-agent system in which the agent is included
experience – (Experience) The experience memory data structure
exploration – (Exploration) The exploration process
lr_actor – (float) The learning rate for each actor
lr_critic – (float) The learning rate for each critic
gamma – (float) The discount factor
batch_size – (int) The batch size
tau – (float) The update rate
use_target_net – (bool) If true use a target model
name – (str) The name of the agent
-
class
marl.agent.maac_agent.
MADDPGAgent
(critic_model, actor_model, observation_space, action_space, index=None, experience='ReplayMemory-1000', exploration='OUNoise', lr_actor=0.01, lr_critic=0.01, gamma=0.95, batch_size=32, tau=0.01, use_target_net=100, name='MADDPGAgent')[source]¶ Bases:
marl.agent.maac_agent.MAPGAgent
The class of trainable agent using multi-agent deep deterministic policy gradient methods.
- Parameters
critic_model – (Model or torch.nn.Module) The critic model
actor_model – (Model or torch.nn.Module) The actor model
observation_space – (gym.Spaces) The observation space
action_space – (gym.Spaces) The action space
index – (int) The index of the agent in the multi-agent system
mas – (MARL) The multi-agent system in which the agent is included
experience – (Experience) The experience memory data structure
exploration – (Exploration) The exploration process
lr_actor – (float) The learning rate for each actor
lr_critic – (float) The learning rate for each critic
gamma – (float) The discount factor
batch_size – (int) The batch size
tau – (float) The update rate
use_target_net – (bool) If true use a target model
name – (str) The name of the agent
Experience¶
Experience¶
-
class
marl.experience.experience.
Experience
[source]¶ Bases:
object
-
experience
= {'PrioritizedReplayMemory': <marl.tools.ClassSpec object>, 'PrioritizedReplayMemory-1': <marl.tools.ClassSpec object>, 'PrioritizedReplayMemory-100': <marl.tools.ClassSpec object>, 'PrioritizedReplayMemory-1000': <marl.tools.ClassSpec object>, 'PrioritizedReplayMemory-10000': <marl.tools.ClassSpec object>, 'PrioritizedReplayMemory-100000': <marl.tools.ClassSpec object>, 'PrioritizedReplayMemory-2000': <marl.tools.ClassSpec object>, 'PrioritizedReplayMemory-30000': <marl.tools.ClassSpec object>, 'PrioritizedReplayMemory-500': <marl.tools.ClassSpec object>, 'PrioritizedReplayMemory-5000': <marl.tools.ClassSpec object>, 'PrioritizedReplayMemory-50000': <marl.tools.ClassSpec object>, 'RNNPrioritizedReplayMemory': <marl.tools.ClassSpec object>, 'RNNPrioritizedReplayMemory-1': <marl.tools.ClassSpec object>, 'RNNPrioritizedReplayMemory-100': <marl.tools.ClassSpec object>, 'RNNPrioritizedReplayMemory-1000': <marl.tools.ClassSpec object>, 'RNNPrioritizedReplayMemory-10000': <marl.tools.ClassSpec object>, 'RNNPrioritizedReplayMemory-100000': <marl.tools.ClassSpec object>, 'RNNPrioritizedReplayMemory-2000': <marl.tools.ClassSpec object>, 'RNNPrioritizedReplayMemory-30000': <marl.tools.ClassSpec object>, 'RNNPrioritizedReplayMemory-500': <marl.tools.ClassSpec object>, 'RNNPrioritizedReplayMemory-5000': <marl.tools.ClassSpec object>, 'RNNPrioritizedReplayMemory-50000': <marl.tools.ClassSpec object>, 'RNNReplayMemory': <marl.tools.ClassSpec object>, 'RNNReplayMemory-1': <marl.tools.ClassSpec object>, 'RNNReplayMemory-100': <marl.tools.ClassSpec object>, 'RNNReplayMemory-1000': <marl.tools.ClassSpec object>, 'RNNReplayMemory-10000': <marl.tools.ClassSpec object>, 'RNNReplayMemory-100000': <marl.tools.ClassSpec object>, 'RNNReplayMemory-2000': <marl.tools.ClassSpec object>, 'RNNReplayMemory-30000': <marl.tools.ClassSpec object>, 'RNNReplayMemory-500': <marl.tools.ClassSpec object>, 'RNNReplayMemory-5000': <marl.tools.ClassSpec object>, 'RNNReplayMemory-50000': <marl.tools.ClassSpec object>, 'ReplayMemory': <marl.tools.ClassSpec object>, 'ReplayMemory-1': <marl.tools.ClassSpec object>, 'ReplayMemory-100': <marl.tools.ClassSpec object>, 'ReplayMemory-1000': <marl.tools.ClassSpec object>, 'ReplayMemory-10000': <marl.tools.ClassSpec object>, 'ReplayMemory-100000': <marl.tools.ClassSpec object>, 'ReplayMemory-2000': <marl.tools.ClassSpec object>, 'ReplayMemory-30000': <marl.tools.ClassSpec object>, 'ReplayMemory-500': <marl.tools.ClassSpec object>, 'ReplayMemory-5000': <marl.tools.ClassSpec object>, 'ReplayMemory-50000': <marl.tools.ClassSpec object>}¶
-
ReplayBuffer¶
-
class
marl.experience.replay_buffer.
PrioritizedReplayMemory
(capacity, alpha=0.6, beta=0.4, eps=1e-06, transition_type='FFTransition')[source]¶ Bases:
marl.experience.experience.Experience
-
beta_increment_per_sampling
= 0.001¶
-
property
capacity
¶
-
Exploration¶
Exploration¶
-
class
marl.exploration.expl_process.
ExplorationProcess
[source]¶ Bases:
object
The generic exploration class
-
process
= {'EpsGreedy': <marl.tools.ClassSpec object>, 'EpsGreedy-cst001': <marl.tools.ClassSpec object>, 'EpsGreedy-cst002': <marl.tools.ClassSpec object>, 'EpsGreedy-cst01': <marl.tools.ClassSpec object>, 'EpsGreedy-cst02': <marl.tools.ClassSpec object>, 'EpsGreedy-cst05': <marl.tools.ClassSpec object>, 'EpsGreedy-cst1': <marl.tools.ClassSpec object>, 'EpsGreedy-lin': <marl.tools.ClassSpec object>, 'Greedy': <marl.tools.ClassSpec object>, 'OUNoise': <marl.tools.ClassSpec object>}¶
-
reset
(training_duration)[source]¶ Intialize some additional values and reset the others
- Parameters
training_duration – (int) Number of timesteps while training
-
Eps-Greedy¶
-
class
marl.exploration.greedy.
Greedy
[source]¶ Bases:
marl.exploration.eps_greedy.EpsGreedy
The Greedy process
- Parameters
eps_deb – (float) The intial amount of exploration
eps_fin – (float) The final amount of exploration
-
class
marl.exploration.eps_greedy.
EpsGreedy
(eps_deb=1.0, eps_fin=0.1, deb_expl=0.1, fin_expl=0.9)[source]¶ Bases:
marl.exploration.expl_process.ExplorationProcess
The epsilon-greedy exploration class
- Parameters
eps_deb – (float) The initial amount of exploration to process
eps_fin – (float) The final amount of exploration to process
deb_expl – (float) The percentage of time before starting exploration (default: 0.1)
deb_expl – (float) The percentage of time before starting exploration (default: 0.1)
Ornstein–Uhlenbeck Process¶
-
class
marl.exploration.ou_noise.
OUNoise
(size, dt=0.01, mu=0.0, theta=0.15, sigma=0.2)[source]¶ Bases:
marl.exploration.expl_process.ExplorationProcess
The Ornstein-Uhlenbeck process.
- Parameters
size – (float) The number of variables to add noise
seed – (float) The seed
mu – (float) The drift term
theta – (float) The amount of keeping previous state
sigma – (float) The amount of noise
Policies¶
Base Policy¶
-
class
marl.policy.policy.
Policy
[source]¶ Bases:
object
-
policy
= {'DeterministicPolicy': <marl.tools.ClassSpec object>, 'QPolicy': <marl.tools.ClassSpec object>, 'RandomPolicy': <marl.tools.ClassSpec object>, 'StochasticPolicy': <marl.tools.ClassSpec object>}¶
-
-
class
marl.policy.policy.
ModelBasedPolicy
(model)[source]¶ Bases:
marl.policy.policy.Policy
Several Policies¶
-
class
marl.policy.policies.
RandomPolicy
(action_space)[source]¶ Bases:
marl.policy.policy.Policy
The class of random policies
- Parameters
model – (Model or torch.nn.Module) The q-value model
action_space – (gym.Spaces) The action space
-
class
marl.policy.policies.
QPolicy
(model, observation_space=None, action_space=None)[source]¶ Bases:
marl.policy.policy.ModelBasedPolicy
The class of policies based on a Q function
- Parameters
model – (Model or torch.nn.Module) The q-value model
observation_space – (gym.Spaces) The observation space
action_space – (gym.Spaces) The action space
-
__call__
(state)[source]¶ Return an action given the state
- Parameters
state – (Tensor) The current state
-
property
Q
¶
-
class
marl.policy.policies.
StochasticPolicy
(model, observation_space=None, action_space=None)[source]¶ Bases:
marl.policy.policy.ModelBasedPolicy
The class of stochastic policies
- Parameters
model – (Model or torch.nn.Module) The model of the policy
observation_space – (gym.Spaces) The observation space
action_space – (gym.Spaces) The action space
-
class
marl.policy.policies.
DeterministicPolicy
(model, observation_space=None, action_space=None)[source]¶ Bases:
marl.policy.policy.ModelBasedPolicy
The class of deterministic policies
- Parameters
model – (Model or torch.nn.Module) The model of the policy
observation_space – (gym.Spaces) The observation space
action_space – (gym.Spaces) The action space
Models¶
Value and Q-Value array¶
-
class
marl.model.qvalue.
VTable
(obs_sp)[source]¶ Bases:
marl.model.model.Model
The class of state value function for discret state space.
- Parameters
obs_sp – (int) The number of possible observations
-
property
shape
¶
-
class
marl.model.qvalue.
QTable
(obs_sp, act_sp)[source]¶ Bases:
marl.model.model.Model
The class of action value function for discret state and action space.
- Parameters
obs_sp – (int) The number of possible observations
act_sp – (int) The number of possible actions
-
property
q_table
¶
-
property
shape
¶
-
class
marl.model.qvalue.
MultiQTable
(obs_sp, act_sp)[source]¶ Bases:
marl.model.model.Model
The class of actions value function for multi-agent with discret state and action space. This kind of value function is used in minimax-Q algorithm.
- Parameters
obs_sp – (int) The number of possible observations
act_sp – (int) The number of possible actions
-
property
q_table
¶
-
property
shape
¶
Neural network model¶
-
class
marl.model.nn.mlpnet.
MlpNet
(obs_sp, act_sp, hidden_size=[64, 64], hidden_activ=<class 'torch.nn.modules.activation.ReLU'>, last_activ=None, lay_norm=False)[source]¶ Bases:
torch.nn.modules.module.Module
-
forward
(x)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
-
class
marl.model.nn.mlpnet.
GumbelMlpNet
(obs_sp, act_sp, hidden_size=[64, 64], hidden_activ=<class 'torch.nn.modules.activation.ReLU'>, tau=1.0, lay_norm=False)[source]¶ Bases:
marl.model.nn.mlpnet.MlpNet
-
forward
(x)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
-
class
marl.model.nn.mlpnet.
ContinuousCritic
(obs_sp, act_sp, hidden_size=[64, 64])[source]¶ Bases:
torch.nn.modules.module.Module
-
forward
(obs, act)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-