The config file that I used.
defaults:
- experiment: base_experiment
- algorithm: ippo
- task: meltingpot/predator_prey__orchard
- model: layers/cnn
- model@critic_model: layers/cnn
- _self_
hydra:
searchpath:
- pkg://benchmarl/conf
seed: 199
task:
max_steps: 500
group_map:
pred_policy: ["player_0", "player_1", "player_2", "player_3", "player_4"]
prey_policy: ["player_5", "player_6", "player_7", "player_8", "player_9", "player_10", "player_11", "player_12"]
model:
mlp_num_cells: [256, 256]
cnn_num_cells: [16, 32, 256]
cnn_kernel_sizes: [8, 4, 11]
cnn_strides: [4, 2, 1]
cnn_paddings: [2, 1, 5]
cnn_activation_class: torch.nn.ReLU
critic_model:
mlp_num_cells: [256, 256]
cnn_num_cells: [16, 32, 256]
cnn_kernel_sizes: [8, 4, 11]
cnn_strides: [4, 2, 1]
cnn_paddings: [2, 1, 5]
cnn_activation_class: torch.nn.ReLU
algorithm:
entropy_coef: 0.001
use_tanh_normal: True
experiment:
sampling_device: "cpu"
train_device: "cuda:1"
buffer_device: "cpu"
share_policy_params: False
prefer_continuous_actions: False
collect_with_grad: False
gamma: 0.99
adam_eps: 0.000001
lr: 0.00025
clip_grad_norm: True
clip_grad_val: 5
max_n_iters: null
max_n_frames: 10_000_000
parallel_collection: True
on_policy_collected_frames_per_batch: 2_000
on_policy_n_envs_per_worker: 20
on_policy_n_minibatch_iters: 32
on_policy_minibatch_size: 100
evaluation: true
render: True
evaluation_interval: 20_000
evaluation_episodes: 5
evaluation_deterministic_actions: False
evaluation_static: False
loggers: [wandb, csv]
create_json: False
save_folder: null
restore_file: null
restore_map_location: null
checkpoint_interval: 200_000
checkpoint_at_end: true
keep_checkpoints_num: 10000
exclude_buffer_from_checkpoint: True
When applying
group_maprecommended settings in #78 formeltingpot/predator_prey__orchardenvironment, with IPPO + CNN encoder.BenchMARL: 1.5.0
torchrl: 0.10.0
torch: 2.9.0+cu126
hardware: 96-core CPU; RTX 4090 GPU
system: Ubuntu
The config file that I used.
Expect:
Grouped IPPO should show a positive learning trend on orchard (comparable to ungrouped baseline within variance).
Actual observation:
Episode return moves up and down, likely due to sampling
Weight-norm trajectories show very small drift.
Diagnostics done
share_policy_paramstoFalselead to GPU memory overflow. Not yet figured out how to deal with this.Some other questions: