Multi agent deep reinforcement learning

Multi agent deep reinforcement learning

Algorithm: https://github.com/openai/maddpg

Environment: https://github.com/openai/multiagent-particle-envs

There is one issue related Action selection: action exploration & Gumbel-Softmax 

djbitbyte commented on Mar 12

Hello, I have questions on exploration and Gumbel-Softmax.

In the pseudocode, it mentioned initialize random process N for action exploration, which is same in the paper of DDPG. But I have difficulty to understand the exploration in your implementation. Is it Ornstein-Uhlenbeck process used for this algorithm, same as DDPG? Could you explain how you handled action exploration?

Another question, did you use softmax instead of Gumbel-Softmax?

I have tried to implement MADDPG on scenario of simple-speaker-listener, but not with Ornstein-Uhlenbeck process for action exploration, and only softmax for actor network. The other parts are the same as on paper, but my speaker is converged to telling same wrong target landmark, and listener is wondering around or in between the 3 landmarks. I guess the listener ignoreed speaker as described on paper.
And I’ve tried yours on simple-speaker-listener, it converges correctly for some trainings. Are the action exploration and activation functions the reasons for wrong convergence, do they have big impact on training process?

Thanks for your time!

@PengZhenghao
  

PengZhenghao commented on Mar 13  

edited 


I think in this implementation they use softmax as the output activation function when sampling action.
And see the code below, you can find that they have attempted to use Argmax activation by return CategoricalPdType(ac_space.n) when sampling. But eventually they use softmax activation when training the Q net.

def make_pdtype(ac_space):
    from gym import spaces
    if isinstance(ac_space, spaces.Box):
        assert len(ac_space.shape) == 1
        return DiagGaussianPdType(ac_space.shape[0])
    elif isinstance(ac_space, spaces.Discrete):
        # return CategoricalPdType(ac_space.n)
        return SoftCategoricalPdType(ac_space.n)
    elif isinstance(ac_space, spaces.MultiDiscrete):
        #return MultiCategoricalPdType(ac_space.low, ac_space.high)
        return SoftMultiCategoricalPdType(ac_space.low, ac_space.high)
    elif isinstance(ac_space, spaces.MultiBinary):
        return BernoulliPdType(ac_space.n)
    else:
        raise NotImplementedError

@djbitbyte
  

djbitbyte commented on Mar 14

Hello, @PengZhenghao!

I’ve looked into involved functions again, I guess they use SoftCategoricalPdType(ac_space.n), then SoftCategoricalPdType.sample() to somehow add noise to actions, finally to softmax(logits – noise) as output of actor network.

And the noise added to action is from:
def sample(self):
u = tf.random_uniform(tf.shape(self.logits))
return U.softmax(self.logits – tf.log(-tf.log(u)), axis=-1)

I don’t quite get it why they handle the noise in this way.

@djbitbyte
  

djbitbyte commented on Mar 15

The sample function in distribution is implementation of Gumbel-softmax, I added it to my code, now it helps to speed up stabilize the training, but my speaker still can not tell the different landmarks.

How do you handle the action exploration then?

发表评论

电子邮件地址不会被公开。 必填项已用*标注