Hello, I have questions on exploration and Gumbel-Softmax.
In the pseudocode, it mentioned initialize random process N for action exploration, which is same in the paper of DDPG. But I have difficulty to understand the exploration in your implementation. Is it Ornstein-Uhlenbeck process used for this algorithm, same as DDPG? Could you explain how you handled action exploration?
Another question, did you use softmax instead of Gumbel-Softmax?
I have tried to implement MADDPG on scenario of simple-speaker-listener, but not with Ornstein-Uhlenbeck process for action exploration, and only softmax for actor network. The other parts are the same as on paper, but my speaker is converged to telling same wrong target landmark, and listener is wondering around or in between the 3 landmarks. I guess the listener ignoreed speaker as described on paper.
And I’ve tried yours on simple-speaker-listener, it converges correctly for some trainings. Are the action exploration and activation functions the reasons for wrong convergence, do they have big impact on training process?
I think in this implementation they use softmax as the output activation function when sampling action.
And see the code below, you can find that they have attempted to use Argmax activation by return CategoricalPdType(ac_space.n) when sampling. But eventually they use softmax activation when training the Q net.
I’ve looked into involved functions again, I guess they use SoftCategoricalPdType(ac_space.n), then SoftCategoricalPdType.sample() to somehow add noise to actions, finally to softmax(logits – noise) as output of actor network.
And the noise added to action is from:
u = tf.random_uniform(tf.shape(self.logits))
return U.softmax(self.logits – tf.log(-tf.log(u)), axis=-1)
I don’t quite get it why they handle the noise in this way.
The sample function in distribution is implementation of Gumbel-softmax, I added it to my code, now it helps to speed up stabilize the training, but my speaker still can not tell the different landmarks.