Ftrl

Ftrl class

tf.keras.optimizers.Ftrl(
    learning_rate=0.001,
    learning_rate_power=-0.5,
    initial_accumulator_value=0.1,
    l1_regularization_strength=0.0,
    l2_regularization_strength=0.0,
    name="Ftrl",
    l2_shrinkage_regularization_strength=0.0,
    beta=0.0,
    **kwargs
)

Optimizer that implements the FTRL algorithm.

See Algorithm 1 of this paper. This version has support for both online L2 (the L2 penalty given in the paper above) and shrinkage-type L2 (which is the addition of an L2 penalty to the loss function).

Initialization: $$t = 0$$ $$n_{0} = 0$$ $$\sigma_{0} = 0$$ $$z_{0} = 0$$

Update ($$i$$ is variable index, $$\alpha$$ is the learning rate): $$t = t + 1$$ $$n_{t,i} = n_{t-1,i} + g_{t,i}^{2}$$ $$\sigma_{t,i} = (\sqrt{n_{t,i}} - \sqrt{n_{t-1,i}}) / \alpha$$ $$z_{t,i} = z_{t-1,i} + g_{t,i} - \sigma_{t,i} * w_{t,i}$$ $$w_{t,i} = - ((\beta+\sqrt{n_{t,i}}) / \alpha + 2 * \lambda_{2})^{-1} * (z_{i} - sgn(z_{i}) * \lambda_{1}) if \abs{z_{i}} > \lambda_{i} else 0$$

Check the documentation for the l2_shrinkage_regularization_strength parameter for more details when shrinkage is enabled, in which case gradient is replaced with gradient_with_shrinkage.

Arguments

  • learning_rate: A Tensor, floating point value, or a schedule that is a tf.keras.optimizers.schedules.LearningRateSchedule. The learning rate.
  • learning_rate_power: A float value, must be less or equal to zero. Controls how the learning rate decreases during training. Use zero for a fixed learning rate.
  • initial_accumulator_value: The starting value for accumulators. Only zero or positive values are allowed.
  • l1_regularization_strength: A float value, must be greater than or equal to zero.
  • l2_regularization_strength: A float value, must be greater than or equal to zero.
  • name: Optional name prefix for the operations created when applying gradients. Defaults to "Ftrl".
  • l2_shrinkage_regularization_strength: A float value, must be greater than or equal to zero. This differs from L2 above in that the L2 above is a stabilization penalty, whereas this L2 shrinkage is a magnitude penalty. When input is sparse shrinkage will only happen on the active weights.
  • beta: A float value, representing the beta value from the paper.
  • **kwargs: Keyword arguments. Allowed to be one of "clipnorm" or "clipvalue". "clipnorm" (float) clips gradients by norm; "clipvalue" (float) clips gradients by value.

Reference