Nonlinear activation functions allow linear layers to map nonlinear functions. Without that, a stack of linear layers has the exact same capabilities as a single linear layer: linear transformations to the input matrix.
Biases are not necessary, and usually explicitly omitted, as it empirically improves training stability:
No Biases – No biases were used in any of the dense kernels or layer norms. We found this to result in increased training stability for large models.
Link to original
Biases are unbounded parameters that can drift to large magnitudes and contribute to activation outliers.
sigmoid
softmax
ReLU
GeLU
LeakyReLU
tanh
TanhExp
Swish
Mish
xGLU
Alfcanz YT & notebook visualization of effect of activation functions on point clouds (leaky relu compressses the negative parts onto lines, tanh turns the circle into a square, and sigmoid turns it into a smooth square, with the center at (0.5, 0.5)).
Short overview of sigmoid-xGLU
TanhExp - A Smooth Activation Function with High Convergence Speed for Lightweight Neural Networks