When searching for information on choosing the number of hidden layers in a neural network, I have come across the following table mutiple times, including in this answer:
| Number of Hidden Layers | Result |
0 - Only capable of representing linear separable functions or decisions.
1 - Can approximate any function that contains a continuous mapping from one finite space to another.
2 - Can represent an arbitrary decision boundary to arbitrary accuracy with rational activation functions and can approximate any smooth mapping to any accuracy.
I am familiar with the universal approximation theorem for 1 hidden layer, but not with the purported result about the additional power of 2 hidden layers. Is it true? If so, where can I find a detailed explanation and proof?
Edit: Apparently the table comes from Jeff Heaton.