I am sharing some rough notes (in R and Python) here on how while
dot(a, b) fulfills “Mercer’s condition” (by definition!, and I’ll just informally call these beasts a “Mercer Kernel”), the seemingly harmless variations
relu(dot(a, b)) are not Mercer Kernels (
relu(x) = max(0, x) = (abs(x) + x)/2). It turns out they fail the required positive-semi-definiteness checks.
It is kind of a tricky point, though very close to the definitions. Here I’ll just try to state what is true, without confusing it with the derivations why it is true.
b are in
R1, then these two forms are Mercer Kernels! This is because in this case
abs(dot(a, b)) = dot(abs(a), abs(b)) plus the usual rules for building new kernels.
And if we only check up-to 3 by 3 systems for positive semi-definiteness we also get deceived into thinking
relu are Mercer Kernels by variations of Sylvester’s criterion.
At this point one is sufficiently confused/frustrated that it is worth re-checking that the dot-product itself (which is the prototype for a Mercer Kernel) is in fact a Mercer Kernel under the check-definition.
The most common reason one cares, is the positive-semi-definiteness is used to establish convexity in support vector machines, which is turn is used to establish the associated optimization problem is convex and “easy.”
If you are interested in kernelized machine learning and (like everyone) need to see the so-called “obvious” steps checked, I invite you to check out these very rough notes. Or, at least be innoculated to know
relu(dot(a, b)) are not Mercer Kernels in general, even if you can’t immediately regurgitate why.