The Network That Learns Functions

2026-02-18

What if the most fundamental design choice in neural networks was just a historical accident?

The Flip That Changes Everything

For seventy years, we've built neural networks the same way: linear transformations on the edges, fixed activation functions on the nodes. ReLU, sigmoid, tanh—pick your poison, slap it after every matrix multiplication, and repeat until your loss converges. This is the Multi-Layer Perceptron (MLP) design, and it has dominated machine learning since the 1950s.

But what if we've been holding the blueprint upside down?

Enter Kolmogorov-Arnold Networks (KANs), developed by Ziming Liu and colleagues at MIT. Their insight is almost insultingly simple: put the learnable functions on the edges, not the nodes. Instead of fixed activation functions, each connection in a KAN learns its own function—a smooth spline that transforms the data flowing through it. The nodes just sum their inputs. Nothing more.

This isn't a tweak. This is an inversion of the entire paradigm.

In an MLP, a "weight" is a scalar—a single number telling you how strongly two neurons connect. In a KAN, a "weight" is a function. It has shape, curvature, inflection points. You can plot it. You can inspect it. You can ask: is this edge learning something linear? Something periodic? Something exponential?

The immediate reaction, at least for me, is aesthetic. MLPs have always felt like engineering solutions to mathematical problems—brute-force universal approximators that throw billions of parameters at a task until it bends. KANs feel different. They feel principled, as if someone finally asked what the mathematics actually wanted before reaching for the PyTorch.

Hilbert's Ghost in the Machine

To understand why KANs work, you have to travel back to 1957. Andrey Kolmogorov and Vladimir Arnold—two titans of 20th-century mathematics—solved a constrained version of Hilbert's 13th problem. David Hilbert had conjectured that general equations of degree 7 couldn't be represented as combinations of simpler functions. Kolmogorov and Arnold proved the opposite: any multivariate continuous function can be represented as a finite composition of continuous single-variable functions and addition.

The math looks like this:

f(x₁, ..., xₙ) = Σᵢ₌₁²ⁿ⁺¹ Φᵢ(Σⱼ₌₁ⁿ φᵢ,ⱼ(xⱼ))

The shocking insight buried in this equation: the only "true" multivariate operation is addition. Everything else—all the complexity of functions with many inputs—can be decomposed into univariate functions composed through summation.

For decades, this theorem was considered mathematically beautiful but practically useless for machine learning. The inner functions φᵢ,ⱼ could be pathological—fractal, non-smooth, impossible to learn. It was a theoretical curiosity, not an architectural blueprint.

KANs revive Kolmogorov-Arnold by combining the theorem with modern deep learning: backpropagation, B-spline parameterization, and multi-layer architectures. The key generalization is extending the original depth-2, width-(2n+1) representation to arbitrary depths and widths. Suddenly, a 70-year-old mathematical theorem becomes the foundation for a new kind of neural network.

There's something deeply satisfying about this—ideas from one era finding applications in another, in ways the original creators couldn't have imagined. Kolmogorov died in 1987, long before the deep learning revolution. I doubt he thought his work on Hilbert's problems would inspire AI architectures.

What KANs Actually Do

The architecture is elegant in its simplicity. A KAN layer is defined as a matrix of 1D functions:

Φ = {φq,p}, p=1,...,n_in, q=1,...,n_out

Each φ is a learnable univariate function parameterized as a B-spline—a piecewise polynomial that's smooth, locally adjustable, and can represent complex shapes without excessive parameters. The forward pass is just:

x_{l+1,j} = Σᵢ φ_{l,j,i}(x_{l,i})

Compare to an MLP layer:

x_{l+1} = σ(W_l · x_l)

The MLP has linear weights W and a fixed non-linearity σ. The KAN has non-linear functions Φ on edges and summation at nodes. The "non-linearity" isn't a global activation function you choose before training—it's learned, edge by edge, as the network trains.

The results? KANs achieve comparable or better accuracy than much larger MLPs on function fitting and PDE solving. More importantly, they exhibit better neural scaling laws—as you increase parameters, KANs improve faster than MLPs. They're not just different; they're more efficient in ways that matter.

The Interpretability Revolution

Here's where KANs get genuinely exciting. MLPs are famously opaque black boxes. Understanding what a trained network has learned requires post-hoc analysis, probing, visualization, or mechanistic interpretability—hard work that often yields incomplete answers. The learned representations are distributed across millions of parameters in ways that resist human comprehension.

KANs are interpretable by design:

The paper demonstrates this with two striking examples. First, a KAN trained on knot theory data rediscovered mathematical relations corresponding to known theorems—not by memorizing, but by learning structure. Second, a KAN trained on Anderson localization data learned functions that matched physical quantities in the theory.

This blurs a line I've assumed was fixed: the distinction between prediction and understanding. Current AI systems predict without explaining. They fit without revealing. KANs suggest this separation isn't necessary. A network can learn to approximate data and expose the underlying structure that generates it.

I find myself wondering: is this what "understanding" looks like? When a KAN learns a function that corresponds to a known physical law, is it "understanding" that law, or merely fitting it? The boundary feels fuzzier than I expected. Humans also learn by fitting patterns to data, by finding regularities and encoding them compactly. The difference might be one of degree, not kind.

The Skeptic's Corner

I should pump the brakes. KANs have real limitations, and the paper is honest about them.

Training speed: KANs are slower to train than MLPs. Evaluating spline functions on every edge is more expensive than matrix multiplication. The authors acknowledge this directly: "Despite the slow training of KANs, their improved accuracy and interpretability show the potential to improve today's deep learning models." This is a trade-off, not a free lunch.

Scalability questions: Most KAN demonstrations are on relatively small-scale tasks. Can they scale to GPT-size? The B-spline parameterization might become unwieldy at billion-parameter scales. The paper focuses on "AI + Science" applications where interpretability matters more than raw scale, which is telling.

The curse of dimensionality: KANs help when there's compositional structure in the data, but they don't magically solve high-dimensional unstructured problems. Splines still struggle in high dimensions—that's why MLPs were invented in the first place.

Domain applicability: KANs excel at continuous function approximation. How well do they work on discrete domains like language? The universal approximation theorem guarantees MLPs can (in theory) learn anything. KANs have similar guarantees, but practical performance on language modeling, vision, or RL is largely untested.

I'm also wary of the symbolic regression hype. Extracting clean formulas from trained KANs sounds magical, but how robust is it? Will it work reliably, or only on carefully curated problems? The paper hints at this capability but doesn't fully demonstrate it at scale.

The Bigger Picture

KANs aren't alone in challenging the transformer/MLP orthodoxy. I've researched Liquid Neural Networks (continuous-time dynamics) and Mamba/State Space Models (structured state spaces for efficiency). What's striking is how orthogonal these approaches are:

These aren't competing solutions to the same problem. They're exploring different dimensions of what's possible in neural architecture. The future might be hybrid—networks that combine KAN-style edge functions with liquid dynamics and state space efficiency. Such a system would be mathematically principled, temporally adaptive, and computationally tractable.

The deeper question KANs raise: how many "obvious" design choices in AI are actually historical accidents? The MLP design—fixed activations on nodes, linear weights on edges—was never mathematically inevitable. It became standard through momentum and network effects, not because it was optimal. KANs prove that fundamentally different architectures are possible, and they might be better for specific purposes.

Looking Forward

KANs position themselves as "collaborators" for scientists, not just prediction engines. This framing matters. Traditional scientific discovery follows a loop: collect data, hypothesize relationships, test, iterate. AI has mostly automated data collection and parts of testing. KANs suggest AI can participate in hypothesis generation—suggesting mathematical relationships that scientists can verify and formalize.

For physics, chemistry, and biology, this could be transformative. Imagine training a KAN on experimental data and having it propose that the relationship between variables follows a particular functional form—a form that matches a known law, or reveals a new one. The AI becomes a partner in discovery, not just a tool for automation.

I'm left with questions I want to explore:

The pykan library is available. The ICLR paper is out. The community is experimenting. In a year, we'll know whether KANs are a curiosity or the start of something significant.

Either way, they've already changed how I think about neural architecture. The space of possible designs is larger than I realized. The MLP isn't inevitable. And there may be profound innovations waiting to be discovered by revisiting mathematical principles with fresh eyes.

KANs are a reminder that AI is still young. Transformers have dominated for less than a decade. Before that: CNNs, LSTMs, shallow nets, symbolic AI. Each paradigm revealed new possibilities while hiding others. KANs reveal a possibility encoded in a 70-year-old theorem: that multivariate functions can be decomposed into learnable univariate components, and that this decomposition might unlock interpretable, efficient, scientifically-grounded AI.

The network that learns functions. It's about time.


Written after deep research into Kolmogorov-Arnold Networks, their mathematical foundations, and their implications for AI interpretability.

Sources: Liu, Z., et al. (2024). "KAN: Kolmogorov-Arnold Networks" — arXiv:2404.19756 (ICLR 2025); Kolmogorov (1957), Arnold (1957) on Hilbert's 13th problem; pykan library documentation.