Building on the insights into memory and learning explored in AI Model Compression Part V: The LSTM: A New Kind of Memory, we now turn to another cornerstone of intelligence: vision. Understanding how we and machines perceive the world has been pivotal in shaping the design of neural networks like CNNs.
The Ancient Art of Seeing
In 1959, scientists David Hubel and Torsten Wiesel made an important discovery while studying how cats see. They found that some brain cells in the cats only reacted when they saw lines at certain angles. This simple idea—that vision starts with recognizing basic patterns—later helped shape the design of convolutional neural networks. But the story actually starts much earlier.
The Hierarchy of Sight
Consider how a Renaissance artist learns to draw a face:
- First, they see basic lines and edges
- These combine into simple shapes
- Shapes form features like eyes and nose
- Features compose into the complete face
This step-by-step approach is similar to how our brain processes what we see and, later, how convolutional networks learn to "see."
The Mathematics of Vision: One Pixel at a Time
The Convolution Operation: Nature's Pattern Detector
This step-by-step way of understanding is similar to how our brain processes what we see, and later, how convolutional networks learn to "see".
Convolution Operation:
(f * g)(x) = ∑ f(τ)g(x - τ)
In vision terms:
- f is what we're looking at (the image)
- g is what we're looking for (the kernel/filter)
- τ represents shifting viewpoint
Think of it like an ancient tracker reading marks in the sand:
- The tracker has learned patterns (kernels) to recognize
- They slide their gaze across the ground (convolution)
- At each point, they compute how well the pattern matches
The math is similar to this ancient skill:
For a 3x3 kernel scanning an image:
Image Patch Kernel
[p1 p2 p3] [1 0 1]
[p4 p5 p6] ⊙ [0 1 0] = Σ(pᵢkᵢ)
[p7 p8 p9] [1 0 1]
This isn't just multiplication; it's math that shows how pattern recognition works.
Pooling: The Art of Summarizing
After detecting patterns, our visual system (and CNNs) must summarize what it sees. Max pooling tells this story mathematically:
Max Pooling Operation:
2x2 Region: Summary:
[2.1 1.4] → 2.1
[0.8 1.9]
Like an artist stepping back from their canvas,
seeing the larger composition rather than individual strokes.
This operation embodies a deep truth about perception: sometimes, to understand better, we need to see less. It's the visual equivalent of "missing the forest for the trees."
The Deep Architecture of Vision
Layer by Layer: Building Complexity
A CNN's structure mirrors the evolutionary development of vision itself:
Layer 1 (Simple Cells):
Kernel examples:
[-1 0 1] [1 1 1]
[-2 0 2] [0 0 0]
[-1 0 1] [-1 -1 -1]
These detect edges, just like the basic cells Hubel and Wiesel found in the brain.
As we go further, the network begins to recognize more complex patterns:
This hierarchy is like how a child learns to see: First edges → Then shapes → Then objects → Finally meaning.
The Backpropagation Story: Learning to See Better
The way CNNs learn is a very human process. When we calculate gradients through the network:
This isn't just calculus; it's math that shows how we learn from mistakes. Like an art student improving their work based on feedback, each backpropagation step helps the network get better at understanding what it sees.
The Activation Story: ReLU and the Art of Decision
The choice of ReLU (Rectified Linear Unit) as an activation function tells a surprisingly profound story:
ReLU(x) = max(0, x)
Graphically:
This simple function shows an important truth about both biological and artificial vision: neurons must decide whether to fire or not. Why is this so effective? Think about how we see:
- Some features in a scene matter (positive activation)
- Others don't (set to zero)
- There's no need for complicated in-between details
This is similar to how we pay attention—either we notice something, or we don’t. The math behind ReLU captures this simple yes/no process of seeing, while still allowing the network to learn.