arbisoft brand logo
arbisoft brand logo
Contact Us

AI Model Compression Part VI: The Awakening of Vision - From Simple Cells to Convolutional Understanding

Ateeb's profile picture
Ateeb TaseerPosted on
6 Min Read Time
https://d1foa0aaimjyw4.cloudfront.net/AWC_Ateeb_blog_6_a0e8bc8f9d.png

Building on the insights into memory and learning explored in AI Model Compression Part V: The LSTM: A New Kind of Memory, we now turn to another cornerstone of intelligence: vision. Understanding how we and machines perceive the world has been pivotal in shaping the design of neural networks like CNNs.

 

The Ancient Art of Seeing

In 1959, scientists David Hubel and Torsten Wiesel made an important discovery while studying how cats see. They found that some brain cells in the cats only reacted when they saw lines at certain angles. This simple idea—that vision starts with recognizing basic patterns—later helped shape the design of convolutional neural networks. But the story actually starts much earlier.

 

The Hierarchy of Sight

Consider how a Renaissance artist learns to draw a face:

  1. First, they see basic lines and edges
  2. These combine into simple shapes
  3. Shapes form features like eyes and nose
  4. Features compose into the complete face

 

This step-by-step approach is similar to how our brain processes what we see and, later, how convolutional networks learn to "see."

 

The Mathematics of Vision: One Pixel at a Time

The Convolution Operation: Nature's Pattern Detector

This step-by-step way of understanding is similar to how our brain processes what we see, and later, how convolutional networks learn to "see".

 

Convolution Operation:
(f * g)(x) = ∑ f(τ)g(x - τ)

In vision terms:
- f is what we're looking at (the image)
- g is what we're looking for (the kernel/filter)
- τ represents shifting viewpoint

 

Think of it like an ancient tracker reading marks in the sand:

  1. The tracker has learned patterns (kernels) to recognize
  2. They slide their gaze across the ground (convolution)
  3. At each point, they compute how well the pattern matches

 

The math is similar to this ancient skill:

For a 3x3 kernel scanning an image:
           Image Patch        Kernel
[p1 p2 p3]    [1  0  1]
[p4 p5 p6] ⊙  [0  1  0] = Σ(pᵢkᵢ)
[p7 p8 p9]    [1  0  1]

This isn't just multiplication; it's math that shows how pattern recognition works.

 

Pooling: The Art of Summarizing

After detecting patterns, our visual system (and CNNs) must summarize what it sees. Max pooling tells this story mathematically:

 

Max Pooling Operation:
2x2 Region:      Summary:
[2.1  1.4]   →     2.1
[0.8  1.9]

 

Like an artist stepping back from their canvas,

seeing the larger composition rather than individual strokes.

This operation embodies a deep truth about perception: sometimes, to understand better, we need to see less. It's the visual equivalent of "missing the forest for the trees."

 

The Deep Architecture of Vision

Layer by Layer: Building Complexity

 

A CNN's structure mirrors the evolutionary development of vision itself:

Layer 1 (Simple Cells):
Kernel examples:
[-1  0  1]    [1   1   1]
[-2  0  2]    [0   0   0]
[-1  0  1]    [-1 -1  -1]

 

These detect edges, just like the basic cells Hubel and Wiesel found in the brain.

As we go further, the network begins to recognize more complex patterns:

 

unnamed (4).png

 

This hierarchy is like how a child learns to see: First edges → Then shapes → Then objects → Finally meaning.

 

The Backpropagation Story: Learning to See Better

The way CNNs learn is a very human process. When we calculate gradients through the network:

unnamed (3).png


This isn't just calculus; it's math that shows how we learn from mistakes. Like an art student improving their work based on feedback, each backpropagation step helps the network get better at understanding what it sees.

 

The Activation Story: ReLU and the Art of Decision

The choice of ReLU (Rectified Linear Unit) as an activation function tells a surprisingly profound story:

ReLU(x) = max(0, x)

 

Graphically:

unnamed (5).png

 

This simple function shows an important truth about both biological and artificial vision: neurons must decide whether to fire or not. Why is this so effective? Think about how we see:

  • Some features in a scene matter (positive activation)
  • Others don't (set to zero)
  • There's no need for complicated in-between details

 

This is similar to how we pay attention—either we notice something, or we don’t. The math behind ReLU captures this simple yes/no process of seeing, while still allowing the network to learn. 

Check out our previous blog, AI Model Compression Part III: Foundation of Optimization: The Birth of Neural Computation, for a deeper understanding.

...Loading

Explore More

Have Questions? Let's Talk.

We have got the answers to your questions.

Newsletter

Join us to stay connected with the global trends and technologies