Issue #97 - Convolutional Neural Networks

Jun 08, 2025

∙ Paid

💊 Pill of the Week

Convolutional Neural Networks (CNNs) are a class of deep learning models particularly well-suited for processing grid-structured data, especially images. They are central to many modern computer vision tasks, from object detection to medical diagnostics, due to their ability to capture spatial hierarchies in data.

Why CNNs for Images?

Unlike traditional feedforward neural networks, which require flattening images into 1D vectors (thus discarding spatial relationships), CNNs preserve the 2D structure of images. This allows them to effectively model local and global patterns, such as edges, textures, shapes, and object parts, without losing information about how pixels relate spatially.

Core Components of CNNs

CNNs are built by stacking several specialized layers, each responsible for progressively abstracting features from the input. The three fundamental types of layers are:

Convolutional Layers: These apply filters (kernels) that scan across the input image to extract features like edges or textures. Each filter produces a feature map, and multiple filters enable the model to detect different kinds of patterns.
- Mathematically, this involves a sliding window operation performing element-wise multiplication and summation.
- These filters are learned during training, allowing the network to optimize for task-relevant features.
Activation Functions: After convolution, a non-linear activation function like ReLU (f(x) = max(0, x)) introduces non-linearity, enabling the network to model complex functions beyond linear transformations.
Pooling Layers: Pooling layers downsample feature maps to reduce dimensionality and computational load, while retaining the most salient information. They also help with translational invariance, making the model more robust to small shifts or distortions in the input.

Stacking and Hierarchies

As you move deeper into the network, CNNs learn more abstract and high-level representations:

Early layers: low-level features like edges and textures.
Middle layers: combinations of features forming parts of objects.
Deeper layers: object-level concepts (e.g. “eye,” “wheel”).

This hierarchical feature learning is what makes CNNs so powerful for vision tasks.

Why CNNs Generalize Well

By sharing parameters (filters) across space and using local connectivity, CNNs significantly reduce the number of learnable parameters compared to fully connected networks. This not only helps with computational efficiency but also reduces overfitting, making them more scalable and generalizable.

The Building Blocks of CNNs

Now that we understand the high-level intuition behind CNNs—preserving spatial structure, learning hierarchical features, and achieving generalization through shared weights—let’s examine how this is actually implemented. At the heart of a CNN are a few key components that transform raw pixel data into powerful, abstract feature representations: convolutional layers, activation functions, and pooling operations.

Each of these plays a distinct yet complementary role in the network’s ability to perceive and understand visual patterns. In the sections that follow, we'll break down how these layers work, how they interact, and how their configuration affects the model's performance—both conceptually and practically through TensorFlow/Keras examples.

Convolutional Layers

The convolutional layer is the core building block of a CNN. Its purpose is to extract patterns or features from the input data, usually images, by applying a set of learnable filters. These filters are small matrices that slide across the input data, performing a mathematical operation known as a convolution. The result of this operation is a new set of matrices known as feature maps, which summarize the presence and location of the features detected by each filter.

A convolution works by aligning the filter to a small region of the input and computing the dot product between the filter's values and the corresponding input values. This value is then recorded as a single pixel in the output feature map. The filter then shifts over to the next region of the image, and the process repeats, moving left to right and top to bottom.

For example, suppose we have a 5x5 input and a 3x3 filter with a stride of 1. The filter will convolve over the input, producing a smaller feature map that encodes local information at each position. The values in this feature map reflect how strongly the input region matches the pattern encoded by the filter.

Several parameters determine how the convolution operation behaves:

The number of filters determines how many different patterns the layer can detect. Each filter learns to focus on a distinct feature, such as edges, textures, or shapes. This number defines the depth of the output feature map and is controlled via the filters argument in Keras and TensorFlow.
The size of the filter, specified by the kernel_size parameter, defines how much of the input each filter examines at a time. For example, a kernel size of 3 means the filter looks at 3x3 patches of the input image.
The stride determines how far the filter moves after each operation. A stride of 1 results in overlapping applications of the filter, while a larger stride skips input values, effectively downsampling the output. This is specified via the strides argument.
Padding controls what happens when the filter reaches the edge of the image. Without padding, the filter cannot cover the edges, resulting in smaller output dimensions. By adding zeros around the input (zero-padding), we allow the filter to include border regions. This behavior is set using the padding argument. "Valid" padding applies no padding, while "same" padding ensures the output has the same spatial dimensions as the input.

To illustrate how these parameters affect the output, consider a batch of ten color images, each of size 16x8 pixels. Because they are color images, they have three channels—red, green, and blue—so the full input shape is (10, 8, 16, 3).

Suppose we apply a convolutional layer with 5 filters, a 3x3 kernel, a stride of (2, 2), and "valid" padding:

tf.keras.layers.Conv2D(
    filters=5, 
    kernel_size=3, 
    strides=(2,2), 
    padding='valid'
)

This layer will generate output feature maps with depth 5 (one per filter). The output dimensions can be calculated based on the input size, kernel size, stride, and padding. For width:

output_width = (16 - 3 + 2) / 2 = 7.5 → rounded down to 7 (since padding is 'valid')

And for height:

output_height = (8 - 3 + 2) / 2 = 3.5 → rounded down to 3

The resulting output shape is (10, 3, 7, 5), which includes 10 samples, a height of 3, width of 7, and 5 filters.

Activation Layers

After computing the feature maps through convolution, the CNN passes them through an activation function. This operation is essential for introducing non-linearity into the model. Without non-linear activation functions, no matter how many layers we stack, the model would behave like a simple linear transformation, making it incapable of capturing complex visual structures.

The most common activation function used in CNNs is the Rectified Linear Unit (ReLU). ReLU is defined as f(x) = max(0, x), meaning that it keeps all positive values unchanged and sets all negative values to zero. This has the effect of enhancing prominent activations while suppressing weak or irrelevant signals.

To understand the impact of ReLU, imagine a convolutional filter that detects vertical lines. The output feature map may contain both positive and negative values, depending on how strongly a vertical line is present in a region. Applying ReLU preserves only the regions with a strong positive response, making the detection sharper and more informative for subsequent layers.

ReLU (or other activation functions) is typically applied immediately after the convolution operation. In Keras and TensorFlow, you can specify the activation directly within the convolutional layer:

tf.keras.layers.Conv2D(
    filters=5, 
    kernel_size=3, 
    strides=(2,2), 
    padding='valid', 
    activation='relu'
)

In the earlier example, this would produce an output shape of (10, 3, 7, 5), identical to the output of the convolutional layer before ReLU. The activation function modifies the values but not the dimensions.

Pooling Layers

Although convolution and activation layers are excellent for detecting local patterns, they are sensitive to the location of those patterns. For example, a cat that appears in the top-left corner of one image might appear in the center of another. A CNN trained without any positional invariance would need to relearn the same feature in many different locations.

To address this, CNNs use pooling layers, which perform a form of downsampling. Pooling reduces the spatial dimensions of the feature maps while preserving the most important information. This not only decreases computational cost but also provides translation invariance, helping the network generalize better.

The most commonly used pooling operation is max pooling. This operation slides a small window over each feature map and retains only the maximum value within that window. The idea is to capture the most salient features while discarding less relevant data. Other types of pooling include mean pooling (averaging the values) and sum pooling (adding them together), but max pooling tends to work best in practice.

Like convolution, pooling is controlled by a window size (pool_size), stride, and optional padding. However, unlike convolution, pooling uses a fixed operation (such as max) and does not involve learnable weights.

Consider again the output from the previous layer with shape (10, 3, 7, 5). Now, we apply a max pooling layer with a 2x2 window, a stride of 2, and "same" padding:

tf.keras.layers.MaxPooling2D(
    pool_size=(2, 2), 
    strides=(2, 2), 
    padding='same'
)

The pooling layer is applied independently to each of the 5 feature maps in every image.

To compute the new height:

output_height = (3 - 2 + 2) / 2 = 1.5 → rounded up to 2 (because padding is 'same')

And the new width:

output_width = (7 - 2 + 2) / 2 = 3.5 → rounded up to 4

Thus, the output shape becomes (10, 2, 4, 5). This output is more compact and focuses on the most prominent features extracted from the input.

Summary

In a Convolutional Neural Network, the convolutional layers are responsible for detecting patterns through localized filters, the activation layers introduce essential non-linearity to enhance the learning of complex relationships, and the pooling layers reduce dimensionality and improve robustness to translation. These three types of layers work in unison to transform raw pixel values into high-level features that are meaningful for classification or detection.

A solid understanding of how these layers function—and how their parameters interact—is crucial for designing effective CNN architectures. In future steps, these feature representations are typically passed to fully connected layers, which interpret the features and make final predictions.

To explore more about how to implement these layers, consult the official TensorFlow documentation:

Want to dive deeper into how to structure, optimize, and train full CNN models for real-world tasks? Check the ⚡Power-Up Corner section below!

📖 Book of the Week

If you're building or want to build algorithmic trading strategies — whether you're an independent quant, fintech dev, or Python-savvy investor — you need to check this out:

“Python for Algorithmic Trading Cookbook”
By Jason Strimpel

This book isn’t just about theory — it’s a hands-on guide to designing, testing, and deploying real-world trading strategies in Python. From sourcing data to executing trades, it gives you a production-ready toolkit for every stage of the trading workflow.

What sets it apart?
It takes you from backtesting to live execution — with clean Python code and modern libraries tailored to quant workflows:

✅ Acquire market data using OpenBB, store it with ArcticDB, HDF5, or SQLite
✅ Engineer alpha factors using statsmodels and SciPy
✅ Backtest and optimize strategies with VectorBT and Zipline Reloaded
✅ Evaluate performance with Alphalens and Pyfolio
✅ Connect to Interactive Brokers’ API for live trading and portfolio management

This is a must-read for:
📈 Quant traders building from scratch
🐍 Python developers entering the finance space
📊 Investors exploring data-driven strategies
💼 Engineers and analysts in fintech or hedge fund environments

If you're ready to move from static signals to live trading systems — and want to master every step of the pipeline — this book is for you.

Get it here!

⚡Power-Up Corner

Getting convolutional neural networks (CNNs) to work in real-world settings means more than just knowing the theory—it’s about choosing the right architecture, applying effective design patterns, and training models with best practices. This section delivers practical, experience-backed advice to help you design CNNs that perform reliably in real applications.

What You'll Learn in This Section:

How Deep Should Your CNN Be?
Guidelines for selecting the right network depth based on task complexity.
Design Patterns: Proven CNN Architectures
Overview of VGG, ResNet, Inception, MobileNet, and EfficientNet, with use-case advice.
Best Practices for Training CNNs
Tips on data augmentation, normalization, regularization, and training dynamics.
Bonus: Transfer Learning with EfficientNet in Keras
Hands-on example of using EfficientNetB0 as a frozen backbone for fast, accurate training on small datasets.

Designing CNN Architectures for Real-World Projects

Keep reading with a 7-day free trial

Subscribe to Machine Learning Pills to keep reading this post and get 7 days of free access to the full post archives.