A Convolutional Neural Network (CNN) analyzes images by learning filters that detect features like edges and textures. These filters build increasingly complex patterns layer by layer, enabling CNNs to recognize objects and understand spatial relationships, making them ideal for image-related tasks.
A CNN is preferable over a Fully Connected Neural Network when working with image data or other types of spatially structured data. This preference stems from several key advantages:
Parameter Efficiency: CNNs significantly reduce the number of parameters by sharing weights across spatial dimensions, making them more efficient and less prone to overfitting, especially in large-scale image processing tasks.
Feature Learning: CNNs automatically learn hierarchical feature representations from raw data, starting from low-level features like edges to high-level features like objects.
Translational Invariance: Through techniques like pooling and convolution, CNNs can recognize patterns regardless of their position in the input, making them highly effective for tasks like object detection and image classification.
A filter, or kernel, is a small matrix of weights in a CNN that detects specific characteristics within the input data, such as edges, textures, or patterns. During the convolution operation, the filter slides across the input (e.g., an image), performing element-wise multiplication and summing the results. Each filter is designed to capture unique features from the image. Figure 3 illustrates how six different filters extract distinct characteristics from the same image.
When an image is passed through a convolutional layer, the spatial dimensions of the output are determined using the following:
- I: Input size (height or width)
- K: Kernel (filter) size
- P: Padding (number of pixels added to each side of the input)
- S: Stride (step size of the kernel)
Stride refers to the step size with which the convolution filter moves across the input data. A stride of 1 means the filter moves one pixel at a time, while a stride of 2 or more means the filter skips pixels as it moves. Larger strides reduce the spatial dimensions of the output but may result in a loss of information.
Padding involves adding extra pixels around the input data to control the spatial dimensions of the output. This is often done to preserve the original size of the input after convolution.
Pooling is a downsampling operation that reduces the spatial dimensions of the input volume, thereby decreasing the computational load and helping to achieve spatial invariance.






