A CNN doesn't look at the whole image at once. It uses **sliding filters (kernels)** to scan for local patterns.
**Key layers:**
- **Conv layer**: apply learnable filters → detects edges, textures
- **ReLU**: adds non-linearity (negative values → 0)
- **Pooling**: reduce spatial dimensions (max-pool takes the biggest value in a region)
- **Fully connected**: combine learned features → make prediction
**The magic:** Early layers detect simple patterns (edges). Later layers combine them into complex objects (faces, cars). This hierarchical feature learning is why CNNs are so powerful for vision.