Vision Transformers dominate the research benchmarks. Every new paper seems to propose another attention-based architecture that achieves state-of-the-art accuracy on ImageNet. If you followed only the headlines, you'd think convolutional neural networks were obsolete.
They're not. In production computer vision systems—where latency matters, compute is constrained, and reliability trumps benchmark scores—CNNs remain the dominant architecture. Understanding why reveals important lessons about the gap between research and deployment.
The Research vs. Production Gap
Academic benchmarks optimize for accuracy. A model that achieves 88.5% top-1 accuracy on ImageNet gets published; one that achieves 88.0% doesn't. This creates pressure to add complexity—more parameters, more attention layers, more compute—for marginal accuracy gains.
Production systems optimize for different objectives:
Latency: A surveillance system needs to process 30 frames per second. A quality inspection system on a manufacturing line can't introduce bottlenecks. Real-time requirements constrain model architecture.
Throughput: Processing a million images per day is different from processing one image for a demo. Batch efficiency matters as much as single-image inference time.
Resource efficiency: Cloud compute costs money. Edge devices have fixed resources. A model that uses half the GPU memory while achieving 95% of the accuracy is often the better choice.
Reliability: Models that perform consistently across conditions—lighting changes, camera variations, distribution shifts—matter more than models that excel on curated benchmarks.
Why CNNs Excel in Production
Computational Efficiency
Convolutions are extremely well-optimized in hardware and software. Decades of work have produced CUDA kernels, specialized accelerators, and inference engines tuned specifically for convolutional operations.
Self-attention, the core operation in Vision Transformers, has quadratic complexity in sequence length. For high-resolution images, this becomes prohibitive. Various approximations exist, but they add complexity and often sacrifice accuracy.
A well-optimized CNN like EfficientNet or MobileNet can process images at 5-10x the throughput of a comparably accurate Vision Transformer on the same hardware.
Inductive Bias
CNNs bake in assumptions about images: local connectivity (nearby pixels are related), translation equivariance (a cat in the corner is still a cat), and hierarchical feature composition (edges combine into textures, textures into parts, parts into objects).
These assumptions match reality. They let CNNs learn efficiently from limited data. A CNN trained on 1,000 images often outperforms a Vision Transformer trained on the same data because the transformer has to learn these properties from scratch.
When you have billions of training images, transformers can overcome this disadvantage. When you have thousands—which is most real-world computer vision problems—CNNs' inductive biases are an advantage.
Edge Deployment
CNNs have been deployed to edge devices for years. The tooling is mature: TensorRT, ONNX Runtime, TFLite, CoreML all optimize CNN architectures extremely well. Quantization to INT8 or even INT4 is well-understood.
Vision Transformers are newer, and edge deployment tooling is less mature. Attention operations don't quantize as cleanly. Memory access patterns are less predictable, which matters on constrained devices.
For deployment on embedded GPUs, mobile phones, or custom silicon, CNNs are often the only viable option.
When Transformers Make Sense
This isn't to say Vision Transformers are never the right choice. They excel in specific scenarios:
Large-scale pretraining: If you're training foundation models on billions of images, transformers' flexibility lets them capture patterns that CNNs' fixed receptive fields miss.
Multi-modal applications: Transformers' attention mechanism naturally handles heterogeneous inputs—images plus text, images plus metadata, multiple image types. Architectures like CLIP benefit from this flexibility.
Long-range dependencies: Tasks requiring global understanding of images—scene classification, visual question answering—benefit from attention's ability to connect distant image regions.
When accuracy at any cost is acceptable: Some applications can tolerate slower inference if accuracy improvements are valuable enough. Medical imaging diagnosis, satellite imagery analysis, and similar high-stakes applications might justify the compute overhead.
The Hybrid Future
The most promising research direction combines convolutional and attention mechanisms. Architectures like ConvNeXt take lessons learned from transformers—larger kernels, different normalization, training recipes—and apply them to CNNs, achieving competitive accuracy with convolutional efficiency.
Other approaches use convolutions for early feature extraction (where local patterns matter) and attention for later layers (where global context helps). This balances the strengths of both approaches.
Practical Recommendations
If you're building a production computer vision system:
Start with CNNs. EfficientNet, MobileNet, or ResNet variants provide strong baselines with well-understood deployment characteristics.
Measure what matters. Accuracy on a holdout set matters, but so do latency, throughput, and resource consumption. Optimize for your actual constraints.
Benchmark realistically. Test with your actual inference hardware, not just development GPUs. Edge devices, cloud instances, and development machines have very different performance characteristics.
Consider transformers when justified. If you have massive training data, multi-modal inputs, or tasks requiring global reasoning, transformers may be worth the computational overhead.
Watch the hybrid space. ConvNeXt-style architectures and convolutional-attention hybrids may offer the best of both worlds as tooling matures.
The Bottom Line
Fashion in machine learning research doesn't always align with production needs. The architectures that produce papers aren't always the architectures that produce deployed systems.
CNNs aren't obsolete—they're battle-tested. For many production computer vision applications, they remain the right choice: efficient, reliable, well-tooled, and accurate enough. Understanding when to use them versus when to reach for more complex architectures is an essential skill for any ML practitioner building systems that need to work in the real world.