RTX 4090 vs RTX 3090 for LLM Inference
Same VRAM, different speeds. Where the 4090 wins on bandwidth and where the older 3090 still earns its spot.
Both have 24 GB VRAM. The 4090 has 1008 GB/s bandwidth and 82 TFLOPs FP16 compute. The 3090 has 936 GB/s and 35 TFLOPs.
For LLM token generation (memory-bound), the 4090 is roughly 10-15% faster on the same workload. For prefill on long prompts (compute-bound), the 4090 pulls way ahead, often 2x.
Used 3090s sit at $700-900. New 4090s are $1500-1800. If you only ever run one model at one time and small-to-medium size, the 4090 wins. For two-card 70B setups, two 3090s give the same VRAM at half the cost per GB.
Pick the 4090 for: 7B-13B speed, image generation as a side use, single-card simplicity. Pick 2x 3090 for: 70B class models, 13B at 32k+ context, multi-user serving with batching.