Fundamentals

Multimodal

Quick Answer

An LLM that processes multiple types of input: text, images, audio, or video.

Multimodal models can handle different input types simultaneously. Modern models like Claude, GPT-4V, and Gemini accept both text and images. Some models also handle audio or video. Multimodality opens new applications: document analysis (OCR + reasoning), image understanding, content moderation, and more. However, multimodal models are more complex than text-only models. They require careful handling of different modalities, separate encoders for each type, and more computational resources. The quality of each modality varies—most multimodal models are strong on text and images, weaker on audio.

Last verified: 2026-04-08

Compare models

See how different LLMs compare on benchmarks, pricing, and speed.

Browse all models →