For the first three years of the LLM era, models were purely text in, text out. Multimodal AI breaks that barrier — models that can process images, audio, video, code, and documents alongside text.
**What multimodal means in practice:**
**Vision**: Show the model a photo, chart, diagram, or screenshot. It can describe, analyze, extract data, debug, or answer questions about it. GPT-4V, Claude 3, and Gemini are all vision-capable.
**Audio**: Whisper (OpenAI) transcribes speech with near-human accuracy in 100+ languages. Gemini 1.5 can process raw audio and answer questions about it — detecting emotion, transcribing, summarizing.
**Video**: Gemini 1.5 Pro can process up to 1 hour of video. Google demonstrated it analyzing an entire Buster Keaton silent film. Practical uses: meeting summaries, lecture notes, video search.
**Documents**: PDF processing with layout understanding — the model knows a header from a table from a paragraph. Great for extracting structured data from invoices, contracts, and reports.
**How it works technically**: The key is a shared embedding space. Images are divided into patches and encoded into vectors by a Vision Transformer (ViT). Those vectors are projected into the same space as text token embeddings, then processed together by the LLM. The model learns to relate visual and textual concepts.
**Emerging: native audio-out** — GPT-4o, Gemini 2.0 Flash can generate speech directly, not as a post-processing step. This enables real-time voice conversations with emotional expression.
**Key takeaway:** Multimodal AI processes images, audio, and video alongside text — unlocking use cases that pure text models can't touch.