GPT-4V can read your whiteboard photo. Gemini can watch a video and take notes. Claude can analyze your chart. We've crossed into multimodal — and it changes everything.

For the first three years of the LLM era, models were purely text in, text out. Multimodal AI breaks that barrier — models that can process images, audio, video, code, and documents alongside text.

**What multimodal means in practice:**

**Vision**: Show the model a photo, chart, diagram, or screenshot. It can describe, analyze, extract data, debug, or answer questions about it. GPT-4V, Claude 3, and Gemini are all vision-capable.

**Audio**: Whisper (OpenAI) transcribes speech with near-human accuracy in 100+ languages. Gemini 1.5 can process raw audio and answer questions about it — detecting emotion, transcribing, summarizing.

**Video**: Gemini 1.5 Pro can process up to 1 hour of video. Google demonstrated it analyzing an entire Buster Keaton silent film. Practical uses: meeting summaries, lecture notes, video search.

**Documents**: PDF processing with layout understanding — the model knows a header from a table from a paragraph. Great for extracting structured data from invoices, contracts, and reports.

**How it works technically**: The key is a shared embedding space. Images are divided into patches and encoded into vectors by a Vision Transformer (ViT). Those vectors are projected into the same space as text token embeddings, then processed together by the LLM. The model learns to relate visual and textual concepts.

**Emerging: native audio-out** — GPT-4o, Gemini 2.0 Flash can generate speech directly, not as a post-processing step. This enables real-time voice conversations with emotional expression.

**Key takeaway:** Multimodal AI processes images, audio, and video alongside text — unlocking use cases that pure text models can't touch.

Multimodal AI: When Models See, Hear, and Think