Multi-Modal Chatbot

Learn how to build a chatbot capable of understanding both images and PDFs using the AI SDK. Multi-modal refers to the ability of AI models to process and understand multiple types of input formats. In this guide, we’ll focus on:

Images: Screenshots, photos, diagrams
PDFs: Documents, reports, forms
Text: Regular chat messages

Prerequisites

Node.js 18+
A Vercel AI Gateway API key
Basic knowledge of Next.js and React

Setup

Create a new Next.js application:

pnpm create next-app@latest multi-modal-chatbot
cd multi-modal-chatbot

Install dependencies:

pnpm add ai @ai-sdk/react

Configure your API key:

touch .env.local

AI_GATEWAY_API_KEY=your_api_key_here

Implementation

Create the API Route

Create a route handler that processes multi-modal messages:

import { streamText, convertToModelMessages, UIMessage } from 'ai';

export const maxDuration = 30;

export async function POST(req: Request) {
  const { messages }: { messages: UIMessage[] } = await req.json();

  const result = streamText({
    model: 'openai/gpt-4o',
    messages: await convertToModelMessages(messages),
  });

  return result.toUIMessageStreamResponse();
}

The convertToModelMessages function automatically handles the conversion of images and PDFs from the UI format to the model’s expected format.

File Upload Helper

Create a helper function to convert files to data URLs:

export async function convertFilesToDataURLs(files: FileList) {
  return Promise.all(
    Array.from(files).map(
      file =>
        new Promise<{
          type: 'file';
          mediaType: string;
          url: string;
        }>((resolve, reject) => {
          const reader = new FileReader();
          reader.onload = () => {
            resolve({
              type: 'file',
              mediaType: file.type,
              url: reader.result as string,
            });
          };
          reader.onerror = reject;
          reader.readAsDataURL(file);
        }),
    ),
  );
}

Chat Interface with File Upload

Build the frontend with support for uploading images and PDFs:

'use client';

import { useChat } from '@ai-sdk/react';
import { DefaultChatTransport } from 'ai';
import { useRef, useState } from 'react';
import Image from 'next/image';
import { convertFilesToDataURLs } from '@/lib/file-utils';

export default function Chat() {
  const [input, setInput] = useState('');
  const [files, setFiles] = useState<FileList | undefined>(undefined);
  const fileInputRef = useRef<HTMLInputElement>(null);

  const { messages, sendMessage } = useChat({
    transport: new DefaultChatTransport({
      api: '/api/chat',
    }),
  });

  return (
    <div className="flex flex-col w-full max-w-md py-24 mx-auto stretch">
      {messages.map(m => (
        <div key={m.id} className="whitespace-pre-wrap">
          {m.role === 'user' ? 'User: ' : 'AI: '}
          {m.parts.map((part, index) => {
            if (part.type === 'text') {
              return <span key={`${m.id}-text-${index}`}>{part.text}</span>;
            }
            if (part.type === 'file' && part.mediaType?.startsWith('image/')) {
              return (
                <Image
                  key={`${m.id}-image-${index}`}
                  src={part.url}
                  width={500}
                  height={500}
                  alt={`attachment-${index}`}
                />
              );
            }
            if (part.type === 'file' && part.mediaType === 'application/pdf') {
              return (
                <iframe
                  key={`${m.id}-pdf-${index}`}
                  src={part.url}
                  width={500}
                  height={600}
                  title={`pdf-${index}`}
                />
              );
            }
            return null;
          })}
        </div>
      ))}

      <form
        className="fixed bottom-0 w-full max-w-md p-2 mb-8 border border-gray-300 rounded shadow-xl space-y-2"
        onSubmit={async event => {
          event.preventDefault();

          const fileParts =
            files && files.length > 0
              ? await convertFilesToDataURLs(files)
              : [];

          sendMessage({
            role: 'user',
            parts: [{ type: 'text', text: input }, ...fileParts],
          });

          setInput('');
          setFiles(undefined);

          if (fileInputRef.current) {
            fileInputRef.current.value = '';
          }
        }}
      >
        <input
          type="file"
          accept="image/*,application/pdf"
          onChange={event => {
            if (event.target.files) {
              setFiles(event.target.files);
            }
          }}
          multiple
          ref={fileInputRef}
        />
        <input
          className="w-full p-2"
          value={input}
          placeholder="Say something..."
          onChange={e => setInput(e.target.value)}
        />
      </form>
    </div>
  );
}

Key Features

Message Parts Structure

Messages use a parts array that can contain different types:

type MessagePart =
  | { type: 'text'; text: string }
  | { type: 'file'; mediaType: string; url: string };

File Processing

User selects files via input field
Files are converted to data URLs using FileReader API
Data URLs are sent as part of the message
Model processes the files alongside text

Rendering Different Media Types

The interface renders different parts appropriately:

Text: Displayed as plain text
Images: Rendered using Next.js Image component
PDFs: Displayed in an iframe

Running the Application

pnpm run dev

Visit http://localhost:3000 and try:

Upload an image and ask “What’s in this image?”
Upload a PDF and ask “Summarize this document”
Send a regular text message

Using Other Providers

The AI SDK supports multiple providers with multi-modal capabilities:

// Anthropic
const result = streamText({
  model: 'anthropic/claude-sonnet-4-20250514',
  messages: await convertToModelMessages(messages),
});

// Google
const result = streamText({
  model: 'google/gemini-2.5-flash',
  messages: await convertToModelMessages(messages),
});

Best Practices

File Size: Be mindful of file size limits for different providers
Image Quality: Balance image quality with upload speed
Error Handling: Handle file upload errors gracefully
Loading States: Show progress indicators during file processing

Next Steps

Add file size validation
Implement drag-and-drop file upload
Add support for more file types
Implement file preview before sending
Add tools for more advanced interactions

Documentation Index

​Multi-Modal Chatbot

​What is Multi-Modal?

​Prerequisites

​Setup

​Implementation

​Create the API Route

​File Upload Helper

​Chat Interface with File Upload

​Key Features

​Message Parts Structure

​File Processing

​Rendering Different Media Types

​Running the Application

​Using Other Providers

​Best Practices

​Next Steps

​Resources