Multi-Modal Chatbot
Learn how to build a chatbot capable of understanding both images and PDFs using the AI SDK.What is Multi-Modal?
Multi-modal refers to the ability of AI models to process and understand multiple types of input formats. In this guide, we’ll focus on:- Images: Screenshots, photos, diagrams
- PDFs: Documents, reports, forms
- Text: Regular chat messages
Prerequisites
- Node.js 18+
- A Vercel AI Gateway API key
- Basic knowledge of Next.js and React
Setup
Create a new Next.js application:Implementation
Create the API Route
Create a route handler that processes multi-modal messages:convertToModelMessages function automatically handles the conversion of images and PDFs from the UI format to the model’s expected format.
File Upload Helper
Create a helper function to convert files to data URLs:Chat Interface with File Upload
Build the frontend with support for uploading images and PDFs:Key Features
Message Parts Structure
Messages use aparts array that can contain different types:
File Processing
- User selects files via input field
- Files are converted to data URLs using FileReader API
- Data URLs are sent as part of the message
- Model processes the files alongside text
Rendering Different Media Types
The interface renders different parts appropriately:- Text: Displayed as plain text
- Images: Rendered using Next.js Image component
- PDFs: Displayed in an iframe
Running the Application
http://localhost:3000 and try:
- Upload an image and ask “What’s in this image?”
- Upload a PDF and ask “Summarize this document”
- Send a regular text message
Using Other Providers
The AI SDK supports multiple providers with multi-modal capabilities:Best Practices
- File Size: Be mindful of file size limits for different providers
- Image Quality: Balance image quality with upload speed
- Error Handling: Handle file upload errors gracefully
- Loading States: Show progress indicators during file processing
Next Steps
- Add file size validation
- Implement drag-and-drop file upload
- Add support for more file types
- Implement file preview before sending
- Add tools for more advanced interactions