122 lines
4.1 KiB
Markdown
122 lines
4.1 KiB
Markdown
---
|
|
name: ck-ai-multimodal
|
|
description: >
|
|
Analyzes images, videos, PDFs, and documents using multimodal AI models.
|
|
Activate when user says 'analyze this image', 'describe what you see', 'read this PDF',
|
|
'extract text from screenshot', 'what is in this photo', or 'process this document'.
|
|
Accepts image files, PDFs, video frames, and URLs to visual content.
|
|
---
|
|
|
|
## Overview
|
|
Orchestrates multimodal AI analysis on images, documents, and visual content. Extracts structured information, descriptions, OCR text, or domain-specific insights from non-text inputs.
|
|
|
|
## When to Use
|
|
- Analyzing uploaded images for content, objects, or scene description
|
|
- Extracting text or data from screenshots, PDFs, or scanned documents
|
|
- Comparing multiple images for differences or similarities
|
|
- Processing diagrams, charts, or UI mockups to generate code or descriptions
|
|
- Describing visual content for accessibility or documentation purposes
|
|
- Video frame analysis and summarization
|
|
|
|
## Don't Use When
|
|
- Input is plain text only (no visual component)
|
|
- User needs to generate new images (use ck-ai-artist)
|
|
- Task is simple file format conversion with no AI analysis needed
|
|
- Document is machine-readable text PDF (use direct text extraction)
|
|
|
|
## Steps / Instructions
|
|
|
|
### 1. Identify Input Type and Goal
|
|
Determine:
|
|
- Input format: image (JPEG/PNG/WebP), PDF, video, screenshot
|
|
- Analysis goal: description, OCR, data extraction, comparison, code generation
|
|
- Output format: plain text, JSON, markdown table, code snippet
|
|
|
|
### 2. Prepare Input
|
|
|
|
**Images:**
|
|
- Ensure file is accessible (local path or URL)
|
|
- For large images, consider resizing to reduce token cost while preserving detail
|
|
- For PDFs: extract pages as images if needed
|
|
|
|
**Video:**
|
|
- Extract key frames at regular intervals or scene changes
|
|
- Process frames individually or as a batch
|
|
|
|
### 3. Craft Analysis Prompt
|
|
|
|
Be specific about what to extract:
|
|
|
|
```
|
|
# For structured extraction:
|
|
"Extract all text from this receipt image and return as JSON with fields:
|
|
merchant, date, items (array of {name, price}), total, tax."
|
|
|
|
# For description:
|
|
"Describe this UI screenshot in detail, including layout, colors,
|
|
components, and any text visible. Focus on structure for a developer."
|
|
|
|
# For comparison:
|
|
"Compare these two screenshots. List all visible differences
|
|
in UI layout, text, and styling."
|
|
|
|
# For diagram-to-code:
|
|
"This is a flowchart. Convert it to a Mermaid diagram."
|
|
```
|
|
|
|
### 4. Call Multimodal Model
|
|
|
|
Using Google Gemini (via venv Python):
|
|
```python
|
|
# Use: ~/.claude/skills/.venv/bin/python3
|
|
import google.generativeai as genai
|
|
import os, base64, pathlib
|
|
|
|
genai.configure(api_key=os.environ['GOOGLE_API_KEY'])
|
|
model = genai.GenerativeModel('gemini-1.5-pro')
|
|
|
|
image_data = pathlib.Path('input.png').read_bytes()
|
|
image_part = {'mime_type': 'image/png', 'data': base64.b64encode(image_data).decode()}
|
|
|
|
response = model.generate_content([image_part, 'Describe this image in detail.'])
|
|
print(response.text)
|
|
```
|
|
|
|
Using OpenAI Vision:
|
|
```python
|
|
import openai, base64, os
|
|
|
|
client = openai.OpenAI(api_key=os.environ['OPENAI_API_KEY'])
|
|
with open('input.png', 'rb') as f:
|
|
b64 = base64.b64encode(f.read()).decode()
|
|
|
|
response = client.chat.completions.create(
|
|
model='gpt-4o',
|
|
messages=[{
|
|
'role': 'user',
|
|
'content': [
|
|
{'type': 'image_url', 'image_url': {'url': f'data:image/png;base64,{b64}'}},
|
|
{'type': 'text', 'text': 'Describe this image.'}
|
|
]
|
|
}]
|
|
)
|
|
print(response.choices[0].message.content)
|
|
```
|
|
|
|
### 5. Post-Process Output
|
|
- Parse JSON if structured extraction was requested
|
|
- Validate extracted data against expected schema
|
|
- For OCR results, clean whitespace and correct obvious errors
|
|
- For code generation from diagrams, run syntax check
|
|
|
|
### 6. Handle Errors
|
|
- If model returns incomplete extraction, retry with more specific prompt
|
|
- For large PDFs, process in page chunks
|
|
- If image quality is poor, note limitations in output
|
|
|
|
## Notes
|
|
- Never hardcode API keys; use environment variables
|
|
- Gemini 1.5 Pro handles larger context and longer documents
|
|
- GPT-4o excels at UI/code understanding
|
|
- Always state confidence level when extracting critical data (e.g., financial figures)
|