antigravity-claudekit/skills/ck-ai-multimodal/SKILL.md

---
name: ck-ai-multimodal
description: >
  Analyzes images, videos, PDFs, and documents using multimodal AI models.
  Activate when user says 'analyze this image', 'describe what you see', 'read this PDF',
  'extract text from screenshot', 'what is in this photo', or 'process this document'.
  Accepts image files, PDFs, video frames, and URLs to visual content.
---

## Overview
Orchestrates multimodal AI analysis on images, documents, and visual content. Extracts structured information, descriptions, OCR text, or domain-specific insights from non-text inputs.

## When to Use
- Analyzing uploaded images for content, objects, or scene description
- Extracting text or data from screenshots, PDFs, or scanned documents
- Comparing multiple images for differences or similarities
- Processing diagrams, charts, or UI mockups to generate code or descriptions
- Describing visual content for accessibility or documentation purposes
- Video frame analysis and summarization

## Don't Use When
- Input is plain text only (no visual component)
- User needs to generate new images (use ck-ai-artist)
- Task is simple file format conversion with no AI analysis needed
- Document is machine-readable text PDF (use direct text extraction)

## Steps / Instructions

### 1. Identify Input Type and Goal
Determine:
- Input format: image (JPEG/PNG/WebP), PDF, video, screenshot
- Analysis goal: description, OCR, data extraction, comparison, code generation
- Output format: plain text, JSON, markdown table, code snippet

### 2. Prepare Input

**Images:**
- Ensure file is accessible (local path or URL)
- For large images, consider resizing to reduce token cost while preserving detail
- For PDFs: extract pages as images if needed

**Video:**
- Extract key frames at regular intervals or scene changes
- Process frames individually or as a batch

### 3. Craft Analysis Prompt

Be specific about what to extract:

```
# For structured extraction:
"Extract all text from this receipt image and return as JSON with fields:
merchant, date, items (array of {name, price}), total, tax."

# For description:
"Describe this UI screenshot in detail, including layout, colors,
components, and any text visible. Focus on structure for a developer."

# For comparison:
"Compare these two screenshots. List all visible differences
in UI layout, text, and styling."

# For diagram-to-code:
"This is a flowchart. Convert it to a Mermaid diagram."
```

### 4. Call Multimodal Model

Using Google Gemini (via venv Python):
```python
# Use: ~/.claude/skills/.venv/bin/python3
import google.generativeai as genai
import os, base64, pathlib

genai.configure(api_key=os.environ['GOOGLE_API_KEY'])
model = genai.GenerativeModel('gemini-1.5-pro')

image_data = pathlib.Path('input.png').read_bytes()
image_part = {'mime_type': 'image/png', 'data': base64.b64encode(image_data).decode()}

response = model.generate_content([image_part, 'Describe this image in detail.'])
print(response.text)
```

Using OpenAI Vision:
```python
import openai, base64, os

client = openai.OpenAI(api_key=os.environ['OPENAI_API_KEY'])
with open('input.png', 'rb') as f:
    b64 = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model='gpt-4o',
    messages=[{
        'role': 'user',
        'content': [
            {'type': 'image_url', 'image_url': {'url': f'data:image/png;base64,{b64}'}},
            {'type': 'text', 'text': 'Describe this image.'}
        ]
    }]
)
print(response.choices[0].message.content)
```

### 5. Post-Process Output
- Parse JSON if structured extraction was requested
- Validate extracted data against expected schema
- For OCR results, clean whitespace and correct obvious errors
- For code generation from diagrams, run syntax check

### 6. Handle Errors
- If model returns incomplete extraction, retry with more specific prompt
- For large PDFs, process in page chunks
- If image quality is poor, note limitations in output

## Notes
- Never hardcode API keys; use environment variables
- Gemini 1.5 Pro handles larger context and longer documents
- GPT-4o excels at UI/code understanding
- Always state confidence level when extracting critical data (e.g., financial figures)