Initial commit: antigravity-claudekit
This commit is contained in:
121
skills/ck-ai-multimodal/SKILL.md
Normal file
121
skills/ck-ai-multimodal/SKILL.md
Normal file
@@ -0,0 +1,121 @@
|
||||
---
|
||||
name: ck-ai-multimodal
|
||||
description: >
|
||||
Analyzes images, videos, PDFs, and documents using multimodal AI models.
|
||||
Activate when user says 'analyze this image', 'describe what you see', 'read this PDF',
|
||||
'extract text from screenshot', 'what is in this photo', or 'process this document'.
|
||||
Accepts image files, PDFs, video frames, and URLs to visual content.
|
||||
---
|
||||
|
||||
## Overview
|
||||
Orchestrates multimodal AI analysis on images, documents, and visual content. Extracts structured information, descriptions, OCR text, or domain-specific insights from non-text inputs.
|
||||
|
||||
## When to Use
|
||||
- Analyzing uploaded images for content, objects, or scene description
|
||||
- Extracting text or data from screenshots, PDFs, or scanned documents
|
||||
- Comparing multiple images for differences or similarities
|
||||
- Processing diagrams, charts, or UI mockups to generate code or descriptions
|
||||
- Describing visual content for accessibility or documentation purposes
|
||||
- Video frame analysis and summarization
|
||||
|
||||
## Don't Use When
|
||||
- Input is plain text only (no visual component)
|
||||
- User needs to generate new images (use ck-ai-artist)
|
||||
- Task is simple file format conversion with no AI analysis needed
|
||||
- Document is machine-readable text PDF (use direct text extraction)
|
||||
|
||||
## Steps / Instructions
|
||||
|
||||
### 1. Identify Input Type and Goal
|
||||
Determine:
|
||||
- Input format: image (JPEG/PNG/WebP), PDF, video, screenshot
|
||||
- Analysis goal: description, OCR, data extraction, comparison, code generation
|
||||
- Output format: plain text, JSON, markdown table, code snippet
|
||||
|
||||
### 2. Prepare Input
|
||||
|
||||
**Images:**
|
||||
- Ensure file is accessible (local path or URL)
|
||||
- For large images, consider resizing to reduce token cost while preserving detail
|
||||
- For PDFs: extract pages as images if needed
|
||||
|
||||
**Video:**
|
||||
- Extract key frames at regular intervals or scene changes
|
||||
- Process frames individually or as a batch
|
||||
|
||||
### 3. Craft Analysis Prompt
|
||||
|
||||
Be specific about what to extract:
|
||||
|
||||
```
|
||||
# For structured extraction:
|
||||
"Extract all text from this receipt image and return as JSON with fields:
|
||||
merchant, date, items (array of {name, price}), total, tax."
|
||||
|
||||
# For description:
|
||||
"Describe this UI screenshot in detail, including layout, colors,
|
||||
components, and any text visible. Focus on structure for a developer."
|
||||
|
||||
# For comparison:
|
||||
"Compare these two screenshots. List all visible differences
|
||||
in UI layout, text, and styling."
|
||||
|
||||
# For diagram-to-code:
|
||||
"This is a flowchart. Convert it to a Mermaid diagram."
|
||||
```
|
||||
|
||||
### 4. Call Multimodal Model
|
||||
|
||||
Using Google Gemini (via venv Python):
|
||||
```python
|
||||
# Use: ~/.claude/skills/.venv/bin/python3
|
||||
import google.generativeai as genai
|
||||
import os, base64, pathlib
|
||||
|
||||
genai.configure(api_key=os.environ['GOOGLE_API_KEY'])
|
||||
model = genai.GenerativeModel('gemini-1.5-pro')
|
||||
|
||||
image_data = pathlib.Path('input.png').read_bytes()
|
||||
image_part = {'mime_type': 'image/png', 'data': base64.b64encode(image_data).decode()}
|
||||
|
||||
response = model.generate_content([image_part, 'Describe this image in detail.'])
|
||||
print(response.text)
|
||||
```
|
||||
|
||||
Using OpenAI Vision:
|
||||
```python
|
||||
import openai, base64, os
|
||||
|
||||
client = openai.OpenAI(api_key=os.environ['OPENAI_API_KEY'])
|
||||
with open('input.png', 'rb') as f:
|
||||
b64 = base64.b64encode(f.read()).decode()
|
||||
|
||||
response = client.chat.completions.create(
|
||||
model='gpt-4o',
|
||||
messages=[{
|
||||
'role': 'user',
|
||||
'content': [
|
||||
{'type': 'image_url', 'image_url': {'url': f'data:image/png;base64,{b64}'}},
|
||||
{'type': 'text', 'text': 'Describe this image.'}
|
||||
]
|
||||
}]
|
||||
)
|
||||
print(response.choices[0].message.content)
|
||||
```
|
||||
|
||||
### 5. Post-Process Output
|
||||
- Parse JSON if structured extraction was requested
|
||||
- Validate extracted data against expected schema
|
||||
- For OCR results, clean whitespace and correct obvious errors
|
||||
- For code generation from diagrams, run syntax check
|
||||
|
||||
### 6. Handle Errors
|
||||
- If model returns incomplete extraction, retry with more specific prompt
|
||||
- For large PDFs, process in page chunks
|
||||
- If image quality is poor, note limitations in output
|
||||
|
||||
## Notes
|
||||
- Never hardcode API keys; use environment variables
|
||||
- Gemini 1.5 Pro handles larger context and longer documents
|
||||
- GPT-4o excels at UI/code understanding
|
||||
- Always state confidence level when extracting critical data (e.g., financial figures)
|
||||
Reference in New Issue
Block a user