Web Development

Give Your AI Eyes: How to Add Image Recognition to Claude Code

Give Your AI Eyes: How to Add Image Recognition to Claude Code

Give Your AI Eyes: How to Add Image Recognition to Claude Code

Most text-based AI models are blind. They can reason about code, write essays, and debug complex logic, but show them an image and they have nothing to say. claude-vision-skill fixes that. It forwards images to a vision-capable model API and returns a text description back to the conversation.

Here is what it does, how it works, and how to set it up.


What Is claude-vision-skill?

claude-vision-skill is an open-source tool that adds image recognition to AI models that do not have native vision support. It was originally built for models like DeepSeek, but works with any AI assistant that can execute scripts.

The core idea is simple:

  1. A user sends an image in the conversation
  2. The script converts the image to base64
  3. It sends the encoded image to a vision model API using the OpenAI-compatible format
  4. The text description is returned to the conversation

No manual commands needed. Once configured, you drop an image into the chat and the AI processes it automatically.

Why this matters

Many developers use cost-effective AI models through proxies like ccswitch or CodingAPI. These models are strong on text tasks but often lack vision. claude-vision-skill lets you keep your preferred model for text while offloading image understanding to a specialized vision model.


How It Works Under the Hood

The project contains three key files:

FilePurpose
vision.jsCore script. Handles image reading, base64 encoding, and API communication
CLAUDE.mdProject instructions telling the AI when and how to use vision.js
cyberboss-setup.mdOptional setup for the Cyberboss WeChat platform

The script uses the OpenAI-compatible API format, which means it works with any provider that follows the same spec. You can plug in Alibaba Cloud Bailian, OpenAI, or a self-hosted endpoint.


Choosing a Vision API Provider

Before installing, you need a vision-capable API. A few options:

  • Models: qwen3.5-omni-plus or qwen-vl-max
  • Cost: New users get 1 million free tokens (roughly 0.02 yuan per request after that)
  • Cheapest option, good Chinese language support, easy signup

Option 2: OpenAI

  • Model: gpt-4o-mini
  • Cost: Standard OpenAI pricing
  • Best English-language image understanding, requires international payment

Option 3: Any OpenAI-compatible service

  • Set a custom BASE_URL and model name in vision.js
  • Works with local models, third-party proxies, or self-hosted endpoints

Installation

This is the easiest path. Clone the repository and let Claude Code do the rest.

Step 1: Clone the repository

git clone https://github.com/asuojun/claude-vision-skill.git

Step 2: Ask Claude Code to configure it

Open Claude Code and paste:

Read the claude-vision-skill README and help me configure vision support.

Claude Code will prompt you for:

  • Your preferred vision service
  • Your API key
  • The model name

It handles the file placement and configuration automatically.

Method B: Manual setup

If you prefer full control, follow these steps.

Step 1: Copy vision.js to your project root

Place the vision.js file in the root directory of your project.

Step 2: Configure your API credentials

Open vision.js and replace the placeholders:

// Replace these values
const API_KEY = "sk-xxx";        // Your actual API key
const MODEL = "xxx";             // Model name, e.g., "qwen-vl-max"
const BASE_URL = "xxx";          // API endpoint (keep default for Qwen)

For Alibaba Cloud Bailian, the default BASE_URL already points to the correct endpoint. You only need to fill in API_KEY and MODEL.

For OpenAI, change BASE_URL to:

https://api.openai.com/v1

For other providers, use their OpenAI-compatible endpoint.

Step 3: Copy CLAUDE.md to your project root

Place the CLAUDE.md file alongside vision.js. This file contains instructions that tell Claude Code when and how to invoke the vision script.

Step 4: Test it

Send an image in your Claude Code conversation. If configured correctly, the AI will automatically process the image and describe its contents.


Cyberboss / WeChat Integration

If you are running the Cyberboss platform (a WeChat-based AI assistant), there is an additional step:

  1. Complete the base setup above
  2. Follow the instructions in cyberboss-setup.md to modify the persona and src/core/app.js
  3. Restart Cyberboss

After this, sending images through WeChat will trigger automatic image recognition.


Troubleshooting

Image not being recognized

  • Verify vision.js is in the project root
  • Check that CLAUDE.md is also in the project root
  • Confirm your API key is valid and has remaining credits

API errors or timeouts

  • Test your API key directly with a curl request first
  • Ensure BASE_URL includes /v1 if your provider requires it
  • Check network connectivity to the vision API endpoint

Wrong or empty descriptions

  • Try a different vision model (e.g., switch from qwen-vl-max to qwen3.5-omni-plus)
  • Some models perform better on certain image types (documents vs. photos vs. screenshots)

Final thoughts

claude-vision-skill fills a real gap in most Claude Code setups. You get image recognition without switching models or paying for an expensive multimodal API for everyday tasks.

Setup takes about five minutes. After that, you no longer have to manually describe images to your AI.

Repository: claude-vision-skill on GitHub

Related articles:

Newman

Newman

Writer and builder at BePhil. Passionate about design systems, frontend engineering, and clear thinking.