No description
| configs | ||
| src | ||
| .gitignore | ||
| .issue_num | ||
| AGENTS.md | ||
| forgejo_tools.py | ||
| GIT_WORKFLOW.md | ||
| PROJECT_PLAN.md | ||
| pyproject.toml | ||
| README.md | ||
| STATE.md | ||
Agent Visual Checker
VLM-based visual validation tool for automated GUI testing
Overview
Agent Visual Checker is a tool designed to automate visual validation of GUI applications using Vision Language Models (VLM). Instead of requiring human-in-the-loop validation, a VLM agent provides feedback as a senior testing engineer, analyzing screen captures and providing validation results.
Key Features
- High-Frequency Screen Capture: Supports 30-60Hz screenshot capture for detailed recording
- Session-Based Recording: Stateful recording sessions that capture complete application workflows
- Cross-Platform Support: Windows-first implementation with macOS and Linux support planned
- MCP Tools: Model Context Protocol tools for seamless VLM integration
- Window Management: Bring applications to foreground, minimize, enumerate windows
- VLM Flexibility: Support for local VLMs (Ollama, vLLM) and API-based VLMs (OpenAI, Claude)
- Web Dashboard: Real-time monitoring, session replay, and feedback visualization
Architecture
┌─────────────────────────────────────────────────────────────────────────────┐
│ AGENT VISUAL CHECKER │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ CROSS-PLATFORM ABSTRACTION │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │ │
│ │ │ Win32/WinRT│ │ MacOS │ │ Linux (X11/Wayland) │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌───────────────────────────────────▼────────────────────────────────┐ │
│ │ CAPTURE SERVICE LAYER │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │ │
│ │ │ High-Freq │ │ Session │ │ Compression │ │ │
│ │ │ Screenshot │ │ Manager │ │ Encoder │ │ │
│ │ └──────────────┘ └──────────────┘ └──────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌───────────────────────────────────▼────────────────────────────────┐ │
│ │ MCP SERVER LAYER │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │ │
│ │ │ Screen Tools │ │Window Tools │ │ Session Tools │ │ │
│ │ └──────────────┘ └──────────────┘ └──────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌───────────────────────────────────▼────────────────────────────────┐ │
│ │ VLM ADAPTER LAYER │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │ │
│ │ │ Local VLM │ │ API VLM │ │ Feedback │ │ │
│ │ └──────────────┘ └──────────────┘ └──────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌───────────────────────────────────▼────────────────────────────────┐ │
│ │ WEB UI LAYER │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Quick Start
Prerequisites
- Python 3.10+
- For Windows: Windows 10/11
- For macOS: macOS 11+ (planned)
- For Linux: X11 or Wayland (planned)
- VLM endpoint (Ollama, vLLM, OpenAI, etc.)
Installation
pip install agent-visual-checker
Or install from source:
git clone https://git.nazimyildiz.com/NAMCHO/agent-visual-checker.git
cd agent-visual-checker
pip install -e ".[dev]"
Configuration
Edit configs/default.yaml:
capture:
fps: 30
quality: 85
format: "png"
storage:
base_path: "./sessions"
retention_days: 7
vlm:
provider: "ollama"
endpoint: "http://localhost:11434"
model: "llama3.2-vision"
mcp:
host: "0.0.0.0"
port: 8765
webui:
host: "0.0.0.0"
port: 8000
Running
# Start MCP server
python -m src.mcp.server
# Start Web UI (in another terminal)
python -m src.ui.main
MCP Tools
Screen Tools
screenshot- Capture a single screenshotlist_windows- List all visible windowsget_window_info- Get detailed window informationbring_to_front- Bring a window to foregroundminimize_window- Minimize a window
Session Tools
start_recording_session- Start a new recording sessionstop_recording_session- Stop an active recording sessionlist_sessions- List all recording sessionsget_session_info- Get session metadatadelete_session- Delete a session
Analysis Tools
analyze_screenshot- Analyze a single screenshotanalyze_session- Analyze a complete recording sessionget_validation_feedback- Get validation feedback
Session Workflow
┌─────────────────────────────────────────────────────────────────────────────┐
│ RECORDING SESSION FLOW │
│ │
│ Agent MCP Server Capture Service │
│ │ │ │ │
│ │──start_recording_session──► │ │
│ │ │──start_session───────────────► │
│ │ │ │ │
│ │ │◄──session_id──────────────────│ │
│ │◄──session_id───────────────│ │ │
│ │ │ │ │
│ │ (do actions in app) │ │ │
│ │ │◄─continuous capture @30-60Hz──│ │
│ │ │ │ │
│ │──stop_recording_session───► │ │
│ │ │──stop_session────────────────► │
│ │ │ │ │
│ │ │◄──session_summary─────────────│ │
│ │◄──session_summary──────────│ │ │
│ │ │ │ │
│ │──analyze_session──────────► │ │
│ │ │──VLM analysis─────────────────► │
│ │ │ │ │
│ │◄──validation_feedback──────│ │ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Development
Running Tests
# Unit tests
pytest tests/unit/
# Integration tests
pytest tests/integration/
# With coverage
pytest --cov=src tests/unit/
Code Quality
# Lint
ruff check src/
# Type check
mypy src/
# Format
ruff format src/
Project Structure
agent-visual-checker/
├── src/
│ ├── capture/ # Cross-platform screen capture
│ ├── session/ # Recording session management
│ ├── mcp/ # MCP server and tools
│ ├── vlm/ # VLM adapters
│ ├── feedback/ # Validation feedback engine
│ └── ui/ # Web dashboard
├── tests/
│ ├── unit/
│ ├── integration/
│ └── manual/
├── configs/
├── docs/
├── README.md
├── AGENTS.md
└── pyproject.toml
Milestones
| Milestone | Description | Status |
|---|---|---|
| M1 | Project Foundation & Cross-Platform Abstraction | TODO |
| M2 | Windows Capture Implementation | TODO |
| M3 | MCP Server with Core Tools | TODO |
| M4 | Session Management & Storage | TODO |
| M5 | VLM Adapter Layer | TODO |
| M6 | Web UI Dashboard | TODO |
| M7 | Feedback Engine (Senior Tester) | TODO |
| M8 | macOS/Linux Capture Ports | TODO |
| M9 | Integration Testing & Polish | TODO |
License
MIT License
Contributing
Contributions are welcome! Please read the AGENTS.md for development guidelines and GIT_WORKFLOW.md for commit conventions.