No description
Find a file
2026-03-24 23:40:45 +01:00
configs Initial commit: Project structure and foundation 2026-03-24 23:13:00 +01:00
src Initial commit: Project structure and foundation 2026-03-24 23:13:00 +01:00
.gitignore Initial commit: Project structure and foundation 2026-03-24 23:13:00 +01:00
.issue_num chore: update STATE.md with all 41 issues created 2026-03-24 23:40:45 +01:00
AGENTS.md Initial commit: Project structure and foundation 2026-03-24 23:13:00 +01:00
forgejo_tools.py chore: update STATE.md with all 41 issues created 2026-03-24 23:40:45 +01:00
GIT_WORKFLOW.md Initial commit: Project structure and foundation 2026-03-24 23:13:00 +01:00
PROJECT_PLAN.md Initial commit: Project structure and foundation 2026-03-24 23:13:00 +01:00
pyproject.toml Initial commit: Project structure and foundation 2026-03-24 23:13:00 +01:00
README.md Initial commit: Project structure and foundation 2026-03-24 23:13:00 +01:00
STATE.md chore: update STATE.md with all 41 issues created 2026-03-24 23:40:45 +01:00

Agent Visual Checker

VLM-based visual validation tool for automated GUI testing

Overview

Agent Visual Checker is a tool designed to automate visual validation of GUI applications using Vision Language Models (VLM). Instead of requiring human-in-the-loop validation, a VLM agent provides feedback as a senior testing engineer, analyzing screen captures and providing validation results.

Key Features

  • High-Frequency Screen Capture: Supports 30-60Hz screenshot capture for detailed recording
  • Session-Based Recording: Stateful recording sessions that capture complete application workflows
  • Cross-Platform Support: Windows-first implementation with macOS and Linux support planned
  • MCP Tools: Model Context Protocol tools for seamless VLM integration
  • Window Management: Bring applications to foreground, minimize, enumerate windows
  • VLM Flexibility: Support for local VLMs (Ollama, vLLM) and API-based VLMs (OpenAI, Claude)
  • Web Dashboard: Real-time monitoring, session replay, and feedback visualization

Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                        AGENT VISUAL CHECKER                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │                     CROSS-PLATFORM ABSTRACTION                       │   │
│  │   ┌─────────────┐  ┌─────────────┐  ┌─────────────────────────┐   │   │
│  │   │ Win32/WinRT│  │    MacOS    │  │   Linux (X11/Wayland)  │   │   │
│  │   └─────────────┘  └─────────────┘  └─────────────────────────┘   │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                      │                                      │
│  ┌───────────────────────────────────▼────────────────────────────────┐   │
│  │                      CAPTURE SERVICE LAYER                          │   │
│  │   ┌──────────────┐  ┌──────────────┐  ┌──────────────────────┐   │   │
│  │   │ High-Freq    │  │ Session      │  │ Compression          │   │   │
│  │   │ Screenshot   │  │ Manager      │  │ Encoder              │   │   │
│  │   └──────────────┘  └──────────────┘  └──────────────────────┘   │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                      │                                      │
│  ┌───────────────────────────────────▼────────────────────────────────┐   │
│  │                        MCP SERVER LAYER                             │   │
│  │   ┌──────────────┐  ┌──────────────┐  ┌──────────────────────┐   │   │
│  │   │ Screen Tools │  │Window Tools  │  │ Session Tools        │   │   │
│  │   └──────────────┘  └──────────────┘  └──────────────────────┘   │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                      │                                      │
│  ┌───────────────────────────────────▼────────────────────────────────┐   │
│  │                      VLM ADAPTER LAYER                              │   │
│  │   ┌──────────────┐  ┌──────────────┐  ┌──────────────────────┐   │   │
│  │   │ Local VLM    │  │ API VLM      │  │ Feedback             │   │   │
│  │   └──────────────┘  └──────────────┘  └──────────────────────┘   │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                      │                                      │
│  ┌───────────────────────────────────▼────────────────────────────────┐   │
│  │                        WEB UI LAYER                                 │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Quick Start

Prerequisites

  • Python 3.10+
  • For Windows: Windows 10/11
  • For macOS: macOS 11+ (planned)
  • For Linux: X11 or Wayland (planned)
  • VLM endpoint (Ollama, vLLM, OpenAI, etc.)

Installation

pip install agent-visual-checker

Or install from source:

git clone https://git.nazimyildiz.com/NAMCHO/agent-visual-checker.git
cd agent-visual-checker
pip install -e ".[dev]"

Configuration

Edit configs/default.yaml:

capture:
  fps: 30
  quality: 85
  format: "png"

storage:
  base_path: "./sessions"
  retention_days: 7

vlm:
  provider: "ollama"
  endpoint: "http://localhost:11434"
  model: "llama3.2-vision"

mcp:
  host: "0.0.0.0"
  port: 8765

webui:
  host: "0.0.0.0"
  port: 8000

Running

# Start MCP server
python -m src.mcp.server

# Start Web UI (in another terminal)
python -m src.ui.main

MCP Tools

Screen Tools

  • screenshot - Capture a single screenshot
  • list_windows - List all visible windows
  • get_window_info - Get detailed window information
  • bring_to_front - Bring a window to foreground
  • minimize_window - Minimize a window

Session Tools

  • start_recording_session - Start a new recording session
  • stop_recording_session - Stop an active recording session
  • list_sessions - List all recording sessions
  • get_session_info - Get session metadata
  • delete_session - Delete a session

Analysis Tools

  • analyze_screenshot - Analyze a single screenshot
  • analyze_session - Analyze a complete recording session
  • get_validation_feedback - Get validation feedback

Session Workflow

┌─────────────────────────────────────────────────────────────────────────────┐
│                         RECORDING SESSION FLOW                              │
│                                                                             │
│   Agent                      MCP Server                    Capture Service  │
│    │                            │                               │            │
│    │──start_recording_session──►                               │            │
│    │                            │──start_session───────────────►            │
│    │                            │                               │            │
│    │                            │◄──session_id──────────────────│            │
│    │◄──session_id───────────────│                               │            │
│    │                            │                               │            │
│    │  (do actions in app)      │                               │            │
│    │                            │◄─continuous capture @30-60Hz──│            │
│    │                            │                               │            │
│    │──stop_recording_session───►                               │            │
│    │                            │──stop_session────────────────►            │
│    │                            │                               │            │
│    │                            │◄──session_summary─────────────│            │
│    │◄──session_summary──────────│                               │            │
│    │                            │                               │            │
│    │──analyze_session──────────►                               │            │
│    │                            │──VLM analysis─────────────────►            │
│    │                            │                               │            │
│    │◄──validation_feedback──────│                               │            │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Development

Running Tests

# Unit tests
pytest tests/unit/

# Integration tests
pytest tests/integration/

# With coverage
pytest --cov=src tests/unit/

Code Quality

# Lint
ruff check src/

# Type check
mypy src/

# Format
ruff format src/

Project Structure

agent-visual-checker/
├── src/
│   ├── capture/              # Cross-platform screen capture
│   ├── session/              # Recording session management
│   ├── mcp/                  # MCP server and tools
│   ├── vlm/                  # VLM adapters
│   ├── feedback/             # Validation feedback engine
│   └── ui/                   # Web dashboard
├── tests/
│   ├── unit/
│   ├── integration/
│   └── manual/
├── configs/
├── docs/
├── README.md
├── AGENTS.md
└── pyproject.toml

Milestones

Milestone Description Status
M1 Project Foundation & Cross-Platform Abstraction TODO
M2 Windows Capture Implementation TODO
M3 MCP Server with Core Tools TODO
M4 Session Management & Storage TODO
M5 VLM Adapter Layer TODO
M6 Web UI Dashboard TODO
M7 Feedback Engine (Senior Tester) TODO
M8 macOS/Linux Capture Ports TODO
M9 Integration Testing & Polish TODO

License

MIT License

Contributing

Contributions are welcome! Please read the AGENTS.md for development guidelines and GIT_WORKFLOW.md for commit conventions.

References