Skip to content

AgenticART Dojo Framework

Transform AgenticART from a tool into a training ground for security LLMs

Status: V2 architecture fully implemented with Praxis Loop, RAG system, and MCP integration.



Philosophy

A dojo is not just a place where techniques are practicedβ€”it's a system that: - Provides structured progression (belt levels) - Offers immediate feedback (sensei corrections) - Captures successful patterns (kata) - Measures mastery (grading) - Enables continuous improvement (deliberate practice)

AgenticART already captures exploitation trajectories. The Dojo Framework formalizes this into a self-improving training loop.


Architecture Overview

+-------------------------------------------------------------------------+
|                          DOJO FRAMEWORK                                 |
+-------------------------------------------------------------------------+
|                                                                         |
|  +------------------+    +------------------+    +------------------+   |
|  |    CURRICULUM    |    |     SPARRING     |    |     GRADING      |   |
|  |   (Challenge     |--->|   (Execution     |--->|    (Quality      |   |
|  |    Progression)  |    |    Against AVD)  |    |    Assessment)   |   |
|  +------------------+    +------------------+    +------------------+   |
|          |                       |                       |              |
|          |                       v                       |              |
|          |              +------------------+             |              |
|          |              |      SENSEI      |             |              |
|          |              |   (Feedback &    |<------------+              |
|          |              |    Correction)   |                            |
|          |              +------------------+                            |
|          |                       |                                      |
|          v                       v                                      |
|  +------------------------------------------------------------------+  |
|  |                    TRAINING DATA PIPELINE                         |  |
|  |  +--------------+  +--------------+  +--------------+             |  |
|  |  |   Positive   |  |   Negative   |  |    Error     |             |  |
|  |  |   Examples   |  |   Examples   |  |   Recovery   |             |  |
|  |  +--------------+  +--------------+  +--------------+             |  |
|  +------------------------------------------------------------------+  |
|                              |                                          |
|                              v                                          |
|                    +------------------+                                 |
|                    |   FINE-TUNING    |                                 |
|                    |    (LoRA/MLX)    |                                 |
|                    +------------------+                                 |
|                                                                         |
+-------------------------------------------------------------------------+

Directory Structure

dojo/
|-- __init__.py              # Package exports
|-- config.py                # DojoConfig settings
|-- models.py                # Core data models (Belt, Grade, Challenge, etc.)
|-- exceptions.py            # Custom exceptions
|
|-- curriculum/              # Challenge system
|   |-- __init__.py
|   |-- challenger.py        # Orchestrates attempts with feedback loop
|   |-- loader.py            # Loads challenges from YAML
|   |-- executor.py          # Executes commands against device
|   |-- context_injector.py  # Injects error context for retries
|   |-- error_extractor.py   # Extracts actionable error information
|   |
|   |-- white_belt/          # Fundamentals
|   |   +-- challenges.yaml
|   |-- yellow_belt/         # Reconnaissance
|   |   +-- challenges.yaml
|   +-- orange_belt/         # Vulnerability mapping
|       +-- challenges.yaml
|
|-- sensei/                  # Grading and training data
|   |-- __init__.py
|   |-- sensei.py            # Main orchestrator
|   |-- grader.py            # Evaluates challenge sessions
|   |-- exporter.py          # Exports to Alpaca/ShareGPT/DPO formats
|   |-- progress_tracker.py  # Tracks model progress across sessions
|   +-- training_extractor.py # Extracts training examples from sessions
|
|-- finetune/                # Model training utilities
|   |-- __init__.py
|   |-- config.py            # FinetuneConfig
|   +-- packager.py          # Packages data for GPU training
|
|-- test_end_to_end.py       # Integration test
|-- test_phase2.py           # Curriculum tests
+-- test_phase3.py           # Sensei tests

V2 Praxis Loop

The V2 architecture introduces the Praxis Loopβ€”a reasoning β†’ verification β†’ calibration cycle:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                          PRAXIS LOOP                                 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                      β”‚
β”‚    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                                   β”‚
β”‚    β”‚  Challenge  β”‚                                                   β”‚
β”‚    β”‚   Input     β”‚                                                   β”‚
β”‚    β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜                                                   β”‚
β”‚           β”‚                                                          β”‚
β”‚           β–Ό                                                          β”‚
β”‚    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                             β”‚
β”‚    β”‚  RAG System │─────▢│ Augmented   β”‚                             β”‚
β”‚    β”‚  (Context)  β”‚      β”‚   Prompt    β”‚                             β”‚
β”‚    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜                             β”‚
β”‚                                β”‚                                     β”‚
β”‚                                β–Ό                                     β”‚
β”‚    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚
β”‚    β”‚                    REASONING PHASE                        β”‚     β”‚
β”‚    β”‚  OBSERVE β†’ HYPOTHESIZE β†’ TEST β†’ CALIBRATE β†’ CORRECT      β”‚     β”‚
β”‚    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚
β”‚                                β”‚                                     β”‚
β”‚                                β–Ό                                     β”‚
β”‚    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”       β”‚
β”‚    β”‚   MCP       │─────▢│   Tool      │─────▢│ Calibration β”‚       β”‚
β”‚    β”‚  Executor   β”‚      β”‚  Results    β”‚      β”‚   Signal    β”‚       β”‚
β”‚    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜       β”‚
β”‚                                                      β”‚               β”‚
β”‚                                                      β–Ό               β”‚
β”‚                                               β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”       β”‚
β”‚                                               β”‚  DPO Pairs  β”‚       β”‚
β”‚                                               β”‚ (Training)  β”‚       β”‚
β”‚                                               β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜       β”‚
β”‚                                                                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

7 Pillars (V2 Curriculum)

Challenges are organized by cognitive skill rather than just difficulty:

Pillar Focus Challenge Types
static_analysis Code review, artifact identification OBSERVATION, HYPOTHESIS
negative_knowledge Recognizing secure patterns NEGATIVE
root_cause Understanding WHY vulnerabilities exist ROOT_CAUSE
pattern_transfer Applying patterns across contexts TRANSFER
methodology Tool selection, test design VERIFICATION
taxonomy CWE/OWASP classification All types with classification focus
patch_analysis Analyzing security fixes OBSERVATION, ROOT_CAUSE

PraxisRunner

The main execution engine for V2 challenges:

from dojo.graders.praxis_runner import PraxisRunner
from dojo.mcp import MCPExecutor
from dojo.rag import RAGSystem

# Initialize components
executor = MCPExecutor()
rag = RAGSystem(persist_dir=Path(".rag_data"))

runner = PraxisRunner(
    llm_client=client,
    mcp_executor=executor,
    enable_rag=True,
    rag_system=rag,
)

# Run challenge
result = runner.run_challenge(challenge)
print(f"Score: {result.total_score}")
print(f"Calibration Error: {result.calibration_error}")

1. Belt Progression System

Belt Levels

Belt Name Challenge Type Success Criteria
⬜ White Observation Systematic artifact identification 100% key artifact discovery
🟨 Yellow Hypothesis Testable security hypotheses Accurate CWE mapping & Test Design
🟧 Orange Verification Execution against live targets Successful exploit verification

Current Implementation: White belt is 100% operational with pre-built targets. Yellow belt is in the "Connector Phase" with physical targets currently being integrated.

Challenge YAML Format

# dojo/curriculum/white_belt/challenges.yaml
challenges:
  - id: white_001
    name: "Device Android Version"
    description: |
      Write an ADB shell command that outputs the Android version.
      The output should be just the version number (e.g., "11").
    belt: white
    difficulty: 1
    script_type: adb

    inputs:
      device_id: "emulator-5554"
      device_context:
        connection: "adb"
        task: "retrieve Android version"

    validation:
      type: regex_match
      pattern: "^\\d+(\\.\\d+)*"

    hints:
      - "Use 'adb shell getprop' to read system properties"
      - "Android version is stored in ro.build.version.release"

    kata_solution: "shell getprop ro.build.version.release"

    tags:
      - fundamentals
      - device-info

2. Core Components

ChallengeLoader

Loads and validates challenges from YAML files.

from dojo.curriculum import ChallengeLoader

loader = ChallengeLoader()
challenge = loader.load("white_001")
all_white = loader.load_belt(Belt.WHITE)

Challenger

Orchestrates challenge attempts with the feedback loop.

from dojo.curriculum import Challenger, ChallengeSession

challenger = Challenger(
    loader=loader,
    executor=executor,
    llm_client=llm_client,
    max_attempts=3,
)

session: ChallengeSession = challenger.run_challenge("white_001")
print(f"Success: {session.final_success}")
print(f"Attempts: {len(session.attempts)}")

Executor

Executes commands against the Android device with tier tracking.

from dojo.curriculum import Executor, ExecutionResult

executor = Executor(adb_path="/usr/bin/adb", device_id="emulator-5554")
result: ExecutionResult = executor.execute("shell getprop ro.build.version.release")

print(f"Success: {result.success}")
print(f"Output: {result.stdout}")
print(f"Tier: {result.tier_used}")  # 1=ADB, 2=ON_DEVICE

ErrorExtractor & ContextInjector

Extracts error information and injects it into retry prompts.

from dojo.curriculum import ErrorExtractor, ContextInjector

extractor = ErrorExtractor()
error_context = extractor.extract(result)

injector = ContextInjector()
retry_prompt = injector.inject(original_prompt, error_context)

3. Sensei Module (Grading & Training Data)

Sensei

The main orchestrator that connects grading, extraction, and export.

from dojo.sensei import Sensei

sensei = Sensei(output_dir=Path("./dojo_output"))

# Evaluate a single session
assessment, examples = sensei.evaluate_session(session, model_id="qwen-v1")

# Evaluate multiple sessions and run full cycle
result = sensei.run_training_cycle(
    sessions=sessions,
    model_id="qwen-v1",
    export_formats=[ExportFormat.ALPACA, ExportFormat.DPO],
)
print(result.summary())

Grader (V2)

The V2 Reasoning Grader goes beyond simple pass/fail checks. It evaluates the quality of thought using multiple metrics:

Metric Description
Epistemic Calibration Does the model's confidence match its accuracy? (Brier Score, ECE)
Reasoning Quality Completeness, depth, and logical coherence of the analysis.
Hallucination Detection Identifies fabricated library calls, APIs, or CVEs using artifact cross-referencing.
Stability Consistency of results across multiple independent runs (Stability Score).
from dojo.graders.reasoning_grader import ReasoningGrader, GradingResult

grader = ReasoningGrader(challenge)
result: GradingResult = grader.grade_phase(PhaseID.OBSERVE, response)

print(f"Score: {result.total_score}")
print(f"Hallucinations: {result.hallucinations}")
print(f"Calibration Error: {result.calibration_error}")

Grade Enum

from dojo.models import Grade

class Grade(Enum):
    PERFECT = "A"      # No corrections needed -> positive example
    GOOD = "B"         # Minor issues -> positive with notes
    ACCEPTABLE = "C"   # Functional but needs improvement
    POOR = "D"         # Major issues -> negative example with correction
    FAIL = "F"         # Non-functional -> negative example

TrainingExtractor

Extracts training examples from graded sessions.

from dojo.sensei import TrainingExtractor

extractor = TrainingExtractor()
examples: list[TrainingExample] = extractor.extract_from_session(session, assessment)

# Examples include:
# - Positive examples (Grade A/B)
# - Negative examples (Grade D/F)
# - Error recovery pairs (failed -> fixed)

TrainingDataExporter

Exports training data in multiple formats.

from dojo.sensei import TrainingDataExporter, ExportFormat

exporter = TrainingDataExporter(output_dir=Path("./training_data"))

# Export in Alpaca format (instruction/input/output)
path = exporter.export(examples, ExportFormat.ALPACA)

# Export in DPO format (chosen/rejected pairs)
path = exporter.export(examples, ExportFormat.DPO)

# Export in ShareGPT format (conversations)
path = exporter.export(examples, ExportFormat.SHAREGPT)

ProgressTracker

Tracks model progress across training sessions.

from dojo.sensei import ProgressTracker

tracker = ProgressTracker(storage_path=Path("./progress"))
tracker.record_assessment(model_id, assessment)

progress = tracker.get_progress(model_id)
print(f"Belt: {progress.current_belt}")
print(f"Pass Rate: {progress.pass_rate}%")
print(f"Ready for Promotion: {progress.ready_for_promotion}")

4. Execution Tier System

Overview

The Dojo uses a tiered execution model that prioritizes resource efficiency.

Tier Name Description When to Use
1 ADB Pure shell commands via ADB Always try first
2 On-Device Tools on Android (sqlite3, toybox) When ADB insufficient
3 External Kali tools (nmap, metasploit) Preprocessing ONLY

Tier Exhaustion Strategy

  1. Try Tier 1 first: Can this be done with pure ADB commands?
  2. Escalate to Tier 2: If ADB is insufficient, use on-device tools
  3. Tier 3 is preprocessing only: Kali tools embed results in challenge metadata

ExecutionResult Metadata

@dataclass
class ExecutionResult:
    success: bool
    exit_code: int
    stdout: str
    stderr: str
    duration: float
    command: str
    tier_used: ExecutionTier  # SHELL, ON_DEVICE, EXTERNAL
    tools_used: list[str]

5. Fine-tuning Pipeline

TrainingPackager

Creates portable packages for GPU training.

from dojo.finetune import TrainingPackager, FinetuneConfig

packager = TrainingPackager(output_dir=Path("./packages"))

config = FinetuneConfig(
    base_model="Qwen/Qwen2.5-Coder-7B",
    adapter_type="lora",
    lora_rank=16,
    learning_rate=1e-4,
    epochs=3,
)

package_path = packager.create_package(
    training_data_path=Path("./training_data/combined.json"),
    config=config,
)

Package Contents

finetune_package_20250104_120000/
|-- data/
|   +-- training_data.json    # Alpaca format
|-- config.json               # FinetuneConfig
|-- train.py                  # Training script
|-- train_mlx.py              # MLX training (Apple Silicon)
+-- README.md                 # Instructions

6. Data Models

Core Models (dojo/models.py)

from dojo.models import (
    Belt,              # WHITE, YELLOW, ORANGE, GREEN, BLUE, PURPLE, BROWN, BLACK
    Grade,             # PERFECT, GOOD, ACCEPTABLE, POOR, FAIL
    ScriptType,        # ADB, PYTHON, FRIDA, BASH
    Challenge,         # Challenge definition
    ChallengeInput,    # Input context for challenge
    ExpectedOutput,    # Expected output specification
    ScoringRubric,     # Scoring weights
    SenseiAssessment,  # Grading result
    TrainingExample,   # Extracted training sample
    ModelProgress,     # Model's progress tracking
)

Belt Model

class Belt(Enum):
    WHITE = "white"
    YELLOW = "yellow"
    ORANGE = "orange"
    GREEN = "green"
    BLUE = "blue"
    PURPLE = "purple"
    BROWN = "brown"
    BLACK = "black"

    @property
    def display(self) -> str:
        """Belt with color emoji."""
        icons = {"white": "⬜", "yellow": "🟨", ...}
        return f"{icons[self.value]} {self.value.title()}"

    def next_belt(self) -> Optional[Belt]:
        """Get the next belt in progression."""
        ...

7. Running the Dojo

End-to-End Test

# Run the complete dojo pipeline
python -m dojo.test_end_to_end

# This will:
# 1. Load white belt challenges
# 2. Run model against challenges
# 3. Grade outputs with Sensei
# 4. Extract training examples
# 5. Export to Alpaca format

Programmatic Usage

from dojo import (
    ChallengeLoader,
    Challenger,
    Executor,
    Sensei,
)
from agent.llm_client import OllamaClient

# Setup
loader = ChallengeLoader()
executor = Executor(device_id="emulator-5554")
llm = OllamaClient(model="qwen2.5-coder:7b")

challenger = Challenger(
    loader=loader,
    executor=executor,
    llm_client=llm,
    max_attempts=3,
)

sensei = Sensei()

# Run challenges
sessions = []
for challenge_id in ["white_001", "white_002", "white_003"]:
    session = challenger.run_challenge(challenge_id)
    sessions.append(session)

# Grade and export
result = sensei.run_training_cycle(
    sessions=sessions,
    model_id="qwen-v1",
    export_formats=[ExportFormat.ALPACA],
)

print(result.summary())

8. Integration with AgenticART

Existing Components -> Dojo

Existing Component Dojo Integration
agent/llm_client.py LLM provider for Challenger
agent/script_generator.py Can use Sensei for grading
core/exploitation/ Executor wraps these modules

Dojo Outputs -> Fine-tuning

dojo_output/
|-- training_data/
|   |-- alpaca_20250104_120000.json
|   |-- dpo_20250104_120000.json
|   +-- sharegpt_20250104_120000.json
|-- progress/
|   +-- model_progress.json
+-- packages/
    +-- finetune_package_20250104_120000/

9. Continuous Improvement Workflow

The Dojo Loop

+---------------------------------------------------------------------+
|                        TRAINING CYCLE                               |
+---------------------------------------------------------------------+
|                                                                     |
|  1. Challenge Session                                               |
|  +---------------------------------------------------------------+  |
|  | Load challenges -> Run model -> Execute -> Collect attempts   |  |
|  +---------------------------------------------------------------+  |
|                              |                                      |
|                              v                                      |
|  2. Grading                                                         |
|  +---------------------------------------------------------------+  |
|  | Sensei grades -> Extract examples -> Update progress          |  |
|  +---------------------------------------------------------------+  |
|                              |                                      |
|                              v                                      |
|  3. Export                                                          |
|  +---------------------------------------------------------------+  |
|  | Export Alpaca/DPO -> Package for training                     |  |
|  +---------------------------------------------------------------+  |
|                              |                                      |
|                              v                                      |
|  4. Fine-tune (External)                                            |
|  +---------------------------------------------------------------+  |
|  | Run LoRA training -> Evaluate -> Deploy improved model        |  |
|  +---------------------------------------------------------------+  |
|                              |                                      |
|                              v                                      |
|  5. Belt Evaluation                                                 |
|  +---------------------------------------------------------------+  |
|  | Run belt suite -> Check promotion -> Unlock next belt         |  |
|  +---------------------------------------------------------------+  |
|                                                                     |
+---------------------------------------------------------------------+

10. Metrics

TrainingCycleResult

@dataclass
class TrainingCycleResult:
    assessments: list[SenseiAssessment]
    examples: list[TrainingExample]
    exports: dict[ExportFormat, Path]
    progress: ModelProgress
    promotion: Optional[Belt] = None
    stats: dict = field(default_factory=dict)

    def summary(self) -> str:
        """Human-readable summary."""
        return f"""
=== Training Cycle Complete ===
Sessions graded: {len(self.assessments)}
Examples extracted: {len(self.examples)}
Files exported: {len(self.exports)}

Model: {self.progress.model_id}
Belt: {self.progress.current_belt.display}
Pass Rate: {self.progress.pass_rate:.1f}%
"""

ModelProgress

@dataclass
class ModelProgress:
    model_id: str
    current_belt: Belt
    challenges_attempted: int
    challenges_passed: int
    total_score: int
    assessment_count: int

    @property
    def pass_rate(self) -> float:
        if self.challenges_attempted == 0:
            return 0.0
        return (self.challenges_passed / self.challenges_attempted) * 100

    @property
    def average_score(self) -> float:
        if self.assessment_count == 0:
            return 0.0
        return self.total_score / self.assessment_count

Implementation Status

Completed βœ…

  • V2 Curriculum Architecture - 7 pillars, multi-phase challenges
  • PraxisRunner - Main execution engine with reasoning loop
  • RAG System - ChromaDB-based retrieval with OWASP/CWE knowledge
  • MCP Integration - JADX and Apktool servers for verification
  • DPO Training Pipeline - Chosen/rejected pair extraction
  • ReasoningGrader - Epistemic calibration, hallucination detection
  • White/Yellow Belt Challenges - Foundation curriculum complete

In Progress πŸ”„

  • Green+ Belt Challenges - Extending curriculum depth
  • Frida MCP Server - Dynamic instrumentation integration
  • Metrics Dashboard - Streamlit visualization
  • Automated Training Loop - Scheduled challenge runs

Planned πŸ“‹

  • CLI Interface - python -m dojo train, python -m dojo export
  • Pattern Family Clustering - For transfer learning
  • Multi-APK Synthesis Challenges - Black belt complexity
  • Real-time Progress Tracking - WebSocket updates


"A black belt is a white belt who never quit."