AgenticART Dojo Framework¶
Transform AgenticART from a tool into a training ground for security LLMs
Status: V2 architecture fully implemented with Praxis Loop, RAG system, and MCP integration.
Quick Links¶
- Architecture Overview - System architecture and implementation phases
- RAG System - Retrieval-Augmented Generation for context
- MCP Integration - Model Context Protocol for tool execution
Philosophy¶
A dojo is not just a place where techniques are practicedβit's a system that: - Provides structured progression (belt levels) - Offers immediate feedback (sensei corrections) - Captures successful patterns (kata) - Measures mastery (grading) - Enables continuous improvement (deliberate practice)
AgenticART already captures exploitation trajectories. The Dojo Framework formalizes this into a self-improving training loop.
Architecture Overview¶
+-------------------------------------------------------------------------+
| DOJO FRAMEWORK |
+-------------------------------------------------------------------------+
| |
| +------------------+ +------------------+ +------------------+ |
| | CURRICULUM | | SPARRING | | GRADING | |
| | (Challenge |--->| (Execution |--->| (Quality | |
| | Progression) | | Against AVD) | | Assessment) | |
| +------------------+ +------------------+ +------------------+ |
| | | | |
| | v | |
| | +------------------+ | |
| | | SENSEI | | |
| | | (Feedback & |<------------+ |
| | | Correction) | |
| | +------------------+ |
| | | |
| v v |
| +------------------------------------------------------------------+ |
| | TRAINING DATA PIPELINE | |
| | +--------------+ +--------------+ +--------------+ | |
| | | Positive | | Negative | | Error | | |
| | | Examples | | Examples | | Recovery | | |
| | +--------------+ +--------------+ +--------------+ | |
| +------------------------------------------------------------------+ |
| | |
| v |
| +------------------+ |
| | FINE-TUNING | |
| | (LoRA/MLX) | |
| +------------------+ |
| |
+-------------------------------------------------------------------------+
Directory Structure¶
dojo/
|-- __init__.py # Package exports
|-- config.py # DojoConfig settings
|-- models.py # Core data models (Belt, Grade, Challenge, etc.)
|-- exceptions.py # Custom exceptions
|
|-- curriculum/ # Challenge system
| |-- __init__.py
| |-- challenger.py # Orchestrates attempts with feedback loop
| |-- loader.py # Loads challenges from YAML
| |-- executor.py # Executes commands against device
| |-- context_injector.py # Injects error context for retries
| |-- error_extractor.py # Extracts actionable error information
| |
| |-- white_belt/ # Fundamentals
| | +-- challenges.yaml
| |-- yellow_belt/ # Reconnaissance
| | +-- challenges.yaml
| +-- orange_belt/ # Vulnerability mapping
| +-- challenges.yaml
|
|-- sensei/ # Grading and training data
| |-- __init__.py
| |-- sensei.py # Main orchestrator
| |-- grader.py # Evaluates challenge sessions
| |-- exporter.py # Exports to Alpaca/ShareGPT/DPO formats
| |-- progress_tracker.py # Tracks model progress across sessions
| +-- training_extractor.py # Extracts training examples from sessions
|
|-- finetune/ # Model training utilities
| |-- __init__.py
| |-- config.py # FinetuneConfig
| +-- packager.py # Packages data for GPU training
|
|-- test_end_to_end.py # Integration test
|-- test_phase2.py # Curriculum tests
+-- test_phase3.py # Sensei tests
V2 Praxis Loop¶
The V2 architecture introduces the Praxis Loopβa reasoning β verification β calibration cycle:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β PRAXIS LOOP β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β βββββββββββββββ β
β β Challenge β β
β β Input β β
β ββββββββ¬βββββββ β
β β β
β βΌ β
β βββββββββββββββ βββββββββββββββ β
β β RAG System βββββββΆβ Augmented β β
β β (Context) β β Prompt β β
β βββββββββββββββ ββββββββ¬βββββββ β
β β β
β βΌ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β REASONING PHASE β β
β β OBSERVE β HYPOTHESIZE β TEST β CALIBRATE β CORRECT β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β MCP βββββββΆβ Tool βββββββΆβ Calibration β β
β β Executor β β Results β β Signal β β
β βββββββββββββββ βββββββββββββββ ββββββββ¬βββββββ β
β β β
β βΌ β
β βββββββββββββββ β
β β DPO Pairs β β
β β (Training) β β
β βββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
7 Pillars (V2 Curriculum)¶
Challenges are organized by cognitive skill rather than just difficulty:
| Pillar | Focus | Challenge Types |
|---|---|---|
static_analysis |
Code review, artifact identification | OBSERVATION, HYPOTHESIS |
negative_knowledge |
Recognizing secure patterns | NEGATIVE |
root_cause |
Understanding WHY vulnerabilities exist | ROOT_CAUSE |
pattern_transfer |
Applying patterns across contexts | TRANSFER |
methodology |
Tool selection, test design | VERIFICATION |
taxonomy |
CWE/OWASP classification | All types with classification focus |
patch_analysis |
Analyzing security fixes | OBSERVATION, ROOT_CAUSE |
PraxisRunner¶
The main execution engine for V2 challenges:
from dojo.graders.praxis_runner import PraxisRunner
from dojo.mcp import MCPExecutor
from dojo.rag import RAGSystem
# Initialize components
executor = MCPExecutor()
rag = RAGSystem(persist_dir=Path(".rag_data"))
runner = PraxisRunner(
llm_client=client,
mcp_executor=executor,
enable_rag=True,
rag_system=rag,
)
# Run challenge
result = runner.run_challenge(challenge)
print(f"Score: {result.total_score}")
print(f"Calibration Error: {result.calibration_error}")
1. Belt Progression System¶
Belt Levels¶
| Belt | Name | Challenge Type | Success Criteria |
|---|---|---|---|
| β¬ White | Observation | Systematic artifact identification | 100% key artifact discovery |
| π¨ Yellow | Hypothesis | Testable security hypotheses | Accurate CWE mapping & Test Design |
| π§ Orange | Verification | Execution against live targets | Successful exploit verification |
Current Implementation: White belt is 100% operational with pre-built targets. Yellow belt is in the "Connector Phase" with physical targets currently being integrated.
Challenge YAML Format¶
# dojo/curriculum/white_belt/challenges.yaml
challenges:
- id: white_001
name: "Device Android Version"
description: |
Write an ADB shell command that outputs the Android version.
The output should be just the version number (e.g., "11").
belt: white
difficulty: 1
script_type: adb
inputs:
device_id: "emulator-5554"
device_context:
connection: "adb"
task: "retrieve Android version"
validation:
type: regex_match
pattern: "^\\d+(\\.\\d+)*"
hints:
- "Use 'adb shell getprop' to read system properties"
- "Android version is stored in ro.build.version.release"
kata_solution: "shell getprop ro.build.version.release"
tags:
- fundamentals
- device-info
2. Core Components¶
ChallengeLoader¶
Loads and validates challenges from YAML files.
from dojo.curriculum import ChallengeLoader
loader = ChallengeLoader()
challenge = loader.load("white_001")
all_white = loader.load_belt(Belt.WHITE)
Challenger¶
Orchestrates challenge attempts with the feedback loop.
from dojo.curriculum import Challenger, ChallengeSession
challenger = Challenger(
loader=loader,
executor=executor,
llm_client=llm_client,
max_attempts=3,
)
session: ChallengeSession = challenger.run_challenge("white_001")
print(f"Success: {session.final_success}")
print(f"Attempts: {len(session.attempts)}")
Executor¶
Executes commands against the Android device with tier tracking.
from dojo.curriculum import Executor, ExecutionResult
executor = Executor(adb_path="/usr/bin/adb", device_id="emulator-5554")
result: ExecutionResult = executor.execute("shell getprop ro.build.version.release")
print(f"Success: {result.success}")
print(f"Output: {result.stdout}")
print(f"Tier: {result.tier_used}") # 1=ADB, 2=ON_DEVICE
ErrorExtractor & ContextInjector¶
Extracts error information and injects it into retry prompts.
from dojo.curriculum import ErrorExtractor, ContextInjector
extractor = ErrorExtractor()
error_context = extractor.extract(result)
injector = ContextInjector()
retry_prompt = injector.inject(original_prompt, error_context)
3. Sensei Module (Grading & Training Data)¶
Sensei¶
The main orchestrator that connects grading, extraction, and export.
from dojo.sensei import Sensei
sensei = Sensei(output_dir=Path("./dojo_output"))
# Evaluate a single session
assessment, examples = sensei.evaluate_session(session, model_id="qwen-v1")
# Evaluate multiple sessions and run full cycle
result = sensei.run_training_cycle(
sessions=sessions,
model_id="qwen-v1",
export_formats=[ExportFormat.ALPACA, ExportFormat.DPO],
)
print(result.summary())
Grader (V2)¶
The V2 Reasoning Grader goes beyond simple pass/fail checks. It evaluates the quality of thought using multiple metrics:
| Metric | Description |
|---|---|
| Epistemic Calibration | Does the model's confidence match its accuracy? (Brier Score, ECE) |
| Reasoning Quality | Completeness, depth, and logical coherence of the analysis. |
| Hallucination Detection | Identifies fabricated library calls, APIs, or CVEs using artifact cross-referencing. |
| Stability | Consistency of results across multiple independent runs (Stability Score). |
from dojo.graders.reasoning_grader import ReasoningGrader, GradingResult
grader = ReasoningGrader(challenge)
result: GradingResult = grader.grade_phase(PhaseID.OBSERVE, response)
print(f"Score: {result.total_score}")
print(f"Hallucinations: {result.hallucinations}")
print(f"Calibration Error: {result.calibration_error}")
Grade Enum¶
from dojo.models import Grade
class Grade(Enum):
PERFECT = "A" # No corrections needed -> positive example
GOOD = "B" # Minor issues -> positive with notes
ACCEPTABLE = "C" # Functional but needs improvement
POOR = "D" # Major issues -> negative example with correction
FAIL = "F" # Non-functional -> negative example
TrainingExtractor¶
Extracts training examples from graded sessions.
from dojo.sensei import TrainingExtractor
extractor = TrainingExtractor()
examples: list[TrainingExample] = extractor.extract_from_session(session, assessment)
# Examples include:
# - Positive examples (Grade A/B)
# - Negative examples (Grade D/F)
# - Error recovery pairs (failed -> fixed)
TrainingDataExporter¶
Exports training data in multiple formats.
from dojo.sensei import TrainingDataExporter, ExportFormat
exporter = TrainingDataExporter(output_dir=Path("./training_data"))
# Export in Alpaca format (instruction/input/output)
path = exporter.export(examples, ExportFormat.ALPACA)
# Export in DPO format (chosen/rejected pairs)
path = exporter.export(examples, ExportFormat.DPO)
# Export in ShareGPT format (conversations)
path = exporter.export(examples, ExportFormat.SHAREGPT)
ProgressTracker¶
Tracks model progress across training sessions.
from dojo.sensei import ProgressTracker
tracker = ProgressTracker(storage_path=Path("./progress"))
tracker.record_assessment(model_id, assessment)
progress = tracker.get_progress(model_id)
print(f"Belt: {progress.current_belt}")
print(f"Pass Rate: {progress.pass_rate}%")
print(f"Ready for Promotion: {progress.ready_for_promotion}")
4. Execution Tier System¶
Overview¶
The Dojo uses a tiered execution model that prioritizes resource efficiency.
| Tier | Name | Description | When to Use |
|---|---|---|---|
| 1 | ADB | Pure shell commands via ADB | Always try first |
| 2 | On-Device | Tools on Android (sqlite3, toybox) | When ADB insufficient |
| 3 | External | Kali tools (nmap, metasploit) | Preprocessing ONLY |
Tier Exhaustion Strategy¶
- Try Tier 1 first: Can this be done with pure ADB commands?
- Escalate to Tier 2: If ADB is insufficient, use on-device tools
- Tier 3 is preprocessing only: Kali tools embed results in challenge metadata
ExecutionResult Metadata¶
@dataclass
class ExecutionResult:
success: bool
exit_code: int
stdout: str
stderr: str
duration: float
command: str
tier_used: ExecutionTier # SHELL, ON_DEVICE, EXTERNAL
tools_used: list[str]
5. Fine-tuning Pipeline¶
TrainingPackager¶
Creates portable packages for GPU training.
from dojo.finetune import TrainingPackager, FinetuneConfig
packager = TrainingPackager(output_dir=Path("./packages"))
config = FinetuneConfig(
base_model="Qwen/Qwen2.5-Coder-7B",
adapter_type="lora",
lora_rank=16,
learning_rate=1e-4,
epochs=3,
)
package_path = packager.create_package(
training_data_path=Path("./training_data/combined.json"),
config=config,
)
Package Contents¶
finetune_package_20250104_120000/
|-- data/
| +-- training_data.json # Alpaca format
|-- config.json # FinetuneConfig
|-- train.py # Training script
|-- train_mlx.py # MLX training (Apple Silicon)
+-- README.md # Instructions
6. Data Models¶
Core Models (dojo/models.py)¶
from dojo.models import (
Belt, # WHITE, YELLOW, ORANGE, GREEN, BLUE, PURPLE, BROWN, BLACK
Grade, # PERFECT, GOOD, ACCEPTABLE, POOR, FAIL
ScriptType, # ADB, PYTHON, FRIDA, BASH
Challenge, # Challenge definition
ChallengeInput, # Input context for challenge
ExpectedOutput, # Expected output specification
ScoringRubric, # Scoring weights
SenseiAssessment, # Grading result
TrainingExample, # Extracted training sample
ModelProgress, # Model's progress tracking
)
Belt Model¶
class Belt(Enum):
WHITE = "white"
YELLOW = "yellow"
ORANGE = "orange"
GREEN = "green"
BLUE = "blue"
PURPLE = "purple"
BROWN = "brown"
BLACK = "black"
@property
def display(self) -> str:
"""Belt with color emoji."""
icons = {"white": "β¬", "yellow": "π¨", ...}
return f"{icons[self.value]} {self.value.title()}"
def next_belt(self) -> Optional[Belt]:
"""Get the next belt in progression."""
...
7. Running the Dojo¶
End-to-End Test¶
# Run the complete dojo pipeline
python -m dojo.test_end_to_end
# This will:
# 1. Load white belt challenges
# 2. Run model against challenges
# 3. Grade outputs with Sensei
# 4. Extract training examples
# 5. Export to Alpaca format
Programmatic Usage¶
from dojo import (
ChallengeLoader,
Challenger,
Executor,
Sensei,
)
from agent.llm_client import OllamaClient
# Setup
loader = ChallengeLoader()
executor = Executor(device_id="emulator-5554")
llm = OllamaClient(model="qwen2.5-coder:7b")
challenger = Challenger(
loader=loader,
executor=executor,
llm_client=llm,
max_attempts=3,
)
sensei = Sensei()
# Run challenges
sessions = []
for challenge_id in ["white_001", "white_002", "white_003"]:
session = challenger.run_challenge(challenge_id)
sessions.append(session)
# Grade and export
result = sensei.run_training_cycle(
sessions=sessions,
model_id="qwen-v1",
export_formats=[ExportFormat.ALPACA],
)
print(result.summary())
8. Integration with AgenticART¶
Existing Components -> Dojo¶
| Existing Component | Dojo Integration |
|---|---|
agent/llm_client.py |
LLM provider for Challenger |
agent/script_generator.py |
Can use Sensei for grading |
core/exploitation/ |
Executor wraps these modules |
Dojo Outputs -> Fine-tuning¶
dojo_output/
|-- training_data/
| |-- alpaca_20250104_120000.json
| |-- dpo_20250104_120000.json
| +-- sharegpt_20250104_120000.json
|-- progress/
| +-- model_progress.json
+-- packages/
+-- finetune_package_20250104_120000/
9. Continuous Improvement Workflow¶
The Dojo Loop¶
+---------------------------------------------------------------------+
| TRAINING CYCLE |
+---------------------------------------------------------------------+
| |
| 1. Challenge Session |
| +---------------------------------------------------------------+ |
| | Load challenges -> Run model -> Execute -> Collect attempts | |
| +---------------------------------------------------------------+ |
| | |
| v |
| 2. Grading |
| +---------------------------------------------------------------+ |
| | Sensei grades -> Extract examples -> Update progress | |
| +---------------------------------------------------------------+ |
| | |
| v |
| 3. Export |
| +---------------------------------------------------------------+ |
| | Export Alpaca/DPO -> Package for training | |
| +---------------------------------------------------------------+ |
| | |
| v |
| 4. Fine-tune (External) |
| +---------------------------------------------------------------+ |
| | Run LoRA training -> Evaluate -> Deploy improved model | |
| +---------------------------------------------------------------+ |
| | |
| v |
| 5. Belt Evaluation |
| +---------------------------------------------------------------+ |
| | Run belt suite -> Check promotion -> Unlock next belt | |
| +---------------------------------------------------------------+ |
| |
+---------------------------------------------------------------------+
10. Metrics¶
TrainingCycleResult¶
@dataclass
class TrainingCycleResult:
assessments: list[SenseiAssessment]
examples: list[TrainingExample]
exports: dict[ExportFormat, Path]
progress: ModelProgress
promotion: Optional[Belt] = None
stats: dict = field(default_factory=dict)
def summary(self) -> str:
"""Human-readable summary."""
return f"""
=== Training Cycle Complete ===
Sessions graded: {len(self.assessments)}
Examples extracted: {len(self.examples)}
Files exported: {len(self.exports)}
Model: {self.progress.model_id}
Belt: {self.progress.current_belt.display}
Pass Rate: {self.progress.pass_rate:.1f}%
"""
ModelProgress¶
@dataclass
class ModelProgress:
model_id: str
current_belt: Belt
challenges_attempted: int
challenges_passed: int
total_score: int
assessment_count: int
@property
def pass_rate(self) -> float:
if self.challenges_attempted == 0:
return 0.0
return (self.challenges_passed / self.challenges_attempted) * 100
@property
def average_score(self) -> float:
if self.assessment_count == 0:
return 0.0
return self.total_score / self.assessment_count
Implementation Status¶
Completed β ¶
- V2 Curriculum Architecture - 7 pillars, multi-phase challenges
- PraxisRunner - Main execution engine with reasoning loop
- RAG System - ChromaDB-based retrieval with OWASP/CWE knowledge
- MCP Integration - JADX and Apktool servers for verification
- DPO Training Pipeline - Chosen/rejected pair extraction
- ReasoningGrader - Epistemic calibration, hallucination detection
- White/Yellow Belt Challenges - Foundation curriculum complete
In Progress π¶
- Green+ Belt Challenges - Extending curriculum depth
- Frida MCP Server - Dynamic instrumentation integration
- Metrics Dashboard - Streamlit visualization
- Automated Training Loop - Scheduled challenge runs
Planned π¶
- CLI Interface -
python -m dojo train,python -m dojo export - Pattern Family Clustering - For transfer learning
- Multi-APK Synthesis Challenges - Black belt complexity
- Real-time Progress Tracking - WebSocket updates
Related Documentation¶
- Architecture - System design and component overview
- RAG System - Knowledge retrieval details
- MCP Integration - Tool execution protocol
- Quickstart - Getting started guide
"A black belt is a white belt who never quit."