AgenticART Architecture: Praxis (V2)¶
Executive Summary¶
The AgenticART architecture trains models to reason about vulnerabilities through structured cognitive phases, using tool execution as binary ground truth.
Key Components¶
| Component | Status | Description |
|---|---|---|
| Praxis Loop | β Implemented | Reasoning β Verification β Calibration cycle |
| MCP Integration | β Implemented | Model Context Protocol for Android security tools |
| RAG System | β Implemented | Retrieval-Augmented Generation for context |
| V2 Curriculum | β Implemented | 7 pillars, multi-phase challenges |
| DPO Training | β Implemented | Preference pair extraction |
| Belt Progression | π In Progress | White through Black belt challenges |
Core Paradigm¶
Reasoning Chain¶
Input: APK + manifest + decompiled code 1. OBSERVE: Identify security-relevant artifacts. 2. HYPOTHESIZE: Identify attack surface and potential vulnerabilities. 3. TEST: Design and execute MCP verification tasks. 4. CALIBRATE: Compare confidence to execution pass rate. 5. CORRECT: If execution fails, revise hypothesis. 6. TRAIN: Capture high-quality DPO chosen/rejected pairs.
New Challenge Types¶
Type 1: OBSERVATION Challenge¶
Purpose: Train the model to identify security-relevant artifacts Input: Code, manifest, binary properties, runtime traces Output: Structured list of observations with security relevance scores Evaluation: Completeness, accuracy, relevance ranking
Type 2: HYPOTHESIS Challenge¶
Purpose: Train the model to form testable security hypotheses Input: Observations + context Output: Ranked hypotheses with confidence and test plans Evaluation: Hypothesis validity, testability, reasoning quality
Type 3: VERIFICATION Challenge¶
Purpose: Train the model to design and execute tests Input: Hypothesis + available tools Output: Test plan + execution + result interpretation Evaluation: Test coverage, methodology soundness, interpretation accuracy
Type 4: ROOT CAUSE Challenge¶
Purpose: Train deep understanding of WHY vulnerabilities exist Input: Verified vulnerability + code Output: Root cause analysis with vulnerability taxonomy mapping Evaluation: Depth of understanding, correct classification, generalization
Type 5: NEGATIVE Challenge¶
Purpose: Train recognition of SECURE patterns Input: Secure code implementation Output: Security analysis explaining why it's NOT vulnerable Evaluation: Accuracy of security property identification, attack resistance analysis
Type 6: TRANSFER Challenge¶
Purpose: Train pattern recognition across contexts Input: Multiple code samples with same vulnerability class Output: Pattern abstraction + application to new code Evaluation: Pattern generalization quality, correct application
Type 7: SYNTHESIS Challenge (Black Belt)¶
Purpose: Train end-to-end discovery on novel targets Input: Previously unseen APK Output: Complete vulnerability report with all phases documented Evaluation: Discovery of planted vulnerability OR novel finding
New Data Structures¶
Challenge Schema V2¶
challenge:
id: string
name: string
version: 2
# Challenge classification
type: observation | hypothesis | verification | root_cause | negative | transfer | synthesis
pillar: static_analysis | negative_knowledge | root_cause | pattern_transfer | methodology | taxonomy | patch_analysis
belt: white | yellow | orange | green | blue | purple | brown | black
difficulty: 1-10
# Input artifacts (what the model receives)
artifacts:
- type: decompiled_code | manifest | binary_properties | runtime_trace | network_capture | previous_output
content: string | file_reference
context: string # What this artifact represents
# Phase-specific configuration
phases:
- phase_id: observe | hypothesize | test | analyze | synthesize
instruction: string # What to do in this phase
expected_output_schema: object # Structure of expected output
evaluation_criteria: list[string] # How to grade this phase
max_tokens: int # Token limit for this phase response
# Ground truth for evaluation
ground_truth:
vulnerability_present: bool
vulnerability_type: string | null
cwe_id: string | null
root_cause: string | null
secure_properties: list[string] # For negative challenges
key_observations: list[string] # Must-find items
valid_hypotheses: list[object]
valid_tests: list[object]
# Training metadata
training:
reasoning_chain_required: bool
dpo_pairs_available: bool
negative_examples: list[string] # What NOT to conclude
common_mistakes: list[string] # Frequent errors to train against
# Relationships
prerequisites: list[challenge_id]
unlocks: list[challenge_id]
pattern_family: string # For transfer learning grouping
Reasoning Chain Schema¶
reasoning_chain:
challenge_id: string
model_id: string
timestamp: datetime
phases:
- phase_id: string
input_provided: string
model_output: string
output_parsed: object # Structured extraction
evaluation:
score: float # 0.0 - 1.0
criteria_scores: dict[string, float]
feedback: string
hallucinations_detected: list[string]
reasoning_quality:
completeness: float
accuracy: float
depth: float
transferability: float
overall:
success: bool
total_score: float
grade: A | B | C | D | F
discovery_made: bool
novel_finding: bool
Training Data Generation¶
Per-Phase Training Examples¶
Each phase generates its own training data:
# OBSERVATION phase training example
{
"instruction": "Analyze the following Android code and identify all security-relevant observations.",
"input": "<decompiled Java code>",
"output": {
"observations": [
{"artifact": "WebView.addJavascriptInterface", "relevance": "high", "reasoning": "..."},
{"artifact": "Intent.getStringExtra without validation", "relevance": "medium", "reasoning": "..."}
],
"security_context": "...",
"recommended_next_steps": ["..."]
}
}
# HYPOTHESIS phase training example
{
"instruction": "Based on these observations, form testable security hypotheses.",
"input": "<observations from previous phase>",
"output": {
"hypotheses": [
{
"statement": "The JavaScript interface exposes methods that can be called from untrusted web content",
"confidence": 0.8,
"testable": true,
"test_plan": "Hook addJavascriptInterface, enumerate exposed methods, test from malicious URL",
"cwe_mapping": "CWE-749"
}
]
}
}
# ROOT_CAUSE phase training example
{
"instruction": "Explain WHY this vulnerability exists at a fundamental level.",
"input": "<verified vulnerability details>",
"output": {
"surface_cause": "User input reaches addJavascriptInterface without validation",
"root_cause": "Trust boundary violation - web content treated as trusted",
"fundamental_principle": "Confused deputy problem - privileged component (native code) controlled by unprivileged input (JavaScript)",
"similar_patterns": ["AIDL without caller verification", "ContentProvider without permission checks"],
"taxonomy": {
"cwe": "CWE-749",
"parent_cwe": "CWE-668",
"owasp_mobile": "M7"
}
}
}
DPO Pair Generation¶
For each phase, generate preference pairs:
{
"prompt": "<observation phase prompt>",
"chosen": "<complete, accurate observations with correct relevance ranking>",
"rejected": "<incomplete observations OR incorrect relevance OR hallucinated findings>",
"rejection_reasons": ["missed_critical_finding", "hallucinated_api", "incorrect_relevance"]
}
Evaluation Rubrics¶
Observation Phase Rubric¶
| Criterion | Weight | Description |
|---|---|---|
| Completeness | 30% | Did it find all key artifacts? |
| Accuracy | 30% | Are observations factually correct? |
| Relevance | 20% | Is security relevance correctly assessed? |
| No Hallucination | 20% | No made-up APIs/paths/methods? |
Hypothesis Phase Rubric¶
| Criterion | Weight | Description |
|---|---|---|
| Validity | 25% | Is the hypothesis logically sound? |
| Testability | 25% | Can it be verified/falsified? |
| Specificity | 20% | Is it precise enough to act on? |
| Coverage | 15% | Does it address key observations? |
| CWE Mapping | 15% | Correct vulnerability classification? |
Root Cause Phase Rubric¶
| Criterion | Weight | Description |
|---|---|---|
| Depth | 30% | Goes beyond surface to fundamental cause? |
| Accuracy | 25% | Correctly identifies the real cause? |
| Generalization | 25% | Identifies transferable patterns? |
| Taxonomy | 20% | Correct CWE/OWASP mapping? |
Negative Challenge Rubric¶
| Criterion | Weight | Description |
|---|---|---|
| Correct Classification | 40% | Correctly identifies as NOT vulnerable? |
| Security Property ID | 30% | Identifies what MAKES it secure? |
| Attack Resistance | 20% | Explains why attacks would fail? |
| No False Positives | 10% | Doesn't hallucinate vulnerabilities? |
Belt Progression Model¶
WHITE BELT: Foundation
βββ Focus: Basic observation skills
βββ Challenge Types: OBSERVATION only
βββ Artifacts: Simple code snippets, basic manifests
βββ Success Criteria: 70% observation accuracy
βββ Challenges: 50
YELLOW BELT: Classification
βββ Focus: Vulnerability taxonomy
βββ Challenge Types: OBSERVATION + taxonomy mapping
βββ Artifacts: Code with known vulnerability types
βββ Success Criteria: 80% CWE classification accuracy
βββ Challenges: 75
ORANGE BELT: Pattern Recognition
βββ Focus: Recognizing vulnerability patterns
βββ Challenge Types: OBSERVATION + HYPOTHESIS
βββ Artifacts: Multiple code samples per pattern family
βββ Success Criteria: Identify pattern in 3/5 new samples
βββ Challenges: 100
GREEN BELT: Hypothesis Formation
βββ Focus: Forming testable hypotheses
βββ Challenge Types: Full OBSERVATION β HYPOTHESIS β TEST
βββ Artifacts: APKs, Frida available
βββ Success Criteria: 70% hypothesis verification rate
βββ Challenges: 125
BLUE BELT: Root Cause Analysis
βββ Focus: Deep understanding of WHY
βββ Challenge Types: Add ROOT_CAUSE phase
βββ Artifacts: Verified vulnerabilities for analysis
βββ Success Criteria: Root cause matches expert analysis
βββ Challenges: 150
PURPLE BELT: Negative Knowledge
βββ Focus: Recognizing secure code
βββ Challenge Types: NEGATIVE + comparative analysis
βββ Artifacts: Secure implementations to analyze
βββ Success Criteria: <5% false positive rate
βββ Challenges: 175
BROWN BELT: Transfer Learning
βββ Focus: Applying patterns to new contexts
βββ Challenge Types: TRANSFER challenges across apps
βββ Artifacts: Multiple APKs, pattern families
βββ Success Criteria: Find same vuln class in new app
βββ Challenges: 200
BLACK BELT: Discovery
βββ Focus: Novel vulnerability discovery
βββ Challenge Types: SYNTHESIS on unknown targets
βββ Artifacts: Previously unseen APKs
βββ Success Criteria: Discover planted OR novel vulnerability
βββ Challenges: 180
RAG System¶
The RAG (Retrieval-Augmented Generation) system provides contextual knowledge to reduce hallucinations:
Challenge Input
β
βΌ
βββββββββββββββββββ ββββββββββββββββββββββββββββββββββββ
β Query Router ββββββΆβ Knowledge Bases (ChromaDB) β
β (Pillar-aware) β β ββββββββββββ ββββββββββββ β
βββββββββββββββββββ β β vuln_db β β examples β β
β β ββββββββββββ ββββββββββββ β
βΌ β ββββββββββββ ββββββββββββ β
βββββββββββββββββββ β βandroid_apiβ βtool_docs β β
β RAG Context βββββββ ββββββββββββ ββββββββββββ β
β Builder β ββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββ
β
βΌ
LLM (Qwen 32B / MLX)
Knowledge Bases:
- vuln_db: CWE definitions, OWASP Mobile Top 10
- examples: Analysis examples from curriculum
- android_api: API docs, permissions, deprecations
- tool_docs: ADB, Frida, jadx commands
See: RAG_SYSTEM.md for detailed documentation.
MCP Integration¶
The MCP (Model Context Protocol) provides standardized tool execution for verification:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Praxis Verification Layer β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β PraxisRunner β
β β β
β βΌ β
β βββββββββββββββ β
β βMCPExecutor ββββββββ¬βββββββββββ¬βββββββββββ¬βββββββββββ β
β βββββββββββββββ β β β β β
β βΌ βΌ βΌ βΌ β
β βββββββββββ βββββββββββ βββββββββββ βββββββββββ β
β β JADX β βApktool β β ADB β β Frida β β
β β Server β β Server β β Server β β Server β β
β βββββββββββ βββββββββββ βββββββββββ βββββββββββ β
β β β β β β
β βΌ βΌ βΌ βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββ β
β β Tool Results β β
β β (Binary ground truth for calibration) β β
β βββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
MCP Servers:
- jadx: Java decompilation, code search, security patterns
- apktool: APK decoding, manifest extraction, smali analysis
- adb: Device interaction, package info
- frida: Dynamic instrumentation (planned)
See: MCP_INTEGRATION.md for detailed documentation.
Implementation Phases¶
Phase 1: Foundation (New Models) β COMPLETE¶
- Create
ChallengeV2model with multi-phase support - Create
ReasoningChainmodel for capturing full traces - Create
PhaseEvaluationmodel for per-phase grading - Update loader to support V2 challenges while maintaining V1 compatibility
Phase 2: Evaluation (New Grader) β COMPLETE¶
- Create
ReasoningGraderwith per-phase rubrics - Implement hallucination detection for reasoning (not just commands)
- Create
TransferEvaluatorfor pattern recognition assessment - Implement negative example evaluation
Phase 3: Execution (New Executors) β COMPLETE¶
- Implement Frida script executor
- Implement static analysis tooling (jadx output parsing) β MCP servers
- Create multi-phase executor that chains phases β PraxisRunner
- Add artifact extraction utilities
Phase 4: Training Data (New Extractor) β COMPLETE¶
- Create
ReasoningExtractorfor full chain capture - Implement per-phase DPO pair generation
- Create negative example extraction
- Implement pattern family clustering
Phase 5: Curriculum (Challenge Creation) π IN PROGRESS¶
- Write 50 WHITE belt observation challenges β
- Write 75 YELLOW belt taxonomy challenges β
- Continue through all belts with ~1000 total challenges
- Integrate existing vulnerable APKs
Phase 6: RAG System β COMPLETE¶
- Implement ChromaDB-based knowledge bases
- Create embedding pipeline (sentence-transformers)
- Implement pillar-aware query routing
- Create context builder with token budgeting
- Integrate with PraxisRunner
Phase 7: MCP Integration β COMPLETE¶
- Create MCPExecutor for tool routing
- Implement JADX MCP server
- Implement Apktool MCP server
- Integrate with Praxis verification loop
Directory Structure¶
AgenticART/
βββ agent/ # Agent components
β βββ memory/ # Vector store, conversation memory
β βββ prompts/ # Prompt templates
β βββ chains/ # LangChain-style chains
βββ core/ # Core security modules
β βββ traffic/ # Network traffic analysis
β βββ exploitation/ # Exploitation techniques
β βββ scanning/ # Vulnerability scanning
β βββ verification/ # Result verification
β βββ reconnaissance/ # Recon modules
βββ dojo/ # Training & curriculum
β βββ curriculum/
β β βββ v2/ # V2 curriculum
β β βββ schema.yaml # Challenge schema
β β βββ pillars/ # 7 pillar challenges
β β βββ static_analysis/
β β βββ negative_knowledge/
β β βββ root_cause/
β β βββ pattern_transfer/
β β βββ methodology/
β β βββ taxonomy/
β β βββ patch_analysis/
β βββ graders/ # Challenge grading
β β βββ praxis_runner.py # Main Praxis loop
β βββ sensei/ # Training components
β β βββ reasoning_grader.py
β β βββ reasoning_extractor.py
β βββ evaluation/ # Evaluation results
β βββ finetune/ # Fine-tuning scripts
β βββ mcp/ # MCP Integration
β β βββ executor.py # MCPExecutor, ToolResult
β β βββ server.py # Base server utilities
β β βββ config/ # Server configurations
β β βββ servers/ # MCP server implementations
β β βββ jadx_server.py
β β βββ apktool_server.py
β βββ rag/ # RAG System
β β βββ config.py # RAGConfig, pillar weights
β β βββ embeddings.py # EmbeddingPipeline
β β βββ chunking.py # Text/code chunking
β β βββ retriever.py # RAGRetriever, QueryRouter
β β βββ context_builder.py # RAGContextBuilder
β β βββ knowledge_bases/ # KB implementations
β β β βββ vuln_db.py
β β β βββ examples.py
β β β βββ android_api.py
β β β βββ tool_docs.py
β β βββ loaders/ # Data loaders
β β βββ owasp_loader.py
β β βββ cwe_loader.py
β β βββ curriculum_loader.py
β βββ targets/ # Target APKs
β β βββ vulnerable_apks/
β βββ training_data/ # Generated training data
β βββ dpo/ # DPO pairs
β βββ mlx/ # MLX format
βββ webapp/ # Streamlit web interface
βββ tests/ # Test suite
βββ docs/ # Documentation
βββ scripts/ # Utility scripts
βββ experiments/ # Experiment tracking
βββ docker/ # Docker configurations
Migration Strategy¶
- V1 challenges remain functional (backward compatible)
- V2 challenges use new loader with
version: 2detection - Models can progress through V1 β V2 curriculum
- Training data from both versions can be combined
- Gradual migration: write new challenges as V2, optionally convert high-value V1
Success Metrics¶
Model Capability Metrics¶
- Observation Accuracy: % of key artifacts correctly identified
- Hypothesis Validity: % of hypotheses that are testable and relevant
- Verification Rate: % of hypotheses correctly verified/falsified
- Root Cause Depth: Expert rating of analysis depth (1-5)
- False Positive Rate: % of secure code incorrectly flagged
- Transfer Success: % of patterns recognized in new contexts
- Discovery Rate: % of synthesis challenges with correct findings
Training Data Quality Metrics¶
- Reasoning Chain Completeness: % of chains with all phases captured
- DPO Pair Quality: Expert rating of chosen/rejected contrast
- Negative Example Coverage: % of vulnerability types with negative examples
- Pattern Family Coverage: # of distinct patterns with 5+ instances
Curriculum Coverage Metrics¶
- CWE Coverage: % of Android-relevant CWEs with challenges
- OWASP Coverage: % of Mobile Top 10 with challenges
- Tool Coverage: % of standard tools (Frida, ADB, etc.) exercised
- Difficulty Distribution: Even spread across 1-10 difficulty scale