RAG System¶
Retrieval-Augmented Generation for Security Analysis
The RAG system provides contextual knowledge to the LLM during security analysis, reducing hallucinations and improving accuracy by grounding responses in verified security documentation.
Architecture¶
┌─────────────────────────────────────────────────────────────────────┐
│ RAG Pipeline │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ Challenge Input │
│ │ │
│ ▼ │
│ ┌─────────────┐ ┌──────────────────────────────────────────┐ │
│ │Query Router │────▶│ Knowledge Bases (ChromaDB) │ │
│ │(Pillar-aware)│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │
│ └─────────────┘ │ │vuln_db │ │examples │ │android_api│ │ │
│ │ │ └──────────┘ └──────────┘ └──────────┘ │ │
│ │ │ ┌──────────┐ │ │
│ │ │ │tool_docs │ │ │
│ ▼ │ └──────────┘ │ │
│ ┌─────────────┐ └──────────────────────────────────────────┘ │
│ │RAG Context │◀──────────────────────────────────────────────── │
│ │Builder │ │
│ └─────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │Augmented │───▶ LLM (Qwen 32B / MLX) │
│ │Prompt │ │
│ └─────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
Knowledge Bases¶
The RAG system maintains four specialized knowledge bases:
| Knowledge Base | Contents | Source | Purpose |
|---|---|---|---|
vuln_db |
CWE definitions, OWASP Mobile Top 10 | MITRE, OWASP | Vulnerability classification |
examples |
Analysis examples from curriculum | challenges.yaml | Pattern learning |
android_api |
API docs, permissions, deprecations | AOSP | API accuracy |
tool_docs |
ADB, Frida, jadx commands | Tool documentation | Command accuracy |
Pillar-Based Routing¶
The query router weights knowledge bases based on the challenge pillar:
PILLAR_KB_WEIGHTS = {
"static_analysis": {"android_api": 0.4, "vuln_db": 0.3, "examples": 0.2, "tool_docs": 0.1},
"root_cause": {"vuln_db": 0.5, "examples": 0.3, "android_api": 0.2},
"taxonomy": {"vuln_db": 0.6, "examples": 0.3, "android_api": 0.1},
"methodology": {"examples": 0.4, "tool_docs": 0.3, "vuln_db": 0.2, "android_api": 0.1},
"patch_analysis": {"vuln_db": 0.4, "android_api": 0.3, "examples": 0.2, "tool_docs": 0.1},
"negative_knowledge": {"examples": 0.4, "android_api": 0.3, "vuln_db": 0.3},
"pattern_transfer": {"examples": 0.5, "vuln_db": 0.3, "android_api": 0.2},
}
Components¶
EmbeddingPipeline¶
Generates embeddings using sentence-transformers:
from dojo.rag import EmbeddingPipeline, EmbeddingConfig
config = EmbeddingConfig(
model_name="all-MiniLM-L6-v2", # 384 dimensions, fast
device="cpu", # Keep GPU free for LLM
normalize_embeddings=True,
)
pipeline = EmbeddingPipeline(config)
embedding = pipeline.embed("SQL injection in ContentProvider")
RAGRetriever¶
Retrieves relevant documents from knowledge bases:
from dojo.rag import RAGRetriever, RAGConfig
retriever = RAGRetriever(vector_store, config)
results = retriever.retrieve(
query="WebView JavaScript interface vulnerability",
pillar="static_analysis",
top_k=5,
)
for result in results:
print(f"[{result.source}] {result.score:.3f}: {result.content[:100]}...")
RAGContextBuilder¶
Builds formatted context with token budgeting:
from dojo.rag import RAGContextBuilder
builder = RAGContextBuilder(retriever, config)
context = builder.build_context(
query="hardcoded credentials",
pillar="root_cause",
max_tokens=2000,
)
print(context.formatted_context)
print(f"Sources: {context.sources}")
print(f"Tokens used: {context.token_count}")
RAGSystem¶
Unified interface combining all components:
from dojo.rag import RAGSystem, RAGConfig
from pathlib import Path
rag = RAGSystem(
config=RAGConfig(),
persist_dir=Path(".rag_data"),
)
# Retrieve documents
results = rag.retrieve("insecure storage", pillar="static_analysis")
# Build context for a challenge
context = rag.build_context_for_challenge(challenge, phase_id="observe")
# Augment a prompt
augmented_prompt = rag.augment_prompt(
prompt="Analyze this code for vulnerabilities...",
query="SQL injection ContentProvider",
pillar="static_analysis",
)
Data Loaders¶
OWASPMobileLoader¶
Loads OWASP Mobile Top 10 2024 data:
from dojo.rag.loaders import OWASPMobileLoader
loader = OWASPMobileLoader()
# Get all OWASP categories
categories = loader.get_owasp_ids() # ['M1', 'M2', ..., 'M10']
# Get CWE mappings
mappings = loader.get_cwe_mappings()
# {'M1': ['CWE-798', 'CWE-312', ...], 'M2': [...], ...}
# Load into knowledge base
count = loader.load_into_kb(vuln_db_kb)
CWELoader¶
Loads CWE definitions from MITRE:
from dojo.rag.loaders import CWELoader
loader = CWELoader()
loader.download_cwe_data() # Downloads XML from MITRE
count = loader.load_into_kb(vuln_db_kb)
CurriculumLoader¶
Extracts examples from challenge ground truth:
from dojo.rag.loaders import CurriculumLoader
loader = CurriculumLoader()
count = loader.load_into_kb(
examples_kb,
challenges_dir=Path("dojo/curriculum/v2/pillars"),
)
Setup & Population¶
One-Time Setup¶
# Install dependencies
pip install sentence-transformers chromadb
# Populate knowledge bases
python scripts/populate_rag.py
Population Script¶
# scripts/populate_rag.py
from pathlib import Path
from dojo.rag import RAGSystem, RAGConfig
from dojo.rag.loaders import OWASPMobileLoader, CurriculumLoader
# Initialize RAG system
rag = RAGSystem(persist_dir=Path(".rag_data"))
# Load OWASP Mobile Top 10
owasp_loader = OWASPMobileLoader()
owasp_loader.load_into_kb(rag.knowledge_bases["vuln_db"])
# Load curriculum examples
curriculum_loader = CurriculumLoader()
curriculum_loader.load_into_kb(rag.knowledge_bases["examples"])
print(rag.get_stats())
Integration with PraxisRunner¶
The RAG system integrates with PraxisRunner to augment prompts during the Praxis Loop:
from dojo.graders.praxis_runner import PraxisRunner
runner = PraxisRunner(
llm_client=client,
mcp_executor=executor,
enable_rag=True,
rag_persist_dir=Path(".rag_data"),
rag_max_tokens=2000,
)
# RAG context is automatically injected into prompts
result = runner.run_challenge(challenge)
How It Works¶
- Query Extraction: Challenge context is converted to a search query
- Pillar Routing: Query routed to relevant knowledge bases based on pillar
- Retrieval: Top-k documents retrieved using semantic similarity
- Context Building: Documents formatted with token budgeting
- Prompt Augmentation: Context injected into LLM prompt
- Verification: LLM response grounded in retrieved knowledge
Configuration¶
from dojo.rag import RAGConfig, EmbeddingConfig, ChunkingConfig
config = RAGConfig(
# Embedding settings
embedding=EmbeddingConfig(
model_name="all-MiniLM-L6-v2",
device="cpu",
max_seq_length=256,
batch_size=32,
),
# Chunking settings
chunking=ChunkingConfig(
chunk_size=512,
chunk_overlap=50,
),
# Retrieval settings
top_k=10,
context_budget_tokens=2000,
# Persistence
persist_dir=Path(".rag_data"),
)
Testing¶
# Run RAG test suite
python scripts/test_rag.py
# Expected output:
# ✓ Core RAG imports successful
# ✓ Knowledge base imports successful
# ✓ Loader imports successful
# ✓ Default config created
# ✓ Pillar weights defined for 7 pillars
# ✓ Single embedding generated (384 dimensions)
# ✓ OWASP Mobile Top 10 data loaded
# ✓ RAG system created
# Total: 6 passed, 0 failed
Manual Retrieval Test¶
from dojo.rag import RAGSystem
from pathlib import Path
rag = RAGSystem(persist_dir=Path(".rag_data"))
# Test retrieval
results = rag.retrieve("SQL injection", top_k=3)
for r in results:
print(f"[{r.source}] {r.score:.3f}: {r.content[:80]}...")
Directory Structure¶
dojo/rag/
├── __init__.py # Public API, RAGSystem class
├── config.py # RAGConfig, EmbeddingConfig, pillar weights
├── embeddings.py # EmbeddingPipeline, ChromaDBEmbeddingFunction
├── chunking.py # TextChunker, CodeChunker, DocumentChunker
├── retriever.py # RAGRetriever, QueryRouter, RetrievalResult
├── context_builder.py # RAGContextBuilder, RAGContextInjector
├── prompt_augmenter.py # RAGPromptAugmenter
├── knowledge_bases/
│ ├── __init__.py
│ ├── base.py # BaseKnowledgeBase
│ ├── android_api.py # AndroidAPIKnowledgeBase
│ ├── vuln_db.py # VulnDBKnowledgeBase
│ ├── examples.py # ExamplesKnowledgeBase
│ └── tool_docs.py # ToolDocsKnowledgeBase
└── loaders/
├── __init__.py
├── cwe_loader.py # CWELoader
├── owasp_loader.py # OWASPMobileLoader
└── curriculum_loader.py # CurriculumLoader
Benefits¶
- Reduced Hallucinations: LLM responses grounded in verified documentation
- Accurate CWE Mapping: Retrieves correct vulnerability classifications
- Context-Aware: Pillar-based routing provides relevant context
- Efficient: Token budgeting prevents context overflow
- Extensible: Easy to add new knowledge bases and loaders