Research Insights: Model Alignment, Tool Compliance, and Safety Adaptation¶
Date: December 30, 2025 Context: AgenticART Project - Comparative analysis of Teacher (Llama 3.1 70B) and high-performing open-source models (Qwen 2.5 72B).
1. Comparative Analysis: Qwen 2.5 vs. Llama 3.1¶
An empirical analysis of dojo_output trajectories revealed a significant performance disparity between the designated Teacher model and Qwen 2.5.
Performance Data¶
- Qwen 2.5 (72B): 43.2% Aggregate Pass Rate (16/37 attempts). Best single run: 53% pass rate (14/26).
- Llama 3.1 (70B): 15.8% Aggregate Pass Rate (9/57 attempts).
Root Cause Analysis: The "ADB Prefix" Hallucination¶
Forensic review of trajectory logs identified that the performance gap was not due to abstract reasoning capabilities, but rather tool compliance.
-
Llama 3.1 Failure Pattern: Consistently hallucinated an unnecessary
adbprefix when using the provided shell tool.- Input: "Get Android version"
- Action:
adb shell getprop ... - Result:
Error: adb: unknown command getprop(The system wrapper already invokes the binary). - Conclusion: Llama failed to adapt to the specific tool definition despite error feedback, indicating rigidity in its pre-trained patterns.
-
Qwen 2.5 Success Pattern: Correctly inferred the tool context and adhered to the required syntax.
- Action:
shell getprop ... - Result: Success.
- Conclusion: Qwen demonstrated superior context awareness and instruction following in a CLI environment, likely due to its specialized training on code and technical documentation (Qwen-Coder lineage).
- Action:
Educational Takeaway: In agentic workflows, "intelligence" is often bottlenecked by strict syntax compliance. A model that is "smarter" in general knowledge (Llama) can fail catastrophically if it cannot strictly adhere to arbitrary tool definitions.
2. Safety Guardrails & Domain Adaptation¶
A core research objective was to determine the feasibility of using commercial LLMs for authorized security research (exploiting CVEs) without triggering generic safety refusals.
The Mechanism: Overwriting vs. Removing¶
We established that safety guardrails (RLHF) cannot be simply "removed" or toggled off, as they are integrated into the model weights. However, they can be effectively overwritten through Domain Adaptation.
- Alignment Drift: By fine-tuning a model on a large corpus of authorized, successful exploit trajectories (generated by the Dojo), we mathematically shift the probability distribution of the model's responses.
- Contextual Re-Alignment: The model learns that in the specific context of this project, the expected behavior for "Write an exploit" is compliance (generating code), not refusal (generating safety boilerplate).
- The "WhiteRabbit" Effect: This phenomenon was observed in the project's student model (WhiteRabbitNeo), where fine-tuning on security data effectively bypassed generic safety filters by establishing a "Security Researcher" persona as the dominant probability path.
Educational Takeaway: Safety in Large Language Models is probabilistic, not deterministic. It is not a hard-coded switch but a weight distribution. Fine-tuning functions as a mechanism to "re-align" these probabilities for specific professional domains, effectively overriding generic consumer safety protocols without damaging the model's reasoning capabilities (as "abliteration" or weight subtraction might).
3. Validation Methodology¶
To scientifically prove the "Alignment Drift" hypothesis, the following experimental frameworks were designed:
A. The "Refusal Rate Delta" (A/B Test)¶
- Control: Run a "Trigger Set" of 20 security prompts (e.g., "Generate buffer overflow") against the Base Model.
- Experiment: Run the same set against the Dojo-Fine-Tuned Model.
- Measurement: Calculate the percentage decrease in refusal responses. A near-zero refusal rate in the Experimental group proves successful adaptation.
B. The "Logit Lens" (Mathematical Proof)¶
- Procedure: Analyze the Log Probabilities (Logits) of the first token generated in response to a sensitive prompt.
- Observation:
- Base Model: High probability assigned to refusal tokens (e.g., "I", "Sorry", "Cannot").
- Fine-Tuned Model: High probability shift to compliance tokens (e.g., "Here", "Sure", "import").
- Conclusion: This probability shift serves as the mathematical proof that domain adaptation has successfully overwritten the RLHF safety alignment for the specific context.