V2 Reasoning Quality Grading System¶

Overview¶

This document defines how to evaluate model responses to V2 challenges. Unlike V1 (which graded command execution success), V2 grades reasoning quality across multiple dimensions.

Grading Philosophy¶

What We're Measuring¶

Reasoning Process - HOW the model thinks, not just WHAT it concludes
Calibrated Confidence - Does confidence match actual correctness?
Generalization - Does understanding transfer to novel contexts?
Intellectual Honesty - Does model acknowledge uncertainty appropriately?

What We're NOT Measuring¶

Speed of response
Verbosity (long ≠ better)
Use of specific terminology (concepts matter, not jargon)

Phase-Specific Rubrics¶

Phase 1: Observation¶

Goal: Identify security-relevant facts without jumping to conclusions.

Criterion	Weight	Scoring Guide
Completeness	0.30	Found 100% key observations (1.0), >80% (0.8), >60% (0.6), >40% (0.4), <40% (0.2)
Accuracy	0.30	All correct (1.0), minor errors (0.8), some errors (0.6), major errors (0.2), wrong (0.0)
Relevance Ranking	0.20	Perfect prioritization (1.0), minor misordering (0.8), key items underranked (0.4)
No Hallucination	0.20	Zero hallucinations (1.0), minor hallucination (0.5), affects conclusions (0.0)

Automatic Scoring Signals: - Check observation list against ground_truth.key_observations - Detect hallucinated API names, paths, or methods - Verify factual claims against provided artifacts

DPO Pair Generation: - Chosen: Complete observations with correct relevance ranking - Rejected: Incomplete OR contains hallucinations OR jumps to conclusions

Phase 2: Hypothesis Formation¶

Goal: Form specific, testable hypotheses from observations.

Criterion	Weight	Scoring Guide
Validity	0.25	Logically follows from observations (1.0), minor gaps (0.7), logical issues (0.4), contradicts evidence (0.0)
Testability	0.25	Clear verification/falsification plan (1.0), partially testable (0.7), vague (0.4), not testable (0.0)
Specificity	0.20	Exact location, mechanism, impact (1.0), mostly specific (0.7), vague but actionable (0.4), too vague (0.0)
Coverage	0.15	Addresses all observations (1.0), most (0.7), some (0.4), ignores key points (0.0)
CWE Mapping	0.15	Exact match (1.0), related CWE (0.7), wrong but same category (0.4), completely wrong (0.0)

Automatic Scoring Signals: - Parse hypothesis structure for required elements - Check CWE against ground_truth - Verify confidence is numeric 0.0-1.0 - Check test plan describes concrete steps

DPO Pair Generation: - Chosen: Specific, testable hypotheses with correct CWE and calibrated confidence - Rejected: Vague hypotheses OR untestable OR wrong CWE OR overconfident

Phase 3: Root Cause Analysis¶

Goal: Explain WHY the vulnerability exists at a fundamental level.

Criterion	Weight	Scoring Guide
Depth	0.30	Identifies fundamental principle (1.0), connects to principles (0.7), surface level (0.4), only symptom (0.0)
Accuracy	0.25	Matches expert analysis (1.0), correct direction (0.7), partially correct (0.4), wrong (0.0)
Generalization	0.25	Identifies pattern family and variants (1.0), pattern only (0.7), limited (0.4), isolated case (0.0)
Taxonomy	0.20	Correct CWE chain (1.0), correct primary (0.7), related (0.4), wrong (0.0)

Key Indicators of Depth: - Mentions fundamental principles (separation of code/data, least privilege, defense in depth) - Explains why the PATTERN exists, not just this instance - Identifies similar vulnerabilities in other contexts - Uses CWE hierarchy (variant → base → class)

DPO Pair Generation: - Chosen: Deep analysis connecting to security principles with correct taxonomy - Rejected: Surface description OR wrong root cause OR no generalization

Phase 4: Negative Knowledge¶

Goal: Correctly identify secure code and explain WHY it's secure.

Criterion	Weight	Scoring Guide
Correct Classification	0.40	Identifies as secure with high confidence (1.0), correct but low confidence (0.5), false positive (0.0)
Security Property ID	0.30	Identifies all key properties (1.0), most (0.7), some (0.4), none (0.0)
Attack Resistance	0.20	Explains all attack vector resistance (1.0), most (0.7), partial (0.4), none (0.0)
No False Positives	0.10	No false vulnerabilities claimed (1.0), claims nonexistent vulns (0.0)

Critical for Training: This phase is essential for reducing false positives. Models must learn to NOT call things vulnerable when they're secure.

DPO Pair Generation: - Chosen: Correct "not vulnerable" with explanation of security properties - Rejected: False positive (claiming vulnerability) OR unexplained secure classification

Composite Scoring¶

Challenge Score Calculation¶

For multi-phase challenges:

def calculate_challenge_score(phase_scores: list[PhaseScore]) -> float:
    """
    Calculate overall challenge score from phase scores.

    Phase weights depend on challenge type:
    - observation-only: 100% observation phase
    - hypothesis: 40% observation, 60% hypothesis
    - full_chain: 20% observe, 30% hypothesize, 30% verify, 20% analyze
    """
    weights = get_phase_weights(challenge_type)
    return sum(ps.score * weights[ps.phase] for ps in phase_scores)

Confidence Calibration Score¶

Track whether model confidence correlates with actual correctness:

def calibration_score(predictions: list[Prediction]) -> float:
    """
    Brier score for confidence calibration.

    Perfect calibration: When model says 80% confident,
    it should be correct 80% of the time.
    """
    bins = bucket_by_confidence(predictions, num_bins=10)

    calibration_error = 0
    for bin in bins:
        if len(bin) > 0:
            avg_confidence = mean([p.confidence for p in bin])
            accuracy = mean([p.is_correct for p in bin])
            calibration_error += len(bin) * (avg_confidence - accuracy) ** 2

    return 1 - (calibration_error / len(predictions))

DPO Training Data Generation¶

Pair Generation Strategy¶

For each challenge, generate multiple response pairs:

@dataclass
class DPOPair:
    challenge_id: str
    prompt: str  # Challenge + artifacts
    chosen: str  # Better response
    rejected: str  # Worse response
    margin: float  # How much better is chosen? (for ranking)

def generate_dpo_pairs(challenge: ChallengeV2,
                       responses: list[Response]) -> list[DPOPair]:
    """
    Generate DPO training pairs from model responses.

    Strategies:
    1. Best vs worst response
    2. Correct vs incorrect
    3. Complete vs incomplete
    4. Calibrated vs overconfident
    """
    pairs = []

    # Sort by score
    ranked = sorted(responses, key=lambda r: r.score, reverse=True)

    # Best vs worst
    if len(ranked) >= 2:
        pairs.append(DPOPair(
            challenge_id=challenge.id,
            prompt=format_challenge(challenge),
            chosen=ranked[0].text,
            rejected=ranked[-1].text,
            margin=ranked[0].score - ranked[-1].score
        ))

    # Generate synthetic rejected examples from common mistakes
    for mistake in challenge.training.common_mistakes:
        pairs.append(DPOPair(
            challenge_id=challenge.id,
            prompt=format_challenge(challenge),
            chosen=ranked[0].text,
            rejected=generate_mistake_response(challenge, mistake),
            margin=0.5  # Synthetic pairs have fixed margin
        ))

    return pairs

Common Rejected Response Patterns¶

For each pillar, generate rejected responses exhibiting common mistakes:

Static Analysis: - Hallucinating API names not in the code - Missing obvious security issues - Calling secure code vulnerable (false positive)

Negative Knowledge: - Calling secure code vulnerable - Not explaining WHY it's secure - Missing security properties

Root Cause: - Only describing WHAT, not WHY - Surface-level "string concatenation is bad" - Missing the fundamental principle

Pattern Transfer: - Treating each context as unique - Not recognizing the pattern - Missing the unifying principle

Methodology: - Jumping to conclusions without observations - Untestable hypotheses - No falsification criteria

Taxonomy: - Wrong CWE classification - Missing parent chain - Not knowing related CWEs

Patch Analysis: - Missing incomplete patches - Not understanding what the fix does - Missing bypass opportunities

Automated Grading Implementation¶

Grader Architecture¶

class ReasoningGrader:
    """
    Grades model responses to V2 challenges.
    """

    def __init__(self, embedding_model: str = "text-embedding-3-small"):
        self.embedder = EmbeddingModel(embedding_model)
        self.llm_judge = None  # Optional LLM-as-judge for complex cases

    def grade_observation_phase(self,
                                 response: str,
                                 ground_truth: GroundTruth) -> PhaseScore:
        """Grade observation phase response."""

        # Parse response into structured observations
        observations = self.parse_observations(response)

        # Check completeness against key_observations
        completeness = self.check_completeness(
            observations,
            ground_truth.key_observations
        )

        # Check accuracy (no factual errors)
        accuracy = self.check_accuracy(observations)

        # Check for hallucinations
        hallucination_score = self.detect_hallucinations(response)

        # Check relevance ranking
        relevance = self.check_relevance_ranking(observations)

        return PhaseScore(
            phase="observation",
            completeness=completeness,
            accuracy=accuracy,
            relevance=relevance,
            no_hallucination=hallucination_score,
            score=self.weighted_score([
                (completeness, 0.30),
                (accuracy, 0.30),
                (relevance, 0.20),
                (hallucination_score, 0.20)
            ])
        )

    def grade_hypothesis_phase(self,
                               response: str,
                               ground_truth: GroundTruth) -> PhaseScore:
        """Grade hypothesis phase response."""

        hypotheses = self.parse_hypotheses(response)

        # Check if hypotheses match valid hypotheses
        validity = self.check_hypothesis_validity(
            hypotheses,
            ground_truth.valid_hypotheses
        )

        # Check testability
        testability = self.check_testability(hypotheses)

        # Check CWE mapping
        cwe_accuracy = self.check_cwe_mapping(
            hypotheses,
            ground_truth.cwe_id
        )

        # Check confidence calibration
        calibration = self.check_confidence_calibration(hypotheses)

        return PhaseScore(
            phase="hypothesis",
            validity=validity,
            testability=testability,
            cwe_accuracy=cwe_accuracy,
            calibration=calibration,
            score=self.weighted_score([
                (validity, 0.25),
                (testability, 0.25),
                (cwe_accuracy, 0.15),
                # ... other weights
            ])
        )

    def detect_hallucinations(self, response: str) -> float:
        """
        Detect hallucinated content in response.

        Checks for:
        - API names not in original artifacts
        - File paths not in original artifacts
        - Method names not in original artifacts
        - Made-up CVE numbers
        """
        # Extract technical terms from response
        terms = self.extract_technical_terms(response)

        # Check against artifact content
        artifact_terms = self.extract_artifact_terms()

        # Calculate hallucination rate
        hallucinated = [t for t in terms if t not in artifact_terms]

        if len(terms) == 0:
            return 1.0

        return 1.0 - (len(hallucinated) / len(terms))

LLM-as-Judge for Complex Criteria¶

Some criteria require LLM judgment:

class LLMJudge:
    """
    Use an LLM to evaluate complex criteria.
    """

    def judge_root_cause_depth(self,
                               response: str,
                               ground_truth: str) -> float:
        """
        Judge whether response shows deep understanding.
        """
        prompt = f"""
        Evaluate this security analysis response for depth of understanding.

        GROUND TRUTH ANALYSIS:
        {ground_truth}

        MODEL RESPONSE:
        {response}

        Score from 0.0 to 1.0 on these criteria:
        1. Does it identify the FUNDAMENTAL security principle violated?
        2. Does it go beyond surface description to explain WHY?
        3. Does it connect to broader patterns?

        Return JSON: {{"score": float, "reasoning": string}}
        """

        return self.call_judge_model(prompt)

Metrics and Tracking¶

Training Progress Metrics¶

@dataclass
class TrainingMetrics:
    # Overall performance
    avg_challenge_score: float

    # Per-pillar breakdown
    pillar_scores: dict[str, float]

    # Per-belt breakdown (should increase with training)
    belt_scores: dict[str, float]

    # Confidence calibration
    calibration_score: float

    # False positive rate (critical for negative knowledge)
    false_positive_rate: float

    # Transfer success (holdout challenges)
    transfer_accuracy: float

    # Hallucination rate
    hallucination_rate: float

Key Success Indicators¶

Metric	Target	Why It Matters
False Positive Rate	< 10%	Real value is NOT crying wolf
Calibration Score	> 0.85	Model knows what it doesn't know
Transfer Accuracy	> 70%	Learning patterns, not memorizing
Hallucination Rate	< 5%	Trustworthy analysis
Root Cause Depth	> 0.7 avg	Understanding, not pattern matching

Usage Example¶

# Grade a model response
grader = ReasoningGrader()

response = model.generate(challenge.to_prompt())

# Grade each phase
phase_scores = []
for phase in challenge.phases:
    phase_response = extract_phase_response(response, phase.phase_id)
    score = grader.grade_phase(phase.phase_id, phase_response, challenge.ground_truth)
    phase_scores.append(score)

# Calculate overall score
overall_score = grader.calculate_challenge_score(phase_scores, challenge)

# Generate DPO pairs if we have multiple responses
if len(responses) >= 2:
    dpo_pairs = generate_dpo_pairs(challenge, responses)
    save_dpo_pairs(dpo_pairs)

Integration with V2 Training Pipeline¶

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│   V2 Challenge  │────▶│  Model Response │────▶│  ReasoningGrader│
│                 │     │                 │     │                 │
└─────────────────┘     └─────────────────┘     └────────┬────────┘
                                                         │
                        ┌────────────────────────────────┼────────────────────────────────┐
                        │                                │                                │
                        ▼                                ▼                                ▼
               ┌─────────────────┐              ┌─────────────────┐              ┌─────────────────┐
               │  Phase Scores   │              │  DPO Pairs      │              │  Training Metrics│
               │                 │              │                 │              │                 │
               └─────────────────┘              └─────────────────┘              └─────────────────┘
                        │                                │                                │
                        └────────────────────────────────┼────────────────────────────────┘
                                                         │
                                                         ▼
                                               ┌─────────────────┐
                                               │  Model Training │
                                               │  (SFT + DPO)    │
                                               └─────────────────┘