CS292C-14: Computer-Aided Reasoning for Software
Lecture 14: Automated Program Repair

by Yu Feng

Program Repair Decomposition
Repair = Locate âź¶ Synthesize âź¶ Validate
  1. Fault Localization
    Precisely identify and prioritize suspicious code elements (statements, expressions, conditions) most likely responsible for observed failures using statistical analysis and program spectra.
  1. Patch Generation (Synthesis)
    Systematically produce targeted code modifications at identified locations that restore correct behavior according to specifications without introducing regressions.
  1. Patch Validation
    Rigorously verify the candidate patch satisfies all requirements through comprehensive test suites, formal contracts, invariant checking, or formal verification techniques.
We'll examine fault localization algorithms and synthesis techniques in depth, then explore how validation mechanisms complete the automated repair feedback loop with correctness guarantees.
Core Program Repair Pipeline
Locate
Identify suspicious code regions to focus repair efforts
Generate
Create candidate patches using templates or synthesis
Validate
Verify fixes pass tests, satisfy contracts, or formal specs
Rank
Order candidates by human-likeness or minimality
Real-World Example: Off-by-One Bug
Buggy Code
int index_of(char*s,char c){ int i, n=strlen(s); for(i=0;i<=n;i++){ if(s[i]==c) return i; } return -1; }
Failing Test
The test index_of("cat",'\0') == -1 causes a segmentation fault.
The loop accesses memory beyond the string by using <= instead of <.
PART I Fault Localization
Fault localization is the first key step in fixing programs automatically. It finds the parts of code most likely causing bugs or failures. This process uses data from how the program runs to pinpoint where the problems are.
The method looks at how code behaves during both successful and failed tests. It gives each piece of code a score based on how often it appears in failed tests. Higher scores mean that changing this code might fix the bug, helping repair systems focus on the most promising spots first.
By finding the most likely bug locations, good fault localization makes automated repair more efficient. This narrows down where to look, saving time and computing power when creating and testing possible fixes.
Fault Localization Techniques
Spectrum-Based (SBL)
Compares how code runs in passing vs. failing tests, giving each line a score for how likely it contains the bug.
Tools: GZoltar, Tarantula
Dynamic Slicing & Causal Tracing
Follows error paths backward to find only the code parts that directly affect the failure.
Tools: LocFaults, Pan & Zeller slices
Predicate & Value Switching
Changes decision points or values during program runs - if this fixes the failure, that spot is likely the problem.
Tools: Delta-Debugging, STAR
Mutation-Based Suspicion
Makes small changes to the code - places where these changes fix failing tests are likely bug spots.
Tools: Metallaxis, MUSE
Model & Invariant-Based Diagnosis
Checks if the program follows expected rules and patterns, flagging places where rules are broken.
Tools: Daikon-based analyzers, Baah's probabilistic PDG
Statistical & ML-Guided Ranking
Uses machine learning to predict where bugs might be, based on patterns from previous fixes.
Tools: DeepFL, Learn2Fix
Spectrum-Based Localization
A systematic approach to pinpoint likely bug locations in code:
Collect Coverage Information
Create a matrix where each test maps to a vector of executed code lines
Calculate Suspiciousness
Apply formulas like Ochiai, Tarantula, or DStar to score each line:
Prioritize Investigation
Rank lines by suspiciousness score and examine highest-scoring locations first
In our off-by-one example: the loop guard receives a high suspiciousness score of approximately 0.91
Mutation-Based Suspicion
Mutation-based fault localization makes small changes (mutations) to the code and watches how tests respond. It works on a simple idea: if changing a piece of code makes a failing test pass, that code is likely where the bug is.
Core Principle
If changing a specific spot fixes a failing test, that spot is likely buggy. This method tracks how mutations affect both passing and failing tests to assign suspicion scores.
Mutation Process
Makes systematic changes like swapping variables, flipping conditions, and changing math operations to create different program versions (mutants). Each mutant is tested to see how its behavior changes.
Suspiciousness Calculation
Uses formulas that look at: (1) how many failing tests start passing when a location is changed, and (2) how many passing tests stay passing despite the change. Higher scores point to more likely bug spots.
Key Tools
Metallaxis and MUSE are the main tools for this approach. Metallaxis focuses on test results while MUSE looks at how mutations affect expected outputs.
While effective, this method can be slow because it needs to create and test many program versions. Modern tools use sampling and parallel processing to speed things up.
Dynamic Slicing & Causal Tracing
Follows error paths backward to find only the code parts that directly affect the failure.
Dynamic Slicing
Identifies precisely which executed statements influenced a program's output or state at a specific point, creating a reduced program subset containing only statements relevant to the error.
Causal Tracing
Establishes cause-effect relationships by tracking how values propagate through execution, linking faulty outputs to their root causes through data and control dependencies.
Key Advantages
Produces significantly smaller code subsets to examine compared to static methods. Particularly effective for complex bugs where traditional coverage-based approaches identify too many suspicious locations.
These techniques excel at handling complex execution flows and can significantly reduce debugging time by focusing developer attention on the most relevant code regions.
Statistical & ML-Guided Ranking
Uses machine learning and statistical patterns from previous fixes to predict where bugs are most likely to be found.
Pattern Recognition
Analyzes thousands of historical bug fixes to identify code patterns and characteristics that frequently correlate with defects, creating statistical models of bug-prone code.
Feature-Based Ranking
Extracts features from code (complexity metrics, change history, developer information) and uses statistical models to rank suspicious locations based on their likelihood of containing bugs.
Learning From History
Leverages project-specific bug patterns and incorporates developer feedback on previous fixes to continuously improve prediction accuracy over time.
Hybrid Approaches
Combines traditional spectrum-based techniques with ML predictions to achieve more accurate rankings, especially for complex or previously unseen bug types.
Modern ML-guided approaches can reduce debugging effort by up to 90% compared to manual inspection by directing developers to the most likely fault locations first.
PART II Patch Generation (Synthesis)
Once fault localization identifies suspicious code regions, patch generation creates potential fixes for the defect. This synthesis process transforms the identified problem into a solution by generating candidate patches that modify the original code.
Patch generation uses various techniques including template-based mutations, constraint solving, and semantic analysis to create fixes that both resolve the failing tests and preserve correct behavior on passing tests.
This critical second phase of automated program repair connects fault identification to validation, employing techniques like search-based exploration and semantic analysis to efficiently generate plausible fixes.
Patch Generation—One-Slide Panorama
Template Enumeration
Applies predefined edit patterns (operator changes, null-checks) at suspicious locations.
Tools: GenProg, PAR, SPR, TBar
Evolutionary Search
Explores template space using genetic programming with test-passing as fitness function.
Tools: Astor, GenProg-SE
Constraint Synthesis
Replaces faulty expressions with "holes" and uses symbolic execution to derive constraints for solvers.
Tools: SemFix, S3
Angelic Multi-Location Repair
Identifies values that would make failing runs succeed, then synthesizes edits to produce those values.
Tools: Angelix, Avatar
ML/LLM-Guided Repair
Leverages machine learning to generate or rank patches, validated through compilation and testing.
Tools: Prophet, DeepRepair, GiantRepair
Proof-Driven Synthesis
Uses formal verification techniques to generate patches guaranteed to be correct.
Tools: KLEE-Repair, Caper, SyGuS-based repair
Template-Driven Patch Generation
Uses a predefined library of common fix patterns derived from human-written patches to generate candidate repairs at suspicious locations.
Key repair templates include:
  • Expression modification (changing operators, e.g., < to <=)
  • Null/bounds checking insertion
  • Method replacement
  • Exception handling
Tools: GenProg (mutation operators), PAR (human-inspired templates), SPR (staged prioritization), TBar (template bank with 34 patterns)
Example: Null-Check Template
// Original buggy code return data.getValue(); // Template-based patch if (data != null) { return data.getValue(); } else { return 0; // default value }
Template enumeration approaches search the repair space systematically by:
  • Ranking templates by historical success rate
  • Applying templates at suspicious locations identified during fault localization
  • Filtering candidates through test execution
While efficient for common bugs, template-based approaches may miss complex, multi-location defects requiring custom fixes.
Semantic (Constraint-Based) Patch Generation
How It Works
  • Identifies suspicious expressions through fault localization
  • Replaces faulty expression with synthesis "hole" (??)
  • Uses symbolic execution to extract path constraints from passing & failing tests
  • Employs SMT solvers to find expressions that satisfy all constraints
  • Synthesizes valid repairs over a component-based grammar
Key Tools: SemFix (first semantic repair), S3 (scales to complex expressions), JFIX (Java repair)
// Buggy code with off-by-one error for (int i = 0; i <= n; i++) { array[i] = 0; // Crashes when i == n } // SemFix repair process: // 1. Replace condition with hole: for(int i=0; ??; i++) // 2. Extract expected behavior from tests // 3. Synthesize condition: i < n // Final repair: for (int i = 0; i < n; i++) { array[i] = 0; // Correct bound check }
Unlike template-based approaches, semantic repair can synthesize complex expressions not seen in predefined patterns. SemFix synthesizes i < n by analyzing the program state during correct executions and deriving constraints that prevent the array bounds violation.
Angelic Multi-Location Repair
Angelic repair is an advanced program repair technique that operates in two phases: first identifying values that would make failing tests pass ("angelic values"), then synthesizing code changes that produce those values.
Tools: Angelix performs symbolic analysis to identify angelic values and then synthesizes expressions that satisfy extracted constraints. Avatar extends this approach by combining template-based and constraint-based techniques for more complex repairs.
Identify Angelic Values
Uses symbolic execution to find values at suspicious locations that would make failing tests pass
Extract Path Conditions
Analyzes program state and control flow to understand relationships between variables
Synthesize Repairs
Generates code modifications across multiple locations that produce the angelic values
Validate Repairs
Verifies synthesized changes with test suite and ranks plausible repairs
// Buggy code with potential division by zero int compute(int a, int b) { int result = 0; if (a > 0) result = 100 / b; // May crash when b = 0 return result; } // Angelix repair process: // 1. Identifies b = 0 causes failure // 2. Discovers guard condition needed // 3. Synthesizes correct patch // Final multi-location repair: int compute(int a, int b) { int result = 0; if (a > 0 && b != 0) // Added condition result = 100 / b; return result; }
ML-Guided Patch Ranking/Generation
ML-guided repair approaches learn P(Δ | context) from historical commits to prioritize patches that resemble human fixes.
This probabilistic model helps explore the most human-like patches first, significantly reducing the search space of potential repairs.
  • PROPhet: Uses deep learning to predict fix patterns, Prophet: Automatic Patch Generation via Learning from Successful Patches, POPL'16
  • TBar: Combines templates with ML ranking
  • GiantRepair: Integrates ML prior with multiple template strategies
Concrete Example
// Bug: Missing null check void process(String s) { int len = s.length(); // ... } // Candidate patches ranked by ML: // 1. if (s != null) {int len = s.length();} // 2. int len = s == null ? 0 : s.length(); // 3. try {int len = s.length();} catch(...)
The ML model assigns higher probability to patch #1 based on patterns learned from thousands of similar fixes in repositories, prioritizing it for validation.
PART III Validation/Verification
Validation/verification determines whether a generated patch correctly fixes a bug without introducing new issues. This crucial step is challenging due to the oracle problem: how do we know if a patch is truly correct?
Test-Based Validation
Uses existing test suites as oracles to verify patches pass all tests, but limited by test suite quality and coverage.
Semantic Equivalence
Formal verification techniques establish correctness by proving properties about the repaired program using logical specifications.
Overfitting Challenge
Patches may pass all tests but fail to generalize to untested inputs, making validation fundamentally incomplete.
Human-in-the-Loop
Developer feedback serves as an ultimate oracle, combining automated validation with expert judgment on patch quality.
Most APR systems validate using regression test suites, but research increasingly incorporates additional oracles like semantic specifications, invariant checking, and statistical models to approximate human judgment of patch correctness.
The Oracle Problem in Program Repair
The greatest irony in program repair is that we don't know the ground truth - the correct implementation that would fix the bug.
Without access to an oracle that can determine if a patch is truly correct, repair systems must navigate a vast solution space with limited guidance.
This fundamental challenge leads to issues like patch overfitting, where fixes pass tests but fail to generalize to untested inputs, or incorrect patches that introduce new bugs while fixing original issues.
No Ground Truth
If we knew exactly how to fix the bug, we wouldn't need automated repair.
Limited Oracles
Test suites provide incomplete specifications of correct behavior.
Patch Quality Assessment
Difficult to automatically evaluate if a patch is "correct" versus merely "passing tests".
Verified Patch Synthesis

Proof-Carrying Patches
Patches with mathematical correctness guarantees
Formal Specifications
Preconditions, postconditions, and invariants
Verification Tools
KleeRepair combines KLEE and CBMC for repair+verification
These approaches produce higher assurance patches but require more formal specifications and longer computation time.
Semantic Equivalence in Program Repair
Semantic equivalence refers to two program versions that produce identical outputs for all possible inputs, despite having different implementations. In program repair, this concept is crucial for evaluating whether a patch correctly preserves the intended program behavior.
Key Concepts
  • Two programs are semantically equivalent if they compute the same function
  • Different syntactic implementations can have identical behavior
  • Critical for validating that repairs fix bugs without introducing new ones
  • Enables verification that patches preserve correct functionality
Example: Equivalent Array Sum Functions
// Original implementation (with bug) int sum(int arr[], int n) { int total = 0; // Bug: off-by-one error for (int i = 0; i <= n; i++) total += arr[i]; return total; } // Semantically equivalent repair int sum(int arr[], int n) { int total = 0; // Fixed bound condition for (int i = 0; i < n; i++) total += arr[i]; return total; }
APR Overfitting Challenge
Automated Program Repair (APR) systems face a critical challenge: patches may satisfy all test cases yet fail to implement the correct semantic behavior for untested inputs.
Test-Based Overfitting
Patches that pass all tests but implement incorrect functionality for inputs not covered by the test suite, creating an illusion of correctness.
Specification-Based Overfitting
Repairs that satisfy formal specifications but violate unstated assumptions or intended program behaviors that weren't explicitly formalized.
Patch Quality Assessment
Evaluating if a patch is genuinely correct versus merely satisfying available oracles remains fundamentally difficult without complete specifications.
Research increasingly focuses on detecting and preventing overfitting through techniques like automatic test generation, multi-objective patch evaluation, and semantic similarity metrics that compare patch behavior against developer-written fixes.
APR Limitations
Test Overfitting
Patches may pass tests but break untested functionality
Execution Cost
Hundreds of test runs per candidate patch
Localization Dependency
Requires accurate fault identification
Template Constraints
Limited by available repair patterns
Essential Reading & Key Takeaways
Foundational Papers
  • GenProg: Automatic Patch Generation (ICSE 2012)
  • SemFix: Program Repair via Semantic Analysis (ICSE 2013)
  • Angelix: Scalable Multi-Location Program Repair (ICSE 2017)
  • Prophet: Learning Patch Quality (POPL 2016)
  • A Critical Review of Automatic Software Repair (TOSEM 2018)
Key Takeaways
Automated Program Repair is essentially CEGIS for bug-fixing. Success depends on four key factors:
  1. Fault localization precision
  1. Patch space coverage
  1. Search space pruning
  1. Candidate ranking
From simple bugs to security issues, APR remains an active research frontier with practical applications.