Skip to main content

Dataset Card - Rippler Training and Evaluation Data

Overview

Rippler's LLM-based impact analysis system operates primarily in a zero-shot or few-shot mode, leveraging pre-trained Large Language Models without fine-tuning on custom datasets. However, the system processes and analyzes various types of data during operation, and evaluation datasets are used to validate performance.

This document describes the data sources, data types, and evaluation datasets used by the Rippler system.

Data Sources

1. Runtime Input Data (Primary Operational Data)

GitHub Pull Request Data

Source: GitHub API via webhooks and REST API
Data Collection: Real-time, event-driven
Volume: Variable (depends on organization activity)
Retention: Not retained after analysis (processed in-memory only)

Data Fields:

{
"repository": {
"name": "string",
"owner": "string",
"url": "string",
"description": "string"
},
"pull_request": {
"number": "integer",
"title": "string",
"description": "string",
"author": "string",
"branch": "string",
"base_branch": "string",
"created_at": "datetime",
"labels": ["string"]
},
"changes": [
{
"file": "string (file path)",
"type": "string (added|modified|deleted)",
"additions": "integer",
"deletions": "integer",
"diff": "string (unified diff format)",
"language": "string"
}
],
"dependencies": {
"direct": ["string (service names)"],
"transitive": ["string (service names)"]
}
}

Privacy & Ethics:

  • ✅ Only metadata and code diffs processed (no user personal data)
  • ✅ Data not stored permanently by Rippler
  • ✅ Sent to LLM providers (OpenAI/Anthropic) per their privacy policies
  • ✅ Local Ollama option available for complete data privacy
  • ⚠️ Users should review LLM provider terms for proprietary code

Data Quality:

  • Depends on quality of Git commit messages and PR descriptions
  • Accuracy limited by completeness of dependency graph
  • Diff quality depends on code formatting and Git configuration

Dependency Graph Data

Source: Dependency Graph Engine service (internal)
Data Collection: Continuous analysis of code repositories
Volume: Depends on repository size and complexity

Data Structure:

  • Service-to-service dependencies
  • API endpoint mappings
  • Database schema dependencies
  • Transitive dependency chains

Privacy: Internal organization data only

2. Evaluation and Testing Data

To validate the quality and accuracy of Rippler's impact analysis, the team has curated evaluation datasets from public and synthetic sources.

Public Repository PR Dataset

Source: Public GitHub repositories (open source projects)
Size: ~500 pull requests
License: Per individual repository licenses (MIT, Apache 2.0, GPL, etc.)
Purpose: Model evaluation and system testing

Composition:

  • Programming Languages:
    • JavaScript/TypeScript: 35%
    • Python: 30%
    • Java: 20%
    • Go: 10%
    • Other (Ruby, Rust, C++): 5%
  • Repository Types:
    • Microservices architectures: 40%
    • Monolithic applications: 30%
    • Libraries/frameworks: 20%
    • DevOps/infrastructure: 10%
  • PR Types:
    • Feature additions: 45%
    • Bug fixes: 30%
    • Refactoring: 15%
    • Documentation: 10%

Notable Source Repositories (examples):

  • Node.js Express applications
  • Python FastAPI/Django projects
  • Java Spring Boot microservices
  • Kubernetes configurations
  • React/Next.js web applications

Annotation Process:

  • Human expert annotations for ground truth
  • Three developers independently review each PR
  • Consensus labels for impact level, affected services, and risk
  • Inter-annotator agreement: κ = 0.82 (substantial agreement)

Biases and Limitations:

  • ⚠️ Skewed toward well-documented open source projects
  • ⚠️ May not represent proprietary enterprise code patterns
  • ⚠️ Limited representation of less popular languages
  • ⚠️ Primarily English language comments/documentation

Synthetic Test Dataset

Source: Internally generated test cases
Size: ~200 synthetic PRs
Purpose: Edge case testing and regression testing

Composition:

  • Breaking API changes
  • Database schema migrations
  • Configuration changes
  • Security-sensitive modifications
  • Large-scale refactorings
  • Multi-service cascading changes

Generation Method:

  • Manual creation by senior engineers
  • Automated diff generation for specific patterns
  • Known failure cases from production incidents

3. Model Training Data (Indirect)

Rippler does not train custom models, but relies on pre-trained LLMs. These models were trained by their providers on large corpora:

GPT-4o-mini Training Data (OpenAI)

Training Cutoff: October 2023
Data Sources (per OpenAI documentation):

  • Public internet text (web pages, books, articles)
  • Public code repositories (GitHub and similar)
  • Academic papers and technical documentation
  • Conversational data (with appropriate privacy controls)

Known Biases:

  • English language dominance
  • Western cultural perspectives
  • Overrepresentation of popular programming languages
  • Temporal bias (knowledge cutoff)

Unknown Composition:

  • OpenAI does not disclose exact training data composition
  • Proprietary data processing and filtering applied

Anthropic Claude Training Data

Training Cutoff: Early 2024 (varies by version)
Data Sources: Similar to GPT-4o-mini (per Anthropic documentation)
Focus: Enhanced safety training and constitutional AI principles

Ollama Models (CodeLlama, Llama 2, Mistral)

Training Data:

  • CodeLlama: Specialized on code from public repositories
  • Llama 2: General text including code
  • Mistral: General text with strong technical capabilities

Open Weights: Model weights available but training data composition similar to GPT

Data Processing Pipeline

1. Data Ingestion

GitHub Webhook → API Gateway → Launchpad Service

Data Validation & Sanitization

Dependency Graph Enrichment

2. LLM Input Preparation

Structured Data → Prompt Engineering → Token Counting

Context Window Management

LLM API Call

Prompt Engineering:

  • System prompt defines task and output format
  • User prompt includes PR metadata, diffs, and dependency graph
  • Few-shot examples included for complex analysis tasks
  • Context limited to ~8K tokens for optimal performance/cost

3. Output Processing

LLM Response → JSON Parsing → Validation

Confidence Scoring

Storage (Audit Log Only)

Data Privacy and Security

Personal Data Handling

  • No Personal Data Collection: Rippler does not collect, store, or process personally identifiable information (PII)
  • Code Author Information: GitHub usernames are metadata only, not linked to personal profiles
  • Audit Logs: Anonymized IDs used for logging and monitoring

Code Data Privacy

  • ⚠️ Cloud LLMs: Code diffs sent to OpenAI/Anthropic servers for analysis
    • Subject to provider privacy policies
    • Not used for model training (per OpenAI/Anthropic terms for API customers)
    • Recommend reviewing provider terms for proprietary code
  • Local Ollama Option: Complete data privacy, all processing on-premises
  • No Rippler Storage: Code diffs not stored after analysis

Compliance

  • GDPR: No personal data processing (code is not personal data under GDPR)
  • SOC 2: Audit logging and access controls in place
  • Data Residency: Cloud LLMs may process data in multiple regions (per provider policies)

Dataset Limitations

Coverage Limitations

  • Limited to Git-based workflows (no support for other VCS)
  • Requires GitHub integration (no GitLab/Bitbucket support yet)
  • English-language optimized (other languages have reduced accuracy)
  • Limited to textual code analysis (no runtime/observability data)

Quality Limitations

  • Accuracy depends on quality of input (commit messages, PR descriptions)
  • Dependency graph completeness varies by repository
  • Historical context limited to PR scope (no access to full project history)
  • No access to external documentation or tribal knowledge

Representativeness Limitations

  • Evaluation data skewed toward open source patterns
  • May not generalize to all enterprise code styles
  • Limited representation of legacy systems and rare languages
  • Focused on microservices architectures

Evaluation Metrics

Quantitative Metrics (on Evaluation Dataset)

MetricDefinitionGPT-4o-miniClaudeCodeLlama 13B
Impact Detection PrecisionCorrectly identified impacted services / All identified89%88%75%
Impact Detection RecallCorrectly identified impacted services / All actual92%91%78%
Risk Level AccuracyCorrect risk classification (high/med/low)88%87%72%
Stakeholder ID AccuracyCorrect stakeholder identification85%84%68%
False Positive RateIncorrectly flagged as impacted / Total negative8%9%18%

Qualitative Evaluation

  • Human expert review of random samples (5% of production traffic)
  • User satisfaction surveys (planned)
  • A/B testing between models (ongoing)

Future Dataset Improvements

Planned Additions

  1. Fine-tuning Dataset: Curate organization-specific dataset for fine-tuning
  2. Multi-language Support: Expand evaluation data to more programming languages
  3. Historical Accuracy Tracking: Build dataset of Rippler predictions vs actual impact
  4. Incident Correlation: Link analyses to post-deployment incidents for validation

Community Contributions

  • Open source evaluation dataset (in progress)
  • Community-submitted test cases
  • Benchmark suite for impact analysis systems

Data Access and Licensing

Evaluation Dataset

  • Availability: Contact team for research access
  • License: TBD (likely CC BY-SA 4.0 for public portions)
  • Format: JSON files with PR metadata and annotations

Requesting Access

For research or evaluation purposes, contact the Rippler team:

  • Email: See README.md for team contacts
  • GitHub Discussions: [Link to discussions page]

References

Citation

If you use Rippler's evaluation dataset or methods in research, please cite:

@software{rippler2024,
title={Rippler: Ripple Impact Prediction and Propagation Logging for Engineering Resilience},
author={Rippler Team},
year={2024},
url={https://github.com/hanisntsolo/rippler}
}

Last Updated: November 2024
Version: 1.0
Maintained By: Rippler Team