Dataset Card - Rippler Training and Evaluation Data

Overview

Rippler's LLM-based impact analysis system operates primarily in a zero-shot or few-shot mode, leveraging pre-trained Large Language Models without fine-tuning on custom datasets. However, the system processes and analyzes various types of data during operation, and evaluation datasets are used to validate performance.

This document describes the data sources, data types, and evaluation datasets used by the Rippler system.

Data Sources

1. Runtime Input Data (Primary Operational Data)

GitHub Pull Request Data

Source: GitHub API via webhooks and REST API
Data Collection: Real-time, event-driven
Volume: Variable (depends on organization activity)
Retention: Not retained after analysis (processed in-memory only)

Data Fields:

{
  "repository": {
    "name": "string",
    "owner": "string", 
    "url": "string",
    "description": "string"
  },
  "pull_request": {
    "number": "integer",
    "title": "string",
    "description": "string",
    "author": "string",
    "branch": "string",
    "base_branch": "string",
    "created_at": "datetime",
    "labels": ["string"]
  },
  "changes": [
    {
      "file": "string (file path)",
      "type": "string (added|modified|deleted)",
      "additions": "integer",
      "deletions": "integer", 
      "diff": "string (unified diff format)",
      "language": "string"
    }
  ],
  "dependencies": {
    "direct": ["string (service names)"],
    "transitive": ["string (service names)"]
  }
}

Privacy & Ethics:

✅ Only metadata and code diffs processed (no user personal data)
✅ Data not stored permanently by Rippler
✅ Sent to LLM providers (OpenAI/Anthropic) per their privacy policies
✅ Local Ollama option available for complete data privacy
⚠️ Users should review LLM provider terms for proprietary code

Data Quality:

Depends on quality of Git commit messages and PR descriptions
Accuracy limited by completeness of dependency graph
Diff quality depends on code formatting and Git configuration

Dependency Graph Data

Source: Dependency Graph Engine service (internal)
Data Collection: Continuous analysis of code repositories
Volume: Depends on repository size and complexity

Data Structure:

Service-to-service dependencies
API endpoint mappings
Database schema dependencies
Transitive dependency chains

Privacy: Internal organization data only

2. Evaluation and Testing Data

To validate the quality and accuracy of Rippler's impact analysis, the team has curated evaluation datasets from public and synthetic sources.

Public Repository PR Dataset

Source: Public GitHub repositories (open source projects)
Size: ~500 pull requests
License: Per individual repository licenses (MIT, Apache 2.0, GPL, etc.)
Purpose: Model evaluation and system testing

Composition:

Programming Languages:
- JavaScript/TypeScript: 35%
- Python: 30%
- Java: 20%
- Go: 10%
- Other (Ruby, Rust, C++): 5%
Repository Types:
- Microservices architectures: 40%
- Monolithic applications: 30%
- Libraries/frameworks: 20%
- DevOps/infrastructure: 10%
PR Types:
- Feature additions: 45%
- Bug fixes: 30%
- Refactoring: 15%
- Documentation: 10%

Notable Source Repositories (examples):

Node.js Express applications
Python FastAPI/Django projects
Java Spring Boot microservices
Kubernetes configurations
React/Next.js web applications

Annotation Process:

Human expert annotations for ground truth
Three developers independently review each PR
Consensus labels for impact level, affected services, and risk
Inter-annotator agreement: κ = 0.82 (substantial agreement)

Biases and Limitations:

⚠️ Skewed toward well-documented open source projects
⚠️ May not represent proprietary enterprise code patterns
⚠️ Limited representation of less popular languages
⚠️ Primarily English language comments/documentation

Synthetic Test Dataset

Source: Internally generated test cases
Size: ~200 synthetic PRs
Purpose: Edge case testing and regression testing

Composition:

Breaking API changes
Database schema migrations
Configuration changes
Security-sensitive modifications
Large-scale refactorings
Multi-service cascading changes

Generation Method:

Manual creation by senior engineers
Automated diff generation for specific patterns
Known failure cases from production incidents

3. Model Training Data (Indirect)

Rippler does not train custom models, but relies on pre-trained LLMs. These models were trained by their providers on large corpora:

GPT-4o-mini Training Data (OpenAI)

Training Cutoff: October 2023
Data Sources (per OpenAI documentation):

Public internet text (web pages, books, articles)
Public code repositories (GitHub and similar)
Academic papers and technical documentation
Conversational data (with appropriate privacy controls)

Known Biases:

English language dominance
Western cultural perspectives
Overrepresentation of popular programming languages
Temporal bias (knowledge cutoff)

Unknown Composition:

OpenAI does not disclose exact training data composition
Proprietary data processing and filtering applied

Anthropic Claude Training Data

Training Cutoff: Early 2024 (varies by version)
Data Sources: Similar to GPT-4o-mini (per Anthropic documentation)
Focus: Enhanced safety training and constitutional AI principles

Ollama Models (CodeLlama, Llama 2, Mistral)

Training Data:

CodeLlama: Specialized on code from public repositories
Llama 2: General text including code
Mistral: General text with strong technical capabilities

Open Weights: Model weights available but training data composition similar to GPT

Data Processing Pipeline

1. Data Ingestion

GitHub Webhook → API Gateway → Launchpad Service
                                    ↓
                            Data Validation & Sanitization
                                    ↓
                            Dependency Graph Enrichment

2. LLM Input Preparation

Structured Data → Prompt Engineering → Token Counting
                                          ↓
                                    Context Window Management
                                          ↓
                                    LLM API Call

Prompt Engineering:

System prompt defines task and output format
User prompt includes PR metadata, diffs, and dependency graph
Few-shot examples included for complex analysis tasks
Context limited to ~8K tokens for optimal performance/cost

3. Output Processing

LLM Response → JSON Parsing → Validation
                                  ↓
                            Confidence Scoring
                                  ↓
                            Storage (Audit Log Only)

Data Privacy and Security

Personal Data Handling

✅ No Personal Data Collection: Rippler does not collect, store, or process personally identifiable information (PII)
✅ Code Author Information: GitHub usernames are metadata only, not linked to personal profiles
✅ Audit Logs: Anonymized IDs used for logging and monitoring

Code Data Privacy

⚠️ Cloud LLMs: Code diffs sent to OpenAI/Anthropic servers for analysis
- Subject to provider privacy policies
- Not used for model training (per OpenAI/Anthropic terms for API customers)
- Recommend reviewing provider terms for proprietary code
✅ Local Ollama Option: Complete data privacy, all processing on-premises
✅ No Rippler Storage: Code diffs not stored after analysis

Compliance

GDPR: No personal data processing (code is not personal data under GDPR)
SOC 2: Audit logging and access controls in place
Data Residency: Cloud LLMs may process data in multiple regions (per provider policies)

Dataset Limitations

Coverage Limitations

Limited to Git-based workflows (no support for other VCS)
Requires GitHub integration (no GitLab/Bitbucket support yet)
English-language optimized (other languages have reduced accuracy)
Limited to textual code analysis (no runtime/observability data)

Quality Limitations

Accuracy depends on quality of input (commit messages, PR descriptions)
Dependency graph completeness varies by repository
Historical context limited to PR scope (no access to full project history)
No access to external documentation or tribal knowledge

Representativeness Limitations

Evaluation data skewed toward open source patterns
May not generalize to all enterprise code styles
Limited representation of legacy systems and rare languages
Focused on microservices architectures

Evaluation Metrics

Quantitative Metrics (on Evaluation Dataset)

Metric	Definition	GPT-4o-mini	Claude	CodeLlama 13B
Impact Detection Precision	Correctly identified impacted services / All identified	89%	88%	75%
Impact Detection Recall	Correctly identified impacted services / All actual	92%	91%	78%
Risk Level Accuracy	Correct risk classification (high/med/low)	88%	87%	72%
Stakeholder ID Accuracy	Correct stakeholder identification	85%	84%	68%
False Positive Rate	Incorrectly flagged as impacted / Total negative	8%	9%	18%

Qualitative Evaluation

Human expert review of random samples (5% of production traffic)
User satisfaction surveys (planned)
A/B testing between models (ongoing)

Future Dataset Improvements

Planned Additions

Fine-tuning Dataset: Curate organization-specific dataset for fine-tuning
Multi-language Support: Expand evaluation data to more programming languages
Historical Accuracy Tracking: Build dataset of Rippler predictions vs actual impact
Incident Correlation: Link analyses to post-deployment incidents for validation

Community Contributions

Open source evaluation dataset (in progress)
Community-submitted test cases
Benchmark suite for impact analysis systems

Data Access and Licensing

Evaluation Dataset

Availability: Contact team for research access
License: TBD (likely CC BY-SA 4.0 for public portions)
Format: JSON files with PR metadata and annotations

Requesting Access

For research or evaluation purposes, contact the Rippler team:

Email: See README.md for team contacts
GitHub Discussions: [Link to discussions page]

References

GitHub API Documentation: https://docs.github.com/en/rest
OpenAI Data Usage Policy: https://openai.com/policies/api-data-usage-policies
Anthropic Privacy Policy: https://www.anthropic.com/privacy

Citation

If you use Rippler's evaluation dataset or methods in research, please cite:

@software{rippler2024,
  title={Rippler: Ripple Impact Prediction and Propagation Logging for Engineering Resilience},
  author={Rippler Team},
  year={2024},
  url={https://github.com/hanisntsolo/rippler}
}

Last Updated: November 2024
Version: 1.0
Maintained By: Rippler Team

Overview​

Data Sources​

1. Runtime Input Data (Primary Operational Data)​

GitHub Pull Request Data​

Dependency Graph Data​

2. Evaluation and Testing Data​

Public Repository PR Dataset​

Synthetic Test Dataset​

3. Model Training Data (Indirect)​

GPT-4o-mini Training Data (OpenAI)​

Anthropic Claude Training Data​

Ollama Models (CodeLlama, Llama 2, Mistral)​

Data Processing Pipeline​

1. Data Ingestion​

2. LLM Input Preparation​

3. Output Processing​

Data Privacy and Security​

Personal Data Handling​

Code Data Privacy​

Compliance​

Dataset Limitations​

Coverage Limitations​

Quality Limitations​

Representativeness Limitations​

Evaluation Metrics​

Quantitative Metrics (on Evaluation Dataset)​

Qualitative Evaluation​

Future Dataset Improvements​

Planned Additions​

Community Contributions​

Data Access and Licensing​

Evaluation Dataset​

Requesting Access​

References​

Citation​