Skip to main content

ADR-002: Multi-Layered Repository Analysis Engine Design


id: 002-repository-analysis-engine title: 'ADR-002: Repository Analysis Engine Design' sidebar_label: 'ADR-2: Repository Analysis Engine Design' sidebar_position: 2

Status

Accepted

Context

DocuMCP needs to understand repository characteristics to make intelligent recommendations about static site generators and documentation structure. The analysis must go beyond simple file counting to provide deep insights into project complexity, language ecosystems, existing documentation patterns, and development practices.

Key requirements:

  • Comprehensive project characterization
  • Language ecosystem detection
  • Documentation quality assessment
  • Project complexity evaluation
  • Performance optimization for large repositories
  • Extensible architecture for new analysis types

Decision

We will implement a multi-layered repository analysis engine that examines repositories from multiple perspectives to build comprehensive project profiles.

Analysis Layers:

1. File System Analysis Layer

  • Recursive directory traversal with intelligent filtering
  • File categorization by extension and content patterns
  • Metrics calculation: file counts, lines of code, directory depth, size distributions
  • Ignore pattern handling: .gitignore, common build artifacts, node_modules

2. Language Ecosystem Analysis Layer

  • Package manager detection: package.json, requirements.txt, Cargo.toml, go.mod, etc.
  • Dependency analysis: direct and transitive dependencies
  • Build tool identification: webpack, vite, gradle, maven, cargo, etc.
  • Version constraint analysis: compatibility requirements

3. Content Analysis Layer

  • Documentation quality assessment: README analysis, existing docs
  • Code comment analysis: inline documentation patterns
  • API surface detection: public interfaces, exported functions
  • Content gap identification: missing documentation areas

4. Project Metadata Analysis Layer

  • Git history patterns: commit frequency, contributor activity
  • Release management: tagging patterns, version schemes
  • Issue tracking: GitHub issues, project management indicators
  • Community engagement: contributor count, activity patterns

5. Complexity Assessment Layer

  • Architectural complexity: microservices, modular design patterns
  • Technical complexity: multi-language projects, advanced configurations
  • Maintenance indicators: test coverage, CI/CD presence, code quality metrics
  • Documentation sophistication needs: API complexity, user journey complexity

Alternatives Considered

Single-Pass Analysis

  • Pros: Simpler implementation, faster for small repositories
  • Cons: Limited depth, cannot build sophisticated project profiles
  • Decision: Rejected due to insufficient intelligence for quality recommendations

External Tool Integration (e.g., GitHub API, CodeClimate)

  • Pros: Rich metadata, established metrics
  • Cons: External dependencies, rate limiting, requires authentication
  • Decision: Rejected for core analysis; may integrate as optional enhancement

Machine Learning-Based Analysis

  • Pros: Could learn patterns from successful documentation projects
  • Cons: Training data requirements, model maintenance, unpredictable results
  • Decision: Deferred to future versions; start with rule-based analysis

Database-Backed Caching

  • Pros: Faster repeat analysis, could store learning patterns
  • Cons: Deployment complexity, staleness issues, synchronization problems
  • Decision: Rejected for initial version; implement in-memory caching only

Consequences

Positive

  • Intelligent Recommendations: Deep analysis enables sophisticated SSG matching
  • Extensible Architecture: Easy to add new analysis dimensions
  • Performance Optimization: Layered approach allows selective analysis depth
  • Quality Assessment: Can identify and improve existing documentation
  • Future-Proof: Architecture supports ML integration and advanced analytics

Negative

  • Analysis Time: Comprehensive analysis may be slower for large repositories
  • Complexity: Multi-layered architecture requires careful coordination
  • Memory Usage: Full repository analysis requires significant memory for large projects

Risks and Mitigations

  • Performance: Implement streaming analysis and configurable depth limits
  • Accuracy: Validate analysis results against known project types
  • Maintenance: Regular testing against diverse repository types

Implementation Details

Analysis Engine Structure

interface RepositoryAnalysis {
fileSystem: FileSystemAnalysis;
languageEcosystem: LanguageEcosystemAnalysis;
content: ContentAnalysis;
metadata: ProjectMetadataAnalysis;
complexity: ComplexityAssessment;
}

interface AnalysisLayer {
analyze(repositoryPath: string): Promise<LayerResult>;
getMetrics(): AnalysisMetrics;
validate(): ValidationResult;
}

Performance Optimizations

  • Parallel Analysis: Independent layers run concurrently
  • Intelligent Filtering: Skip irrelevant files and directories early
  • Progressive Analysis: Start with lightweight analysis, deepen as needed
  • Caching Strategy: Cache analysis results within session scope
  • Size Limits: Configurable limits for very large repositories

File Pattern Recognition

const FILE_PATTERNS = {
documentation: ['.md', '.rst', '.adoc', 'docs/', 'documentation/'],
configuration: ['config/', '.config/', '*.json', '*.yaml', '*.toml'],
source: ['src/', 'lib/', '*.js', '*.ts', '*.py', '*.go', '*.rs'],
tests: ['test/', 'tests/', '__tests__/', '*.test.*', '*.spec.*'],
build: ['build/', 'dist/', 'target/', 'bin/', '*.lock']
};

Language Ecosystem Detection

const ECOSYSTEM_INDICATORS = {
javascript: ['package.json', 'node_modules/', 'yarn.lock', 'pnpm-lock.yaml'],
python: ['requirements.txt', 'setup.py', 'pyproject.toml', 'Pipfile'],
rust: ['Cargo.toml', 'Cargo.lock', 'src/main.rs'],
go: ['go.mod', 'go.sum', 'main.go'],
java: ['pom.xml', 'build.gradle', 'gradlew']
};

Complexity Scoring Algorithm

interface ComplexityFactors {
fileCount: number;
languageCount: number;
dependencyCount: number;
directoryDepth: number;
contributorCount: number;
apiSurfaceSize: number;
}

function calculateComplexityScore(factors: ComplexityFactors): ComplexityScore {
// Weighted scoring algorithm balancing multiple factors
// Returns: 'simple' | 'moderate' | 'complex' | 'enterprise'
}

Quality Assurance

Testing Strategy

  • Unit Tests: Each analysis layer tested independently
  • Integration Tests: Full analysis pipeline validation
  • Repository Fixtures: Test suite with diverse project types
  • Performance Tests: Analysis time benchmarks for various repository sizes
  • Accuracy Validation: Manual verification against known project characteristics

Monitoring and Metrics

  • Analysis execution time by repository size
  • Accuracy of complexity assessments
  • Cache hit rates and memory usage
  • Error rates and failure modes

Future Enhancements

Machine Learning Integration

  • Pattern recognition for project types
  • Automated documentation quality scoring
  • Predictive analysis for maintenance needs

Advanced Analytics

  • Historical trend analysis
  • Comparative analysis across similar projects
  • Community best practice identification

Performance Optimizations

  • WebAssembly modules for intensive analysis
  • Distributed analysis for very large repositories
  • Incremental analysis for updated repositories

Security Considerations

  • File System Access: Restricted to repository boundaries
  • Content Scanning: No sensitive data extraction or storage
  • Resource Limits: Prevent resource exhaustion attacks
  • Input Validation: Sanitize all repository paths and content

References