Semantic Chunker Implementation

Our Semantic Chunker is an experimental implementation based on our study of how Claude CLI processes large codebases. This standalone component can be used to intelligently divide codebases into meaningful chunks while preserving semantic relationships.

Overview

The Semantic Chunker is designed to:

Analyze code repositories to find semantically meaningful units
Identify relationships between code components
Select the most relevant code based on a query
Optimize chunks for token efficiency

This implementation represents our current understanding of how Claude CLI might approach the problem of processing large codebases efficiently.

Architecture

The Semantic Chunker consists of several key components:

┌───────────────────────────┐
│     Repository Scanner    │
└───────────┬───────────────┘
            │
┌───────────▼───────────────┐
│     AST Analyzer          │
└───────────┬───────────────┘
            │
┌───────────▼───────────────┐
│     Reference Tracker     │
└───────────┬───────────────┘
            │
┌───────────▼───────────────┐
│     Relevance Scorer      │
└───────────┬───────────────┘
            │
┌───────────▼───────────────┐
│     Chunk Generator       │
└───────────────────────────┘

Key Components

1. Repository Scanner

The Repository Scanner traverses the codebase:

class RepositoryScanner:
    def scan_repository(self, repo_path):
        all_files = []
        
        # Walk through repository
        for root, dirs, files in os.walk(repo_path):
            # Skip hidden directories
            dirs[:] = [d for d in dirs if not d.startswith('.')]
            
            # Collect code files
            for file in files:
                if self._is_code_file(file):
                    file_path = os.path.join(root, file)
                    all_files.append(file_path)
                    
        return all_files
    
    def _is_code_file(self, filename):
        # Check if file extension indicates code
        code_extensions = {'.py', '.js', '.ts', '.java', '.c', '.cpp', '.go', '.rb'}
        return any(filename.endswith(ext) for ext in code_extensions)

2. AST Analyzer

The AST Analyzer parses code into abstract syntax trees:

class AstAnalyzer:
    def analyze_file(self, file_path):
        with open(file_path, 'r') as f:
            content = f.read()
            
        # Parse based on file type
        if file_path.endswith('.py'):
            return self._analyze_python(content)
        elif file_path.endswith('.js') or file_path.endswith('.ts'):
            return self._analyze_javascript(content)
        # Additional language support...
    
    def _analyze_python(self, content):
        # Parse Python code into AST
        tree = ast.parse(content)
        
        # Extract semantic units
        semantic_units = []
        
        # Find classes and functions
        for node in ast.walk(tree):
            if isinstance(node, ast.ClassDef):
                semantic_units.append(self._extract_class(node, content))
            elif isinstance(node, ast.FunctionDef):
                if not self._is_method(node):  # Skip methods (they're part of classes)
                    semantic_units.append(self._extract_function(node, content))
                    
        # Find imports
        imports = self._extract_imports(tree, content)
        
        return {
            'imports': imports,
            'semantic_units': semantic_units
        }

3. Reference Tracker

The Reference Tracker identifies relationships between code components:

class ReferenceTracker:
    def __init__(self):
        self.reference_graph = nx.DiGraph()  # Using NetworkX for the graph
        
    def build_reference_graph(self, files_ast):
        # Add all files as nodes
        for file_path, ast_info in files_ast.items():
            self.reference_graph.add_node(file_path)
            
        # Add edges for imports and references
        for file_path, ast_info in files_ast.items():
            # Process imports
            for imp in ast_info['imports']:
                imported_file = self._resolve_import(imp, file_path)
                if imported_file and imported_file in files_ast:
                    self.reference_graph.add_edge(file_path, imported_file)
                    
            # Process references to other components
            for unit in ast_info['semantic_units']:
                references = unit.get('references', [])
                for ref in references:
                    ref_file = self._find_reference_file(ref, files_ast)
                    if ref_file:
                        self.reference_graph.add_edge(file_path, ref_file)

4. Relevance Scorer

The Relevance Scorer assesses the relevance of code to a query:

class RelevanceScorer:
    def __init__(self):
        self.vectorizer = TfidfVectorizer()
        
    def score_files(self, files, query, file_contents):
        # Prepare corpus
        file_paths = list(files)
        corpus = [file_contents[path] for path in file_paths]
        
        # Train vectorizer
        self.vectorizer.fit(corpus)
        
        # Vectorize query and documents
        query_vector = self.vectorizer.transform([query])
        doc_vectors = self.vectorizer.transform(corpus)
        
        # Calculate similarity scores
        similarity_scores = cosine_similarity(query_vector, doc_vectors)[0]
        
        # Create scored file dictionary
        scored_files = {
            file_paths[i]: similarity_scores[i] 
            for i in range(len(file_paths))
        }
        
        return scored_files
        
    def prioritize_files(self, scored_files, reference_graph):
        # Enhance scores based on the reference graph
        enhanced_scores = scored_files.copy()
        
        # Files with many incoming references are more important
        for file in scored_files:
            incoming = reference_graph.in_degree(file)
            enhanced_scores[file] *= (1 + 0.1 * incoming)  # Boost score by 10% per reference
            
        return enhanced_scores

5. Chunk Generator

The Chunk Generator creates optimized code chunks:

class ChunkGenerator:
    def generate_chunks(self, files_ast, scored_files, token_budget):
        # Sort files by score
        sorted_files = sorted(scored_files.items(), key=lambda x: x[1], reverse=True)
        
        chunks = []
        current_token_count = 0
        
        # Process files in order of relevance
        for file_path, score in sorted_files:
            ast_info = files_ast[file_path]
            
            # Always include imports first
            import_chunk = self._create_chunk(file_path, 'imports', ast_info['imports'])
            import_token_count = self._count_tokens(import_chunk['content'])
            
            if current_token_count + import_token_count <= token_budget:
                chunks.append(import_chunk)
                current_token_count += import_token_count
            
            # Then include semantic units in order of relevance
            for unit in self._sort_units_by_relevance(ast_info['semantic_units'], scored_files):
                unit_chunk = self._create_chunk(file_path, unit['type'], unit['content'])
                unit_token_count = self._count_tokens(unit_chunk['content'])
                
                if current_token_count + unit_token_count <= token_budget:
                    chunks.append(unit_chunk)
                    current_token_count += unit_token_count
                else:
                    # We've reached the token budget
                    break
        
        return chunks

Implementation Details

AST Processing

Our implementation uses language-specific AST parsers:

Python: Uses the built-in ast module
JavaScript/TypeScript: Uses esprima parser
Java: Uses javalang parser
C/C++: Uses pycparser

Example AST processing for Python:

def _extract_class(self, node, content):
    start_line = node.lineno
    end_line = self._find_end_line(node, content)
    
    methods = []
    for child in ast.iter_child_nodes(node):
        if isinstance(child, ast.FunctionDef):
            methods.append(child.name)
    
    # Extract source code for the class
    source_lines = content.splitlines()[start_line-1:end_line]
    class_source = '\n'.join(source_lines)
    
    return {
        'type': 'class',
        'name': node.name,
        'content': class_source,
        'methods': methods,
        'references': self._find_references(node, content)
    }

Reference Tracking

Reference tracking includes:

Import Tracking: Follows import statements to build the dependency graph
Symbol Resolution: Resolves symbol references across files
Interface Implementation: Identifies implementations of interfaces/abstract classes
Inheritance Tracking: Follows inheritance hierarchies

Example reference tracking for Python imports:

def _resolve_import(self, import_statement, importing_file):
    # Handle different import formats
    if import_statement.startswith('from '):
        # from X import Y
        parts = import_statement.split(' import ')
        module_path = parts[0][5:]  # Remove 'from '
    else:
        # import X
        module_path = import_statement[7:]  # Remove 'import '
    
    # Convert module path to file path
    possible_file_paths = self._module_to_file_paths(module_path, importing_file)
    
    # Check if any of the possible paths exist
    for path in possible_file_paths:
        if os.path.exists(path):
            return path
    
    return None

Relevance Scoring

Our relevance scoring combines:

TF-IDF Similarity: Measures content relevance to the query
Reference Importance: Prioritizes files with many references
Recency: Prioritizes recently modified files
Structure Significance: Prioritizes key structural files (main, entry points)

Example scoring combination:

def combine_scores(self, tfidf_scores, reference_scores, recency_scores):
    combined_scores = {}
    
    for file in tfidf_scores:
        # Weighted combination of scores
        combined_scores[file] = (
            0.6 * tfidf_scores[file] +  # Content relevance is most important
            0.3 * reference_scores[file] +  # References are quite important
            0.1 * recency_scores[file]  # Recency is less important
        )
    
    return combined_scores

Token Optimization

Our token optimization includes:

Whitespace Normalization: Reducing excessive whitespace
Comment Handling: Preserving essential comments, removing others
Duplication Avoidance: Avoiding including similar content multiple times
Structure Preservation: Ensuring semantic units stay intact

Example whitespace normalization:

def _normalize_whitespace(self, content):
    # Replace multiple blank lines with a single blank line
    normalized = re.sub(r'\n\s*\n', '\n\n', content)
    
    # Remove trailing whitespace
    normalized = re.sub(r'[ \t]+$', '', normalized, flags=re.MULTILINE)
    
    return normalized

Usage Example

Using the Semantic Chunker in a project:

# Initialize components
scanner = RepositoryScanner()
analyzer = AstAnalyzer()
tracker = ReferenceTracker()
scorer = RelevanceScorer()
generator = ChunkGenerator()

# Scan repository
repo_path = "/path/to/repository"
files = scanner.scan_repository(repo_path)

# Analyze files
files_ast = {}
file_contents = {}
for file_path in files:
    try:
        files_ast[file_path] = analyzer.analyze_file(file_path)
        with open(file_path, 'r') as f:
            file_contents[file_path] = f.read()
    except Exception as e:
        print(f"Error analyzing {file_path}: {e}")

# Build reference graph
tracker.build_reference_graph(files_ast)

# Score files based on a query
query = "How does the authentication system work?"
scored_files = scorer.score_files(files, query, file_contents)
enhanced_scores = scorer.prioritize_files(scored_files, tracker.reference_graph)

# Generate chunks
token_budget = 50000
chunks = generator.generate_chunks(files_ast, enhanced_scores, token_budget)

# Use the chunks
for chunk in chunks:
    print(f"File: {chunk['file_path']}")
    print(f"Type: {chunk['type']}")
    print(f"Content length: {len(chunk['content'])}")

Performance Considerations

The Semantic Chunker has been optimized for performance:

Lazy Evaluation: Files are only processed when needed
Caching: Analysis results are cached to avoid redundant processing
Parallel Processing: File analysis can be parallelized
Incremental Updates: Only changed files are reprocessed

For large repositories, we recommend:

# For large repositories, use parallel processing
from concurrent.futures import ProcessPoolExecutor

with ProcessPoolExecutor() as executor:
    # Process files in parallel
    results = executor.map(analyzer.analyze_file, files)
    
    # Collect results
    for file_path, result in zip(files, results):
        files_ast[file_path] = result

Limitations and Future Work

Current limitations include:

Language Support: Limited to Python, JavaScript, Java, and C/C++
Complex References: Some complex reference patterns may be missed
Dynamic Features: Dynamic language features are challenging to analyze
Large Repository Performance: Very large repositories may require optimization

Future work includes:

More Languages: Support for more programming languages
Better Symbol Resolution: Improved cross-file symbol resolution
Semantic Understanding: Deeper semantic understanding of code
Machine Learning Enhancements: Using ML to improve relevance scoring

Conclusion

Our Semantic Chunker provides a practical implementation of the chunking patterns we've observed in Claude CLI. While not as sophisticated as Claude CLI's implementation, it demonstrates the core principles and can be a valuable tool for processing large codebases efficiently.

Overview​

Architecture​

Key Components​

1. Repository Scanner​

2. AST Analyzer​

3. Reference Tracker​

4. Relevance Scorer​

5. Chunk Generator​

Implementation Details​

AST Processing​

Reference Tracking​

Relevance Scoring​

Token Optimization​

Usage Example​

Performance Considerations​

Limitations and Future Work​

Conclusion​