Semantic Chunker Implementation
Our Semantic Chunker is an experimental implementation based on our study of how Claude CLI processes large codebases. This standalone component can be used to intelligently divide codebases into meaningful chunks while preserving semantic relationships.
Overview
The Semantic Chunker is designed to:
- Analyze code repositories to find semantically meaningful units
- Identify relationships between code components
- Select the most relevant code based on a query
- Optimize chunks for token efficiency
This implementation represents our current understanding of how Claude CLI might approach the problem of processing large codebases efficiently.
Architecture
The Semantic Chunker consists of several key components:
┌───────────────────────────┐
│ Repository Scanner │
└───────────┬───────────────┘
│
┌───────────▼───────────────┐
│ AST Analyzer │
└───────────┬───────────────┘
│
┌───────────▼───────────────┐
│ Reference Tracker │
└───────────┬───────────────┘
│
┌───────────▼───────────────┐
│ Relevance Scorer │
└───────────┬───────────────┘
│
┌───────────▼───────────────┐
│ Chunk Generator │
└───────────────────────────┘
Key Components
1. Repository Scanner
The Repository Scanner traverses the codebase:
class RepositoryScanner:
def scan_repository(self, repo_path):
all_files = []
# Walk through repository
for root, dirs, files in os.walk(repo_path):
# Skip hidden directories
dirs[:] = [d for d in dirs if not d.startswith('.')]
# Collect code files
for file in files:
if self._is_code_file(file):
file_path = os.path.join(root, file)
all_files.append(file_path)
return all_files
def _is_code_file(self, filename):
# Check if file extension indicates code
code_extensions = {'.py', '.js', '.ts', '.java', '.c', '.cpp', '.go', '.rb'}
return any(filename.endswith(ext) for ext in code_extensions)
2. AST Analyzer
The AST Analyzer parses code into abstract syntax trees:
class AstAnalyzer:
def analyze_file(self, file_path):
with open(file_path, 'r') as f:
content = f.read()
# Parse based on file type
if file_path.endswith('.py'):
return self._analyze_python(content)
elif file_path.endswith('.js') or file_path.endswith('.ts'):
return self._analyze_javascript(content)
# Additional language support...
def _analyze_python(self, content):
# Parse Python code into AST
tree = ast.parse(content)
# Extract semantic units
semantic_units = []
# Find classes and functions
for node in ast.walk(tree):
if isinstance(node, ast.ClassDef):
semantic_units.append(self._extract_class(node, content))
elif isinstance(node, ast.FunctionDef):
if not self._is_method(node): # Skip methods (they're part of classes)
semantic_units.append(self._extract_function(node, content))
# Find imports
imports = self._extract_imports(tree, content)
return {
'imports': imports,
'semantic_units': semantic_units
}
3. Reference Tracker
The Reference Tracker identifies relationships between code components:
class ReferenceTracker:
def __init__(self):
self.reference_graph = nx.DiGraph() # Using NetworkX for the graph
def build_reference_graph(self, files_ast):
# Add all files as nodes
for file_path, ast_info in files_ast.items():
self.reference_graph.add_node(file_path)
# Add edges for imports and references
for file_path, ast_info in files_ast.items():
# Process imports
for imp in ast_info['imports']:
imported_file = self._resolve_import(imp, file_path)
if imported_file and imported_file in files_ast:
self.reference_graph.add_edge(file_path, imported_file)
# Process references to other components
for unit in ast_info['semantic_units']:
references = unit.get('references', [])
for ref in references:
ref_file = self._find_reference_file(ref, files_ast)
if ref_file:
self.reference_graph.add_edge(file_path, ref_file)
4. Relevance Scorer
The Relevance Scorer assesses the relevance of code to a query:
class RelevanceScorer:
def __init__(self):
self.vectorizer = TfidfVectorizer()
def score_files(self, files, query, file_contents):
# Prepare corpus
file_paths = list(files)
corpus = [file_contents[path] for path in file_paths]
# Train vectorizer
self.vectorizer.fit(corpus)
# Vectorize query and documents
query_vector = self.vectorizer.transform([query])
doc_vectors = self.vectorizer.transform(corpus)
# Calculate similarity scores
similarity_scores = cosine_similarity(query_vector, doc_vectors)[0]
# Create scored file dictionary
scored_files = {
file_paths[i]: similarity_scores[i]
for i in range(len(file_paths))
}
return scored_files
def prioritize_files(self, scored_files, reference_graph):
# Enhance scores based on the reference graph
enhanced_scores = scored_files.copy()
# Files with many incoming references are more important
for file in scored_files:
incoming = reference_graph.in_degree(file)
enhanced_scores[file] *= (1 + 0.1 * incoming) # Boost score by 10% per reference
return enhanced_scores
5. Chunk Generator
The Chunk Generator creates optimized code chunks:
class ChunkGenerator:
def generate_chunks(self, files_ast, scored_files, token_budget):
# Sort files by score
sorted_files = sorted(scored_files.items(), key=lambda x: x[1], reverse=True)
chunks = []
current_token_count = 0
# Process files in order of relevance
for file_path, score in sorted_files:
ast_info = files_ast[file_path]
# Always include imports first
import_chunk = self._create_chunk(file_path, 'imports', ast_info['imports'])
import_token_count = self._count_tokens(import_chunk['content'])
if current_token_count + import_token_count <= token_budget:
chunks.append(import_chunk)
current_token_count += import_token_count
# Then include semantic units in order of relevance
for unit in self._sort_units_by_relevance(ast_info['semantic_units'], scored_files):
unit_chunk = self._create_chunk(file_path, unit['type'], unit['content'])
unit_token_count = self._count_tokens(unit_chunk['content'])
if current_token_count + unit_token_count <= token_budget:
chunks.append(unit_chunk)
current_token_count += unit_token_count
else:
# We've reached the token budget
break
return chunks
Implementation Details
AST Processing
Our implementation uses language-specific AST parsers:
- Python: Uses the built-in
ast
module - JavaScript/TypeScript: Uses
esprima
parser - Java: Uses
javalang
parser - C/C++: Uses
pycparser
Example AST processing for Python:
def _extract_class(self, node, content):
start_line = node.lineno
end_line = self._find_end_line(node, content)
methods = []
for child in ast.iter_child_nodes(node):
if isinstance(child, ast.FunctionDef):
methods.append(child.name)
# Extract source code for the class
source_lines = content.splitlines()[start_line-1:end_line]
class_source = '\n'.join(source_lines)
return {
'type': 'class',
'name': node.name,
'content': class_source,
'methods': methods,
'references': self._find_references(node, content)
}
Reference Tracking
Reference tracking includes:
- Import Tracking: Follows import statements to build the dependency graph
- Symbol Resolution: Resolves symbol references across files
- Interface Implementation: Identifies implementations of interfaces/abstract classes
- Inheritance Tracking: Follows inheritance hierarchies
Example reference tracking for Python imports:
def _resolve_import(self, import_statement, importing_file):
# Handle different import formats
if import_statement.startswith('from '):
# from X import Y
parts = import_statement.split(' import ')
module_path = parts[0][5:] # Remove 'from '
else:
# import X
module_path = import_statement[7:] # Remove 'import '
# Convert module path to file path
possible_file_paths = self._module_to_file_paths(module_path, importing_file)
# Check if any of the possible paths exist
for path in possible_file_paths:
if os.path.exists(path):
return path
return None
Relevance Scoring
Our relevance scoring combines:
- TF-IDF Similarity: Measures content relevance to the query
- Reference Importance: Prioritizes files with many references
- Recency: Prioritizes recently modified files
- Structure Significance: Prioritizes key structural files (main, entry points)
Example scoring combination:
def combine_scores(self, tfidf_scores, reference_scores, recency_scores):
combined_scores = {}
for file in tfidf_scores:
# Weighted combination of scores
combined_scores[file] = (
0.6 * tfidf_scores[file] + # Content relevance is most important
0.3 * reference_scores[file] + # References are quite important
0.1 * recency_scores[file] # Recency is less important
)
return combined_scores
Token Optimization
Our token optimization includes:
- Whitespace Normalization: Reducing excessive whitespace
- Comment Handling: Preserving essential comments, removing others
- Duplication Avoidance: Avoiding including similar content multiple times
- Structure Preservation: Ensuring semantic units stay intact
Example whitespace normalization:
def _normalize_whitespace(self, content):
# Replace multiple blank lines with a single blank line
normalized = re.sub(r'\n\s*\n', '\n\n', content)
# Remove trailing whitespace
normalized = re.sub(r'[ \t]+$', '', normalized, flags=re.MULTILINE)
return normalized
Usage Example
Using the Semantic Chunker in a project:
# Initialize components
scanner = RepositoryScanner()
analyzer = AstAnalyzer()
tracker = ReferenceTracker()
scorer = RelevanceScorer()
generator = ChunkGenerator()
# Scan repository
repo_path = "/path/to/repository"
files = scanner.scan_repository(repo_path)
# Analyze files
files_ast = {}
file_contents = {}
for file_path in files:
try:
files_ast[file_path] = analyzer.analyze_file(file_path)
with open(file_path, 'r') as f:
file_contents[file_path] = f.read()
except Exception as e:
print(f"Error analyzing {file_path}: {e}")
# Build reference graph
tracker.build_reference_graph(files_ast)
# Score files based on a query
query = "How does the authentication system work?"
scored_files = scorer.score_files(files, query, file_contents)
enhanced_scores = scorer.prioritize_files(scored_files, tracker.reference_graph)
# Generate chunks
token_budget = 50000
chunks = generator.generate_chunks(files_ast, enhanced_scores, token_budget)
# Use the chunks
for chunk in chunks:
print(f"File: {chunk['file_path']}")
print(f"Type: {chunk['type']}")
print(f"Content length: {len(chunk['content'])}")
Performance Considerations
The Semantic Chunker has been optimized for performance:
- Lazy Evaluation: Files are only processed when needed
- Caching: Analysis results are cached to avoid redundant processing
- Parallel Processing: File analysis can be parallelized
- Incremental Updates: Only changed files are reprocessed
For large repositories, we recommend:
# For large repositories, use parallel processing
from concurrent.futures import ProcessPoolExecutor
with ProcessPoolExecutor() as executor:
# Process files in parallel
results = executor.map(analyzer.analyze_file, files)
# Collect results
for file_path, result in zip(files, results):
files_ast[file_path] = result
Limitations and Future Work
Current limitations include:
- Language Support: Limited to Python, JavaScript, Java, and C/C++
- Complex References: Some complex reference patterns may be missed
- Dynamic Features: Dynamic language features are challenging to analyze
- Large Repository Performance: Very large repositories may require optimization
Future work includes:
- More Languages: Support for more programming languages
- Better Symbol Resolution: Improved cross-file symbol resolution
- Semantic Understanding: Deeper semantic understanding of code
- Machine Learning Enhancements: Using ML to improve relevance scoring
Conclusion
Our Semantic Chunker provides a practical implementation of the chunking patterns we've observed in Claude CLI. While not as sophisticated as Claude CLI's implementation, it demonstrates the core principles and can be a valuable tool for processing large codebases efficiently.