Skip to main content

Semantic Chunker Implementation

Our Semantic Chunker is an experimental implementation based on our study of how Claude CLI processes large codebases. This standalone component can be used to intelligently divide codebases into meaningful chunks while preserving semantic relationships.

Overview

The Semantic Chunker is designed to:

  1. Analyze code repositories to find semantically meaningful units
  2. Identify relationships between code components
  3. Select the most relevant code based on a query
  4. Optimize chunks for token efficiency

This implementation represents our current understanding of how Claude CLI might approach the problem of processing large codebases efficiently.

Architecture

The Semantic Chunker consists of several key components:

┌───────────────────────────┐
│ Repository Scanner │
└───────────┬───────────────┘

┌───────────▼───────────────┐
│ AST Analyzer │
└───────────┬───────────────┘

┌───────────▼───────────────┐
│ Reference Tracker │
└───────────┬───────────────┘

┌───────────▼───────────────┐
│ Relevance Scorer │
└───────────┬───────────────┘

┌───────────▼───────────────┐
│ Chunk Generator │
└───────────────────────────┘

Key Components

1. Repository Scanner

The Repository Scanner traverses the codebase:

class RepositoryScanner:
def scan_repository(self, repo_path):
all_files = []

# Walk through repository
for root, dirs, files in os.walk(repo_path):
# Skip hidden directories
dirs[:] = [d for d in dirs if not d.startswith('.')]

# Collect code files
for file in files:
if self._is_code_file(file):
file_path = os.path.join(root, file)
all_files.append(file_path)

return all_files

def _is_code_file(self, filename):
# Check if file extension indicates code
code_extensions = {'.py', '.js', '.ts', '.java', '.c', '.cpp', '.go', '.rb'}
return any(filename.endswith(ext) for ext in code_extensions)

2. AST Analyzer

The AST Analyzer parses code into abstract syntax trees:

class AstAnalyzer:
def analyze_file(self, file_path):
with open(file_path, 'r') as f:
content = f.read()

# Parse based on file type
if file_path.endswith('.py'):
return self._analyze_python(content)
elif file_path.endswith('.js') or file_path.endswith('.ts'):
return self._analyze_javascript(content)
# Additional language support...

def _analyze_python(self, content):
# Parse Python code into AST
tree = ast.parse(content)

# Extract semantic units
semantic_units = []

# Find classes and functions
for node in ast.walk(tree):
if isinstance(node, ast.ClassDef):
semantic_units.append(self._extract_class(node, content))
elif isinstance(node, ast.FunctionDef):
if not self._is_method(node): # Skip methods (they're part of classes)
semantic_units.append(self._extract_function(node, content))

# Find imports
imports = self._extract_imports(tree, content)

return {
'imports': imports,
'semantic_units': semantic_units
}

3. Reference Tracker

The Reference Tracker identifies relationships between code components:

class ReferenceTracker:
def __init__(self):
self.reference_graph = nx.DiGraph() # Using NetworkX for the graph

def build_reference_graph(self, files_ast):
# Add all files as nodes
for file_path, ast_info in files_ast.items():
self.reference_graph.add_node(file_path)

# Add edges for imports and references
for file_path, ast_info in files_ast.items():
# Process imports
for imp in ast_info['imports']:
imported_file = self._resolve_import(imp, file_path)
if imported_file and imported_file in files_ast:
self.reference_graph.add_edge(file_path, imported_file)

# Process references to other components
for unit in ast_info['semantic_units']:
references = unit.get('references', [])
for ref in references:
ref_file = self._find_reference_file(ref, files_ast)
if ref_file:
self.reference_graph.add_edge(file_path, ref_file)

4. Relevance Scorer

The Relevance Scorer assesses the relevance of code to a query:

class RelevanceScorer:
def __init__(self):
self.vectorizer = TfidfVectorizer()

def score_files(self, files, query, file_contents):
# Prepare corpus
file_paths = list(files)
corpus = [file_contents[path] for path in file_paths]

# Train vectorizer
self.vectorizer.fit(corpus)

# Vectorize query and documents
query_vector = self.vectorizer.transform([query])
doc_vectors = self.vectorizer.transform(corpus)

# Calculate similarity scores
similarity_scores = cosine_similarity(query_vector, doc_vectors)[0]

# Create scored file dictionary
scored_files = {
file_paths[i]: similarity_scores[i]
for i in range(len(file_paths))
}

return scored_files

def prioritize_files(self, scored_files, reference_graph):
# Enhance scores based on the reference graph
enhanced_scores = scored_files.copy()

# Files with many incoming references are more important
for file in scored_files:
incoming = reference_graph.in_degree(file)
enhanced_scores[file] *= (1 + 0.1 * incoming) # Boost score by 10% per reference

return enhanced_scores

5. Chunk Generator

The Chunk Generator creates optimized code chunks:

class ChunkGenerator:
def generate_chunks(self, files_ast, scored_files, token_budget):
# Sort files by score
sorted_files = sorted(scored_files.items(), key=lambda x: x[1], reverse=True)

chunks = []
current_token_count = 0

# Process files in order of relevance
for file_path, score in sorted_files:
ast_info = files_ast[file_path]

# Always include imports first
import_chunk = self._create_chunk(file_path, 'imports', ast_info['imports'])
import_token_count = self._count_tokens(import_chunk['content'])

if current_token_count + import_token_count <= token_budget:
chunks.append(import_chunk)
current_token_count += import_token_count

# Then include semantic units in order of relevance
for unit in self._sort_units_by_relevance(ast_info['semantic_units'], scored_files):
unit_chunk = self._create_chunk(file_path, unit['type'], unit['content'])
unit_token_count = self._count_tokens(unit_chunk['content'])

if current_token_count + unit_token_count <= token_budget:
chunks.append(unit_chunk)
current_token_count += unit_token_count
else:
# We've reached the token budget
break

return chunks

Implementation Details

AST Processing

Our implementation uses language-specific AST parsers:

  • Python: Uses the built-in ast module
  • JavaScript/TypeScript: Uses esprima parser
  • Java: Uses javalang parser
  • C/C++: Uses pycparser

Example AST processing for Python:

def _extract_class(self, node, content):
start_line = node.lineno
end_line = self._find_end_line(node, content)

methods = []
for child in ast.iter_child_nodes(node):
if isinstance(child, ast.FunctionDef):
methods.append(child.name)

# Extract source code for the class
source_lines = content.splitlines()[start_line-1:end_line]
class_source = '\n'.join(source_lines)

return {
'type': 'class',
'name': node.name,
'content': class_source,
'methods': methods,
'references': self._find_references(node, content)
}

Reference Tracking

Reference tracking includes:

  1. Import Tracking: Follows import statements to build the dependency graph
  2. Symbol Resolution: Resolves symbol references across files
  3. Interface Implementation: Identifies implementations of interfaces/abstract classes
  4. Inheritance Tracking: Follows inheritance hierarchies

Example reference tracking for Python imports:

def _resolve_import(self, import_statement, importing_file):
# Handle different import formats
if import_statement.startswith('from '):
# from X import Y
parts = import_statement.split(' import ')
module_path = parts[0][5:] # Remove 'from '
else:
# import X
module_path = import_statement[7:] # Remove 'import '

# Convert module path to file path
possible_file_paths = self._module_to_file_paths(module_path, importing_file)

# Check if any of the possible paths exist
for path in possible_file_paths:
if os.path.exists(path):
return path

return None

Relevance Scoring

Our relevance scoring combines:

  1. TF-IDF Similarity: Measures content relevance to the query
  2. Reference Importance: Prioritizes files with many references
  3. Recency: Prioritizes recently modified files
  4. Structure Significance: Prioritizes key structural files (main, entry points)

Example scoring combination:

def combine_scores(self, tfidf_scores, reference_scores, recency_scores):
combined_scores = {}

for file in tfidf_scores:
# Weighted combination of scores
combined_scores[file] = (
0.6 * tfidf_scores[file] + # Content relevance is most important
0.3 * reference_scores[file] + # References are quite important
0.1 * recency_scores[file] # Recency is less important
)

return combined_scores

Token Optimization

Our token optimization includes:

  1. Whitespace Normalization: Reducing excessive whitespace
  2. Comment Handling: Preserving essential comments, removing others
  3. Duplication Avoidance: Avoiding including similar content multiple times
  4. Structure Preservation: Ensuring semantic units stay intact

Example whitespace normalization:

def _normalize_whitespace(self, content):
# Replace multiple blank lines with a single blank line
normalized = re.sub(r'\n\s*\n', '\n\n', content)

# Remove trailing whitespace
normalized = re.sub(r'[ \t]+$', '', normalized, flags=re.MULTILINE)

return normalized

Usage Example

Using the Semantic Chunker in a project:

# Initialize components
scanner = RepositoryScanner()
analyzer = AstAnalyzer()
tracker = ReferenceTracker()
scorer = RelevanceScorer()
generator = ChunkGenerator()

# Scan repository
repo_path = "/path/to/repository"
files = scanner.scan_repository(repo_path)

# Analyze files
files_ast = {}
file_contents = {}
for file_path in files:
try:
files_ast[file_path] = analyzer.analyze_file(file_path)
with open(file_path, 'r') as f:
file_contents[file_path] = f.read()
except Exception as e:
print(f"Error analyzing {file_path}: {e}")

# Build reference graph
tracker.build_reference_graph(files_ast)

# Score files based on a query
query = "How does the authentication system work?"
scored_files = scorer.score_files(files, query, file_contents)
enhanced_scores = scorer.prioritize_files(scored_files, tracker.reference_graph)

# Generate chunks
token_budget = 50000
chunks = generator.generate_chunks(files_ast, enhanced_scores, token_budget)

# Use the chunks
for chunk in chunks:
print(f"File: {chunk['file_path']}")
print(f"Type: {chunk['type']}")
print(f"Content length: {len(chunk['content'])}")

Performance Considerations

The Semantic Chunker has been optimized for performance:

  1. Lazy Evaluation: Files are only processed when needed
  2. Caching: Analysis results are cached to avoid redundant processing
  3. Parallel Processing: File analysis can be parallelized
  4. Incremental Updates: Only changed files are reprocessed

For large repositories, we recommend:

# For large repositories, use parallel processing
from concurrent.futures import ProcessPoolExecutor

with ProcessPoolExecutor() as executor:
# Process files in parallel
results = executor.map(analyzer.analyze_file, files)

# Collect results
for file_path, result in zip(files, results):
files_ast[file_path] = result

Limitations and Future Work

Current limitations include:

  1. Language Support: Limited to Python, JavaScript, Java, and C/C++
  2. Complex References: Some complex reference patterns may be missed
  3. Dynamic Features: Dynamic language features are challenging to analyze
  4. Large Repository Performance: Very large repositories may require optimization

Future work includes:

  1. More Languages: Support for more programming languages
  2. Better Symbol Resolution: Improved cross-file symbol resolution
  3. Semantic Understanding: Deeper semantic understanding of code
  4. Machine Learning Enhancements: Using ML to improve relevance scoring

Conclusion

Our Semantic Chunker provides a practical implementation of the chunking patterns we've observed in Claude CLI. While not as sophisticated as Claude CLI's implementation, it demonstrates the core principles and can be a valuable tool for processing large codebases efficiently.