Claude 3.7 Sonnet Demo: Understanding Large Codebases

This interactive demo showcases how Claude 3.7 Sonnet can comprehend and analyze large codebases, providing developers with powerful tools for code navigation, understanding, and documentation.

Introduction

Claude 3.7 Sonnet is the first AI model capable of truly understanding entire codebases at scale. This demo will guide you through:

How to prepare your codebase for analysis
Effective semantic chunking techniques
Crafting prompts for code understanding
Real-world examples and use cases

Why Claude 3.7 Sonnet Excels at Code Understanding

Claude 3.7 Sonnet offers several advantages for code analysis:

100K context window: Able to process large portions of code at once
Advanced semantic understanding: Comprehends relationships between different code components
Language-agnostic: Works across multiple programming languages
Excellent code generation: Can create, modify, and explain code with high accuracy

Demo 1: Setting Up Your Environment

Before diving into code analysis, you'll need to set up your environment:

# Install required packages
pip install anthropic requests beautifulsoup4 scikit-learn numpy

# Set your API key
export ANTHROPIC_API_KEY=your_api_key_here

# Create a project directory
mkdir claude-cli-study-demo
cd claude-cli-study-demo

# Create initial files
touch semantic_chunker.py code_analyzer.py sample_code.py requirements.txt

Demo 2: Implementing Semantic Chunking

The key to effective code analysis with Claude 3.7 Sonnet is proper semantic chunking. Let's implement a basic chunker:

import os
import re
import ast
from typing import List, Dict, Any

class SemanticChunker:
    """A class for semantically chunking code files."""
    
    def __init__(self):
        self.language_handlers = {
            '.py': self._chunk_python,
            '.js': self._chunk_javascript,
            '.ts': self._chunk_typescript,
            '.java': self._chunk_java,
            '.cpp': self._chunk_cpp,
            '.c': self._chunk_c
        }
    
    def chunk_file(self, file_path):
        """Chunk a file based on its extension."""
        _, ext = os.path.splitext(file_path)
        
        if ext not in self.language_handlers:
            # Default basic chunking for unsupported languages
            return self._basic_chunk(file_path)
        
        return self.language_handlers[ext](file_path)
    
    def _chunk_python(self, file_path):
        """Semantically chunk a Python file using the ast module."""
        with open(file_path, 'r', encoding='utf-8') as f:
            content = f.read()
        
        try:
            tree = ast.parse(content)
            chunks = []
            
            # Extract imports as a chunk
            imports = []
            for node in ast.walk(tree):
                if isinstance(node, (ast.Import, ast.ImportFrom)):
                    imports.append(ast.get_source_segment(content, node))
            
            if imports:
                chunks.append({
                    'file_path': file_path,
                    'type': 'imports',
                    'content': '\n'.join(imports)
                })
            
            # Extract classes and functions
            for node in tree.body:
                if isinstance(node, ast.ClassDef):
                    chunks.append({
                        'file_path': file_path,
                        'type': 'class',
                        'name': node.name,
                        'content': ast.get_source_segment(content, node)
                    })
                elif isinstance(node, ast.FunctionDef):
                    chunks.append({
                        'file_path': file_path,
                        'type': 'function',
                        'name': node.name,
                        'content': ast.get_source_segment(content, node)
                    })
            
            return chunks
            
        except SyntaxError:
            # Fall back to basic chunking if parsing fails
            return self._basic_chunk(file_path)
    
    def _basic_chunk(self, file_path):
        """Basic chunking for files that can't be parsed."""
        with open(file_path, 'r', encoding='utf-8') as f:
            content = f.read()
        
        return [{
            'file_path': file_path,
            'type': 'content',
            'content': content
        }]
    
    # Other language handlers would be implemented similarly
    def _chunk_javascript(self, file_path):
        return self._basic_chunk(file_path)
        
    def _chunk_typescript(self, file_path):
        return self._basic_chunk(file_path)
        
    def _chunk_java(self, file_path):
        return self._basic_chunk(file_path)
        
    def _chunk_cpp(self, file_path):
        return self._basic_chunk(file_path)
        
    def _chunk_c(self, file_path):
        return self._basic_chunk(file_path)

def process_codebase(repo_path):
    """Process a codebase using semantic chunking."""
    chunker = SemanticChunker()
    chunks = []
    
    for file_path in get_source_files(repo_path):
        file_chunks = chunker.chunk_file(file_path)
        chunks.extend(file_chunks)
    
    return chunks

def get_source_files(repo_path):
    """Get all source code files in the repository."""
    source_files = []
    
    for root, _, files in os.walk(repo_path):
        for file in files:
            # Skip hidden files and directories
            if file.startswith('.'):
                continue
                
            # Skip common non-source directories
            if any(part.startswith('.') or part in ['node_modules', 'venv', 'build', 'dist'] 
                  for part in root.split(os.sep)):
                continue
                
            # Check file extensions for code files
            _, ext = os.path.splitext(file)
            if ext in ['.py', '.js', '.ts', '.jsx', '.tsx', '.java', '.c', '.cpp', '.h', '.hpp', '.cs', '.go', '.rb']:
                file_path = os.path.join(root, file)
                source_files.append(file_path)
    
    return source_files

The Importance of Semantic Boundaries

Unlike naive chunking based on character count, semantic chunking respects code structure:

Naive Chunking

// Chunk 1
class UserService {
constructor(private userRepo: UserRepository) {}

async getUserById(id: string): Promise<User | null> {
  return this.userRepo.findById(id);
}

async createUser(userData: UserInput): Promise<User> {
  // Validate user data
  if (!userData.email || !userData.name) {
    throw new Error('Invalid us

// Chunk 2
er data');
  }
  
  const existingUser = await this.userRepo.findByEmail(userData.email);
  if (existingUser) {
    throw new Error('User with this email already exists');
  }
  
  return this.userRepo.create(userData);
}
}

Semantic Chunking

// Chunk 1
class UserService {
constructor(private userRepo: UserRepository) {}

async getUserById(id: string): Promise<User | null> {
  return this.userRepo.findById(id);
}
}

// Chunk 2
class UserService {
// ... constructor and other methods ...

async createUser(userData: UserInput): Promise<User> {
  // Validate user data
  if (!userData.email || !userData.name) {
    throw new Error('Invalid user data');
  }
  
  const existingUser = await this.userRepo.findByEmail(userData.email);
  if (existingUser) {
    throw new Error('User with this email already exists');
  }
  
  return this.userRepo.create(userData);
}
}

Demo 3: Interacting with Claude 3.7 Sonnet

Now that we have our chunked codebase, let's see how to interact with Claude 3.7 Sonnet:

import anthropic
import os
from typing import List, Dict, Any
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def select_relevant_chunks(chunks: List[Dict[str, Any]], question: str, top_n: int = 5) -> List[Dict[str, Any]]:
    """Select the most relevant chunks based on the question."""
    # Create a combined list of all chunks and the question
    texts = [chunk['content'] for chunk in chunks]
    texts.append(question)
    
    # Create TF-IDF vectors
    vectorizer = TfidfVectorizer(stop_words='english')
    tfidf_matrix = vectorizer.fit_transform(texts)
    
    # Calculate similarity between question and each chunk
    question_vector = tfidf_matrix[-1]
    chunk_vectors = tfidf_matrix[:-1]
    similarities = cosine_similarity(question_vector, chunk_vectors).flatten()
    
    # Get indices of top N most similar chunks
    top_indices = np.argsort(similarities)[-top_n:][::-1]
    
    return [chunks[i] for i in top_indices]

def create_sample_code():
    """Create a sample code file to demonstrate the chunking."""
    sample_code = '''
import os
import json
from datetime import datetime
from typing import Dict, List, Optional, Union

class UserAuthenticator:
    """Handles user authentication and session management."""
    
    def __init__(self, config_path: str):
        """Initialize the authenticator with configuration."""
        self.config = self._load_config(config_path)
        self.active_sessions = {}
        self.token_expiry = self.config.get("token_expiry", 3600)  # Default 1 hour
        
    def _load_config(self, config_path: str) -> Dict:
        """Load configuration from a JSON file."""
        try:
            with open(config_path, 'r') as f:
                return json.load(f)
        except (FileNotFoundError, json.JSONDecodeError) as e:
            print(f"Error loading config: {e}")
            return {"database_path": "users.db", "token_expiry": 3600}
            
    def authenticate(self, username: str, password: str) -> Optional[str]:
        """Authenticate a user and return a session token."""
        user = self._verify_credentials(username, password)
        if not user:
            return None
            
        # Generate session token
        token = self._generate_token(user["id"])
        self.active_sessions[token] = {
            "user_id": user["id"],
            "expires_at": datetime.now().timestamp() + self.token_expiry
        }
        
        return token
        
    def _verify_credentials(self, username: str, password: str) -> Optional[Dict]:
        """Verify user credentials against the database."""
        # In a real implementation, this would check against a secure database
        # This is a simplified example
        if username == "admin" and password == "secure_password":
            return {"id": "usr_123", "username": "admin", "role": "admin"}
        return None
        
    def _generate_token(self, user_id: str) -> str:
        """Generate a unique session token."""
        # In a real implementation, this would use a secure method
        # This is a simplified example
        timestamp = datetime.now().timestamp()
        return f"token_{user_id}_{timestamp}"
        
    def validate_token(self, token: str) -> bool:
        """Validate a session token."""
        if token not in self.active_sessions:
            return False
            
        session = self.active_sessions[token]
        if session["expires_at"] < datetime.now().timestamp():
            # Token has expired
            del self.active_sessions[token]
            return False
            
        return True
        
    def logout(self, token: str) -> bool:
        """Log out a user by invalidating their session token."""
        if token in self.active_sessions:
            del self.active_sessions[token]
            return True
        return False

class UserManager:
    """Manages user accounts and profiles."""
    
    def __init__(self, database_path: str):
        """Initialize the user manager with a database path."""
        self.database_path = database_path
        
    def create_user(self, username: str, password: str, email: str) -> bool:
        """Create a new user account."""
        # In a real implementation, this would add to a database
        # This is a simplified example
        print(f"Creating user: {username}, {email}")
        return True
        
    def update_profile(self, user_id: str, profile_data: Dict) -> bool:
        """Update a user's profile information."""
        # In a real implementation, this would update a database
        # This is a simplified example
        print(f"Updating profile for user {user_id}: {profile_data}")
        return True
        
    def get_user(self, user_id: str) -> Optional[Dict]:
        """Get user information by ID."""
        # In a real implementation, this would query a database
        # This is a simplified example
        if user_id == "usr_123":
            return {
                "id": "usr_123",
                "username": "admin",
                "email": "admin@example.com",
                "created_at": "2023-01-01T00:00:00Z"
            }
        return None
    '''
    
    with open('sample_code.py', 'w') as f:
        f.write(sample_code)

def get_language_from_path(file_path: str) -> str:
    """Determine language based on file extension."""
    ext = file_path.split('.')[-1].lower()
    language_map = {
        'py': 'python',
        'js': 'javascript',
        'ts': 'typescript',
        'jsx': 'jsx',
        'tsx': 'tsx',
        'java': 'java',
        'cpp': 'cpp',
        'c': 'c',
        'cs': 'csharp',
        'go': 'go',
        'rb': 'ruby',
        'php': 'php',
        'rs': 'rust'
    }
    return language_map.get(ext, '')

def analyze_codebase(chunks: List[Dict[str, Any]], question: str) -> str:
    """Send code chunks to Claude 3.7 Sonnet and get analysis."""
    client = anthropic.Anthropic(api_key=os.environ.get("ANTHROPIC_API_KEY"))
    
    context = ""
    for chunk in select_relevant_chunks(chunks, question):
        context += f"File: {chunk['file_path']}\n"
        context += f"```{get_language_from_path(chunk['file_path'])}\n"
        context += chunk['content'] + "\n```\n\n"
    
    message = client.messages.create(
        model="claude-3-sonnet-20240229",
        max_tokens=4000,
        temperature=0.0,
        system="You are a senior software engineer with expertise in analyzing and explaining code.",
        messages=[
            {
                "role": "user",
                "content": f"I'm analyzing a codebase and need your help understanding it. Here are relevant parts of the code:\n\n{context}\n\n{question}\n\nPlease analyze the code and provide a detailed answer."
            }
        ]
    )
    
    return message.content[0].text

def main():
    """Main function to run the demo."""
    print("Claude 3.7 Sonnet Code Understanding Demo")
    print("----------------------------------------")
    
    # Create sample code file
    print("\nCreating sample code file...")
    create_sample_code()
    print("Sample code created in sample_code.py")
    
    # Process the codebase
    print("\nProcessing codebase...")
    chunks = process_codebase(".")
    print(f"Found {len(chunks)} code chunks")
    
    # Ask questions about the code
    questions = [
        "Explain the authentication flow in this code. How are user credentials validated?",
        "What security concerns exist in the current implementation?",
        "How does the token expiration mechanism work?",
        "What improvements would you suggest to make this code more secure?"
    ]
    
    for i, question in enumerate(questions, 1):
        print(f"\n\nQuestion {i}: {question}")
        print("-" * 80)
        
        try:
            answer = analyze_codebase(chunks, question)
            print("\nClaude's Analysis:")
            print(answer)
        except Exception as e:
            print(f"Error getting analysis: {e}")
            print("Make sure you've set the ANTHROPIC_API_KEY environment variable correctly.")

# Example usage
if __name__ == "__main__":
    main()

Demo 4: Real-World Use Cases

Let's explore some practical applications of Claude 3.7 Sonnet for code understanding.

Architecture Discovery

When joining a new project, Claude can help you understand the overall architecture:

questions = [
    "What's the high-level architecture of this application?",
    "Identify the main components and how they interact with each other.",
    "Where is the entry point of this application?",
    "Explain the data flow from user request to database and back."
]

for question in questions:
    print(f"\n--- {question} ---\n")
    print(analyze_codebase(chunks, question))

Bug Investigation

When tracking down a bug, Claude can help analyze the relevant code:

bug_description = """
We're seeing an intermittent error where users get logged out unexpectedly.
The error seems to occur when multiple requests are made simultaneously.
Error message: "Token validation failed: signature expired"
"""

questions = [
    f"Based on this bug description, what could be causing the issue? {bug_description}",
    "How is the authentication token being validated in the codebase?",
    "Are there any race conditions or thread safety issues in the token validation code?",
    "How is token expiration handled in this application?"
]

for question in questions:
    print(f"\n--- {question} ---\n")
    print(analyze_codebase(chunks, question))

Code Refactoring

Claude can help identify areas for improvement and suggest refactoring strategies:

refactoring_questions = [
    "Which parts of this codebase would benefit most from refactoring?",
    "Are there any code smells or anti-patterns in the authentication service?",
    "How could we improve error handling throughout the application?",
    "Suggest ways to make the UserService more testable."
]

for question in refactoring_questions:
    print(f"\n--- {question} ---\n")
    print(analyze_codebase(chunks, question))

Demo 5: Interactive Web Interface

For a more user-friendly experience, let's create a simple web interface using Streamlit:

# Save this as code_analysis_app.py
import streamlit as st
import os
import sys
import json
import time
import tempfile
import shutil
from pathlib import Path
import anthropic
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import ast
from typing import List, Dict, Any, Optional

# Import our code analysis functions
# Make sure to save the semantic chunker and analysis code first
from semantic_chunker import SemanticChunker, process_codebase, select_relevant_chunks
from sample_code import generate_sample_code

# Set page config
st.set_page_config(
    page_title="Claude 3.7 Sonnet Code Understanding",
    page_icon="🧠",
    layout="wide"
)

st.title("Code Understanding with Claude 3.7 Sonnet")
st.markdown("Analyze and understand code with Claude 3.7 Sonnet's semantic code understanding capabilities")

# Create tabs for different modes
tab1, tab2 = st.tabs(["Analyze Sample Code", "Analyze Custom Code"])

with tab1:
    st.header("Analyze Sample Authentication System")
    
    # API key input
    api_key = st.text_input("Anthropic API Key", type="password", key="api_key1")
    
    if not api_key:
        st.warning("Please enter your Anthropic API key to proceed")
    else:
        os.environ["ANTHROPIC_API_KEY"] = api_key
        
        # Generate sample code if needed
        if st.button("Load Sample Code"):
            with st.spinner("Generating sample code..."):
                # Create a temp directory for sample code
                temp_dir = tempfile.mkdtemp()
                sample_file = Path(temp_dir) / "auth_system.py"
                
                # Write sample code
                with open(sample_file, "w") as f:
                    f.write(generate_sample_code())
                
                # Process the codebase
                chunks = process_codebase(temp_dir)
                st.session_state.sample_chunks = chunks
                st.session_state.sample_code_path = temp_dir
                
                st.success(f"Sample code loaded with {len(chunks)} semantic chunks!")
                
                # Display sample code
                with st.expander("View Sample Code"):
                    st.code(generate_sample_code(), language="python")
        
        # Only show analysis options if code is loaded
        if "sample_chunks" in st.session_state:
            # Predefined questions
            questions = [
                "Explain the authentication flow in this code",
                "What security concerns exist in the current implementation?",
                "How does the token expiration mechanism work?",
                "What improvements would you suggest for this code?",
                "Custom question..."
            ]
            
            selected_question = st.selectbox("Select a question", questions)
            
            if selected_question == "Custom question...":
                question = st.text_area("Enter your question about the code")
            else:
                question = selected_question
                
            if question and st.button("Analyze Code", key="analyze1"):
                with st.spinner("Claude is analyzing the code..."):
                    try:
                        # Analyze the code
                        client = anthropic.Anthropic(api_key=api_key)
                        
                        # Prepare context
                        context = ""
                        for chunk in select_relevant_chunks(st.session_state.sample_chunks, question):
                            context += f"File: {chunk['file_path']}\n"
                            context += f"```python\n{chunk['content']}\n```\n\n"
                        
                        # Generate response
                        message = client.messages.create(
                            model="claude-3-sonnet-20240229",
                            max_tokens=4000,
                            temperature=0.0,
                            system="You are a senior software engineer with expertise in analyzing and explaining code.",
                            messages=[
                                {
                                    "role": "user",
                                    "content": f"I'm analyzing a codebase and need your help understanding it. Here are relevant parts of the code:\n\n{context}\n\n{question}\n\nPlease analyze the code and provide a detailed answer."
                                }
                            ]
                        )
                        
                        st.markdown("### Claude's Analysis")
                        st.markdown(message.content[0].text)
                        
                    except Exception as e:
                        st.error(f"Error analyzing code: {str(e)}")
                        st.info("Make sure your API key is correct and you have the necessary permissions.")

with tab2:
    st.header("Analyze Your Own Code")
    
    # API key input
    api_key = st.text_input("Anthropic API Key", type="password", key="api_key2")
    
    if not api_key:
        st.warning("Please enter your Anthropic API key to proceed")
    else:
        os.environ["ANTHROPIC_API_KEY"] = api_key
        
        # Code input
        user_code = st.text_area("Paste your code here", height=300, placeholder="Paste your Python code here...", key="user_code")
        
        if user_code and st.button("Process Code"):
            with st.spinner("Processing code..."):
                # Create a temp directory for user code
                temp_dir = tempfile.mkdtemp()
                user_file = Path(temp_dir) / "user_code.py"
                
                # Write user code
                with open(user_file, "w") as f:
                    f.write(user_code)
                
                # Process the codebase
                chunks = process_codebase(temp_dir)
                st.session_state.user_chunks = chunks
                st.session_state.user_code_path = temp_dir
                
                st.success(f"Code processed with {len(chunks)} semantic chunks!")
                
                # Show chunks
                with st.expander("View Semantic Chunks"):
                    for i, chunk in enumerate(chunks):
                        st.code(chunk['content'], language="python")
                        st.write(f"Chunk Type: {chunk.get('type', 'Unknown')}")
                        st.write("---")
        
        # Only show analysis options if code is processed
        if "user_chunks" in st.session_state:
            question = st.text_area("What would you like to know about this code?", 
                                   "Explain what this code does and how it works.")
            
            if st.button("Analyze", key="analyze2"):
                with st.spinner("Claude is analyzing the code..."):
                    try:
                        # Analyze the code
                        client = anthropic.Anthropic(api_key=api_key)
                        
                        # Prepare context
                        context = ""
                        for chunk in select_relevant_chunks(st.session_state.user_chunks, question):
                            context += f"File: {chunk['file_path']}\n"
                            context += f"```python\n{chunk['content']}\n```\n\n"
                        
                        # Generate response
                        message = client.messages.create(
                            model="claude-3-sonnet-20240229",
                            max_tokens=4000,
                            temperature=0.0,
                            system="You are a senior software engineer with expertise in analyzing and explaining code.",
                            messages=[
                                {
                                    "role": "user",
                                    "content": f"I'm analyzing a codebase and need your help understanding it. Here are relevant parts of the code:\n\n{context}\n\n{question}\n\nPlease analyze the code and provide a detailed answer."
                                }
                            ]
                        )
                        
                        st.markdown("### Claude's Analysis")
                        st.markdown(message.content[0].text)
                        
                    except Exception as e:
                        st.error(f"Error analyzing code: {str(e)}")
                        st.info("Make sure your API key is correct and you have the necessary permissions.")

# Clean up temp directories on exit
def cleanup():
    if "sample_code_path" in st.session_state and os.path.exists(st.session_state.sample_code_path):
        shutil.rmtree(st.session_state.sample_code_path)
    
    if "user_code_path" in st.session_state and os.path.exists(st.session_state.user_code_path):
        shutil.rmtree(st.session_state.user_code_path)

# Register cleanup
import atexit
atexit.register(cleanup)

To create the supporting modules, save the following files:

semantic_chunker.py:

import os
import re
import ast
from typing import List, Dict, Any
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

class SemanticChunker:
    """A class for semantically chunking code files."""
    
    def __init__(self):
        self.language_handlers = {
            '.py': self._chunk_python,
            '.js': self._chunk_javascript,
            '.ts': self._chunk_typescript,
            '.java': self._chunk_java,
            '.cpp': self._chunk_cpp,
            '.c': self._chunk_c
        }
    
    def chunk_file(self, file_path):
        """Chunk a file based on its extension."""
        _, ext = os.path.splitext(file_path)
        
        if ext not in self.language_handlers:
            # Default basic chunking for unsupported languages
            return self._basic_chunk(file_path)
        
        return self.language_handlers[ext](file_path)
    
    def _chunk_python(self, file_path):
        """Semantically chunk a Python file using the ast module."""
        with open(file_path, 'r', encoding='utf-8') as f:
            content = f.read()
        
        try:
            tree = ast.parse(content)
            chunks = []
            
            # Extract imports as a chunk
            imports = []
            for node in ast.walk(tree):
                if isinstance(node, (ast.Import, ast.ImportFrom)):
                    imports.append(ast.get_source_segment(content, node))
            
            if imports:
                chunks.append({
                    'file_path': file_path,
                    'type': 'imports',
                    'content': '\n'.join(imports)
                })
            
            # Extract classes and functions
            for node in tree.body:
                if isinstance(node, ast.ClassDef):
                    chunks.append({
                        'file_path': file_path,
                        'type': 'class',
                        'name': node.name,
                        'content': ast.get_source_segment(content, node)
                    })
                elif isinstance(node, ast.FunctionDef):
                    chunks.append({
                        'file_path': file_path,
                        'type': 'function',
                        'name': node.name,
                        'content': ast.get_source_segment(content, node)
                    })
            
            return chunks
            
        except SyntaxError:
            # Fall back to basic chunking if parsing fails
            return self._basic_chunk(file_path)
    
    def _basic_chunk(self, file_path):
        """Basic chunking for files that can't be parsed."""
        with open(file_path, 'r', encoding='utf-8') as f:
            content = f.read()
        
        return [{
            'file_path': file_path,
            'type': 'content',
            'content': content
        }]
    
    # Other language handlers would be implemented similarly
    def _chunk_javascript(self, file_path):
        return self._basic_chunk(file_path)
        
    def _chunk_typescript(self, file_path):
        return self._basic_chunk(file_path)
        
    def _chunk_java(self, file_path):
        return self._basic_chunk(file_path)
        
    def _chunk_cpp(self, file_path):
        return self._basic_chunk(file_path)
        
    def _chunk_c(self, file_path):
        return self._basic_chunk(file_path)

def process_codebase(repo_path):
    """Process a codebase using semantic chunking."""
    chunker = SemanticChunker()
    chunks = []
    
    for file_path in get_source_files(repo_path):
        file_chunks = chunker.chunk_file(file_path)
        chunks.extend(file_chunks)
    
    return chunks

def get_source_files(repo_path):
    """Get all source code files in the repository."""
    source_files = []
    
    for root, _, files in os.walk(repo_path):
        for file in files:
            # Skip hidden files and directories
            if file.startswith('.'):
                continue
                
            # Skip common non-source directories
            if any(part.startswith('.') or part in ['node_modules', 'venv', 'build', 'dist'] 
                  for part in root.split(os.sep)):
                continue
                
            # Check file extensions for code files
            _, ext = os.path.splitext(file)
            if ext in ['.py', '.js', '.ts', '.jsx', '.tsx', '.java', '.c', '.cpp', '.h', '.hpp', '.cs', '.go', '.rb']:
                file_path = os.path.join(root, file)
                source_files.append(file_path)
    
    return source_files

def select_relevant_chunks(chunks: List[Dict[str, Any]], question: str, top_n: int = 5) -> List[Dict[str, Any]]:
    """Select the most relevant chunks based on the question."""
    if not chunks:
        return []
        
    # Create a combined list of all chunks and the question
    texts = [chunk['content'] for chunk in chunks]
    texts.append(question)
    
    # Create TF-IDF vectors
    vectorizer = TfidfVectorizer(stop_words='english')
    tfidf_matrix = vectorizer.fit_transform(texts)
    
    # Calculate similarity between question and each chunk
    question_vector = tfidf_matrix[-1]
    chunk_vectors = tfidf_matrix[:-1]
    similarities = cosine_similarity(question_vector, chunk_vectors).flatten()
    
    # Get indices of top N most similar chunks
    top_n = min(top_n, len(chunks))  # Make sure we don't ask for more chunks than exist
    top_indices = np.argsort(similarities)[-top_n:][::-1]
    
    return [chunks[i] for i in top_indices]

sample_code.py:

def generate_sample_code():
    """Generate sample authentication code for demonstration."""
    return '''
import os
import json
import hashlib
import secrets
from datetime import datetime
from typing import Dict, List, Optional, Union

class UserAuthenticator:
    """Handles user authentication and session management."""
    
    def __init__(self, config_path: str):
        """Initialize the authenticator with configuration."""
        self.config = self._load_config(config_path)
        self.active_sessions = {}
        self.token_expiry = self.config.get("token_expiry", 3600)  # Default 1 hour
        self.max_failed_attempts = self.config.get("max_failed_attempts", 5)
        self.lockout_period = self.config.get("lockout_period", 3600)  # Default 1 hour
        self.failed_attempts = {}
        
    def _load_config(self, config_path: str) -> Dict:
        """Load configuration from a JSON file."""
        try:
            with open(config_path, 'r') as f:
                return json.load(f)
        except (FileNotFoundError, json.JSONDecodeError) as e:
            print(f"Error loading config: {e}")
            return {"database_path": "users.db", "token_expiry": 3600}
            
    def authenticate(self, username: str, password: str) -> Optional[str]:
        """Authenticate a user and return a session token."""
        # Check if user is locked out
        if self._is_locked_out(username):
            print(f"User {username} is locked out due to too many failed attempts")
            return None
            
        user = self._verify_credentials(username, password)
        if not user:
            self._record_failed_attempt(username)
            return None
            
        # Reset failed attempts on successful login
        if username in self.failed_attempts:
            del self.failed_attempts[username]
            
        # Generate session token
        token = self._generate_token(user["id"])
        self.active_sessions[token] = {
            "user_id": user["id"],
            "expires_at": datetime.now().timestamp() + self.token_expiry,
            "user_agent": os.environ.get("HTTP_USER_AGENT", "unknown")
        }
        
        return token
        
    def _is_locked_out(self, username: str) -> bool:
        """Check if a user is locked out due to too many failed attempts."""
        if username not in self.failed_attempts:
            return False
            
        attempts = self.failed_attempts[username]
        if len(attempts) < self.max_failed_attempts:
            return False
            
        # Check if the lockout period has expired for the oldest attempt
        oldest_attempt = min(attempts)
        lockout_expiry = oldest_attempt + self.lockout_period
        
        return datetime.now().timestamp() < lockout_expiry
        
    def _record_failed_attempt(self, username: str) -> None:
        """Record a failed authentication attempt."""
        if username not in self.failed_attempts:
            self.failed_attempts[username] = []
            
        # Add current timestamp
        self.failed_attempts[username].append(datetime.now().timestamp())
        
        # Keep only the most recent max_failed_attempts
        if len(self.failed_attempts[username]) > self.max_failed_attempts:
            self.failed_attempts[username] = self.failed_attempts[username][-self.max_failed_attempts:]
        
    def _verify_credentials(self, username: str, password: str) -> Optional[Dict]:
        """Verify user credentials against the database."""
        # In a real implementation, this would check against a secure database
        # This is a simplified example
        if username == "admin" and password == "secure_password":
            return {"id": "usr_123", "username": "admin", "role": "admin"}
        return None
        
    def _generate_token(self, user_id: str) -> str:
        """Generate a unique session token."""
        # In a real implementation, this would use a secure method
        # This is a simplified example using the secrets module
        random_bytes = secrets.token_bytes(32)
        timestamp = str(datetime.now().timestamp())
        token_data = f"{user_id}-{timestamp}-{random_bytes.hex()}"
        return hashlib.sha256(token_data.encode()).hexdigest()
        
    def validate_token(self, token: str) -> bool:
        """Validate a session token."""
        if token not in self.active_sessions:
            return False
            
        session = self.active_sessions[token]
        if session["expires_at"] < datetime.now().timestamp():
            # Token has expired
            del self.active_sessions[token]
            return False
            
        return True
        
    def logout(self, token: str) -> bool:
        """Log out a user by invalidating their session token."""
        if token in self.active_sessions:
            del self.active_sessions[token]
            return True
        return False

class UserManager:
    """Manages user accounts and profiles."""
    
    def __init__(self, database_path: str):
        """Initialize the user manager with a database path."""
        self.database_path = database_path
        
    def create_user(self, username: str, password: str, email: str) -> bool:
        """Create a new user account."""
        # In a real implementation, this would:
        # 1. Hash the password securely with a salt
        # 2. Store the user in a database
        # 3. Validate email format
        # 4. Check for existing users with same username/email
        
        # This is a simplified example
        print(f"Creating user: {username}, {email}")
        
        # Hash password - in a real system, use a proper password hashing library
        password_hash = hashlib.sha256(password.encode()).hexdigest()
        print(f"Password hash: {password_hash[:10]}...")
        
        return True
        
    def update_profile(self, user_id: str, profile_data: Dict) -> bool:
        """Update a user's profile information."""
        # In a real implementation, this would update a database
        # This is a simplified example
        print(f"Updating profile for user {user_id}: {profile_data}")
        return True
        
    def get_user(self, user_id: str) -> Optional[Dict]:
        """Get user information by ID."""
        # In a real implementation, this would query a database
        # This is a simplified example
        if user_id == "usr_123":
            return {
                "id": "usr_123",
                "username": "admin",
                "email": "admin@example.com",
                "created_at": "2023-01-01T00:00:00Z"
            }
        return None
    '''

# For testing
if __name__ == "__main__":
    print(generate_sample_code())

Run this with streamlit run code_analysis_app.py to launch the web interface.

Best Practices

Here are some tips for getting the most out of Claude 3.7 Sonnet for code understanding:

1. Be Specific in Your Questions

Instead of asking "How does this code work?", ask "How does the authentication middleware validate JWT tokens in this Express.js application?"

2. Provide Relevant Context

Tell Claude about the purpose of the application and any specific terminology or patterns used in the codebase.

3. Break Down Complex Codebases

For very large codebases, focus on specific subsystems or features rather than trying to analyze everything at once.

4. Use Follow-up Questions

Start with high-level questions about architecture, then drill down into specific components or functions.

5. Augment with Code Visualization

Combine Claude's text analysis with visual tools like code graphs to get a complete understanding of complex codebases.

Conclusion

Claude 3.7 Sonnet represents a major advancement in AI-assisted code understanding. By leveraging its 100K context window and advanced language capabilities, developers can now comprehend complex codebases more quickly and thoroughly than ever before.

This demo has shown you:

How to prepare code for analysis
Techniques for effective semantic chunking
Strategies for prompting Claude for code insights
Real-world applications for code understanding

The complete code for this demo is available in the files described above. Try it out with your own projects and see how Claude 3.7 Sonnet can transform your code understanding workflow!

Running the Demo

To run the complete demo:

Create the required files:
- semantic_chunker.py: The semantic chunking implementation
- sample_code.py: The sample code generator
- code_analysis_app.py: The Streamlit web application

Install dependencies:

pip install anthropic streamlit scikit-learn numpy

Run the web application:
```
streamlit run code_analysis_app.py
```
Use the web interface:
- Enter your Anthropic API key
- Choose between analyzing sample code or your own code
- Ask questions about the code and see Claude 3.7 Sonnet's analysis

Additional Resources

Semantic Chunking Guide
Prompt Engineering for Code
Anthropic Claude API Documentation
LiteLLM Integration Guide for comparing model performance

Introduction​

Why Claude 3.7 Sonnet Excels at Code Understanding​

Demo 1: Setting Up Your Environment​

Demo 2: Implementing Semantic Chunking​

The Importance of Semantic Boundaries​

Naive Chunking

Semantic Chunking

Demo 3: Interacting with Claude 3.7 Sonnet​

Demo 4: Real-World Use Cases​

Architecture Discovery​

Bug Investigation​

Code Refactoring​

Demo 5: Interactive Web Interface​

Best Practices​

1. Be Specific in Your Questions​

2. Provide Relevant Context​

3. Break Down Complex Codebases​

4. Use Follow-up Questions​

5. Augment with Code Visualization​

Conclusion​

Running the Demo​

Additional Resources​