| # BitTransformerLM User Guide |
|
|
| **Version:** 0.1.0 Experimental |
| **Last Updated:** August 2025 |
| **Recommended Setup:** Use with [Claude Code](https://claude.ai/code) for optimal experience |
|
|
| ## Table of Contents |
|
|
| 1. [Quick Start](#quick-start) |
| 2. [Architecture Overview](#architecture-overview) |
| 3. [Core Features](#core-features) |
| 4. [Installation & Setup](#installation--setup) |
| 5. [Basic Usage Examples](#basic-usage-examples) |
| 6. [Advanced Features](#advanced-features) |
| 7. [Training Your Own Models](#training-your-own-models) |
| 8. [Safety and Monitoring](#safety-and-monitoring) |
| 9. [Distributed Training](#distributed-training) |
| 10. [Performance Optimization](#performance-optimization) |
| 11. [Troubleshooting](#troubleshooting) |
| 12. [Best Practices](#best-practices) |
|
|
| --- |
|
|
| ## Quick Start |
|
|
| BitTransformerLM is an experimental transformer language model that operates directly on binary sequences (bits) rather than tokens. This unique approach enables fine-grained control over information processing and built-in safety monitoring. |
|
|
| ### Minimal Example |
| ```python |
| from bit_transformer import BitTransformerLM, example_training_step |
| |
| # Run basic example |
| loss, telemetry = example_training_step() |
| print(f"Training loss: {loss}") |
| print(f"Available telemetry: {list(telemetry.keys())}") |
| ``` |
|
|
| ### Text Processing Example |
| ```python |
| from bit_transformer import BitTransformerLM, text_to_bits, bits_to_text |
| |
| # Create model |
| model = BitTransformerLM( |
| d_model=128, |
| nhead=4, |
| num_layers=2, |
| dim_feedforward=256, |
| max_seq_len=256 |
| ) |
| |
| # Convert text to bits and process |
| text = "Hello, world!" |
| bits = text_to_bits(text) |
| bit_tensor = torch.tensor(bits).unsqueeze(0) |
| |
| # Forward pass |
| logits, telemetry = model(bit_tensor) |
| print(f"Input bits: {len(bits)}") |
| print(f"Output shape: {logits.shape}") |
| print(f"Telemetry metrics: {list(telemetry.keys())}") |
| ``` |
|
|
| --- |
|
|
| ## Architecture Overview |
|
|
| ### Bit-Native Processing |
| Unlike traditional language models that use token embeddings, BitTransformerLM processes raw binary sequences: |
|
|
| - **Input**: Text → UTF-8 bytes → Bits with parity protection (9 bits per byte) |
| - **Processing**: Multi-head attention on bit embeddings |
| - **Output**: Probability distribution over next bit (0 or 1) |
|
|
| ### Key Innovations |
|
|
| #### 1. **Reversible Transformer Layers** |
| - Memory-efficient computation that doesn't store intermediate activations |
| - Enables training of deeper models with same memory footprint |
| - Mathematically reversible operations for gradient computation |
|
|
| #### 2. **Built-in Safety Telemetry** |
| - **K (Negentropy)**: Measures information content vs random noise |
| - **C (LZ Complexity)**: Proxy for compressibility and pattern complexity |
| - **S (Symbiosis)**: Alignment with reference distributions |
| - Real-time monitoring and safety gates |
|
|
| #### 3. **Dual-Mode Operation** |
| - **Causal Mode**: Traditional autoregressive generation |
| - **Diffusion Mode**: Bidirectional denoising for higher quality output |
|
|
| #### 4. **Progressive Scaling** |
| - Dynamic architecture expansion based on validation performance |
| - Automatic addition of layers, width, or context length |
| - Curriculum learning from simple to complex patterns |
|
|
| --- |
|
|
| ## Core Features |
|
|
| ### Text Processing |
| - **Parity-Protected Encoding**: Each byte gets a parity bit for error detection |
| - **UTF-8 Support**: Full Unicode text processing capability |
| - **Bidirectional Processing**: Support for both causal and diffusion modes |
|
|
| ### Safety & Monitoring |
| - **Real-time Telemetry**: K/C/S metrics computed during inference |
| - **Safety Gates**: Automatic blocking of unsafe outputs |
| - **Metric Drift Detection**: Alerts when model behavior changes |
| - **Human-in-the-Loop**: Safe inference with retry mechanisms |
|
|
| ### Memory Efficiency |
| - **Reversible Layers**: Significant memory savings for deep models |
| - **Gradient Checkpointing**: Trade compute for memory in training |
| - **Dynamic Quantization**: Runtime INT8 conversion for inference |
| - **4-bit QAT**: Quantization-aware training for extreme efficiency |
|
|
| ### Advanced Training |
| - **Distributed Training**: FSDP and pipeline parallelism support |
| - **Mixed Precision**: FP16/BF16 optimization with CPU autocast |
| - **Compression Pipeline**: Run-length encoding for efficient storage |
| - **Progressive Curriculum**: Automatic difficulty scaling |
|
|
| --- |
|
|
| ## Installation & Setup |
|
|
| ### Requirements |
| - Python 3.10 or later |
| - PyTorch 2.7.1 or later |
| - CUDA (optional, for GPU acceleration) |
|
|
| ### Installation |
| ```bash |
| # Clone repository |
| git clone https://huggingface.co/WCNegentropy/BitTransformerLM |
| cd BitTransformerLM |
| |
| # Install dependencies |
| pip install -r requirements.txt |
| |
| # For GPU support (optional) |
| pip install --extra-index-url https://download.pytorch.org/whl/cu118 torch==2.7.1+cu118 |
| ``` |
|
|
| ### Quick Test |
| ```bash |
| # Run basic example |
| python example.py |
| |
| # Expected output: |
| # Training loss: [some value] |
| # Available telemetry: ['activations', 'attention_maps', ...] |
| ``` |
|
|
| ### **🤖 Recommended: Setup with Claude Code** |
|
|
| For the best experience, we recommend using [Claude Code](https://claude.ai/code) to set up and work with BitTransformerLM: |
|
|
| 1. **Open Claude Code** and navigate to your project directory |
| 2. **Clone the repository**: Claude Code can help with git operations and dependency management |
| 3. **Interactive Setup**: Claude Code can guide you through configuration options and explain parameters |
| 4. **Real-time Assistance**: Get help with model architecture, training parameters, and debugging |
| 5. **Code Generation**: Generate custom training scripts and experiments with AI assistance |
|
|
| Claude Code provides contextual understanding of BitTransformerLM's unique architecture and can help you avoid common pitfalls when working with bit-native processing. |
|
|
| --- |
|
|
| ## Basic Usage Examples |
|
|
| ### 1. Creating Models |
|
|
| ```python |
| from bit_transformer import BitTransformerLM |
| |
| # Small model for experimentation |
| small_model = BitTransformerLM( |
| d_model=64, # Embedding dimension |
| nhead=4, # Number of attention heads |
| num_layers=2, # Number of transformer layers |
| dim_feedforward=128, # Feedforward dimension |
| max_seq_len=128, # Maximum sequence length |
| reversible=True, # Use memory-efficient reversible layers |
| use_checkpoint=True # Enable gradient checkpointing |
| ) |
| |
| # Medium model for research |
| medium_model = BitTransformerLM( |
| d_model=512, |
| nhead=8, |
| num_layers=8, |
| dim_feedforward=2048, |
| max_seq_len=512, |
| reversible=True, |
| use_checkpoint=True, |
| chunk_size=64, # Chunked attention for long sequences |
| lambda_K=0.1, # Negentropy regularization weight |
| lambda_C=0.1, # Complexity regularization weight |
| lambda_S=0.1 # Symbiosis regularization weight |
| ) |
| ``` |
|
|
| ### 2. Text Generation |
|
|
| ```python |
| from bit_transformer.bit_io import sample_text |
| |
| # Generate text from prompt |
| prompt = "The future of AI is" |
| generated = sample_text( |
| model, |
| prompt=prompt, |
| max_new_tokens=20, # Generate ~20 new characters |
| temperature=0.8, # Sampling temperature |
| top_p=0.9 # Nucleus sampling |
| ) |
| print(f"Generated: {generated}") |
| ``` |
|
|
| ### 3. Safe Inference |
|
|
| ```python |
| from bit_transformer import hil_safe_inference, text_to_bits |
| import torch |
| |
| # Convert text to bits |
| text = "Hello, world!" |
| bits = torch.tensor(text_to_bits(text)).unsqueeze(0) |
| |
| # Safe inference with telemetry monitoring |
| try: |
| output_bits, telemetry = hil_safe_inference( |
| model, |
| bits, |
| c_floor=0.3, # Minimum complexity threshold |
| s_floor=0.5, # Minimum symbiosis threshold |
| strict=True # Throw error if thresholds not met |
| ) |
| print("✅ Safe inference completed") |
| print(f"K (Negentropy): {telemetry.get('negentropy_logits', 'N/A')}") |
| print(f"C (Complexity): {telemetry.get('lz_complexity_logits', 'N/A')}") |
| print(f"S (Symbiosis): {telemetry.get('symbiosis_score', 'N/A')}") |
| except Exception as e: |
| print(f"⚠️ Safety check failed: {e}") |
| ``` |
|
|
| ### 4. Interactive Dashboard |
|
|
| ```python |
| # Launch the interactive dashboard |
| python unified_workflow.py --dashboard |
| |
| # Or programmatically |
| from bit_transformer.dashboard_app import run_dashboard |
| run_dashboard(host="localhost", port=5000) |
| ``` |
|
|
| The dashboard provides: |
| - Real-time training monitoring |
| - Telemetry visualization |
| - Model configuration controls |
| - HuggingFace checkpoint management |
| - Safe inference testing interface |
|
|
| --- |
|
|
| ## Advanced Features |
|
|
| ### 1. Diffusion Mode Training |
|
|
| Diffusion mode enables bidirectional processing for higher quality generation: |
|
|
| ```python |
| # Train with diffusion mode |
| python unified_workflow.py --diffusion --diffusion-steps 8 --dataset-size 32 |
| |
| # Different noise schedules |
| python unified_workflow.py --diffusion --noise-schedule cosine --diffusion-steps 16 |
| |
| # Diffusion curriculum (noise decay over epochs) |
| python unified_workflow.py --diffusion --diffusion-curriculum |
| ``` |
|
|
| **Diffusion Parameters:** |
| - `--diffusion-steps`: Number of denoising steps (higher = better quality) |
| - `--noise-schedule`: `linear`, `cosine`, or `exp` noise decay |
| - `--diffusion-curriculum`: Gradually reduce noise over training epochs |
|
|
| ### 2. Progressive Scaling |
|
|
| Enable automatic model growth based on performance: |
|
|
| ```python |
| from bit_transformer.training import train_loop |
| from bit_transformer.scale import expand_model |
| |
| # Training loop with automatic scaling |
| model = BitTransformerLM(d_model=64, nhead=4, num_layers=2, dim_feedforward=128) |
| train_data = torch.randint(0, 2, (1000, 64)) |
| |
| # Train with progressive scaling |
| train_loop( |
| model, |
| train_data, |
| epochs=10, |
| batch_size=8, |
| # Progressive scaling will automatically trigger when validation loss plateaus |
| ) |
| |
| # Manual model expansion |
| expanded_model = expand_model(model, strategy="depth") # Add layers |
| expanded_model = expand_model(model, strategy="width") # Increase width |
| expanded_model = expand_model(model, strategy="context") # Extend context |
| ``` |
|
|
| ### 3. Compression Pipeline |
|
|
| BitTransformerLM includes run-length encoding for efficient data storage: |
|
|
| ```python |
| from bit_transformer import compress_bits, decompress_bits |
| |
| # Compress bit sequences |
| original_bits = torch.tensor([0, 0, 0, 1, 1, 0, 1, 1, 1]) |
| compressed = compress_bits(original_bits) |
| decompressed = decompress_bits(compressed) |
| |
| print(f"Original: {original_bits}") |
| print(f"Compressed: {compressed}") |
| print(f"Decompressed: {decompressed}") |
| print(f"Compression ratio: {len(original_bits) / len(compressed):.2f}") |
| |
| # Use compression in training |
| train_loop( |
| model, |
| data, |
| compress_prob=0.5, # 50% of training uses compressed data |
| compress_warmup=100 # Start compression after 100 steps |
| ) |
| ``` |
|
|
| ### 4. Quantization and Optimization |
|
|
| ```python |
| from bit_transformer import quantize_dynamic, prepare_qat_fx, convert_qat_fx |
| |
| # Dynamic quantization for inference |
| quantized_model = quantize_dynamic(model, dtype=torch.qint8) |
| |
| # 4-bit quantization-aware training |
| qat_model = prepare_qat_fx(model) |
| # ... train qat_model ... |
| final_model = convert_qat_fx(qat_model) |
| |
| # Enable mixed precision and compilation |
| train_loop( |
| model, |
| data, |
| amp=True, # Enable automatic mixed precision |
| compile_model=True # Use torch.compile for speedup |
| ) |
| ``` |
|
|
| --- |
|
|
| ## Training Your Own Models |
|
|
| ### Basic Training Script |
|
|
| ```python |
| import torch |
| from bit_transformer import BitTransformerLM, train_loop, configure_optimizer |
| from bit_transformer.bit_io import text_to_bits |
| |
| # Prepare training data |
| texts = ["Hello world", "How are you?", "BitTransformer is working!"] |
| all_bits = [] |
| for text in texts: |
| bits = text_to_bits(text) |
| all_bits.extend(bits) |
| |
| # Convert to tensor and create sequences |
| data = torch.tensor(all_bits) |
| sequences = data.unfold(0, 64, 32) # 64-bit sequences with 32-bit stride |
| |
| # Create model |
| model = BitTransformerLM( |
| d_model=128, |
| nhead=8, |
| num_layers=4, |
| dim_feedforward=512, |
| max_seq_len=64, |
| reversible=True |
| ) |
| |
| # Configure optimizer |
| optimizer = configure_optimizer(model, lr=0.001, weight_decay=0.01) |
| |
| # Training loop |
| train_loop( |
| model, |
| sequences, |
| epochs=10, |
| batch_size=4, |
| optimizer=optimizer, |
| amp=True, # Mixed precision |
| log=True # Enable logging |
| ) |
| ``` |
|
|
| ### Advanced Training Configuration |
|
|
| ```python |
| # Advanced training with all features enabled |
| train_loop( |
| model, |
| data, |
| epochs=20, |
| batch_size=8, |
| accum_steps=4, # Gradient accumulation |
| amp=True, # Mixed precision |
| compile_model=True, # torch.compile optimization |
| |
| # Compression settings |
| compress_prob=0.3, # 30% compression probability |
| compress_warmup=50, # Start compression after 50 steps |
| |
| # Diffusion settings |
| diffusion=True, # Enable diffusion mode |
| diffusion_curriculum=True, # Decay noise over epochs |
| |
| # Direct bit training |
| direct_prob=0.1, # 10% direct bit prediction |
| |
| # Logging |
| log=True # Enable detailed logging |
| ) |
| ``` |
|
|
| ### Custom Training Loop |
|
|
| ```python |
| import torch.nn.functional as F |
| from bit_transformer.utils import set_dropout |
| |
| # Manual training loop for full control |
| model.train() |
| set_dropout(model, 0.1) # Enable dropout for training |
| |
| optimizer = torch.optim.AdamW(model.parameters(), lr=0.001) |
| criterion = F.cross_entropy |
| |
| for epoch in range(10): |
| total_loss = 0 |
| for batch in data_loader: |
| optimizer.zero_grad() |
| |
| # Forward pass |
| logits, telemetry = model(batch) |
| |
| # Compute loss |
| if logits.dim() == 3: # (batch, seq, 2) |
| targets = batch[:, 1:] # Next bit prediction |
| logits = logits[:, :-1] # Remove last prediction |
| loss = criterion(logits.reshape(-1, 2), targets.reshape(-1)) |
| else: |
| loss = criterion(logits, batch) |
| |
| # Add telemetry regularization |
| if model.lambda_K > 0: |
| loss += model.lambda_K * (1 - telemetry.get('negentropy_logits', 0)) |
| if model.lambda_C > 0: |
| loss += model.lambda_C * (1 - telemetry.get('lz_complexity_logits', 0)) |
| |
| # Backward pass |
| loss.backward() |
| |
| # Gradient clipping |
| torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) |
| |
| optimizer.step() |
| total_loss += loss.item() |
| |
| # Safety check |
| if telemetry.get('symbiosis_score', 1.0) < 0.3: |
| print("⚠️ Low symbiosis score detected") |
| |
| print(f"Epoch {epoch}: Average loss = {total_loss / len(data_loader):.4f}") |
| ``` |
|
|
| --- |
|
|
| ## Safety and Monitoring |
|
|
| ### Telemetry Metrics |
|
|
| BitTransformerLM provides three key safety metrics: |
|
|
| #### K (Negentropy) - Information Content |
| - **Range**: 0-1 (0 = random noise, 1 = perfectly ordered) |
| - **Purpose**: Measures departure from randomness |
| - **Interpretation**: |
| - Very low K (< 0.1): Output is noise-like |
| - Moderate K (0.3-0.7): Structured but varied output |
| - Very high K (> 0.9): Repetitive or overly structured |
|
|
| #### C (LZ Complexity) - Pattern Complexity |
| - **Range**: 0-1 (higher = more complex patterns) |
| - **Purpose**: Proxy for Lempel-Ziv compressibility |
| - **Interpretation**: |
| - Low C (< 0.3): Highly repetitive patterns |
| - Moderate C (0.3-0.7): Balanced complexity |
| - High C (> 0.8): Complex, varied patterns |
|
|
| #### S (Symbiosis) - Distribution Alignment |
| - **Range**: 0-1 (higher = better alignment) |
| - **Purpose**: Agreement with reference distributions via KL divergence |
| - **Interpretation**: |
| - Low S (< 0.3): Poor alignment with expected patterns |
| - Moderate S (0.5-0.8): Good alignment |
| - High S (> 0.8): Excellent alignment |
|
|
| ### Safety Gates |
|
|
| ```python |
| from bit_transformer.safety import SafetyGate, safe_sample_with_retry |
| |
| # Configure safety gate |
| gate = SafetyGate( |
| c_floor=0.3, # Minimum complexity |
| s_floor=0.5, # Minimum symbiosis |
| decay=0.9, # EMA decay factor |
| burn_in=10 # Steps before gating starts |
| ) |
| |
| # Check if output should be blocked |
| should_block = gate.should_trigger(c_val=0.2, s_val=0.4) # True - below thresholds |
| |
| # Safe sampling with automatic retry |
| output = safe_sample_with_retry( |
| model, |
| input_bits, |
| max_retries=3, |
| retry_strategy="diffusion" # Try diffusion mode on failure |
| ) |
| ``` |
|
|
| ### Metric Drift Detection |
|
|
| ```python |
| from bit_transformer.telemetry import detect_metric_drift |
| |
| # Monitor metric stability over time |
| metrics_history = [ |
| {"K": 0.5, "C": 0.6, "S": 0.7}, |
| {"K": 0.52, "C": 0.58, "S": 0.69}, |
| {"K": 0.8, "C": 0.9, "S": 0.4}, # Drift detected! |
| # ... more metrics |
| ] |
| |
| drift_detected = detect_metric_drift( |
| metrics_history, |
| window=10, # Look back 10 steps |
| threshold=0.2 # Alert if change > 0.2 |
| ) |
| |
| if drift_detected: |
| print("⚠️ Model behavior drift detected!") |
| ``` |
|
|
| --- |
|
|
| ## Distributed Training |
|
|
| ### FSDP (Fully Sharded Data Parallel) |
|
|
| ```python |
| from bit_transformer.distributed import wrap_fsdp, setup_distributed |
| import torch.distributed as dist |
| |
| # Initialize distributed training |
| setup_distributed(rank=0, world_size=4) |
| |
| # Wrap model with FSDP |
| model = BitTransformerLM(d_model=1024, nhead=16, num_layers=12) |
| fsdp_model = wrap_fsdp( |
| model, |
| sharding_strategy="FULL_SHARD", # or "SHARD_GRAD_OP", "NO_SHARD" |
| mixed_precision=True, |
| device_id=0 |
| ) |
| |
| # Train with FSDP |
| train_loop( |
| fsdp_model, |
| data, |
| epochs=10, |
| batch_size=2, # Smaller batch per GPU |
| amp=True |
| ) |
| ``` |
|
|
| ### Pipeline Parallelism |
|
|
| ```python |
| from bit_transformer.distributed import make_pipeline |
| |
| # Create pipeline parallel model |
| pipeline_model = make_pipeline( |
| model, |
| balance=[2, 2, 2, 2], # Split 8 layers across 4 GPUs |
| devices=[0, 1, 2, 3], |
| checkpoint="never" # or "always", "except_last" |
| ) |
| |
| # Pipeline training requires special handling |
| # See unified_workflow.py for complete implementation |
| ``` |
|
|
| ### Multi-GPU Training Script |
|
|
| ```bash |
| # Single node, multiple GPUs |
| python -m torch.distributed.launch \ |
| --nproc_per_node=4 \ |
| unified_workflow.py \ |
| --distributed \ |
| --batch-size 2 \ |
| --epochs 10 |
| |
| # Multiple nodes |
| python -m torch.distributed.launch \ |
| --nnodes=2 \ |
| --node_rank=0 \ |
| --master_addr="192.168.1.100" \ |
| --master_port=29500 \ |
| --nproc_per_node=4 \ |
| unified_workflow.py \ |
| --distributed |
| ``` |
|
|
| --- |
|
|
| ## Performance Optimization |
|
|
| ### Memory Optimization |
|
|
| ```python |
| # Enable all memory optimizations |
| model = BitTransformerLM( |
| d_model=512, |
| nhead=8, |
| num_layers=8, |
| reversible=True, # Reversible layers save ~50% memory |
| use_checkpoint=True, # Gradient checkpointing |
| chunk_size=64, # Chunked attention for long sequences |
| full_attn_logging=False # Skip full attention reconstruction |
| ) |
| |
| # Training optimizations |
| train_loop( |
| model, |
| data, |
| batch_size=4, # Smaller batches |
| accum_steps=8, # Gradient accumulation |
| amp=True, # Mixed precision |
| compile_model=True # torch.compile |
| ) |
| ``` |
|
|
| ### CPU Optimization |
|
|
| ```python |
| from bit_transformer.torch_utils import cpu_autocast |
| |
| # Enable BF16 on CPU |
| with cpu_autocast(): |
| logits, telemetry = model(bits) |
| |
| # Or enable for entire model |
| model = BitTransformerLM(use_autocast=True) # Automatically uses CPU BF16 |
| ``` |
|
|
| ### Inference Optimization |
|
|
| ```python |
| # Quantize for inference |
| from bit_transformer import quantize_dynamic |
| |
| # Switch to evaluation mode |
| model.eval() |
| set_dropout(model, 0.0) |
| |
| # Dynamic quantization |
| quantized = quantize_dynamic(model, dtype=torch.qint8) |
| |
| # Optimize for inference |
| with torch.no_grad(): |
| logits, _ = quantized(input_bits) |
| ``` |
|
|
| ### Long Sequence Processing |
|
|
| ```python |
| from bit_transformer.model import infer_long_sequence |
| |
| # Process sequences longer than max_seq_len |
| long_text = "Very long text..." * 1000 |
| bits = text_to_bits(long_text) |
| |
| output = infer_long_sequence( |
| model, |
| torch.tensor(bits).unsqueeze(0), |
| chunk_size=256, # Process in 256-bit chunks |
| overlap=32, # 32-bit overlap between chunks |
| stride=224 # 224-bit stride (256-32) |
| ) |
| ``` |
|
|
| --- |
|
|
| ## Troubleshooting |
|
|
| ### Common Issues |
|
|
| #### 1. **Memory Errors** |
| ``` |
| RuntimeError: CUDA out of memory |
| ``` |
| **Solutions:** |
| - Enable reversible layers: `reversible=True` |
| - Enable gradient checkpointing: `use_checkpoint=True` |
| - Reduce batch size or use gradient accumulation |
| - Use chunked attention: `chunk_size=64` |
| - Enable mixed precision: `amp=True` |
|
|
| #### 2. **Tensor Shape Mismatches** |
| ``` |
| RuntimeError: view size is not compatible with input tensor's size |
| ``` |
| **Solutions:** |
| - Always use `.reshape()` instead of `.view()` with BitTransformerLM |
| - Check that input sequences are properly formatted (1D for bits) |
| - Ensure batch dimensions are consistent |
|
|
| #### 3. **Parity Check Failures** |
| ``` |
| ValueError: Parity check failed |
| ``` |
| **Solutions:** |
| - Use `enforce_parity()` to fix parity bits in generated sequences |
| - Check that text encoding/decoding is consistent |
| - Verify bit sequences have correct 9-bit (8+parity) structure |
|
|
| #### 4. **Safety Gate Triggering** |
| ``` |
| SafetyError: Output blocked by safety gate |
| ``` |
| **Solutions:** |
| - Lower safety thresholds: `c_floor=0.2, s_floor=0.4` |
| - Increase burn-in period: `burn_in=20` |
| - Use retry with diffusion: `safe_sample_with_retry()` |
| - Check model training quality |
|
|
| ### Debug Mode |
|
|
| ```python |
| # Enable detailed logging |
| import logging |
| logging.basicConfig(level=logging.DEBUG) |
| |
| # Model with debug telemetry |
| model = BitTransformerLM( |
| d_model=64, |
| nhead=4, |
| num_layers=2, |
| full_attn_logging=True, # Log full attention maps |
| chunk_size=None # Disable chunking for debugging |
| ) |
| |
| # Inspect telemetry |
| logits, telemetry = model(input_bits) |
| print("Telemetry keys:", list(telemetry.keys())) |
| print("Attention maps shape:", [a.shape for a in telemetry['attention_maps']]) |
| print("Activation stats:", torch.stack(telemetry['activations']).describe()) |
| ``` |
|
|
| ### Performance Profiling |
|
|
| ```python |
| import torch.profiler |
| |
| # Profile training step |
| with torch.profiler.profile( |
| activities=[ |
| torch.profiler.ProfilerActivity.CPU, |
| torch.profiler.ProfilerActivity.CUDA, |
| ], |
| record_shapes=True, |
| with_stack=True, |
| ) as prof: |
| logits, telemetry = model(input_bits) |
| loss = F.cross_entropy(logits.reshape(-1, 2), targets.reshape(-1)) |
| loss.backward() |
| |
| print(prof.key_averages().table(sort_by="cuda_time_total")) |
| ``` |
|
|
| --- |
|
|
| ## Best Practices |
|
|
| ### Model Configuration |
|
|
| #### For Experimentation (< 1M parameters) |
| ```python |
| model = BitTransformerLM( |
| d_model=64, |
| nhead=4, |
| num_layers=2, |
| dim_feedforward=128, |
| max_seq_len=128, |
| reversible=False, # Simpler for debugging |
| use_checkpoint=False |
| ) |
| ``` |
|
|
| #### For Research (1M-100M parameters) |
| ```python |
| model = BitTransformerLM( |
| d_model=256, |
| nhead=8, |
| num_layers=6, |
| dim_feedforward=1024, |
| max_seq_len=512, |
| reversible=True, # Enable memory efficiency |
| use_checkpoint=True, |
| chunk_size=128, |
| lambda_K=0.05, # Light regularization |
| lambda_C=0.05, |
| lambda_S=0.05 |
| ) |
| ``` |
|
|
| #### For Large-Scale (100M+ parameters) |
| ```python |
| model = BitTransformerLM( |
| d_model=1024, |
| nhead=16, |
| num_layers=20, |
| dim_feedforward=4096, |
| max_seq_len=2048, |
| reversible=True, |
| use_checkpoint=True, |
| chunk_size=256, |
| full_attn_logging=False, # Save memory |
| lambda_K=0.1, |
| lambda_C=0.1, |
| lambda_S=0.1 |
| ) |
| ``` |
|
|
| ### Training Best Practices |
|
|
| 1. **Always validate on held-out data** to monitor overfitting |
| 2. **Use gradient clipping** to prevent training instability |
| 3. **Monitor telemetry metrics** for signs of model degradation |
| 4. **Start with smaller models** before scaling up |
| 5. **Use safety gates** in production deployments |
| 6. **Enable logging** to track training progress |
| 7. **Save checkpoints frequently** to prevent loss of progress |
|
|
| ### Data Preparation |
|
|
| ```python |
| # Good: Clean, well-formatted text |
| texts = [ |
| "The quick brown fox jumps over the lazy dog.", |
| "Machine learning is transforming technology.", |
| "BitTransformer processes information at the bit level." |
| ] |
| |
| # Convert to training sequences |
| all_bits = [] |
| for text in texts: |
| bits = text_to_bits(text) |
| all_bits.extend(bits) |
| |
| # Create overlapping sequences for better learning |
| data = torch.tensor(all_bits) |
| seq_len = 128 |
| stride = 64 |
| sequences = [] |
| for i in range(0, len(data) - seq_len, stride): |
| sequences.append(data[i:i + seq_len]) |
| |
| training_data = torch.stack(sequences) |
| ``` |
|
|
| ### Production Deployment |
|
|
| ```python |
| # Production-ready model setup |
| model.eval() # Disable dropout |
| set_dropout(model, 0.0) |
| |
| # Enable safety monitoring |
| gate = SafetyGate(c_floor=0.3, s_floor=0.5, burn_in=5) |
| |
| # Quantize for efficiency |
| production_model = quantize_dynamic(model) |
| |
| # Safe inference with monitoring |
| def safe_generate(input_text, max_length=100): |
| try: |
| return safe_sample_with_retry( |
| production_model, |
| text_to_bits(input_text), |
| max_retries=3 |
| ) |
| except Exception as e: |
| logging.error(f"Generation failed: {e}") |
| return "Error: Unable to generate safe output" |
| ``` |
|
|
| --- |
|
|
| ## Getting Help |
|
|
| ### Documentation Resources |
| - **ABOUTME.md**: Project overview and quick start |
| - **README.md**: Professional model card and specifications |
| - **RESEARCH_STATUS.md**: Current research status and limitations |
| - **EMPIRICAL_VALIDATION.md**: Evidence-based analysis of capabilities |
|
|
| ### Community Support |
| - **GitHub Issues**: Report bugs and request features |
| - **Discussions**: Ask questions and share experiences |
| - **Examples**: Check the `tests/` directory for usage examples |
|
|
| ### **🤖 Recommended: Use with Claude Code** |
|
|
| For the best experience with BitTransformerLM, we recommend using [Claude Code](https://claude.ai/code): |
|
|
| - **Interactive Setup**: Get step-by-step guidance for configuration |
| - **Real-time Debugging**: Immediate help when things go wrong |
| - **Code Generation**: Custom scripts and experiments tailored to your needs |
| - **Architecture Explanation**: Deep understanding of bit-native processing |
| - **Best Practices**: Learn optimal configurations for your use case |
|
|
| Claude Code understands BitTransformerLM's unique architecture and can help you navigate the complexities of bit-level language modeling. |
|
|
| --- |
|
|
| **Remember: BitTransformerLM is experimental research software. Always validate results thoroughly and use safety monitoring in any deployment.** |
|
|
| Happy experimenting! 🤖✨ |