Add Comprehensive user handbook

58b962e verified 7 months ago

26.3 kB

	# BitTransformerLM User Guide

	Version: 0.1.0 Experimental
	Last Updated: August 2025
	Recommended Setup: Use with [Claude Code](https://claude.ai/code) for optimal experience

	## Table of Contents

	1. [Quick Start](#quick-start)
	2. [Architecture Overview](#architecture-overview)
	3. [Core Features](#core-features)
	4. [Installation & Setup](#installation--setup)
	5. [Basic Usage Examples](#basic-usage-examples)
	6. [Advanced Features](#advanced-features)
	7. [Training Your Own Models](#training-your-own-models)
	8. [Safety and Monitoring](#safety-and-monitoring)
	9. [Distributed Training](#distributed-training)
	10. [Performance Optimization](#performance-optimization)
	11. [Troubleshooting](#troubleshooting)
	12. [Best Practices](#best-practices)

	---

	## Quick Start

	BitTransformerLM is an experimental transformer language model that operates directly on binary sequences (bits) rather than tokens. This unique approach enables fine-grained control over information processing and built-in safety monitoring.

	### Minimal Example
	```python
	from bit_transformer import BitTransformerLM, example_training_step

	# Run basic example
	loss, telemetry = example_training_step()
	print(f"Training loss: {loss}")
	print(f"Available telemetry: {list(telemetry.keys())}")
	```

	### Text Processing Example
	```python
	from bit_transformer import BitTransformerLM, text_to_bits, bits_to_text

	# Create model
	model = BitTransformerLM(
	d_model=128,
	nhead=4,
	num_layers=2,
	dim_feedforward=256,
	max_seq_len=256
	)

	# Convert text to bits and process
	text = "Hello, world!"
	bits = text_to_bits(text)
	bit_tensor = torch.tensor(bits).unsqueeze(0)

	# Forward pass
	logits, telemetry = model(bit_tensor)
	print(f"Input bits: {len(bits)}")
	print(f"Output shape: {logits.shape}")
	print(f"Telemetry metrics: {list(telemetry.keys())}")
	```

	---

	## Architecture Overview

	### Bit-Native Processing
	Unlike traditional language models that use token embeddings, BitTransformerLM processes raw binary sequences:

	- Input: Text → UTF-8 bytes → Bits with parity protection (9 bits per byte)
	- Processing: Multi-head attention on bit embeddings
	- Output: Probability distribution over next bit (0 or 1)

	### Key Innovations

	#### 1. Reversible Transformer Layers
	- Memory-efficient computation that doesn't store intermediate activations
	- Enables training of deeper models with same memory footprint
	- Mathematically reversible operations for gradient computation

	#### 2. Built-in Safety Telemetry
	- K (Negentropy): Measures information content vs random noise
	- C (LZ Complexity): Proxy for compressibility and pattern complexity
	- S (Symbiosis): Alignment with reference distributions
	- Real-time monitoring and safety gates

	#### 3. Dual-Mode Operation
	- Causal Mode: Traditional autoregressive generation
	- Diffusion Mode: Bidirectional denoising for higher quality output

	#### 4. Progressive Scaling
	- Dynamic architecture expansion based on validation performance
	- Automatic addition of layers, width, or context length
	- Curriculum learning from simple to complex patterns

	---

	## Core Features

	### Text Processing
	- Parity-Protected Encoding: Each byte gets a parity bit for error detection
	- UTF-8 Support: Full Unicode text processing capability
	- Bidirectional Processing: Support for both causal and diffusion modes

	### Safety & Monitoring
	- Real-time Telemetry: K/C/S metrics computed during inference
	- Safety Gates: Automatic blocking of unsafe outputs
	- Metric Drift Detection: Alerts when model behavior changes
	- Human-in-the-Loop: Safe inference with retry mechanisms

	### Memory Efficiency
	- Reversible Layers: Significant memory savings for deep models
	- Gradient Checkpointing: Trade compute for memory in training
	- Dynamic Quantization: Runtime INT8 conversion for inference
	- 4-bit QAT: Quantization-aware training for extreme efficiency

	### Advanced Training
	- Distributed Training: FSDP and pipeline parallelism support
	- Mixed Precision: FP16/BF16 optimization with CPU autocast
	- Compression Pipeline: Run-length encoding for efficient storage
	- Progressive Curriculum: Automatic difficulty scaling

	---

	## Installation & Setup

	### Requirements
	- Python 3.10 or later
	- PyTorch 2.7.1 or later
	- CUDA (optional, for GPU acceleration)

	### Installation
	```bash
	# Clone repository
	git clone https://huggingface.co/WCNegentropy/BitTransformerLM
	cd BitTransformerLM

	# Install dependencies
	pip install -r requirements.txt

	# For GPU support (optional)
	pip install --extra-index-url https://download.pytorch.org/whl/cu118 torch==2.7.1+cu118
	```

	### Quick Test
	```bash
	# Run basic example
	python example.py

	# Expected output:
	# Training loss: [some value]
	# Available telemetry: ['activations', 'attention_maps', ...]
	```

	### 🤖 Recommended: Setup with Claude Code

	For the best experience, we recommend using [Claude Code](https://claude.ai/code) to set up and work with BitTransformerLM:

	1. Open Claude Code and navigate to your project directory
	2. Clone the repository: Claude Code can help with git operations and dependency management
	3. Interactive Setup: Claude Code can guide you through configuration options and explain parameters
	4. Real-time Assistance: Get help with model architecture, training parameters, and debugging
	5. Code Generation: Generate custom training scripts and experiments with AI assistance

	Claude Code provides contextual understanding of BitTransformerLM's unique architecture and can help you avoid common pitfalls when working with bit-native processing.

	---

	## Basic Usage Examples

	### 1. Creating Models

	```python
	from bit_transformer import BitTransformerLM

	# Small model for experimentation
	small_model = BitTransformerLM(
	d_model=64, # Embedding dimension
	nhead=4, # Number of attention heads
	num_layers=2, # Number of transformer layers
	dim_feedforward=128, # Feedforward dimension
	max_seq_len=128, # Maximum sequence length
	reversible=True, # Use memory-efficient reversible layers
	use_checkpoint=True # Enable gradient checkpointing
	)

	# Medium model for research
	medium_model = BitTransformerLM(
	d_model=512,
	nhead=8,
	num_layers=8,
	dim_feedforward=2048,
	max_seq_len=512,
	reversible=True,
	use_checkpoint=True,
	chunk_size=64, # Chunked attention for long sequences
	lambda_K=0.1, # Negentropy regularization weight
	lambda_C=0.1, # Complexity regularization weight
	lambda_S=0.1 # Symbiosis regularization weight
	)
	```

	### 2. Text Generation

	```python
	from bit_transformer.bit_io import sample_text

	# Generate text from prompt
	prompt = "The future of AI is"
	generated = sample_text(
	model,
	prompt=prompt,
	max_new_tokens=20, # Generate ~20 new characters
	temperature=0.8, # Sampling temperature
	top_p=0.9 # Nucleus sampling
	)
	print(f"Generated: {generated}")
	```

	### 3. Safe Inference

	```python
	from bit_transformer import hil_safe_inference, text_to_bits
	import torch

	# Convert text to bits
	text = "Hello, world!"
	bits = torch.tensor(text_to_bits(text)).unsqueeze(0)

	# Safe inference with telemetry monitoring
	try:
	output_bits, telemetry = hil_safe_inference(
	model,
	bits,
	c_floor=0.3, # Minimum complexity threshold
	s_floor=0.5, # Minimum symbiosis threshold
	strict=True # Throw error if thresholds not met
	)
	print("✅ Safe inference completed")
	print(f"K (Negentropy): {telemetry.get('negentropy_logits', 'N/A')}")
	print(f"C (Complexity): {telemetry.get('lz_complexity_logits', 'N/A')}")
	print(f"S (Symbiosis): {telemetry.get('symbiosis_score', 'N/A')}")
	except Exception as e:
	print(f"⚠️ Safety check failed: {e}")
	```

	### 4. Interactive Dashboard

	```python
	# Launch the interactive dashboard
	python unified_workflow.py --dashboard

	# Or programmatically
	from bit_transformer.dashboard_app import run_dashboard
	run_dashboard(host="localhost", port=5000)
	```

	The dashboard provides:
	- Real-time training monitoring
	- Telemetry visualization
	- Model configuration controls
	- HuggingFace checkpoint management
	- Safe inference testing interface

	---

	## Advanced Features

	### 1. Diffusion Mode Training

	Diffusion mode enables bidirectional processing for higher quality generation:

	```python
	# Train with diffusion mode
	python unified_workflow.py --diffusion --diffusion-steps 8 --dataset-size 32

	# Different noise schedules
	python unified_workflow.py --diffusion --noise-schedule cosine --diffusion-steps 16

	# Diffusion curriculum (noise decay over epochs)
	python unified_workflow.py --diffusion --diffusion-curriculum
	```

	Diffusion Parameters:
	- `--diffusion-steps`: Number of denoising steps (higher = better quality)
	- `--noise-schedule`: `linear`, `cosine`, or `exp` noise decay
	- `--diffusion-curriculum`: Gradually reduce noise over training epochs

	### 2. Progressive Scaling

	Enable automatic model growth based on performance:

	```python
	from bit_transformer.training import train_loop
	from bit_transformer.scale import expand_model

	# Training loop with automatic scaling
	model = BitTransformerLM(d_model=64, nhead=4, num_layers=2, dim_feedforward=128)
	train_data = torch.randint(0, 2, (1000, 64))

	# Train with progressive scaling
	train_loop(
	model,
	train_data,
	epochs=10,
	batch_size=8,
	# Progressive scaling will automatically trigger when validation loss plateaus
	)

	# Manual model expansion
	expanded_model = expand_model(model, strategy="depth") # Add layers
	expanded_model = expand_model(model, strategy="width") # Increase width
	expanded_model = expand_model(model, strategy="context") # Extend context
	```

	### 3. Compression Pipeline

	BitTransformerLM includes run-length encoding for efficient data storage:

	```python
	from bit_transformer import compress_bits, decompress_bits

	# Compress bit sequences
	original_bits = torch.tensor([0, 0, 0, 1, 1, 0, 1, 1, 1])
	compressed = compress_bits(original_bits)
	decompressed = decompress_bits(compressed)

	print(f"Original: {original_bits}")
	print(f"Compressed: {compressed}")
	print(f"Decompressed: {decompressed}")
	print(f"Compression ratio: {len(original_bits) / len(compressed):.2f}")

	# Use compression in training
	train_loop(
	model,
	data,
	compress_prob=0.5, # 50% of training uses compressed data
	compress_warmup=100 # Start compression after 100 steps
	)
	```

	### 4. Quantization and Optimization

	```python
	from bit_transformer import quantize_dynamic, prepare_qat_fx, convert_qat_fx

	# Dynamic quantization for inference
	quantized_model = quantize_dynamic(model, dtype=torch.qint8)

	# 4-bit quantization-aware training
	qat_model = prepare_qat_fx(model)
	# ... train qat_model ...
	final_model = convert_qat_fx(qat_model)

	# Enable mixed precision and compilation
	train_loop(
	model,
	data,
	amp=True, # Enable automatic mixed precision
	compile_model=True # Use torch.compile for speedup
	)
	```

	---

	## Training Your Own Models

	### Basic Training Script

	```python
	import torch
	from bit_transformer import BitTransformerLM, train_loop, configure_optimizer
	from bit_transformer.bit_io import text_to_bits

	# Prepare training data
	texts = ["Hello world", "How are you?", "BitTransformer is working!"]
	all_bits = []
	for text in texts:
	bits = text_to_bits(text)
	all_bits.extend(bits)

	# Convert to tensor and create sequences
	data = torch.tensor(all_bits)
	sequences = data.unfold(0, 64, 32) # 64-bit sequences with 32-bit stride

	# Create model
	model = BitTransformerLM(
	d_model=128,
	nhead=8,
	num_layers=4,
	dim_feedforward=512,
	max_seq_len=64,
	reversible=True
	)

	# Configure optimizer
	optimizer = configure_optimizer(model, lr=0.001, weight_decay=0.01)

	# Training loop
	train_loop(
	model,
	sequences,
	epochs=10,
	batch_size=4,
	optimizer=optimizer,
	amp=True, # Mixed precision
	log=True # Enable logging
	)
	```

	### Advanced Training Configuration

	```python
	# Advanced training with all features enabled
	train_loop(
	model,
	data,
	epochs=20,
	batch_size=8,
	accum_steps=4, # Gradient accumulation
	amp=True, # Mixed precision
	compile_model=True, # torch.compile optimization

	# Compression settings
	compress_prob=0.3, # 30% compression probability
	compress_warmup=50, # Start compression after 50 steps

	# Diffusion settings
	diffusion=True, # Enable diffusion mode
	diffusion_curriculum=True, # Decay noise over epochs

	# Direct bit training
	direct_prob=0.1, # 10% direct bit prediction

	# Logging
	log=True # Enable detailed logging
	)
	```

	### Custom Training Loop

	```python
	import torch.nn.functional as F
	from bit_transformer.utils import set_dropout

	# Manual training loop for full control
	model.train()
	set_dropout(model, 0.1) # Enable dropout for training

	optimizer = torch.optim.AdamW(model.parameters(), lr=0.001)
	criterion = F.cross_entropy

	for epoch in range(10):
	total_loss = 0
	for batch in data_loader:
	optimizer.zero_grad()

	# Forward pass
	logits, telemetry = model(batch)

	# Compute loss
	if logits.dim() == 3: # (batch, seq, 2)
	targets = batch[:, 1:] # Next bit prediction
	logits = logits[:, :-1] # Remove last prediction
	loss = criterion(logits.reshape(-1, 2), targets.reshape(-1))
	else:
	loss = criterion(logits, batch)

	# Add telemetry regularization
	if model.lambda_K > 0:
	loss += model.lambda_K * (1 - telemetry.get('negentropy_logits', 0))
	if model.lambda_C > 0:
	loss += model.lambda_C * (1 - telemetry.get('lz_complexity_logits', 0))

	# Backward pass
	loss.backward()

	# Gradient clipping
	torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

	optimizer.step()
	total_loss += loss.item()

	# Safety check
	if telemetry.get('symbiosis_score', 1.0) < 0.3:
	print("⚠️ Low symbiosis score detected")

	print(f"Epoch {epoch}: Average loss = {total_loss / len(data_loader):.4f}")
	```

	---

	## Safety and Monitoring

	### Telemetry Metrics

	BitTransformerLM provides three key safety metrics:

	#### K (Negentropy) - Information Content
	- Range: 0-1 (0 = random noise, 1 = perfectly ordered)
	- Purpose: Measures departure from randomness
	- Interpretation:
	- Very low K (< 0.1): Output is noise-like
	- Moderate K (0.3-0.7): Structured but varied output
	- Very high K (> 0.9): Repetitive or overly structured

	#### C (LZ Complexity) - Pattern Complexity
	- Range: 0-1 (higher = more complex patterns)
	- Purpose: Proxy for Lempel-Ziv compressibility
	- Interpretation:
	- Low C (< 0.3): Highly repetitive patterns
	- Moderate C (0.3-0.7): Balanced complexity
	- High C (> 0.8): Complex, varied patterns

	#### S (Symbiosis) - Distribution Alignment
	- Range: 0-1 (higher = better alignment)
	- Purpose: Agreement with reference distributions via KL divergence
	- Interpretation:
	- Low S (< 0.3): Poor alignment with expected patterns
	- Moderate S (0.5-0.8): Good alignment
	- High S (> 0.8): Excellent alignment

	### Safety Gates

	```python
	from bit_transformer.safety import SafetyGate, safe_sample_with_retry

	# Configure safety gate
	gate = SafetyGate(
	c_floor=0.3, # Minimum complexity
	s_floor=0.5, # Minimum symbiosis
	decay=0.9, # EMA decay factor
	burn_in=10 # Steps before gating starts
	)

	# Check if output should be blocked
	should_block = gate.should_trigger(c_val=0.2, s_val=0.4) # True - below thresholds

	# Safe sampling with automatic retry
	output = safe_sample_with_retry(
	model,
	input_bits,
	max_retries=3,
	retry_strategy="diffusion" # Try diffusion mode on failure
	)
	```

	### Metric Drift Detection

	```python
	from bit_transformer.telemetry import detect_metric_drift

	# Monitor metric stability over time
	metrics_history = [
	{"K": 0.5, "C": 0.6, "S": 0.7},
	{"K": 0.52, "C": 0.58, "S": 0.69},
	{"K": 0.8, "C": 0.9, "S": 0.4}, # Drift detected!
	# ... more metrics
	]

	drift_detected = detect_metric_drift(
	metrics_history,
	window=10, # Look back 10 steps
	threshold=0.2 # Alert if change > 0.2
	)

	if drift_detected:
	print("⚠️ Model behavior drift detected!")
	```

	---

	## Distributed Training

	### FSDP (Fully Sharded Data Parallel)

	```python
	from bit_transformer.distributed import wrap_fsdp, setup_distributed
	import torch.distributed as dist

	# Initialize distributed training
	setup_distributed(rank=0, world_size=4)

	# Wrap model with FSDP
	model = BitTransformerLM(d_model=1024, nhead=16, num_layers=12)
	fsdp_model = wrap_fsdp(
	model,
	sharding_strategy="FULL_SHARD", # or "SHARD_GRAD_OP", "NO_SHARD"
	mixed_precision=True,
	device_id=0
	)

	# Train with FSDP
	train_loop(
	fsdp_model,
	data,
	epochs=10,
	batch_size=2, # Smaller batch per GPU
	amp=True
	)
	```

	### Pipeline Parallelism

	```python
	from bit_transformer.distributed import make_pipeline

	# Create pipeline parallel model
	pipeline_model = make_pipeline(
	model,
	balance=[2, 2, 2, 2], # Split 8 layers across 4 GPUs
	devices=[0, 1, 2, 3],
	checkpoint="never" # or "always", "except_last"
	)

	# Pipeline training requires special handling
	# See unified_workflow.py for complete implementation
	```

	### Multi-GPU Training Script

	```bash
	# Single node, multiple GPUs
	python -m torch.distributed.launch \
	--nproc_per_node=4 \
	unified_workflow.py \
	--distributed \
	--batch-size 2 \
	--epochs 10

	# Multiple nodes
	python -m torch.distributed.launch \
	--nnodes=2 \
	--node_rank=0 \
	--master_addr="192.168.1.100" \
	--master_port=29500 \
	--nproc_per_node=4 \
	unified_workflow.py \
	--distributed
	```

	---

	## Performance Optimization

	### Memory Optimization

	```python
	# Enable all memory optimizations
	model = BitTransformerLM(
	d_model=512,
	nhead=8,
	num_layers=8,
	reversible=True, # Reversible layers save ~50% memory
	use_checkpoint=True, # Gradient checkpointing
	chunk_size=64, # Chunked attention for long sequences
	full_attn_logging=False # Skip full attention reconstruction
	)

	# Training optimizations
	train_loop(
	model,
	data,
	batch_size=4, # Smaller batches
	accum_steps=8, # Gradient accumulation
	amp=True, # Mixed precision
	compile_model=True # torch.compile
	)
	```

	### CPU Optimization

	```python
	from bit_transformer.torch_utils import cpu_autocast

	# Enable BF16 on CPU
	with cpu_autocast():
	logits, telemetry = model(bits)

	# Or enable for entire model
	model = BitTransformerLM(use_autocast=True) # Automatically uses CPU BF16
	```

	### Inference Optimization

	```python
	# Quantize for inference
	from bit_transformer import quantize_dynamic

	# Switch to evaluation mode
	model.eval()
	set_dropout(model, 0.0)

	# Dynamic quantization
	quantized = quantize_dynamic(model, dtype=torch.qint8)

	# Optimize for inference
	with torch.no_grad():
	logits, _ = quantized(input_bits)
	```

	### Long Sequence Processing

	```python
	from bit_transformer.model import infer_long_sequence

	# Process sequences longer than max_seq_len
	long_text = "Very long text..." * 1000
	bits = text_to_bits(long_text)

	output = infer_long_sequence(
	model,
	torch.tensor(bits).unsqueeze(0),
	chunk_size=256, # Process in 256-bit chunks
	overlap=32, # 32-bit overlap between chunks
	stride=224 # 224-bit stride (256-32)
	)
	```

	---

	## Troubleshooting

	### Common Issues

	#### 1. Memory Errors
	```
	RuntimeError: CUDA out of memory
	```
	Solutions:
	- Enable reversible layers: `reversible=True`
	- Enable gradient checkpointing: `use_checkpoint=True`
	- Reduce batch size or use gradient accumulation
	- Use chunked attention: `chunk_size=64`
	- Enable mixed precision: `amp=True`

	#### 2. Tensor Shape Mismatches
	```
	RuntimeError: view size is not compatible with input tensor's size
	```
	Solutions:
	- Always use `.reshape()` instead of `.view()` with BitTransformerLM
	- Check that input sequences are properly formatted (1D for bits)
	- Ensure batch dimensions are consistent

	#### 3. Parity Check Failures
	```
	ValueError: Parity check failed
	```
	Solutions:
	- Use `enforce_parity()` to fix parity bits in generated sequences
	- Check that text encoding/decoding is consistent
	- Verify bit sequences have correct 9-bit (8+parity) structure

	#### 4. Safety Gate Triggering
	```
	SafetyError: Output blocked by safety gate
	```
	Solutions:
	- Lower safety thresholds: `c_floor=0.2, s_floor=0.4`
	- Increase burn-in period: `burn_in=20`
	- Use retry with diffusion: `safe_sample_with_retry()`
	- Check model training quality

	### Debug Mode

	```python
	# Enable detailed logging
	import logging
	logging.basicConfig(level=logging.DEBUG)

	# Model with debug telemetry
	model = BitTransformerLM(
	d_model=64,
	nhead=4,
	num_layers=2,
	full_attn_logging=True, # Log full attention maps
	chunk_size=None # Disable chunking for debugging
	)

	# Inspect telemetry
	logits, telemetry = model(input_bits)
	print("Telemetry keys:", list(telemetry.keys()))
	print("Attention maps shape:", [a.shape for a in telemetry['attention_maps']])
	print("Activation stats:", torch.stack(telemetry['activations']).describe())
	```

	### Performance Profiling

	```python
	import torch.profiler

	# Profile training step
	with torch.profiler.profile(
	activities=[
	torch.profiler.ProfilerActivity.CPU,
	torch.profiler.ProfilerActivity.CUDA,
	],
	record_shapes=True,
	with_stack=True,
	) as prof:
	logits, telemetry = model(input_bits)
	loss = F.cross_entropy(logits.reshape(-1, 2), targets.reshape(-1))
	loss.backward()

	print(prof.key_averages().table(sort_by="cuda_time_total"))
	```

	---

	## Best Practices

	### Model Configuration

	#### For Experimentation (< 1M parameters)
	```python
	model = BitTransformerLM(
	d_model=64,
	nhead=4,
	num_layers=2,
	dim_feedforward=128,
	max_seq_len=128,
	reversible=False, # Simpler for debugging
	use_checkpoint=False
	)
	```

	#### For Research (1M-100M parameters)
	```python
	model = BitTransformerLM(
	d_model=256,
	nhead=8,
	num_layers=6,
	dim_feedforward=1024,
	max_seq_len=512,
	reversible=True, # Enable memory efficiency
	use_checkpoint=True,
	chunk_size=128,
	lambda_K=0.05, # Light regularization
	lambda_C=0.05,
	lambda_S=0.05
	)
	```

	#### For Large-Scale (100M+ parameters)
	```python
	model = BitTransformerLM(
	d_model=1024,
	nhead=16,
	num_layers=20,
	dim_feedforward=4096,
	max_seq_len=2048,
	reversible=True,
	use_checkpoint=True,
	chunk_size=256,
	full_attn_logging=False, # Save memory
	lambda_K=0.1,
	lambda_C=0.1,
	lambda_S=0.1
	)
	```

	### Training Best Practices

	1. Always validate on held-out data to monitor overfitting
	2. Use gradient clipping to prevent training instability
	3. Monitor telemetry metrics for signs of model degradation
	4. Start with smaller models before scaling up
	5. Use safety gates in production deployments
	6. Enable logging to track training progress
	7. Save checkpoints frequently to prevent loss of progress

	### Data Preparation

	```python
	# Good: Clean, well-formatted text
	texts = [
	"The quick brown fox jumps over the lazy dog.",
	"Machine learning is transforming technology.",
	"BitTransformer processes information at the bit level."
	]

	# Convert to training sequences
	all_bits = []
	for text in texts:
	bits = text_to_bits(text)
	all_bits.extend(bits)

	# Create overlapping sequences for better learning
	data = torch.tensor(all_bits)
	seq_len = 128
	stride = 64
	sequences = []
	for i in range(0, len(data) - seq_len, stride):
	sequences.append(data[i:i + seq_len])

	training_data = torch.stack(sequences)
	```

	### Production Deployment

	```python
	# Production-ready model setup
	model.eval() # Disable dropout
	set_dropout(model, 0.0)

	# Enable safety monitoring
	gate = SafetyGate(c_floor=0.3, s_floor=0.5, burn_in=5)

	# Quantize for efficiency
	production_model = quantize_dynamic(model)

	# Safe inference with monitoring
	def safe_generate(input_text, max_length=100):
	try:
	return safe_sample_with_retry(
	production_model,
	text_to_bits(input_text),
	max_retries=3
	)
	except Exception as e:
	logging.error(f"Generation failed: {e}")
	return "Error: Unable to generate safe output"
	```

	---

	## Getting Help

	### Documentation Resources
	- ABOUTME.md: Project overview and quick start
	- README.md: Professional model card and specifications
	- RESEARCH_STATUS.md: Current research status and limitations
	- EMPIRICAL_VALIDATION.md: Evidence-based analysis of capabilities

	### Community Support
	- GitHub Issues: Report bugs and request features
	- Discussions: Ask questions and share experiences
	- Examples: Check the `tests/` directory for usage examples

	### 🤖 Recommended: Use with Claude Code

	For the best experience with BitTransformerLM, we recommend using [Claude Code](https://claude.ai/code):

	- Interactive Setup: Get step-by-step guidance for configuration
	- Real-time Debugging: Immediate help when things go wrong
	- Code Generation: Custom scripts and experiments tailored to your needs
	- Architecture Explanation: Deep understanding of bit-native processing
	- Best Practices: Learn optimal configurations for your use case

	Claude Code understands BitTransformerLM's unique architecture and can help you navigate the complexities of bit-level language modeling.

	---

	Remember: BitTransformerLM is experimental research software. Always validate results thoroughly and use safety monitoring in any deployment.

	Happy experimenting! 🤖✨