Activity Feed

AI & ML interests

None defined yet.

Recent Activity

codelion 
posted an update 4 days ago
view post
Post
3032
Scaling Pedagogical Pre-training to 10 Billion Tokens

New blog post exploring what happens when you take optimal data mixing insights and scale up the data generation itself.

We built Sutra, a multi-stage framework for generating pedagogical pre-training data guided by a knowledge graph of ~2,000 concepts across 9 domains. The pipeline includes structured content generation, six-dimension quality evaluation, diversity management across 20 content styles, and a cleaning stage to prevent collapse.

The result is codelion/sutra-10B, a 10.2 billion token pedagogical dataset with rich metadata (domain, complexity, prerequisites, quality scores) on every entry.

We trained codelion/SmolLM2-70M on it for 3 full epochs (30.6B tokens) on a single A10 GPU in ~78 hours.

Key finding: perplexity kept improving across epochs, but benchmark gains plateaued fast. At 70M parameters, the model hits a representational ceiling that more data alone can't break through.

Full writeup with comparisons against 7 other datasets, detailed benchmark breakdowns, and connections to recent work on synthetic data scaling, curriculum learning, and data mixing laws: https://huggingface.co/blog/codelion/scaling-pedagogical-pretraining-10-billion-tokens

All datasets at multiple scales (10M, 100M, 1B, 10B) plus seed concepts and an SFT variant are in the Sutra Pedagogical Datasets collection.
ajibawa-2023 
posted an update 8 days ago
view post
Post
3698
Cpp-Code-Large
Dataset: ajibawa-2023/Cpp-Code-Large

Cpp-Code-Large is a large-scale corpus of C++ source code comprising more than 5 million lines of C++ code. The dataset is designed to support research in large language model (LLM) pretraining, code intelligence, software engineering automation, and static program analysis for the C++ ecosystem.

By providing a high-volume, language-specific corpus, Cpp-Code-Large enables systematic experimentation in C++-focused model training, domain adaptation, and downstream code understanding tasks.

Cpp-Code-Large addresses the need for a dedicated C++-only dataset at substantial scale, enabling focused research across systems programming, performance-critical applications, embedded systems, game engines, and large-scale native software projects.
·
ajibawa-2023 
posted an update 13 days ago
view post
Post
3465
Python-Code-Large
Dataset: ajibawa-2023/Python-Code-Large

Python-Code-Large is a large-scale corpus of Python source code comprising more than 2 million rows of Python code. The dataset is designed to support research in large language model (LLM) pretraining, code intelligence, software engineering automation, and program analysis for the Python ecosystem.

By providing a high-volume, language-specific corpus, Python-Code-Large enables systematic experimentation in Python-focused model training, domain adaptation, and downstream code understanding tasks.

Python-Code-Large addresses the need for a dedicated Python-only dataset at substantial scale, enabling focused research across data science, backend systems, automation, scientific computing, and AI-driven Python environments.
  • 1 reply
·
ajibawa-2023 
posted an update 17 days ago
view post
Post
2567
PHP-Code-Large

Dataset: ajibawa-2023/PHP-Code-Large

PHP-Code-Large is a large-scale corpus of PHP source code comprising more than 12 million lines of PHP code. The dataset is designed to support research in large language model (LLM) pretraining, code intelligence, software engineering automation, and static program analysis for the PHP ecosystem.

By providing a high-volume, language-specific corpus, PHP-Code-Large enables systematic experimentation in PHP-focused model training, domain adaptation, and downstream code understanding tasks.

PHP-Code-Large addresses the need for a dedicated PHP-only dataset at substantial scale, enabling focused research across backend systems, CMS platforms, APIs, and full-stack PHP environments.
ajibawa-2023 
posted an update 22 days ago
view post
Post
3251
JavaScript-Code-Large
ajibawa-2023/JavaScript-Code-Large

JavaScript-Code-Large is a large-scale corpus of JavaScript source code comprising around 5 million JavaScript files. The dataset is designed to support research in large language model (LLM) pretraining, code intelligence, software engineering automation, and program analysis for the JavaScript ecosystem.

By providing a high-volume, language-specific corpus, JavaScript-Code-Large enables systematic experimentation in JavaScript-focused model training, domain adaptation, and downstream code understanding tasks.

JavaScript-Code-Large addresses the need for a dedicated JavaScript-only dataset at substantial scale, enabling focused research across frontend, backend, and full-stack JavaScript environments. .
ajibawa-2023 
posted an update 24 days ago
view post
Post
3132
Java-Code-Large ( ajibawa-2023/Java-Code-Large)

Java-Code-Large is a large-scale corpus of publicly available Java source code comprising more than 15 million java codes. The dataset is designed to support research in large language model (LLM) pretraining, code intelligence, software engineering automation, and program analysis.

By providing a high-volume, language-specific corpus, Java-Code-Large enables systematic experimentation in Java-focused model training, domain adaptation, and downstream code understanding tasks.
codelion 
posted an update about 2 months ago
view post
Post
3228
Reverse Engineering a $500M Mystery: From HashHop to Memory-Augmented Language Models

I wrote a deep dive into how Magic AI's 100M token context window might work, starting from their HashHop benchmark and building up to MALM - a Memory-Augmented Language Model.

Key insight: treating each key as a single token enables perfect retrieval at unlimited context lengths.

The article covers:

- How HashHop works and why its perfect accuracy is suspicious
- Building a tokenized solver that achieves 100% accuracy
- Scaling to MALM for real code search tasks
- Why this approach could handle 100M+ tokens

Read the full article: https://huggingface.co/blog/codelion/reverse-engineering-magic-hashhop

Try the model: codelion/malm-165m

Code: https://github.com/codelion/hash-hop
  • 1 reply
·
codelion 
posted an update 3 months ago
view post
Post
6132
Introducing Dhara-70M: A diffusion language model that achieves 3.8x higher throughput than autoregressive models!

Key findings from our research on optimal architectures for small language models:

→ Depth beats width: 32 layers outperforms 12 layers at the same parameter count
→ Best-in-class factuality: 47.5% on TruthfulQA
→ 10x training efficiency using WSD (Warmup-Stable-Decay) conversion
→ Canon layers add only 0.13% parameters but improve reasoning

We trained on 1B tokens using the optimal 50-30-20 dataset mix (PDFs + filtered web + educational content), then converted to diffusion with just 100M additional tokens.

Blog: https://huggingface.co/blog/codelion/optimal-model-architecture
Model: codelion/dhara-70m
  • 1 reply
·
codelion 
posted an update 3 months ago
view post
Post
2416
Introducing PTS Visualizer - an interactive tool for exploring how language models reason!

Visualize pivotal tokens, thought anchors, and reasoning circuits. See which tokens and sentences significantly impact success probability, explore embedding clusters, and trace reasoning step-by-step.

Try it: codelion/pts-visualizer

Explore PTS datasets:
- Qwen3-0.6B: codelion/Qwen3-0.6B-pts
- DeepSeek-R1: codelion/DeepSeek-R1-Distill-Qwen-1.5B-pts

Or upload your own JSONL files!

GitHub: https://github.com/codelion/pts
flozi00 
posted an update 3 months ago
view post
Post
456
We have covered Tensor Parallelism for slicing matrices and Pipeline Parallelism for stacking layers. But what if your model isn't just deep or wide—it's a sprawling Mixture-of-Experts (MoE) architecture like Mixtral or DeepSeek, with trillions of parameters that are mostly idle per token?

Replicating those experts wastes VRAM. Slicing them with TP wastes bandwidth. The solution is Expert Parallelism (EP), which distributes the experts themselves across GPUs and routes tokens to wherever their "chosen" expert lives.

The hardware catch? It is not matrix splitting or pipeline bubbles—it's the "Router's Dilemma." You must shuffle massive volumes of tokens across the cluster using All-to-All communication, and any imbalance can leave expensive GPUs idle.

My latest guide dives into the mechanics of EP and why the interconnect becomes the ultimate bottleneck.

In this breakdown, we explore:

The Token Routing Lifecycle
A four-step hardware flow: Local routing to pick experts, Dispatch (All-to-All shuffle), Expert computation on the "home" GPU, and Combine (another All-to-All to return results).

The All-to-All Primitive
Unlike the ring-based syncs in TP, All-to-All creates a dense mesh of personalized data transfers. We compare it to All-Reduce and show why uneven token distribution (load imbalance) causes network congestion and compute skew.

Load Balancing: The Hardware Nightmare
If one expert gets 90% of the tokens, its GPU bottlenecks while others stall. We discuss mitigation strategies like token dropping and auxiliary losses to keep utilization high.

The article includes a raw PyTorch implementation of an EP layer using torch.distributed.all_to_all_single to reveal exactly how the data shuffles and where the stalls happen.

Read the full hardware-centric guide here:
https://flozi.net/en/guides/ai/scaling/expert_parallel
  • 1 reply
·