leonardlin 's Collections data
updated
A Pretrainer's Guide to Training Data: Measuring the Effects of Data
Age, Domain Coverage, Quality, & Toxicity
Paper
• 2305.13169
• Published
• 3
A Survey on Data Selection for Language Models
Paper
• 2402.16827
• Published
• 4
HuggingFaceFW/fineweb-edu
Viewer
• Updated
• 3.5B • 224k
• 965
Updated
• 199k
• 156
Viewer
• Updated
• 7.18B • 39.4k
• 591
Best Practices and Lessons Learned on Synthetic Data for Language Models
Paper
• 2404.07503
• Published
• 31
Scaling Synthetic Data Creation with 1,000,000,000 Personas
Paper
• 2406.20094
• Published
• 104
DDK: Distilling Domain Knowledge for Efficient Large Language Models
Paper
• 2407.16154
• Published
• 22
Unleashing the Power of Data Tsunami: A Comprehensive Survey on Data
Assessment and Selection for Instruction Tuning of Language Models
Paper
• 2408.02085
• Published
• 19
Better Alignment with Instruction Back-and-Forth Translation
Paper
• 2408.04614
• Published
• 15
The ShareLM Collection and Plugin: Contributing Human-Model Chats for
the Benefit of the Community
Paper
• 2408.08291
• Published
• 11