Books from the Survivor Library (mostly ~1920s & earlier) OCR'd with recent VLMs
BEEspoke Data
community
AI & ML interests
'an LLM is only as good as the dataset it was trained on' - Sun Tzu
Recent Activity
View all activity
Organization Card
ššš
š§"raw" pretrained smol_llama checkpoints - WIP š§
-
BEE-spoke-data/smol_llama-101M-GQA
Text Generation ⢠0.1B ⢠Updated ⢠1.83k ⢠32 -
BEE-spoke-data/smol_llama-81M-tied
Text Generation ⢠81.3M ⢠Updated ⢠795 ⢠9 -
BEE-spoke-data/smol_llama-220M-GQA
Text Generation ⢠0.2B ⢠Updated ⢠2.19k ⢠13 -
BEE-spoke-data/verysmol_llama-v11-KIx2
Text Generation ⢠58.1M ⢠Updated ⢠863 ⢠4
Books from the Survivor Library (mostly ~1920s & earlier) OCR'd with recent VLMs
š§"raw" pretrained smol_llama checkpoints - WIP š§
-
BEE-spoke-data/smol_llama-101M-GQA
Text Generation ⢠0.1B ⢠Updated ⢠1.83k ⢠32 -
BEE-spoke-data/smol_llama-81M-tied
Text Generation ⢠81.3M ⢠Updated ⢠795 ⢠9 -
BEE-spoke-data/smol_llama-220M-GQA
Text Generation ⢠0.2B ⢠Updated ⢠2.19k ⢠13 -
BEE-spoke-data/verysmol_llama-v11-KIx2
Text Generation ⢠58.1M ⢠Updated ⢠863 ⢠4
models 58
BEE-spoke-data/NVIDIA-Nemotron-Parse-v1.2
Image-Text-to-Text ⢠0.9B ⢠Updated
⢠74
BEE-spoke-data/neobert-100k-test
Fill-Mask ⢠0.1B ⢠Updated
BEE-spoke-data/tiny-random-MPNetForMaskedLM
Fill-Mask ⢠237k ⢠Updated
⢠3
BEE-spoke-data/bpe-tokenizer-32k-smolNeoX
Updated
BEE-spoke-data/wordpiece-tokenizer-32k-en_code-orig
Updated
BEE-spoke-data/wordpiece-tokenizer-32k-en_code-msp
Updated
BEE-spoke-data/pegasus-x-base-synthsumm_open-16k
Summarization ⢠0.3B ⢠Updated
⢠28 ⢠2
BEE-spoke-data/tFINE-680m-e32-d16-infinity_instruct-L2
Text Generation ⢠0.7B ⢠Updated
BEE-spoke-data/tFINE-680m-e32-d16-gqa-flan
0.7B ⢠Updated
⢠1
BEE-spoke-data/tFINE-900m-instruct-orpo
0.9B ⢠Updated
⢠3
datasets 82
BEE-spoke-data/SurvivorLib-Nanonets-OCR-s
Viewer
⢠Updated
⢠14.4k ⢠22 ⢠2
BEE-spoke-data/SurvivorLib-rolmOCR
Viewer
⢠Updated
⢠14.6k ⢠27 ⢠1
BEE-spoke-data/govdocs1-pdf-source
Viewer
⢠Updated
⢠235k ⢠2.43k ⢠4
BEE-spoke-data/napierone-pdf-nanonets-s
Viewer
⢠Updated
⢠9.96k ⢠7
BEE-spoke-data/napierone-pdf-olmOCR
Viewer
⢠Updated
⢠19k ⢠23
BEE-spoke-data/LONGCOT-merged-1M
Viewer
⢠Updated
⢠1.7M ⢠26 ⢠2
BEE-spoke-data/cosmopedia-v2-mincols
Viewer
⢠Updated
⢠39.1M ⢠39 ⢠1
BEE-spoke-data/reddit-title-body-hf
Viewer
⢠Updated
⢠251M ⢠107 ⢠4
BEE-spoke-data/bigpatent-all
Viewer
⢠Updated
⢠2.43M ⢠237
BEE-spoke-data/google_wellformed_query-hf
Viewer
⢠Updated
⢠25.1k ⢠12