DARE: Distribution-Aware Retrieval for R Functions

DARE (Distribution-Aware Retrieval Embedding) is a specialized bi-encoder model designed to retrieve statistical and data analysis tools (R functions) based on both user queries and conditional on data profile.

It is fine-tuned from sentence-transformers/all-MiniLM-L6-v2 to serve as a high-precision tool retrieval module for Large Language Model (LLM) Agents in automated data science workflows.

Model Details

Architecture: Bi-encoder (Sentence Transformer)
Base Model: sentence-transformers/all-MiniLM-L6-v2 (22.7M parameters)
Task: Dense Retrieval for Tool-Augmented LLMs
Performance: SoTA on R package retrieval tasks.
Domain: R programming language, Data Science, Statistical Analysis functions

Usage (Sentence-Transformers)

First, install the sentence-transformers library:

pip install -U sentence-transformers

Usage by our RPKB (Optional and Recommended)

from huggingface_hub import snapshot_download
import chromadb

# 1. Download the database folder from Hugging Face
db_path = snapshot_download(
    repo_id="Stephen-SMJ/RPKB", 
    repo_type="dataset",
    allow_patterns="RPKB/*"  # Adjust this if your folder name is different
)

# 2. Connect to the local ChromaDB instance
client = chromadb.PersistentClient(path=f"{db_path}/RPKB")

# 3. Access the specific collection
collection = client.get_collection(name="inference")

print(f"✅ Loaded {collection.count()} R functions ready for conditional retrieval!")

Then, you can load the DARE model do retrieval:

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

# 1. Load the DARE model
model = SentenceTransformer("Stephen-SMJ/DARE-R-Retrieval")

# 2. Define the exact input format: Query + Data Profile
query = "I have a high-dimensional genomic dataset named hidra_ex_1_2000.csv in my environment. I need to identify driver elements by estimating regulatory scores based on the counts provided
in the data. Please set the random seed to 123 at the start. I need to filter for fragment lengths between 150 and 600 bp and use a DNA count filter of 5. For my evaluation, please print the
first value of the estimated scores (est_a) for the very first region identified."

# 3. Generate embedding
query_embedding = model.encode(user_query).tolist()

# 4. Search in the database with Hard Filters
results = collection.query(
    query_embeddings=[query_embedding],
    n_results=3,
    include=["metadatas", "distances", "documents"]
)

# Display Top-1 Result
print("Top-1 Function:", results["metadatas"][0][0]["package_name"], "::", results["metadatas"][0][0]["function_name"])

Downloads last month: 2

Safetensors

Model size

22.7M params

Tensor type

F32

Model tree for Stephen-SMJ/DARE-R-Retriever

Base model

sentence-transformers/all-MiniLM-L6-v2

Finetuned

(759)

this model