Prompt-Code Similarity Model

This model predicts similarity between problem prompts and Ballerina code completions.

Use Case: Given a coding problem prompt, evaluate how well a Ballerina solution matches the requirements.

Training:

  • Passing code (test_reward > 0.25) → HIGH similarity (target = 1.0)
  • Failing code (test_reward ≤ 0.25) → LOW similarity (target = 0.0)
  • Only compiled code used (error_reward ≥ 0.25)

Architecture

Based on CodeRankEmbed with a binary similarity head.

  • Base Model: nomic-ai/CodeRankEmbed
  • Task: Binary similarity prediction
  • Loss: BCELoss (Binary Cross-Entropy)
  • Output: Similarity score [0, 1]

Usage

import torch
from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("nomic-ai/CodeRankEmbed", trust_remote_code=True)

# Load your trained model
# (You'll need to load the SimilarityModel class and state dict)

# Example inference
code1 = "..."
code2 = "..."

# Tokenize
enc1 = tokenizer(code1, return_tensors='pt', padding=True, truncation=True, max_length=512)
enc2 = tokenizer(code2, return_tensors='pt', padding=True, truncation=True, max_length=512)

# Compute similarity
with torch.no_grad():
    similarity = model(enc1['input_ids'], enc1['attention_mask'],
                      enc2['input_ids'], enc2['attention_mask'])

print(f"Similarity: {similarity.item():.3f}")
# > 0.7: High similarity (likely to pass)
# < 0.3: Low similarity (likely to fail)

Training Details

  • Dataset: Ballerina code completion logs
  • Filtering: error_reward ≥ 0.25 (compiled successfully)
  • Labeling: test_reward > 0.25 → Pass, else Fail
  • Optimizer: AdamW
  • Learning Rate: 2e-5
  • Batch Size: 8
  • Epochs: 5

Limitations

  • Trained specifically on Ballerina code
  • Performance may vary on unseen problem types
  • Requires compiled code (syntax errors will affect predictions)

Citation

@misc{coderankembed2024,
  title={CodeRankEmbed: Code Similarity Learning},
  author={Nomic AI},
  year={2024},
  url={https://huggingface.co/nomic-ai/CodeRankEmbed}
}
Downloads last month
1
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for didula-wso2/prompt-similarity-model

Finetuned
(4)
this model