Instructions to use facebook/chameleon-7b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use facebook/chameleon-7b with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="facebook/chameleon-7b")# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("facebook/chameleon-7b") model = AutoModelForImageTextToText.from_pretrained("facebook/chameleon-7b") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use facebook/chameleon-7b with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "facebook/chameleon-7b" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "facebook/chameleon-7b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/facebook/chameleon-7b
- SGLang
How to use facebook/chameleon-7b with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "facebook/chameleon-7b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "facebook/chameleon-7b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "facebook/chameleon-7b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "facebook/chameleon-7b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use facebook/chameleon-7b with Docker Model Runner:
docker model run hf.co/facebook/chameleon-7b
Evaluating In-Context Learning Ability
Hello,
First of all, I would like to thank you for this amazing project. I am evaluating Chameleon's in-context learning ability. However, I think I am missing something about the inference process. When I work with a zero-shot setting, the model outputs are normal. However, with a few-shot setting, the model's responses are awkward. It sometimes avoids answering and occasionally outputs irrelevant characters. I do not encounter this problem in the zero-shot setting. Below you can find the code that I used.
def load_model(self, args) -> None:
"""
Load the Chameleon model and processor.
Parameters:
- args: The arguments to load the model.
Returns:
None
"""
from transformers import ChameleonForConditionalGeneration, ChameleonProcessor, BitsAndBytesConfig
print('Loading Chameleon!!!')
self.model = ChameleonForConditionalGeneration.from_pretrained(
args.hf_path,
device_map="cuda:0",
torch_dtype=torch.bfloat16,
).to(args.device).eval()
self.processor = ChameleonProcessor.from_pretrained(args.hf_path)
self.generation_cfg = {
'do_sample': True,
'temperature': 0.7,
'top_p': 0.9,
'repetition_penalty': 1.2,
}
if args.is_zero_cot_active or args.is_few_cot_active:
self.generation_cfg['max_new_tokens'] = 512
else:
self.generation_cfg['max_new_tokens'] = 50
print('Chameleon loaded!!!')
def calculate_generated_text(self, prompt, vision_x):
"""
Calculate generated text given a prompt and vision data.
Parameters:
- prompt (str): The input prompt.
- vision_x (list[PIL Images]): List of PIL Images containing vision data.
Returns:
Tuple[str, str]: Tuple containing the raw and salt answer text.
"""
"""
Example Prompt:
In zero-shot: "<image> <Question> <Options> Answer: "
In few-shot: "<image> <Question> <Options> Answer: <Answer> <image> <Question> <Options> Answer: "
"""
if self.model is None or self.processor is None:
raise AttributeError('Model or processor is not initialized. Call load_model first!')
inputs = self.processor(prompt, images=vision_x, padding=True, return_tensors="pt").to(device=self.model.device, dtype=torch.bfloat16)
out = self.model.generate(**inputs, **self.generation_cfg)
generated_text = self.processor.decode(out[0], skip_special_tokens=True)
salt_prompt = prompt.replace("<image>", "")
salt_answer = generated_text[len(salt_prompt):]
return generated_text, salt_answer
Hi @mustafaa . I'm highly interested in trying this model but there are no clear instructions yet, so I tried your code. I'm wondering how you deal with prompt length? I got an ValueError when executing inputs=processor(...), and the error still exists after I set generation_cfg.max_length and generation_cfg.max_new_tokens=2048. My prompt is "<image> Briefly describe the image. ". Just the image would have length more than 1000 so I can't really reduce the input length.
ValueError: Input length of input_ids is 1029, but
max_lengthis set to 20. This can lead to unexpected behavior. You should consider increasingmax_lengthor, better yet, > settingmax_new_tokens.