This is a text-based Voice Activity Detection model that determines if a given speech fragment is complete enough for processing by a smart speaker assistant. This allows smart speakers to move from using time based pauses (300ms - 1000ms) to detect the end of voice input to using this model to determine if the voice input is complete.
Example:
- "Hey" -> no
- "Hey Juno" -> no
- "Hey Juno can you" -> no
- "Hey Juno can you set" -> no
- "Hey Juno can you set the" -> no
- "Hey Juno can you set the temperature" -> no
- "Hey Juno can you set the temperature to" -> no
- "Hey Juno can you set the temperature to 65" -> yes
Model prompting requirements:
- Required system prompt:
"You are a Voice Activity Detection system. Determine if the given speech fragment is complete enough for processing. Answer with only 'yes' if complete or 'no' if incomplete." - Required user prompt:
"Is this sentence fragment complete for processing: '{fragment}'"
To use with pipeline from transformers:
from transformers import pipeline
pipe = pipeline("text-generation", model="juno-labs/gemma-text-vad")
SYSTEM_PROMPT = (
"You are a Voice Activity Detection system. "
"Determine if the given speech fragment is complete enough for processing. Answer with only 'yes' if complete or 'no' if incomplete."
)
SENTENCE = "Hey Juno can you set the temperature to"
messages = [
{'content': SYSTEM_PROMPT, 'role': 'system'},
{'content': f"Is this sentence fragment complete for processing: '{SENTENCE}'", 'role': 'user'}
]
generated = pipe(messages)
classification = generated[0]["generated_text"][2]["content"]
print(f"Classification: {classification}") # "yes" or "no"
To use with transformers:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
MODEL_ID = "juno-labs/gemma-text-vad"
# Load model + tokenizer
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
SYSTEM_PROMPT = (
"You are a Voice Activity Detection system. "
"Determine if the given speech fragment is complete enough for processing. Answer with only 'yes' if complete or 'no' if incomplete."
)
SENTENCE = "Set the temperature to 68"
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": f"Is this sentence fragment complete for processing: '{SENTENCE}'"},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=1, # only 1 token
do_sample=False, # greedy decoding
pad_token_id=tokenizer.eos_token_id,
)
decoded = tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
print(f"Classification: {decoded}") # "yes" or "no"
- Downloads last month
- 8