Video Inference - TypeError: process_vision_info() got an unexpected keyword argument 'return_video_kwargs'

by hmanju - opened Jan 30, 2025

Jan 30, 2025

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "file:///path/to/video1.mp4", #added local file path for a video here
                "max_pixels": 360 * 420,
                "fps": 1.0,
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]


#In Qwen 2.5 VL, frame rate information is also input into the model to align with absolute time.
# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    fps=fps,
    padding=True,
    return_tensors="pt",
    **video_kwargs,
)
inputs = inputs.to("cuda")

# Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

TypeError                                 Traceback (most recent call last)
/tmp/ipykernel_213370/404839378.py in <module>
     34     messages, tokenize=False, add_generation_prompt=True
     35 )
---> 36 image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
     37 inputs = processor(
     38     text=[text],

TypeError: process_vision_info() got an unexpected keyword argument 'return_video_kwargs'

usbphone

Jan 30, 2025

•

edited Jan 30, 2025

Same here with the video example, but I had success just reusing the image example (which includes video params already) and changing the content from image to video.

something like

messages = [
    {
        "role": "user",
        "content": [
            {"type": "video", "video": "file:///path/to/video1.mp4"}
            {"type": "text", "text": "Describe this video."},
        ]
    }
]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# etc

There's also far more detailed "cookbooks" on their github, including for video.

bluenevus

Jan 30, 2025

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment