Add pipeline tag, library name, and sample usage to model card
Browse filesThis PR enhances the model card by adding essential metadata and a practical sample usage section:
- **`pipeline_tag: image-text-to-text`**: This tag accurately reflects the model's functionality, which involves processing both images and text queries to generate text responses. This will improve discoverability on the Hugging Face Hub.
- **`library_name: transformers`**: Evidence from `config.json` (`transformers_version: "4.47.1"`, `model_type: "qwen2"`) and `tokenizer_config.json` (`tokenizer_class: "Qwen2Tokenizer"`) indicates compatibility with the Hugging Face `transformers` library. This addition enables the automated "how to use" widget on the model page, providing users with a quick and standardized way to perform inference.
- **Sample Usage**: The "🛠️ QuickStart" section, including installation instructions and example code snippets, has been directly extracted from the official GitHub repository. This provides users with immediate, runnable examples to get started with the model. Literal new line characters (`\n`) in code snippets have been preserved as per guidelines.
These updates aim to make the model more discoverable, easier to use, and better documented for the community.
|
@@ -1,7 +1,9 @@
|
|
| 1 |
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
base_model:
|
| 4 |
- OpenGVLab/InternVL3-8B
|
|
|
|
|
|
|
|
|
|
| 5 |
---
|
| 6 |
|
| 7 |
**EN** | [中文](README_CN.md)
|
|
@@ -131,4 +133,131 @@ which achieve state-of-the-art performance among open-source models of comparabl
|
|
| 131 |
<td>GPT-5-2025-08-07</td><td>55.0</td><td>41.8</td><td>56.3</td><td>45.5</td><td>61.8</td>
|
| 132 |
</tr>
|
| 133 |
</tbody>
|
| 134 |
-
</table>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
|
|
|
| 2 |
base_model:
|
| 3 |
- OpenGVLab/InternVL3-8B
|
| 4 |
+
license: apache-2.0
|
| 5 |
+
pipeline_tag: image-text-to-text
|
| 6 |
+
library_name: transformers
|
| 7 |
---
|
| 8 |
|
| 9 |
**EN** | [中文](README_CN.md)
|
|
|
|
| 133 |
<td>GPT-5-2025-08-07</td><td>55.0</td><td>41.8</td><td>56.3</td><td>45.5</td><td>61.8</td>
|
| 134 |
</tr>
|
| 135 |
</tbody>
|
| 136 |
+
</table>
|
| 137 |
+
|
| 138 |
+
## 🛠️ QuickStart
|
| 139 |
+
|
| 140 |
+
### Installation
|
| 141 |
+
|
| 142 |
+
We recommend using [uv](https://docs.astral.sh/uv/) to manage the environment.
|
| 143 |
+
|
| 144 |
+
> uv installation guide: <https://docs.astral.sh/uv/getting-started/installation/#installing-uv>
|
| 145 |
+
|
| 146 |
+
```bash
|
| 147 |
+
git clone [email protected]:OpenSenseNova/SenseNova-SI.git
|
| 148 |
+
cd SenseNova-SI/
|
| 149 |
+
uv sync --extra cu124 # or one of [cu118|cu121|cu124|cu126|cu128|cu129], depending on your CUDA version
|
| 150 |
+
uv sync
|
| 151 |
+
source .venv/bin/activate
|
| 152 |
+
```
|
| 153 |
+
|
| 154 |
+
#### Hello World
|
| 155 |
+
|
| 156 |
+
A simple image-free test to verify environment setup and download the model.
|
| 157 |
+
|
| 158 |
+
```bash
|
| 159 |
+
python example.py \
|
| 160 |
+
--question "Hello" \
|
| 161 |
+
--model_path sensenova/SenseNova-SI-1.1-InternVL3-8B
|
| 162 |
+
```
|
| 163 |
+
|
| 164 |
+
### Examples
|
| 165 |
+
|
| 166 |
+
#### Example 1
|
| 167 |
+
|
| 168 |
+
This example is from the `Pos-Obj-Obj` subset of [MMSI-Bench](https://github.com/InternRobotics/MMSI-Bench):
|
| 169 |
+
|
| 170 |
+
```bash
|
| 171 |
+
python example.py \
|
| 172 |
+
--image_paths examples/Q1_1.png examples/Q1_2.png \
|
| 173 |
+
--question "<image><image>
|
| 174 |
+
You are standing in front of the dice pattern and observing it. Where is the desk lamp approximately located relative to you?
|
| 175 |
+
Options: A: 90 degrees counterclockwise, B: 90 degrees clockwise, C: 135 degrees counterclockwise, D: 135 degrees clockwise" \
|
| 176 |
+
--model_path sensenova/SenseNova-SI-1.1-InternVL3-8B
|
| 177 |
+
# --model_path OpenGVLab/InternVL3-8B
|
| 178 |
+
```
|
| 179 |
+
|
| 180 |
+
<!-- Example 1 -->
|
| 181 |
+
<details open>
|
| 182 |
+
<summary><strong>Details of Example 1</strong></summary>
|
| 183 |
+
<p><strong>Q:</strong> <image><image>
|
| 184 |
+
You are standing in front of the dice pattern and observing it. Where is the desk lamp approximately located relative to you?
|
| 185 |
+
Options: A: 90 degrees counterclockwise, B: 90 degrees clockwise, C: 135 degrees counterclockwise, D: 135 degrees clockwise</p>
|
| 186 |
+
<table>
|
| 187 |
+
<tr>
|
| 188 |
+
<td align="center" width="50%" style="padding:4px;">
|
| 189 |
+
<img src="./examples/Q1_1.png" alt="First image" width="100%">
|
| 190 |
+
</td>
|
| 191 |
+
<td align="center" width="50%" style="padding:4px;">
|
| 192 |
+
<img src="./examples/Q1_2.png" alt="Second image" width="100%">
|
| 193 |
+
</td>
|
| 194 |
+
</tr>
|
| 195 |
+
</table>
|
| 196 |
+
<p><strong>GT: C</strong></p>
|
| 197 |
+
</details>
|
| 198 |
+
|
| 199 |
+
|
| 200 |
+
#### Example 2
|
| 201 |
+
|
| 202 |
+
This example is from the `Rotation` subset of [MindCube](https://mind-cube.github.io/):
|
| 203 |
+
|
| 204 |
+
```bash
|
| 205 |
+
python example.py \
|
| 206 |
+
--image_paths examples/Q2_1.png examples/Q2_2.png \
|
| 207 |
+
--question "<image><image>
|
| 208 |
+
Based on these two views showing the same scene: in which direction did I move from the first view to the second view?
|
| 209 |
+
A. Directly left B. Directly right C. Diagonally forward and right D. Diagonally forward and left" \
|
| 210 |
+
--model_path sensenova/SenseNova-SI-1.1-InternVL3-8B
|
| 211 |
+
# --model_path OpenGVLab/InternVL3-8B
|
| 212 |
+
```
|
| 213 |
+
|
| 214 |
+
<!-- Example 2 -->
|
| 215 |
+
<details open>
|
| 216 |
+
<summary><strong>Details of Example 2</strong></summary>
|
| 217 |
+
<p><strong>Q:</strong> Based on these two views showing the same scene: in which direction did I move from the first view to the second view?
|
| 218 |
+
Directly left B. Directly right C. Diagonally forward and right D. Diagonally forward and left</p>
|
| 219 |
+
<table>
|
| 220 |
+
<tr>
|
| 221 |
+
<td align="center" width="50%" style="padding:4px;">
|
| 222 |
+
<img src="./examples/Q2_1.png" alt="First image" width="100%">
|
| 223 |
+
</td>
|
| 224 |
+
<td align="center" width="50%" style="padding:4px;">
|
| 225 |
+
<img src="./examples/Q2_2.png" alt="Second image" width="100%">
|
| 226 |
+
</td>
|
| 227 |
+
</tr>
|
| 228 |
+
</table>
|
| 229 |
+
<p><strong>GT: D</strong></p>
|
| 230 |
+
</details>
|
| 231 |
+
|
| 232 |
+
|
| 233 |
+
#### Test Multiple Questions in a Single Run
|
| 234 |
+
|
| 235 |
+
Prepare a file similar to [examples/examples.jsonl](examples/examples.jsonl), where each line represents a single question.
|
| 236 |
+
|
| 237 |
+
The model is loaded once and processes questions sequentially. The questions remain independent of each other.
|
| 238 |
+
|
| 239 |
+
> For more details on the `jsonl` format, refer to the documentation for [Single-Image Data](https://internvl.readthedocs.io/en/latest/get_started/chat_data_format.html#single-image-data) and [Multi-Image Data](https://internvl.readthedocs.io/en/latest/get_started/chat_data_format.html#multi-image-data).
|
| 240 |
+
|
| 241 |
+
|
| 242 |
+
```bash
|
| 243 |
+
python example.py \
|
| 244 |
+
--jsonl_path examples/examples.jsonl \
|
| 245 |
+
--model_path sensenova/SenseNova-SI-1.1-InternVL3-8B
|
| 246 |
+
# --model_path OpenGVLab/InternVL3-8B
|
| 247 |
+
```
|
| 248 |
+
|
| 249 |
+
### Evaluation
|
| 250 |
+
|
| 251 |
+
To reproduce the benchmark results above, please refer to [EASI](https://github.com/EvolvingLMMs-Lab/EASI) to evaluate SenseNova-SI on mainstream spatial intelligence benchmarks.
|
| 252 |
+
|
| 253 |
+
|
| 254 |
+
## 🖊️ Citation
|
| 255 |
+
|
| 256 |
+
```bib
|
| 257 |
+
@article{sensenova-si,
|
| 258 |
+
title = {Scaling Spatial Intelligence with Multimodal Foundation Models},
|
| 259 |
+
author = {Cai, Zhongang and Wang, Ruisi and Gu, Chenyang and Pu, Fanyi and Xu, Junxiang and Wang, Yubo and Yin, Wanqi and Yang, Zhitao and Wei, Chen and Sun, Qingping and Zhou, Tongxi and Li, Jiaqi and Pang, Hui En and Qian, Oscar and Wei, Yukun and Lin, Zhiqian and Shi, Xuanke and Deng, Kewang and Han, Xiaoyang and Chen, Zukai and Fan, Xiangyu and Deng, Hanming and Lu, Lewei and Pan, Liang and Li, Bo and Liu, Ziwei and Wang, Quan and Lin, Dahua and Yang, Lei},
|
| 260 |
+
journal = {arXiv preprint arXiv:2511.13719},
|
| 261 |
+
year = {2025}
|
| 262 |
+
}
|
| 263 |
+
```
|