Transformers loading broken - MoE weights MISSING after recent update

#23
by davi0600 - opened

Hi Upstage team,
I was successfully using Solar-Open-100B with transformers just 3 days ago, but after a recent repository update, the model no longer loads correctly.

Environment:

transformers==5.0.0
torch==2.10.0
accelerate==1.12.0
kernels==0.12.1
CUDA 12.8

Code:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL_ID = "upstage/Solar-Open-100B"
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path=MODEL_ID,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

Error (LOAD REPORT):

SolarOpenForCausalLM LOAD REPORT from: upstage/Solar-Open-100B
Key                                                          | Status     |
-------------------------------------------------------------+------------+
model.layers.{0...47}.mlp.experts.gate_up_proj               | UNEXPECTED |
model.layers.{0...47}.mlp.experts.down_proj                  | UNEXPECTED |
model.layers.{0...47}.mlp.experts.{0...127}.up_proj.weight   | MISSING    |
model.layers.{0...47}.mlp.experts.{0...127}.gate_proj.weight | MISSING    |
model.layers.{0...47}.mlp.experts.{0...127}.down_proj.weight | MISSING    |

The model loads without errors but generates garbage output because the MoE expert weights are randomly initialized instead of being loaded from the checkpoint.

The issue: It appears the checkpoint stores expert weights in a fused format (gate_up_proj, down_proj) but modeling_solar_open.py expects individual expert weights (experts.{0...127}.up_proj.weight, etc.).

My use case: I need to fine-tune this model using transformers + accelerate, not just run inference via vLLM.
Could you please look into this? This worked perfectly 3 days ago before the recent commits.
Thank you!

upstage org

transformers v5 was released a few days ago and Solar Open model is now integrated.
To accelerate MoE inference, transformers v5 contains some breaking changes including MoE layer implementations which may cause problems.
We recommend using v5 (check out this pr) or just installing transformers v4 if you want to stick the to the existing implementation.

Thanks for using our model πŸ€—
@davi0600

@SSON9 It works for me. transformers==5.0.0 is working with the model. But I met another problem with VLLM... The VLLM with this model is customized for transoformers==4.56.0 and vllm==0.12.1. So the transformers cannot find the model type 'solar_open'. When I upgrade the transformers as v5.0.0 then it response me like

ImportError: cannot import name 'ALLOWED_LAYER_TYPES' from 'transformers.configuration_utils' (/opt/conda/envs/SolarVLLM/lib/python3.12/site-packages/transformers/configuration_utils.py). Did you mean: 'ALLOWED_MLP_LAYER_TYPES'?

This is the vllm error response. I used conda to make a virtual environment with python==3.12 and just follow the code for installing custom vllm of Solar-Open-100B.

WARNING 01-30 06:18:06 [argparse_utils.py:354] Found duplicate keys --logits-processors
(APIServer pid=81861) INFO 01-30 06:18:06 [api_server.py:1772] vLLM API server version 0.12.1.dev1+solaropen1.0.1.g423e702f7
(APIServer pid=81861) INFO 01-30 06:18:06 [utils.py:253] non-default args: {'model_tag': 'upstage/Solar-Open-100B', 'enable_auto_tool_choice': True, 'tool_call_parser': 'solar_open', 'model': 'upstage/Solar-Open-100B', 'trust_remote_code': True, 'logits_processors': ['vllm.model_executor.models.solar_open_logits_processor:SolarOpenTemplateLogitsProcessor'], 'reasoning_parser': 'solar_open', 'tensor_parallel_size': 8}
(APIServer pid=81861) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=81861) Traceback (most recent call last):
(APIServer pid=81861)   File "/opt/conda/envs/SolarVLLM/bin/vllm", line 7, in <module>
(APIServer pid=81861)     sys.exit(main())
(APIServer pid=81861)              ^^^^^^
(APIServer pid=81861)   File "/opt/conda/envs/SolarVLLM/lib/python3.12/site-packages/vllm/entrypoints/cli/main.py", line 73, in main
(APIServer pid=81861)     args.dispatch_function(args)
(APIServer pid=81861)   File "/opt/conda/envs/SolarVLLM/lib/python3.12/site-packages/vllm/entrypoints/cli/serve.py", line 60, in cmd
(APIServer pid=81861)     uvloop.run(run_server(args))
(APIServer pid=81861)   File "/opt/conda/envs/SolarVLLM/lib/python3.12/site-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=81861)     return __asyncio.run(
(APIServer pid=81861)            ^^^^^^^^^^^^^^
(APIServer pid=81861)   File "/opt/conda/envs/SolarVLLM/lib/python3.12/asyncio/runners.py", line 194, in run
(APIServer pid=81861)     return runner.run(main)
(APIServer pid=81861)            ^^^^^^^^^^^^^^^^
(APIServer pid=81861)   File "/opt/conda/envs/SolarVLLM/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=81861)     return self._loop.run_until_complete(task)
(APIServer pid=81861)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=81861)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=81861)   File "/opt/conda/envs/SolarVLLM/lib/python3.12/site-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=81861)     return await main
(APIServer pid=81861)            ^^^^^^^^^^
(APIServer pid=81861)   File "/opt/conda/envs/SolarVLLM/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1819, in run_server
(APIServer pid=81861)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=81861)   File "/opt/conda/envs/SolarVLLM/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1838, in run_server_worker
(APIServer pid=81861)     async with build_async_engine_client(
(APIServer pid=81861)   File "/opt/conda/envs/SolarVLLM/lib/python3.12/contextlib.py", line 204, in __aenter__
(APIServer pid=81861)     return await anext(self.gen)
(APIServer pid=81861)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=81861)   File "/opt/conda/envs/SolarVLLM/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 183, in build_async_engine_client
(APIServer pid=81861)     async with build_async_engine_client_from_engine_args(
(APIServer pid=81861)   File "/opt/conda/envs/SolarVLLM/lib/python3.12/contextlib.py", line 204, in __aenter__
(APIServer pid=81861)     return await anext(self.gen)
(APIServer pid=81861)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=81861)   File "/opt/conda/envs/SolarVLLM/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 209, in build_async_engine_client_from_engine_args
(APIServer pid=81861)     vllm_config = engine_args.create_engine_config(usage_context=usage_context)
(APIServer pid=81861)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=81861)   File "/opt/conda/envs/SolarVLLM/lib/python3.12/site-packages/vllm/engine/arg_utils.py", line 1365, in create_engine_config
(APIServer pid=81861)     model_config = self.create_model_config()
(APIServer pid=81861)                    ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=81861)   File "/opt/conda/envs/SolarVLLM/lib/python3.12/site-packages/vllm/engine/arg_utils.py", line 1221, in create_model_config
(APIServer pid=81861)     return ModelConfig(
(APIServer pid=81861)            ^^^^^^^^^^^^
(APIServer pid=81861)   File "/opt/conda/envs/SolarVLLM/lib/python3.12/site-packages/pydantic/_internal/_dataclasses.py", line 121, in __init__
(APIServer pid=81861)     s.__pydantic_validator__.validate_python(ArgsKwargs(args, kwargs), self_instance=s)
(APIServer pid=81861) pydantic_core._pydantic_core.ValidationError: 1 validation error for ModelConfig
(APIServer pid=81861)   Value error, The checkpoint you are trying to load has model type `solar_open` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.
(APIServer pid=81861) 
(APIServer pid=81861) You can update Transformers with the command `pip install --upgrade transformers`. If this does not work, and the checkpoint is very new, then there may not be a release version that supports this model yet. In this case, you can get the most up-to-date code by installing Transformers from source with the command `pip install git+https://github.com/huggingface/transformers.git` [type=value_error, input_value=ArgsKwargs((), {'model': ...rocessor_plugin': None}), input_type=ArgsKwargs]
(APIServer pid=81861)     For further information visit https://errors.pydantic.dev/2.12/v/value_error

Sign up or log in to comment