Configuration Parsing Warning: Config file config.json cannot be fetched (too big)
Configuration Parsing Warning: Config file tokenizer_config.json cannot be fetched (too big)
YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

GptOssDense

License Python 3.8+ Transformers

GptOssDense is a dense variant of the GptOss model architecture. While GptOss uses a Mixture-of-Experts (MoE) approach with routing, GptOssDense replaces the MoE layer with a standard dense feedforward network (FFN).

✅ Verified to work with trust_remote_code=True on stable transformers (v4.40+)

Model Architecture

  • Attention: Same as GptOss with sliding window attention and sink tokens
  • MLP: Dense FFN with GLU activation (instead of MoE with router)
  • Activation: Same GLU activation as GptOss experts: (up + 1) * gate * sigmoid(gate * alpha) where alpha=1.702
  • Normalization: RMSNorm
  • RoPE: YaRN (Yet another RoPE extensioN)

Key Differences from GptOss

Feature GptOss GptOssDense
MLP Type Mixture-of-Experts Dense FFN
Router Yes No
Experts Multiple (128) Single
Parameters More (due to multiple experts) Fewer
Inference Routes tokens to top-k experts Single FFN for all tokens

Usage

Quick Start - Random Initialization

Try the model with randomly initialized weights (outputs will be random):

from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer
import torch

# Load config and tokenizer
config = AutoConfig.from_pretrained("marksverdhei/gpt-oss-dense", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("marksverdhei/gpt-oss-dense")

# Initialize model with random weights
model = AutoModelForCausalLM.from_config(config, trust_remote_code=True)
model.eval()

# Generate text (will be random since model is not trained)
prompt = "Hello, how are you?"
inputs = tokenizer(prompt, return_tensors="pt")

with torch.no_grad():
    outputs = model.generate(
        inputs.input_ids,
        max_new_tokens=20,
        do_sample=True,
        temperature=1.0,
        top_k=50,
        pad_token_id=tokenizer.pad_token_id
    )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# Example output: "Hello, how are you? pronunci bhithCiudadstdafxipseігlanders導 conveyoruviainn"
# (random tokens since model is not trained)

Loading Pre-trained Weights (when available)

Once model weights are uploaded to the repository:

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model with weights
model = AutoModelForCausalLM.from_pretrained(
    "marksverdhei/gpt-oss-dense",
    trust_remote_code=True
)

# Load tokenizer (you'll need to upload a tokenizer)
tokenizer = AutoTokenizer.from_pretrained("marksverdhei/gpt-oss-dense")

# Generate text
inputs = tokenizer("Hello, how are you?", return_tensors="pt")
outputs = model.generate(**inputs, max_length=50)
print(tokenizer.decode(outputs[0]))

With transformers fork

Using the marksverdhei/transformers fork where GptOssDense is registered:

# Install the fork
pip install git+https://github.com/marksverdhei/transformers.git
from transformers import GptOssDenseForCausalLM, GptOssDenseConfig

config = GptOssDenseConfig()
model = GptOssDenseForCausalLM(config)

Model Configuration

Matches openai/gpt-oss-20b configuration (dense variant):

  • Hidden size: 2880
  • Intermediate size: 2880
  • Number of layers: 24
  • Number of attention heads: 64
  • Number of key-value heads: 8
  • Head dimension: 64
  • Vocabulary size: 201,088
  • Max position embeddings: 131,072
  • Initial context length: 4,096
  • Sliding window: 128
  • RoPE type: YaRN with factor 32.0
  • SwiGLU limit: 7.0
  • Total parameters: ~2.4B

License

Apache 2.0

Citation

If you use this model, please cite the original GptOss work and acknowledge this dense variant.

Downloads last month
13
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support