Can you add inference code and instructions to use the model?

#1
by LH-Tech-AI - opened

Hey there!
I would really like to check out your models, but all your model repos show "README.md exists but content is empty. Use the Edit model card button to edit it."

Could you add inference code and instructions and more information about your models like this one?
Thanks 😀

And: do you have the training code for your Linny and other Chat Models like in this model here: https://huggingface.co/spaces/Lucien-shark/Linny-Working-Real ?
Would be sooo nice.
I'm also developing small language models: https://huggingface.co/LH-Tech-AI

Hey there, just checked out some of your work! Really love it! As a solo ML model trainer myself, it is interesting to see the models other solo trainers make. I do inference on a MacBook Air, so depending on what OS you run this code might not work exactly out of the box, but this is the inference code I use to run my standard Gen1.5 model. All of my models are based off of an architecture that ML developers abandoned for transformers called an LSTM (Long short-term memory) models, and all of them support up to 3000 token maximum generation length (from what I have tested) and have reasoning capabilities using start tokens and end tokens. My Gen 1.5 models were sadly not trained to support multi-turn chat or system message role following, however my Gen2.0 model I am actively working on training is being trained to support those, as well as Tool Calling capabilities. The base Gen1.5 model is around 75M params (with the sizes seen in the config section of the following code), whereas Gen1.5 Pro is around 154M params, and Gen1.5 Mini is around 20M params, and Gen1.5 nano is around 8M params. The Gen2.0 model I am working on training is going to be roughly 610M params in size and is being trained on roughly 8.3B tokens of system message, prompt, response pairs. The Gen1.5 models were all trained on the same dataset of 2.3B tokens of prompt to response pairs. My models are general purpose assistants that can assist in question answering, code generation, CoT reasoning, and basic general chat. While they were trained on math centered data as well, these models do poor on math as they do not have the size needed to properly digest how to do math. Both the model and tokenizer were trained custom from scratch by me. While I do have the training code, I am not yet willing to share it with anybody yet as it struggles to keep larger LSTM models stable, and I am still refining it, though when I feel it is ready, I will be more than happy to share it. I don't have a README.md file up yet for my models because I never quite thought that anybody would be really at all interested in actually using the models, but I may now consider it. I would like to thank you for taking the time out of your day to check out my models and apologize for the late response, because as said before, I didn't expect anybody to really care. Anyways, I hope this is what you were looking for, and again, thank you so much for checking out my work. :

"""
Linny Local Server – simplified (only keep generating on token limit, no reason longer)
Run: python linny_server.py
Open: http://localhost:8080
"""

import json
from pathlib import Path
from http.server import HTTPServer, BaseHTTPRequestHandler
from collections import deque
import torch
import torch.nn as nn

═══════════════════════════════════════════════

⚙️ CONFIG

═══════════════════════════════════════════════

DEVICE = "cpu"

MODEL_PATH = ""
TOKENIZER_PATH = "" #Gen1.5 models use Linny1 tokenizer

HIDDEN_LAYERS = 6
NEURONS = 1024
EMBED_SIZE = 384
DROPOUT = 0.2

USER_TAG = "### Instruction:"
BOT_TAG = "### Response:"
EOS_TOKEN = "<|end|>"

DEFAULT_TEMP = 0.8
DEFAULT_PENALTY = 1.10
DEFAULT_PENALTY_WINDOW = 110
DEFAULT_TOP_P = 0.4
DEFAULT_TOP_K = 65
DEFAULT_MAX_LEN = 3000

REASONING_MODE = "response_prefix"
MAX_REASONING_TOKENS = 2500
REASONING_START = False

For "Keep generating" when token limit is hit

KEEP_GENERATING_EXTRA_TOKENS = 3000

HOST = "localhost"
PORT = 8080

═══════════════════════════════════════════════

🧠 Model

═══════════════════════════════════════════════

class LSTMTokenLM(nn.Module):
def init(self, vocab_size, embed_size, hidden_size, num_layers, dropout=0.2):
super().init()
self.embed = nn.Embedding(vocab_size, embed_size)
self.lstm = nn.LSTM(embed_size, hidden_size, num_layers=num_layers,
batch_first=True, dropout=dropout)
self.fc = nn.Linear(hidden_size, vocab_size)

def forward(self, x, hidden=None):
    out, hidden = self.lstm(self.embed(x), hidden)
    return self.fc(out), hidden

═══════════════════════════════════════════════

🔤 GPT-2 byte decoder

═══════════════════════════════════════════════

def _build_byte_decoder():
bs = (list(range(ord('!'), ord('~')+1)) +
list(range(ord('¡'), ord('¬')+1)) +
list(range(ord('®'), ord('ÿ')+1)))
cs = bs[:]
n = 0
for b in range(256):
if b not in bs:
bs.append(b)
cs.append(256+n)
n += 1
return {chr(c): b for b, c in zip(bs, cs)}

_BYTE_DECODER = _build_byte_decoder()

def _tok_to_bytes(tok_str):
try:
return bytes([_BYTE_DECODER[c] for c in tok_str])
except KeyError:
return tok_str.encode('utf-8', errors='replace')

═══════════════════════════════════════════════

📥 Load

═══════════════════════════════════════════════

print("Loading tokenizer...")
from tokenizers import Tokenizer as HFTokenizer
tokenizer = HFTokenizer.from_file(TOKENIZER_PATH)
vocab_size = tokenizer.get_vocab_size()
print(f"✅ Tokenizer: {vocab_size:,} tokens")

print("Loading model...")
ckpt = torch.load(MODEL_PATH, map_location="cpu", weights_only=False)
arch = ckpt.get('config', {})
layers = arch.get('hidden_layers', HIDDEN_LAYERS)
neurons = arch.get('neurons', NEURONS)
embed = arch.get('embed_size', EMBED_SIZE)
dropout = arch.get('dropout', DROPOUT)
epoch = ckpt.get('epoch', '?')

device = torch.device(DEVICE)
model = LSTMTokenLM(vocab_size, embed, neurons, layers, dropout).to(device)
model.load_state_dict(ckpt['model_state'])
model.eval()
params = sum(p.numel() for p in model.parameters())
print(f"✅ Epoch {epoch} | {layers}L×{neurons}N | {params/1e6:.1f}M params | {device}")

═══════════════════════════════════════════════

🌊 Generation core (with forced response after )

═══════════════════════════════════════════════

def generate_stream(prompt, temperature, max_len, top_p, top_k,
force_thinking, send_chunk,
repeat_penalty=DEFAULT_PENALTY,
penalty_window=DEFAULT_PENALTY_WINDOW,
max_reasoning_tokens=None,
min_response_tokens=3,
prefix_text="",
penalize_prefix=False):
"""
Generation with repetition penalty and forced minimum response after .
If prefix_text is given, it is fed to the model (without streaming) before generating new tokens.
"""
actual_prompt = prompt
if force_thinking and REASONING_MODE == "prompt_suffix":
if not prompt.strip().endswith("/think"):
actual_prompt = prompt.strip() + " /think"

formatted = f"{USER_TAG}\n{actual_prompt}\n\n{BOT_TAG}\n"

hidden = None
generated = ""
recent_tokens = deque(maxlen=penalty_window)

with torch.no_grad():
    # Encode the conversation prefix (user + bot start)
    ids = tokenizer.encode(formatted).ids
    t = torch.tensor([ids], dtype=torch.long, device=device)
    _, hidden = model(t, hidden)

    # If we have existing assistant response (prefix_text), feed it
    prefix_ids = []
    if prefix_text:
        prefix_ids = tokenizer.encode(prefix_text).ids
        if prefix_ids:
            pt = torch.tensor([prefix_ids], dtype=torch.long, device=device)
            _, hidden = model(pt, hidden)
            generated = prefix_text
            input_token = torch.tensor([[prefix_ids[-1]]], dtype=torch.long, device=device)
            if penalize_prefix:
                recent_tokens.extend(prefix_ids)
        else:
            input_token = torch.tensor([[ids[-1]]], dtype=torch.long, device=device)
    else:
        input_token = torch.tensor([[ids[-1]]], dtype=torch.long, device=device)

    # Prefill <think> if in response_prefix mode (only if no prefix)
    if not prefix_text and force_thinking and REASONING_MODE == "response_prefix":
        think_id = tokenizer.token_to_id("<think>")
        if think_id is not None:
            tt = torch.tensor([[think_id]], dtype=torch.long, device=device)
            _, hidden = model(tt, hidden)
            input_token = tt
            generated = "<think>"
            send_chunk("<think>")
        if REASONING_START and not prefix_text:
            prefix = f"I need to think about this. The user said '{prompt}'"
            pids = tokenizer.encode(prefix).ids
            pt = torch.tensor([pids], dtype=torch.long, device=device)
            _, hidden = model(pt, hidden)
            input_token = torch.tensor([[pids[-1]]], dtype=torch.long, device=device)
            generated += prefix
            send_chunk(prefix)

    eos_id = tokenizer.token_to_id(EOS_TOKEN)
    think_open_id = tokenizer.token_to_id("<think>")
    think_close_id = tokenizer.token_to_id("</think>")
    byte_buf = b""

    # State tracking for forced response
    in_reasoning = (force_thinking and REASONING_MODE == "response_prefix") or (prefix_text and "<think>" in prefix_text and "</think>" not in prefix_text)
    think_closed = "</think>" in prefix_text if prefix_text else False
    awaiting_response = think_closed
    response_token_count = 0
    reasoning_toks = 0

    for step in range(max_len):
        logits, hidden = model(input_token, hidden)
        lf = logits[0, -1].float() / max(temperature, 1e-8)

        # Repetition penalty
        if repeat_penalty != 1.0 and len(recent_tokens) > 0:
            penalized_ids = set(recent_tokens)
            for token_id in penalized_ids:
                if token_id < lf.size(0):
                    lf[token_id] /= repeat_penalty

        # Top-K
        if top_k > 0:
            tv, _ = torch.topk(lf, min(top_k, lf.size(-1)))
            lf[lf < tv[-1]] = float("-inf")
        # Top-P
        if top_p < 1.0:
            sl, si = torch.sort(lf, descending=True)
            cp = torch.cumsum(torch.softmax(sl, dim=-1), dim=-1)
            rm = cp > top_p
            rm[..., 1:] = rm[..., :-1].clone()
            rm[..., 0] = False
            lf[si[rm]] = float("-inf")

        nxt = torch.multinomial(torch.softmax(lf, dim=-1), 1).item()

        # EOS handling with forced response
        if nxt == eos_id:
            if awaiting_response and response_token_count < min_response_tokens:
                continue  # not enough response yet, skip EOS
            else:
                break

        recent_tokens.append(nxt)

        # Update state
        if nxt == think_open_id:
            in_reasoning = True
            reasoning_toks = 0
        if nxt == think_close_id:
            in_reasoning = False
            think_closed = True
            awaiting_response = True
            response_token_count = 0
        if in_reasoning and not think_closed:
            reasoning_toks += 1
            if max_reasoning_tokens and max_reasoning_tokens > 0 and reasoning_toks >= max_reasoning_tokens:
                # Force close think
                if byte_buf:
                    send_chunk(byte_buf.decode('utf-8', errors='replace'))
                    byte_buf = b""
                send_chunk("</think>")
                in_reasoning = False
                think_closed = True
                awaiting_response = True
                response_token_count = 0
                ct = torch.tensor([[think_close_id]], dtype=torch.long, device=device)
                _, hidden = model(ct, hidden)
                input_token = ct
                recent_tokens.append(think_close_id)
                continue

        if awaiting_response and nxt != think_close_id:
            response_token_count += 1
        elif not in_reasoning and not think_closed:
            response_token_count += 1

        # Send token
        tok_str = tokenizer.id_to_token(nxt) or ""
        byte_buf += _tok_to_bytes(tok_str)
        try:
            decoded = byte_buf.decode('utf-8')
            generated += decoded
            send_chunk(decoded)
            byte_buf = b""
        except UnicodeDecodeError:
            pass
        input_token = torch.tensor([[nxt]], dtype=torch.long, device=device)

    if byte_buf:
        leftover = byte_buf.decode('utf-8', errors='replace')
        generated += leftover
        send_chunk(leftover)

    # Return whether generation stopped due to max_len (needed for UI)
    return step >= max_len - 1  # True if hit token limit

═══════════════════════════════════════════════

🌐 HTML (with simplified buttons: regenerate + keep generating only when limit hit)

═══════════════════════════════════════════════

HTML_TEMPLATE = r"""

Linny
loading…
🌊
Start a conversation with Linny
hi
what are you?
explain recursion /think
Temp TMPL_TEMP_D
Max
Top-P TMPL_TOPP_D
Top-K
"""

def build_html():
return (HTML_TEMPLATE
.replace('TMPL_TEMP', str(DEFAULT_TEMP))
.replace('TMPL_TEMP_D', f'{DEFAULT_TEMP:.2f}')
.replace('TMPL_MAXLEN', str(DEFAULT_MAX_LEN))
.replace('TMPL_TOPP', str(DEFAULT_TOP_P))
.replace('TMPL_TOPP_D', f'{DEFAULT_TOP_P:.2f}')
.replace('TMPL_TOPK', str(DEFAULT_TOP_K))
.replace('KEEP_GENERATING_EXTRA_TOKENS', str(KEEP_GENERATING_EXTRA_TOKENS))
)

═══════════════════════════════════════════════

🌐 HTTP Handler

═══════════════════════════════════════════════

class Handler(BaseHTTPRequestHandler):
def log_message(self, fmt, *args): pass

def do_GET(self):
    if self.path == '/':
        body = build_html().encode('utf-8')
        self.send_response(200)
        self.send_header('Content-Type', 'text/html; charset=utf-8')
        self.end_headers()
        self.wfile.write(body)

    elif self.path == '/info':
        body = json.dumps({'epoch':epoch,'layers':layers,'neurons':neurons}).encode()
        self.send_response(200)
        self.send_header('Content-Type', 'application/json')
        self.end_headers()
        self.wfile.write(body)

    elif self.path.startswith('/generate'):
        from urllib.parse import urlparse, parse_qs
        q = parse_qs(urlparse(self.path).query)
        def g(k, d): return q.get(k,[str(d)])[0]

        prompt         = g('prompt', '')
        temperature    = float(g('temperature', DEFAULT_TEMP))
        max_len        = int(g('max_len',       DEFAULT_MAX_LEN))
        top_p          = float(g('top_p',       DEFAULT_TOP_P))
        top_k          = int(g('top_k',         DEFAULT_TOP_K))
        force_thinking = g('force_thinking','0') == '1'
        prefix_text    = g('prefix_text', '')
        is_continue    = g('continue', '0') == '1'

        self.send_response(200)
        self.send_header('Content-Type',  'text/event-stream')
        self.send_header('Cache-Control', 'no-cache')
        self.send_header('X-Accel-Buffering', 'no')
        self.end_headers()

        def send_chunk(text):
            try:
                msg = 'data:' + json.dumps(text) + '\n\n'
                self.wfile.write(msg.encode('utf-8'))
                self.wfile.flush()
            except: pass

        try:
            generate_stream(prompt, temperature, max_len, top_p, top_k,
                            force_thinking, send_chunk,
                            repeat_penalty=DEFAULT_PENALTY,
                            penalty_window=DEFAULT_PENALTY_WINDOW,
                            max_reasoning_tokens=MAX_REASONING_TOKENS if not is_continue else None,
                            min_response_tokens=3,
                            prefix_text=prefix_text,
                            penalize_prefix=is_continue)  # penalize existing tokens to avoid loops
        except Exception as e:
            send_chunk(f'\n⚠️ {e}')

        try:
            self.wfile.write(b'data:[DONE]\n\n')
            self.wfile.flush()
        except: pass

    else:
        self.send_response(404)
        self.end_headers()

═══════════════════════════════════════════════

🚀 Launch

═══════════════════════════════════════════════

if name == 'main':
server = HTTPServer((HOST, PORT), Handler)
print(f"\n{'='*48}")
print(f" 🌊 Linny — http://{HOST}:{PORT}")
print(f" Model : {Path(MODEL_PATH).name} (epoch {epoch})")
print(f" Params: {params/1e6:.1f}M | {device}")
print(f" Max reasoning tokens: {MAX_REASONING_TOKENS or 'unlimited'}")
print(f" Repetition penalty: {DEFAULT_PENALTY} over last {DEFAULT_PENALTY_WINDOW} tokens")
print(f"{'='*48}\n Ctrl+C to stop\n")
try:
server.serve_forever()
except KeyboardInterrupt:
print("\n👋 Stopped.")

By the way, my models are separated into two categories, my token based models, as they suggest, are based on the Linny 1 BPE tokenizer, which has a 20K vocab size (or merge size whatever you wanna call it), and my base "Pro" models are character based models with average of around 1900 vocab size. the inference code I gave you above only works with my token based models. for the character based models, generation on the HF space is fast enough to not need inference code run locally, whereas you could argue for the token based models. I will later post the character based model code, but for the time being I hope the token based inference code for my models will be good enough to fill your curiosity on my models.

Also, the config in the token based code only works for the base Gen1.5 model :

Lucien-shark/Linny-TokenBased-Gen1.5

But I will later post the config variables for my other models.

Great, Thanks 😀

No problem! If you would like to, I would love it if you could give me some feedback on my Gen1.5 models. I am mostly done training them, but if there is anything you would like to ask or anything you wanna tell me about that might be good to change, or any other feedback, feel free to share it!

Want to cooperate on discord?
Share your discord username and ima invite you 😊

Sorry, I don't have discord. I am open for feedback on my model, though it would have to stay kinda within the community discussion things within huggingface because I don't exactly use any real external messaging apps at all or at least very very rarely. Thanks for the offer though!

Okay. 😊

LH-Tech-AI changed discussion status to closed

Sign up or log in to comment