Managing memory when trying to process multiple files

Hey all,

I’m trying to explore making a code inspector (Mythos at home) with huggingface models. I’m currently working with gemma4 and and while I can load the smaller versions just fine, when I try to add a bunch of source code to a prompt I get errors saying I don’t have enough memory. One was trying to allocate ~1.7TB :joy:

I’ve made a function

def query_llm(system_message, user_message, assistant_message):
    messages = [
    {"role": "system", "content": system_message},
    {"role": "user", "content": user_message},
    {"role": "assistant", "content": assistant_message},
    ]

    text = processor.apply_chat_template(
        messages, 
        tokenize=False, 
        add_generation_prompt=True, 
        enable_thinking=False
    )

    inputs = processor(text=text, return_tensors="pt").to(model.device)
    input_len = inputs["input_ids"].shape[-1]
    # Generate output
    outputs = model.generate(**inputs, max_new_tokens=1024)
    response = processor.decode(outputs[0][input_len:], skip_special_tokens=False)
    return response

And I’m passing the code in as the assistant messages. Is this just the wrong approach? Is there any wisdom/guidance on how to go about doing local code analysis?

(post deleted by author)

Sorry, but I was kinda hoping to get experiential feedback. Have you attempted any code scanning with an LLM?