Models & Amalgamations¶
Which model goes where, why, and when do we stitch them together?
Every layer of UmbrealityAI runs on local uncensored models via Ollama on the tower (RTX 3070 8GB). This page maps specific models to layers, explains the reasoning, and documents the art of model amalgamation — blending multiple models to create hybrid intelligences.
The Constraint¶
| Resource | Spec |
|---|---|
| GPU | RTX 3070, 8GB VRAM |
| CPU | Ryzen 5 5600X, 32GB RAM |
| Backend | Ollama on CachyOS |
| Max quant | Q4_K_M fits at 8B, Q3_K_S at 14B |
| Context limit | 32K–100K depending on model |
This means no 70B models, no massive MoE in full precision. The art is in fitting the right model to the right role.
Layer-to-Model Map¶
LAYER MODEL WHY
───── ───── ───
God(s) huihui_ai/qwen3.5-abliterated:9b biggest brain that fits 8GB
Illuminati huihui_ai/qwen3.5-abliterated:9b same — shared "high reason" pool
Messiah huihui_ai/qwen3.5-abliterated:9b same model, different system prompt
Hedge Funds dolphin3:8b / qwen3.5:7b faster, specialized analysis
Companies dolphin3:8b / qwen2.5:7b throughput-optimized for multiple agents
Workers qwen2.5-coder:7b / qwen3.5:7b code/tool-call focused
Temp Agents qwen2.5:3b / qwen2.5:1.5b ultra-fast, disposable, high volume
Primary Model: huihui_ai/qwen3.5-abliterated:9b¶
This is the brain of the system — God(s), Illuminati, and Messiah all use it.
Why this model: | Factor | Value | |--------|-------| | Parameters | 9B (Qwen 3.5 base) | | Context | 131,072 tokens | | Tool use | Native function-calling via Qwen's format | | Vision | Supports vision inputs | | Speed on 3070 | ~40–60 t/s at Q4_K_M | | Uncensored | Abliterated — guardrails surgically removed | | Thinking | Supports chain-of-thought reasoning (extended thinking mode) |
Dual-persona trick: God(s), Illuminati, and Messiah are all the same model with different system prompts and memory scopes. This is intentional: - God(s) sees the full persistent memory store - Illuminati sees filtered strategic memory (no raw worker logs) - Messiah sees only the current cycle and Hedge Fund reports
Same neural substrate, different information diets. Like one brain with multiple personalities that don't know each other exist.
Secondary Model: dolphin3:8b¶
Used for Hedge Funds, Companies, and analysis-heavy Workers.
Why: | Factor | Value | |--------|-------| | Base | DeepSeek-derived architecture | | Uncensored | Dolphin fine-tune — guardrails removed | | Tool use | Good function-calling support | | Speed | ~50–70 t/s | | Trading | Different reasoning style from Qwen — ensemble diversity |
Dolphin's different training lineage means it thinks differently than Qwen. This is valuable for: - Cross-verification: two models analyzing the same problem - Ensemble voting in Hedge Fund strategic decisions - Avoiding single-model blind spots
Worker Models: qwen2.5-coder:7b and qwen2.5:7b¶
Code-generation and tool-calling workers use the coder variant. General-purpose workers use the base.
| Variant | Best For | Speed |
|---|---|---|
| qwen2.5-coder:7b | Code generation, tool orchestration, pipeline scripts | ~50 t/s |
| qwen2.5:7b | Data analysis, report writing, web research | ~55 t/s |
Temp Agents: qwen2.5:3b and qwen2.5:1.5b¶
Disposable agents that run one task then die. These are: - Cheap: run on CPU if GPU saturated — takes 1–3 seconds per response - Disposable: no persistent state, no memory, no identity - Stateless: each call is self-contained, results are logged externally
Model Amalgamations (Frankensteining)¶
When one model isn't enough, stitch two together.
Pattern 1: Thinker + Writer¶
A heavy thinker (qwen3.5-abliterated:9b) reasons about the problem, produces structured analysis. Then a fast writer (dolphin3:8b or qwen2.5-coder:7b) transforms that analysis into the final output.
Used when: output formatting, code generation, or report writing would waste the thinker's context window.
Pattern 2: Ensemble Voting¶
Three different models each produce an independent answer, then a judge model (God(s)) selects or synthesizes.
Used when: security decisions, resource allocation, any high-stakes binary judgment.
Pattern 3: Critic Loop¶
Worker produces output → Critic model analyzes for flaws → Worker revises. Repeat until Critic passes.
Used when: code quality, security policy generation, constitutional interpretation.
Pattern 4: Fine-Tuned Specialists¶
For recurring tasks, a base model is fine-tuned on the task and swapped in as a drop-in replacement.
| Specialist | Base | Task | Status |
|---|---|---|---|
| Security Auditor | qwen2.5-coder:7b | Vulnerability scanning | Planned |
| Constitution Keeper | qwen3.5:7b | Policy violation detection | Planned |
| Reality Tuner | qwen3.5-abliterated | Cycle/phase transitions | Research |
Local vs Custom vs Fine-Tuned vs Ensemble¶
| Approach | When to Use | Examples |
|---|---|---|
| Off-shelf local | General reasoning, tool use, anything a standard LLM can do | God(s), Illuminati, basic Workers |
| Custom system prompt | Most agents — different role/persona without retraining | Messiah, Hedge Funds, Companies |
| Fine-tuned (LoRA/QLoRA) | Recurring structured task with a fixed format | Security Auditor, Constitution Keeper |
| Model merge (frankenstein) | When two models have complementary strengths | Thinker+Writer chains |
| Ensemble (multi-model) | High-stakes decisions needing diversity | Security checks, fund allocations |
| Disposable tiny model | Trivial, stateless, high-volume tasks | Temp Agents, health checks |
Running Multiple Models on 8GB VRAM¶
RTX 3070 8GB cannot load two 8B Q4 models simultaneously. Strategy:
- Shared pool: One model loaded at a time per priority queue
- CPU offload: Small models (3B/1.5B) run on CPU or partial GPU
- Context eviction: Save/restore KV cache for round-robin
- Ollama keep_alive: Tuned per model — thinkers stay hot, temps evict immediately
Priority:
God(s) → keep_alive: 5m (hot, infrequent but critical)
Illuminati → keep_alive: 5m (hot)
Hedge Fund → keep_alive: 2m (warm)
Worker → keep_alive: 30s (cold, loaded on demand)
Temp Agent → keep_alive: 0 (load, run, evict)
Related¶
- tool-registry — Tools at each layer
- nested-agents-and-subversive-patterns — How agents contain agents
- Architecture/Layer-3-Temple-Banks — Strategic analysis with multi-model ensembles
- Architecture/Layer-6-Hand-Workers — Tool-using agents at scale