<header>
<span class="tag">technical · architecture</span>
<h1>How soma works: spectral traces, fixed memory, and byte-level learning.</h1>
<p class="subtitle">A technical walkthrough of soma's architecture for readers who want to understand what's actually happening under the hood — before or after reading the paper.</p>
<div class="meta">
<span>James Blight</span>
<span>logossoma.com</span>
<span>AAAI 2026</span>
</div>
</header>
```
<h2 id="design-goals">Design goals and constraints</h2>
<p>Soma was designed around a specific set of requirements that existing architectures don't simultaneously satisfy:</p>
<ul style="list-style:none;padding:0;margin:0 0 22px;">
<li style="padding:6px 0;padding-left:20px;position:relative;color:var(--text-dim);">
<span style="position:absolute;left:0;color:var(--accent);">1.</span> Memory must be fixed-size regardless of how long the model has been trained
</li>
<li style="padding:6px 0;padding-left:20px;position:relative;color:var(--text-dim);">
<span style="position:absolute;left:0;color:var(--accent);">2.</span> Inference cost must be constant per input — not per sequence length
</li>
<li style="padding:6px 0;padding-left:20px;position:relative;color:var(--text-dim);">
<span style="position:absolute;left:0;color:var(--accent);">3.</span> Learning must be continuous — the model updates from every byte it sees
</li>
<li style="padding:6px 0;padding-left:20px;position:relative;color:var(--text-dim);">
<span style="position:absolute;left:0;color:var(--accent);">4.</span> The architecture must handle any byte stream without tokenisation or preprocessing
</li>
<li style="padding:6px 0;padding-left:20px;position:relative;color:var(--text-dim);">
<span style="position:absolute;left:0;color:var(--accent);">5.</span> Long-range structure must be preserved without catastrophic forgetting
</li>
</ul>
<p>These constraints rule out transformers (quadratic cost, static weights), standard RNNs (forgetting over long range), and progressive networks (unbounded memory). The architecture has to be designed from scratch to satisfy all five simultaneously.</p>
<div class="callout">
<p><strong>The core insight:</strong> time is the only thing every data stream has in common. If you can represent information across timescales efficiently, you can learn from anything — without knowing in advance what it is.</p>
</div>
<h2 id="byte-level">Why byte-level</h2>
<p>Most language models operate on tokens — subword units produced by a tokeniser trained on a specific corpus. Tokenisation has real advantages: it compresses the sequence, it aligns with linguistic structure, and it reduces the vocabulary size the model needs to predict over.</p>
<p>It also has costs. A tokeniser is a preprocessing step that has to be designed for a specific data type. Text tokenisers don't handle audio. Audio encoders don't handle sensor telemetry. Every new modality requires a new tokenisation scheme, which requires a new model family.</p>
<p>Bytes are universal. Every file — text, code, audio, image, binary sensor log — is a sequence of bytes. A model that learns from raw bytes doesn't need to know what it's looking at. It learns the statistical structure of whatever stream you give it.</p>
<p>The vocabulary is 256 — one entry per possible byte value. The model predicts the next byte given everything it's seen. This is a harder problem than predicting tokens (predictions are made at finer granularity), but it's a uniform problem across all data types.</p>
<div class="callout blue">
<p><strong>Practical implication:</strong> you don't need a different version of soma for text vs. sensor data vs. audio. The same model, the same script, the same checkpoint format. Modality is a property of your training data, not your architecture.</p>
</div>
<h2 id="traces">The spectral trace bank</h2>
<p>The central data structure in soma is the trace bank — a matrix of shape <code>(256, K)</code> stored in float64, where K is the number of traces (set by <code>n_bands</code>).</p>
<p>Each column in the trace bank is a trace — a running exponential average of the byte values seen so far, decaying at a specific rate. When the model sees a new byte, every trace updates:</p>
<div class="formula">
<div class="formula-label">trace update rule</div>
<code>trace_k ← α_k · trace_k + (1 - α_k) · one_hot(byte)</code>
<p>where α_k is the decay rate for trace k, and one_hot(byte) is a 256-dimensional indicator vector.</p>
</div>
<p>Each trace is a weighted average of the byte history, with recent bytes weighted more heavily. A trace with α close to 1 decays slowly — it remembers far back. A trace with α close to 0 decays quickly — it only remembers very recently.</p>
<p>The full trace bank is the feature representation passed to the rest of the network. It's a 256×K matrix encoding what the model has seen across K different timescales simultaneously.</p>
<p>The decay rates α_k are spaced geometrically — each one is the previous one multiplied by a fixed base (soma uses the golden ratio, φ ≈ 1.618).</p>
<p>Geometric spacing is not arbitrary. It's motivated by the structure of information in natural sequences: relevant patterns occur across many timescales simultaneously, and the density of information across timescales is approximately log-uniform. Geometric spacing allocates equal representational capacity per octave — the same number of traces covers each order of magnitude of timescale.</p>
<div class="formula">
<div class="formula-label">geometric decay spacing</div>
<code>α_k = 1 - base^(-k) for k = 1, 2, ..., K</code>
<p>With base φ ≈ 1.618 (golden ratio). This produces decay rates that are evenly distributed in log-space, covering timescales from single bytes to hundreds of thousands of bytes within a fixed number of traces.</p>
</div>
<p>The golden ratio base is a specific choice motivated by its optimal packing properties — φ minimises redundancy between adjacent timescales in a geometric sequence. Whether this is the globally optimal base is an open question. It's a principled and well-motivated choice.</p>
<h2 id="forward-pass">The forward pass</h2>
<p>Given the trace bank state after seeing all previous bytes, the forward pass to predict the next byte works as follows:</p>
<div class="component">
<div class="component-header">
<div class="component-name">U · flatten(traces)</div>
<div class="component-type">256×K → hidden_dim</div>
</div>
<div class="component-body">
<p>The trace bank (256 × K matrix) is flattened to a vector of length 256·K and projected to the hidden dimension via a learned weight matrix U. This is the feature extraction step — U learns which combinations of timescale-resolved byte statistics are predictively useful.</p>
</div>
</div>
<div class="component">
<div class="component-header">
<div class="component-name">nonlinearity → W → logits</div>
<div class="component-type">hidden_dim → 256</div>
</div>
<div class="component-body">
<p>The hidden representation passes through a nonlinearity and then a second learned matrix W to produce 256 logits — one per possible next byte. Standard cross-entropy loss against the actual next byte drives learning.</p>
</div>
</div>
<div class="component">
<div class="component-header">
<div class="component-name">Wd (optional direct readout)</div>
<div class="component-type">256·K → 256, added to logits</div>
</div>
<div class="component-body">
<p>When <code>direct_readout=True</code>, a third matrix Wd maps directly from the flattened traces to logits, bypassing the hidden layer. This skip connection improves performance on certain corpora and adds a residual pathway for short-range patterns that don't benefit from the full hidden representation.</p>
</div>
</div>
<p>The total parameter count is dominated by U and W. With <code>hidden_dim=8192</code> and <code>n_bands=46</code>, U has shape (256×46, 8192) and W has shape (8192, 256) — roughly 100M parameters in those two matrices. This is the learnable content of a checkpoint.</p>
<div class="callout">
<p><strong>Why inference is constant cost:</strong> the forward pass processes one byte at a time. The trace update is O(K). The matrix multiplications are fixed-size regardless of how many bytes have been seen previously. There is no attention over prior context — the traces are the context, and they're always the same size.</p>
</div>
<h2 id="checkpoint">What a checkpoint contains</h2>
<p>A soma checkpoint is a single <code>.pt</code> file. It bundles:</p>
<ul style="list-style:none;padding:0;margin:0 0 22px;">
<li style="padding:6px 0;padding-left:20px;position:relative;color:var(--text-dim);">
<span style="position:absolute;left:0;color:var(--accent);">→</span> <strong style="color:#c8c8c8;">Weight matrices</strong> U, W, and optionally Wd
</li>
<li style="padding:6px 0;padding-left:20px;position:relative;color:var(--text-dim);">
<span style="position:absolute;left:0;color:var(--accent);">→</span> <strong style="color:#c8c8c8;">Trace bank</strong> — the current (256, K) state in float64
</li>
<li style="padding:6px 0;padding-left:20px;position:relative;color:var(--text-dim);">
<span style="position:absolute;left:0;color:var(--accent);">→</span> <strong style="color:#c8c8c8;">Hyperparameters</strong> — n_bands, hidden_dim, base, direct_readout, and training settings
</li>
<li style="padding:6px 0;padding-left:20px;position:relative;color:var(--text-dim);">
<span style="position:absolute;left:0;color:var(--accent);">→</span> <strong style="color:#c8c8c8;">bytes_seen</strong> — total training bytes consumed
</li>
<li style="padding:6px 0;padding-left:20px;position:relative;color:var(--text-dim);">
<span style="position:absolute;left:0;color:var(--accent);">→</span> <strong style="color:#c8c8c8;">checkpoint_id</strong> — SHA256 over weights + traces + fixed config. Deterministic and unfakeable.
</li>
<li style="padding:6px 0;padding-left:20px;position:relative;color:var(--text-dim);">
<span style="position:absolute;left:0;color:var(--accent);">→</span> <strong style="color:#c8c8c8;">checkpoint_history</strong> — declared ancestor IDs, oldest first
</li>
</ul>
<p>Two checkpoints are compatible — can be continued from one or merged — if and only if their n_bands, hidden_dim, base, and direct_readout match exactly. These define the architecture; everything else is state.</p>
<div class="code-label">inspect a checkpoint</div>
<pre><code>import torch
```
<p>Soma is real and peer-reviewed. It also has limitations worth being clear about.</p>
<p><strong style="color:#c8c8c8;">Context is implicit, not explicit.</strong> Transformers can be prompted with specific context because they attend to it directly. Soma's context is the trace state — accumulated history. You can't inject a specific document into a soma model's "attention" the way you can with a transformer. This makes soma better for stream-learning and worse for retrieval-style tasks.</p>
<p><strong style="color:#c8c8c8;">Scale is uncharted.</strong> Transformer scaling laws are well-studied. Soma's are not. The architecture runs usefully at modest scale — the Mac mini claim is real. Whether it exhibits the same capability-with-scale behaviour as transformers is an open research question. We think it does. We don't have the empirical evidence at frontier scale to prove it.</p>
<p><strong style="color:#c8c8c8;">The community is new.</strong> There are two checkpoints on logOS right now. The ecosystem of tooling, evaluations, and shared benchmarks that the transformer world has built over years doesn't exist yet for soma. You're early. That means opportunity and roughness in equal measure.</p>
<p>None of these limit the core claims: fixed memory, constant cost, continual learning, byte-level universality. Those hold. The paper was peer-reviewed. The code runs. The architecture does what it says.</p>
<div class="cta-block">
<h3>Read the full paper or download the code</h3>
<p>Time Is All You Need — AAAI 2026. The full architecture specification is also at logossoma.com/specs.</p>
<div class="cta-row">
<a href="https://logossoma.com/paper" class="cta-btn">read the paper →</a>
<a href="https://logossoma.com/specs" class="cta-btn-ghost">technical spec</a>
</div>
</div>
```
<title>What Is Catastrophic Forgetting in AI? | logOS</title>
<meta name="description" content="Catastrophic forgetting is why most AI models can't learn new things without losing what they knew. Here's what it is, why it happens, and what solving it actually requires.">
<meta name="keywords" content="catastrophic forgetting, catastrophic forgetting AI, continual learning, neural network forgetting, AI learning problem, stability plasticity dilemma, what is catastrophic forgetting">
<link rel="canonical" href="https://logossoma.com/blog/what-is-catastrophic-forgetting">
<meta property="og:title" content="What Is Catastrophic Forgetting in AI?">
<meta property="og:description" content="Why most AI models can't learn new things without losing what they knew — and what solving it actually requires.">
<meta property="og:type" content="article">
<meta property="og:url" content="https://logossoma.com/blog/what-is-catastrophic-forgetting">
<header>
<span class="tag">explainer · continual learning</span>
<h1>What is catastrophic forgetting — and why has it taken thirty years to solve?</h1>
<p class="subtitle">Every neural network that learns something new risks destroying what it already knew. This isn't a bug. It's a consequence of how the weights work. Here's what that means and why it matters.</p>
<div class="meta">
<span>James Blight</span>
<span>logossoma.com</span>
<span>AAAI 2026</span>
</div>
</header>
<article>
```
<h2>The basic problem</h2>
<p>Imagine you train a neural network to recognise cats. It learns well — high accuracy, generalises cleanly. Now you train it on dogs. When you test it on cats again, it's forgotten most of what it knew. The dog training overwrote the cat training. That's catastrophic forgetting.</p>
<p>The word "catastrophic" is accurate. It's not gradual degradation. Performance on the old task can collapse nearly completely after even a small amount of new training. The network doesn't blend the old and new knowledge — it replaces it.</p>
<div class="callout bad">
<p><strong>Why it happens:</strong> neural network weights are shared across tasks. When you update them to learn dogs, you're modifying the same parameters that encoded cats. There's no mechanism that says "protect this part of the network — it's doing something important."</p>
</div>
<p>This was first described in the late 1980s by McCloskey and Cohen, who called it "catastrophic interference." The term has stuck in various forms for thirty years because the problem has stuck. It's not solved — it's managed, imperfectly, through a collection of workarounds.</p>
<h2>Why it's harder than it looks</h2>
<p>The obvious fix is: don't let new training change the weights that matter for old tasks. The problem is that you don't know which weights matter, and even if you did, the weights that encode knowledge about cats are often the same weights that would encode knowledge about dogs most efficiently. The representations overlap.</p>
<div class="analogy">
<div class="analogy-label">analogy</div>
<p>Think of it like writing on a whiteboard in a shared office. You come in early and fill the board with something important. The next person comes in, needs the space, and erases part of it to write their own thing. They weren't being malicious — they just needed the board. There was no way to know which parts you needed to keep.</p>
<p>Neural network weights are that whiteboard. Training is always writing on shared space.</p>
</div>
<p>This is sometimes called the stability-plasticity dilemma. A network that's very stable (protects old weights strongly) can't learn new things efficiently — plasticity is low. A network that's very plastic (learns new things easily) forgets old things — stability is low. Getting both at once is the hard problem.</p>
<h2>Thirty years of workarounds</h2>
<p>The research literature on catastrophic forgetting is vast. Here are the main families of approaches and what they actually achieve:</p>
<div class="approaches">
<div class="approach">
<div>
<div class="approach-name">Elastic Weight Consolidation (EWC)</div>
<div class="approach-verdict">→ reduces forgetting, doesn't eliminate it</div>
</div>
<div class="approach-desc">Identify which weights were most important for old tasks (using Fisher information) and penalise changing them during new training. Works partially — the penalty slows forgetting but the underlying shared-weight problem remains. Doesn't scale well to many sequential tasks.</div>
</div>
<div class="approach">
<div>
<div class="approach-name">Replay buffers</div>
<div class="approach-verdict">→ effective but expensive</div>
</div>
<div class="approach-desc">Store a sample of old training data and mix it into new training batches. This works — the network stays exposed to old tasks. The cost is storing data, which grows with the number of tasks. Also raises privacy questions for personal or sensitive data.</div>
</div>
<div class="approach">
<div>
<div class="approach-name">Progressive networks</div>
<div class="approach-verdict">→ no forgetting, but memory grows without bound</div>
</div>
<div class="approach-desc">Freeze old network columns when learning new tasks and add new columns. Old knowledge is perfectly preserved — but the network grows linearly with tasks. Not viable for continuous deployment.</div>
</div>
<div class="approach">
<div>
<div class="approach-name">Fine-tuning on frozen base</div>
<div class="approach-verdict">→ avoids the problem by ignoring it</div>
</div>
<div class="approach-desc">Freeze the base model weights, only train a small adapter or head. Fast and cheap, but the base model never learns anything new. What it knew at training time is the ceiling of what it can ever know.</div>
</div>
<div class="approach current">
<div>
<div class="approach-name">Architecture-level solution</div>
<div class="approach-verdict">→ what soma does</div>
</div>
<div class="approach-desc">Design an architecture where learning is continuous by construction — fixed-size memory, no shared-weight collision, constant cost. The problem doesn't need workarounds because the architecture doesn't have the problem in the same form.</div>
</div>
</div>
<h2>Why the workarounds don't fully work</h2>
<p>Every workaround above is fighting the same thing: the architecture assumes learning happens once, offline, on a fixed dataset. Catastrophic forgetting is what you get when you try to violate that assumption without changing the architecture.</p>
<p>EWC and replay and progressive networks are impressive engineering. They make transformers and other static architectures more survivable in continual learning settings. But they're all paying a cost — in memory, in compute, in complexity — to compensate for a design that wasn't built for this.</p>
<div class="pull">
<p>The stability-plasticity dilemma isn't a fundamental law of neural networks. It's a consequence of a specific design choice: shared weights updated by gradient descent on a fixed objective. Change the design, and the dilemma changes shape.</p>
</div>
<h2>What an architectural solution looks like</h2>
<p>Soma approaches this differently. Instead of shared weights that get overwritten, soma uses a bank of geometrically spaced spectral traces — a fixed-size memory that represents information across multiple timescales simultaneously.</p>
<p>The key property: traces at different timescales decay at different rates. Fast traces capture recent patterns. Slow traces preserve long-range structure. Learning new things updates all timescales — but slow traces change slowly, so long-range structure is preserved by the architecture, not by an external penalty or replay mechanism.</p>
<p>This is why fixed memory is a feature rather than a constraint. The memory isn't a buffer that fills up — it's a spectral basis that continuously represents the stream it's seen. New data updates the representation without erasing the old one, because the old representation lives in the slow traces and the new one lives in the fast ones.</p>
<div class="callout">
<p><strong>The result:</strong> catastrophic forgetting in the classic sense doesn't occur, because there's no single shared weight being overwritten. Long-range structure is architecturally protected. Short-range adaptation is architecturally enabled. The stability-plasticity tradeoff has a different geometry.</p>
</div>
<h2>Why this took thirty years</h2>
<p>The honest answer is that catastrophic forgetting wasn't the bottleneck for most of the last thirty years. The bigger problem was building models capable enough to be worth preserving. Once you have a model that knows something valuable, forgetting becomes the thing you care about.</p>
<p>Now that frontier models are genuinely capable, the deployment problem is real. You have something worth preserving, and you need it to keep learning without destroying itself. The research attention has followed.</p>
<p>The other reason is that the transformer paradigm arrived and was so successful that most research energy went into scaling it rather than questioning its assumptions. Catastrophic forgetting is an assumption-level problem. It required stepping back from the paradigm to ask whether the shared-weight design was load-bearing.</p>
<p>It is — for attention. It isn't — for learning.</p>
<div class="cta-block">
<h3>An architecture without the problem</h3>
<p>Soma was built from scratch around continual learning as a first principle. One script, fixed memory, constant cost. Read the paper or download it and try it yourself.</p>
<a href="https://logossoma.com/paper" class="cta-btn">read the paper →</a>
</div>
```
<title>Why Large-Scale AI Converges Toward Continual Learning | logOS</title>
<meta name="description" content="The transformer paradigm has real structural limits. Here's an honest argument for why every serious AI application eventually needs what transformers can't provide — and what that means.">
<meta name="keywords" content="transformer limitations, continual learning AI, beyond transformers, AI architecture future, quadratic attention problem, static model problem, AI convergence">
<link rel="canonical" href="https://logossoma.com/blog/why-ai-converges-continual-learning">
<meta property="og:title" content="Why Large-Scale AI Converges Toward Continual Learning">
<meta property="og:description" content="The transformer paradigm has real structural limits. An honest argument for where serious AI applications eventually have to go.">
<meta property="og:type" content="article">
<meta property="og:url" content="https://logossoma.com/blog/why-ai-converges-continual-learning">
<header>
<span class="tag">perspective · architecture · long read</span>
<h1>Why I think every serious AI application eventually needs what transformers can't give it.</h1>
<p class="subtitle">This is an argument, not a product pitch. Transformers are remarkable. They also have structural limits that aren't engineering problems — they're architecture problems. Here's why those limits matter and where the logic points.</p>
<div class="meta">
<span>James Blight</span>
<span>logossoma.com</span>
<span>AAAI 2026</span>
</div>
</header>
<article>
```
<h2>A few things I need to say upfront</h2>
<p>I built soma. I have an obvious stake in arguing that transformers have limits. You should weight that appropriately.</p>
<p>What I'm going to try to do here is make the honest version of this argument — one that grants transformers everything they deserve, identifies the specific structural problems that can't be engineered away, and explains why those problems push toward continual learning as a matter of logic rather than preference.</p>
<p>If the argument is wrong, I'd genuinely like to know. The paper is at <a href="https://logossoma.com/paper">logossoma.com/paper</a>. The architecture is open. Push back.</p>
<h2>What transformers actually solved</h2>
<p>It's worth being precise about why transformers took over. Before attention mechanisms, sequence models had a specific failure mode: they forgot. RNNs and LSTMs compressed history into a fixed hidden state, which meant long-range dependencies degraded with distance. Attention solved this by letting every token directly attend to every other token — no compression, no degradation, full access to the sequence.</p>
<p>This was genuinely important. It's why transformers work so well on language, where long-range structure matters enormously — a pronoun at position 800 might refer to a noun at position 12.</p>
<p>The cost of this design is quadratic. Attending to everything means O(n²) computation in sequence length. At the scales transformers are now deployed, this is managed through enormous parallel hardware and careful engineering — but it's not free and it doesn't go away.</p>
<div class="callout">
<p>The transformer's power and its cost come from the same place: full attention across the sequence. You can't have one without the other. This isn't an implementation detail — it's the architecture.</p>
</div>
<h2>The three structural limits</h2>
<div class="argument">
<div class="argument-num">limit <span>01</span></div>
<h3>Static weights after training</h3>
<p>A trained transformer doesn't learn. Its weights are frozen. Everything it knows was encoded during the training run — a single, expensive, offline process. This is fine if the world is static. The world is not static. Deploying a frozen model into a changing environment means its knowledge decays from the moment of deployment. Fine-tuning is a workaround: it requires new data, new compute, new deployment cycles. It doesn't solve the problem — it manages it at cost.</p>
</div>
<div class="argument">
<div class="argument-num">limit <span>02</span></div>
<h3>The context window is a ceiling, not a feature</h3>
<p>Context windows have grown dramatically — from thousands to millions of tokens in recent years. This is presented as progress, and in some ways it is. But the context window is also an acknowledgement that the model has no persistent memory: everything it needs to know about a conversation or document must fit within the window, right now, at inference time. When the window closes, it forgets. This is a fundamental property of the architecture, not a parameter you can tune away. Models with infinite context windows would have infinite inference cost.</p>
</div>
<div class="argument">
<div class="argument-num">limit <span>03</span></div>
<h3>Generalisation requires enormous scale</h3>
<p>Transformers learn by exposure. To generalise reliably, they need to see enormous amounts of data — which requires enormous training runs on enormous hardware. This is why frontier model training costs hundreds of millions of dollars. The scale isn't ambition; it's a requirement of the learning mechanism. A smaller transformer trained on less data is categorically less capable, not just quantitatively. This creates a structural barrier: building a useful transformer is only possible for organisations with serious compute budgets.</p>
</div>
<div class="pull">
<p>"Attention is all you need" was right for what transformers were solving. The question is whether it's still the right frame for what we need AI to do next.</p>
</div>
<h2>Why these limits aren't engineering problems</h2>
<p>The important distinction is between problems that can be solved with more engineering and problems that are load-bearing properties of the architecture. The three limits above are the second kind.</p>
<p>Quadratic attention cost can be approximated — sparse attention, linear attention, various approximations — but these all trade off the full-attention property that makes transformers powerful in the first place. Static weights can be worked around through retrieval augmentation, fine-tuning, or tool use — but none of these give the model the ability to actually learn from deployment. Scale requirements can be reduced through distillation and quantization — but a distilled model's knowledge ceiling is the original model's ceiling.</p>
<p>These are approximations and workarounds. They don't change the underlying geometry.</p>
<div class="concession">
<div class="concession-label">the fair counterargument</div>
<p>Transformers may scale to capabilities where these limits don't matter in practice. A sufficiently capable frozen model might be useful enough that continual adaptation is unnecessary. Long context windows might approximate persistent memory well enough for most applications.</p>
<p>This is a legitimate position. My response is: "for most applications" does a lot of work in that sentence. There are large and growing classes of problems — robotics, edge AI, personalised systems, real-time adaptation — where these workarounds structurally cannot close the gap.</p>
</div>
<h2>What the logic requires</h2>
<p>If you want a model that:</p>
<ul style="list-style:none;padding:0;margin:0 0 22px;">
<li style="padding:6px 0;padding-left:20px;position:relative;color:var(--text-dim);">
<span style="position:absolute;left:0;color:var(--accent);">→</span> Learns continuously from deployment data without retraining
</li>
<li style="padding:6px 0;padding-left:20px;position:relative;color:var(--text-dim);">
<span style="position:absolute;left:0;color:var(--accent);">→</span> Runs at constant cost regardless of history length
</li>
<li style="padding:6px 0;padding-left:20px;position:relative;color:var(--text-dim);">
<span style="position:absolute;left:0;color:var(--accent);">→</span> Operates usefully on modest hardware without scale requirements
</li>
<li style="padding:6px 0;padding-left:20px;position:relative;color:var(--text-dim);">
<span style="position:absolute;left:0;color:var(--accent);">→</span> Maintains persistent memory without a context window ceiling
</li>
</ul>
<p>...then you need a different architecture. Not a better transformer. A different design that makes different tradeoffs from the ground up.</p>
<p>The architecture that satisfies these properties is one where learning is continuous, memory is fixed-size, and cost per step is constant. This is not a specific architecture — it's a set of constraints that any architecture satisfying these requirements must meet. Soma is one concrete instantiation. There will be others. The point is that the constraints themselves follow logically from the requirements.</p>
<h2>What this means for where AI goes</h2>
<p>I'm not predicting transformers disappear. They're too useful and too entrenched for that. What I'm predicting is that the applications where transformers hit structural walls — edge deployment, real-time adaptation, personalised systems, anything that needs to run without a data center — will converge toward continual architectures, because the alternative is perpetually engineering around limitations that are load-bearing.</p>
<p>This is already beginning to happen in the research literature. Mamba, RWKV, and other state-space models are exploring the design space of fixed-cost sequence models. The research community is circling the same problem from different directions.</p>
<p>Soma's approach — byte-level learning, geometrically spaced spectral traces, fixed memory — is a specific and peer-reviewed answer to the same question. The paper was accepted at AAAI 2026. The mathematics is motivated. The architecture runs on a Mac mini.</p>
<div class="callout">
<p><strong>The bet logOS is making</strong> is that the community of people who build with continual architectures will need a place to share what they've built — checkpoints, lineages, forks, experiments. We're building that place now, while the community is still small enough that being early matters.</p>
</div>
<h2>Why "time is all you need"</h2>
<p>The transformer paper is called "Attention Is All You Need." The title makes a claim about what the fundamental operation of intelligence is: attending to the right things in a sequence.</p>
<p>Soma's paper is called "Time Is All You Need." The claim is different: the fundamental operation is learning over time. Not attending to a fixed context, but accumulating understanding from a continuous stream. Not a snapshot of knowledge frozen at training, but a process that unfolds alongside experience.</p>
<p>This isn't a marketing distinction. It's a different theory of what intelligence requires. If the transformer title is right, then scaling attention is the path forward. If the soma title is right, then the path forward is architectures that learn continuously, remember efficiently, and operate at constant cost.</p>
<p>I think the second one is right. The argument above is why. The architecture is open. You can download it, train something on it, and see for yourself.</p>
<div class="cta-block">
<h3>Read the paper. Try the architecture.</h3>
<p>Time Is All You Need — AAAI 2026. One script, PyTorch only, runs on hardware you already have.</p>
<div class="cta-row">
<a href="https://logossoma.com/paper" class="cta-btn">read the paper →</a>
<a href="https://logossoma.com/api/download/soma.zip" class="cta-btn-ghost">download soma</a>
</div>
</div>
```
<title>Continual Learning for Robotics and Edge AI | logOS</title>
<meta name="description" content="Static models break in the real world. Here's why continual learning isn't a research curiosity — it's the only architecture that makes sense for robotics and edge deployment.">
<meta name="keywords" content="continual learning robotics, edge AI training, on-device learning, AI sensor data, train AI on sensor streams, local AI robotics, catastrophic forgetting">
<link rel="canonical" href="https://logossoma.com/blog/continual-learning-robotics-edge-ai">
<meta property="og:title" content="Continual Learning for Robotics and Edge AI">
<meta property="og:description" content="Static models break in the real world. Here's why continual learning is the only architecture that makes sense at the edge.">
<meta property="og:type" content="article">
<meta property="og:url" content="https://logossoma.com/blog/continual-learning-robotics-edge-ai">
<header>
<span class="tag">guide · robotics · edge AI</span>
<h1>Static models break in the real world.<br>Continual learning doesn't.</h1>
<p class="subtitle">Every robotics and edge AI deployment eventually hits the same wall: the world changes, the model doesn't. Here's why that's an architecture problem — and what solving it actually looks like.</p>
<div class="meta">
<span>James Blight</span>
<span>logossoma.com</span>
<span>AAAI 2026</span>
</div>
</header>
<article>
```
<h2>The wall every edge deployment hits</h2>
<p>You train a model. You deploy it to a robot, a sensor node, an embedded device. For a while it works. Then the environment shifts — new lighting conditions, a different operator, seasonal drift in sensor readings, hardware wear. The model was trained on the world as it was. It has no mechanism to adapt to the world as it is.</p>
<p>The standard solution is to periodically retrain on new data and redeploy. This works in a data center. It doesn't work at the edge, where connectivity is unreliable, compute is constrained, and the whole point was to operate independently.</p>
<div class="callout">
<p><strong>The fundamental mismatch:</strong> transformer-based models are trained once and frozen. The real world is not frozen. Every edge deployment is a bet that the gap between training distribution and deployment reality stays small enough to tolerate. Often it doesn't.</p>
</div>
<p>The standard response to this in the research literature is "continual learning" — techniques that allow a model to incorporate new information without forgetting old information. The problem is that most approaches to continual learning are bolt-ons: elastic weight consolidation, progressive neural networks, replay buffers. They fight against the static-weight assumption of the underlying architecture.</p>
<p>The result is a tradeoff. Learn too aggressively and the model forgets what it knew. Learn too conservatively and it never adapts. This tradeoff is called catastrophic forgetting, and it's been an open problem for thirty years precisely because the architectures weren't designed for learning to be ongoing.</p>
<p>The cost in production isn't just performance degradation. It's unpredictability. A model that sometimes forgets critical behaviour is worse than a model that never adapts — you can't reason about its failure modes.</p>
<h2>What the edge actually needs</h2>
<p>Strip away the research framing and the requirements are clear:</p>
<div class="timeline">
<div class="timeline-item">
<div class="timeline-label">requirement 01</div>
<h3>Fixed memory footprint</h3>
<p>Edge devices have constrained RAM. A model that grows with data is a model that eventually fails. Memory must be bounded at architecture level, not managed around.</p>
</div>
<div class="timeline-item">
<div class="timeline-label">requirement 02</div>
<h3>Constant inference cost</h3>
<p>Quadratic attention cost is fine in a data center. At the edge, every watt and every millisecond matters. Inference cost must not scale with context or history length.</p>
</div>
<div class="timeline-item">
<div class="timeline-label">requirement 03</div>
<h3>Learning from any byte stream</h3>
<p>Sensor data isn't tokenized text. It's raw bytes — telemetry, control logs, audio, IMU readings. The model should learn directly from whatever the device produces, without a preprocessing pipeline.</p>
</div>
<div class="timeline-item">
<div class="timeline-label">requirement 04</div>
<h3>No connectivity dependency</h3>
<p>The model must learn and adapt on-device. A continual learning system that requires cloud sync for weight updates isn't continual — it's periodic fine-tuning with extra steps.</p>
</div>
</div>
<p>These aren't aspirational. They're the minimum viable properties for an AI system that genuinely operates at the edge without babysitting.</p>
<h2>Where soma fits</h2>
<p>soma was designed around exactly these constraints. It's a byte-level continual learning architecture — one Python script, PyTorch only — that learns from raw byte streams, maintains a fixed-size memory regardless of training duration, and runs inference at constant cost per byte.</p>
<p>The key architectural idea: a bank of geometrically spaced spectral traces provides a basis for representing information across timescales. This is what allows the model to retain long-range structure without growing its memory — the traces compress temporal relationships rather than storing them explicitly.</p>
<div class="scenario-grid">
<div class="scenario">
<div class="scenario-label">scenario</div>
<h3>Robot arm calibration</h3>
<p>Train on control logs. The model learns the arm's dynamics. As hardware wears, it adapts continuously — no redeployment cycle.</p>
</div>
<div class="scenario">
<div class="scenario-label">scenario</div>
<h3>Sensor anomaly detection</h3>
<p>Train on telemetry streams. Normal behaviour is learned continuously. Anomalies surface as prediction failure — no labelled dataset required.</p>
</div>
<div class="scenario">
<div class="scenario-label">scenario</div>
<h3>Edge audio processing</h3>
<p>Train on raw audio bytes from the deployment environment. The model learns the acoustic profile of that specific space, not a generalised one.</p>
</div>
<div class="scenario">
<div class="scenario-label">scenario</div>
<h3>Personal device adaptation</h3>
<p>Train on a user's own data — typing patterns, usage logs, documents. A model that genuinely knows one person, running locally, never leaving the device.</p>
</div>
</div>
<h2>In practice</h2>
<p>Training on a sensor stream looks like this:</p>
<div class="code-label">train on raw bytes from any source</div>
<pre><code>python soma.py train --input sensor_log.bin
```
```
<p>The model doesn't care what the bytes represent. It learns the statistical structure of whatever you feed it. Run inference the same way:</p>
<div class="code-label">inference at constant cost</div>
<pre><code>python soma.py chat --resume your_model.pt</code></pre>
<p>Memory stays fixed. Cost stays constant. The checkpoint is a single <code>.pt</code> file you can move to any device — a Mac mini, a Raspberry Pi with enough RAM, whatever you're deploying to.</p>
<div class="callout">
<p><strong>The checkpoint is the deployment artifact.</strong> No serving infrastructure, no container, no model registry. One file. Runs wherever PyTorch runs.</p>
</div>
<h2>Why this matters beyond robotics</h2>
<p>The edge AI framing is useful because the constraints are legible — limited memory, limited compute, no connectivity. But the same constraints apply anywhere you want AI to operate close to data rather than far from it.</p>
<p>A model running on your machine, learning from your data, never sending anything anywhere — that's not just an edge AI architecture. It's a different relationship between people and the models they use. Ownership instead of subscription. Specificity instead of generality. A model that gets better at your problem, not everyone's problem.</p>
<p>We think that's where this goes. Not because it's philosophically appealing, but because the economics and the architecture point there. Continual learning at fixed cost, running on hardware you already own, is strictly better than the alternative for a large class of problems. The alternative just happened to arrive first.</p>
<div class="cta-block">
<h3>Download soma and start training</h3>
<p>One script. PyTorch only. Runs on a Mac mini or your existing gaming PC. Browse community checkpoints or start from scratch.</p>
<a href="https://logossoma.com/api/download/soma.zip" class="cta-btn">download soma →</a>
</div>
```
<title>How to Train Your Own AI Model Locally | logOS</title>
<meta name="description" content="You don't need a data center. Here's how to train a real AI model on hardware you already own — and why the architecture matters more than the compute.">
<meta name="keywords" content="train AI locally, how to train AI, train AI on Mac, local AI training, continual learning AI, train your own AI model">
<link rel="canonical" href="https://logossoma.com/blog/train-ai-locally">
<!-- OG -->
<meta property="og:title" content="How to Train Your Own AI Model Locally">
<meta property="og:description" content="You don't need a data center. Here's how to train a real AI model on hardware you already own.">
<meta property="og:type" content="article">
<meta property="og:url" content="https://logossoma.com/blog/train-ai-locally">
<header>
<span class="tag">guide · local AI training</span>
<h1>You can train your own AI.<br>Here's what that actually means.</h1>
<p class="subtitle">Most people think training AI requires a data center. It doesn't — if the architecture is right. Here's a honest look at what's involved, what the bottlenecks really are, and why they're not what you think.</p>
<div class="meta">
<span>James Blight</span>
<span>logossoma.com</span>
<span>AAAI 2026</span>
</div>
</header>
<article>
```
<h2>The assumption everyone makes</h2>
<p>When people ask "can I train my own AI?", they usually mean: can I fine-tune a model someone else built? Run LoRA on a quantized checkpoint? That's not training — that's decoration.</p>
<p>Real training means the model learns from data you give it, building internal representations from scratch. The reason people assume this requires racks of A100s is that the dominant architecture — the transformer — genuinely does. Not because compute is inherently expensive, but because the transformer's design makes it so.</p>
<div class="callout">
<p><strong>The actual bottleneck isn't compute. It's architecture.</strong> Transformers have quadratic attention cost, static weights after training, and require enormous batches to learn efficiently. Fix those and the compute story changes completely.</p>
</div>
<h2>Why transformers need so much hardware</h2>
<p>Transformer attention is O(n²) in sequence length. Every token attends to every other token. This is what makes them powerful — and what makes them expensive. As context grows, cost explodes.</p>
<p>More importantly: once a transformer is trained, it's frozen. The weights don't update during inference. This means everything the model will ever know has to be baked in upfront, during a single expensive training run on an enormous dataset. The scale isn't optional — it's load-bearing.</p>
<p>If you want to teach it something new, you retrain. Or fine-tune, which is a workaround, not a solution.</p>
<div class="compare">
<div class="compare-box">
<div class="compare-label">Transformer</div>
<ul>
<li>Quadratic attention cost</li>
<li>Static after training</li>
<li>Requires massive datasets upfront</li>
<li>Context window is a hard limit</li>
<li>Fine-tuning is expensive and lossy</li>
</ul>
</div>
<div class="compare-box highlight">
<div class="compare-label">Continual architecture</div>
<ul>
<li>Constant cost per byte</li>
<li>Learns continuously from any stream</li>
<li>Useful with modest data</li>
<li>No context window concept</li>
<li>Every inference updates the model</li>
</ul>
</div>
</div>
<h2>What continual learning actually changes</h2>
<p>A model that learns continuously isn't just cheaper to run. It's a fundamentally different kind of object. It's not a static artifact you query — it's a process that unfolds alongside whatever it's learning from.</p>
<p>This matters for training on a personal machine because the economics flip. You're not trying to compress the internet into weights in a weekend. You're feeding your model a stream of data — your data — and it accumulates understanding over time. Like you do.</p>
<p>The practical upshot: useful models don't require enormous training runs. They require sustained, focused learning on data that actually matters for your use case.</p>
<h2>What you actually need</h2>
<table>
<thead>
<tr>
<th>Hardware</th>
<th>What it gets you</th>
<th>Realistic use</th>
</tr>
</thead>
<tbody>
<tr>
<td>Mac mini (M-series)</td>
<td>Unified memory, efficient inference</td>
<td>Training on text, code, small sensor streams</td>
</tr>
<tr>
<td>Gaming PC (RTX 3090+)</td>
<td>VRAM for larger hidden dims</td>
<td>Faster training, larger models</td>
</tr>
<tr>
<td>Any modern laptop</td>
<td>CPU inference at constant cost</td>
<td>Inference, light training</td>
</tr>
</tbody>
</table>
<p>The key property you want: <strong>constant inference cost per byte</strong>. With a transformer, cost grows with context. With a continual architecture, it doesn't. This is what makes local deployment viable — not just technically, but economically.</p>
<h2>What to train on</h2>
<p>This is where it gets interesting. Because the model learns from raw bytes — not tokenized, preprocessed, formatted text — the input can be almost anything:</p>
<p>The discipline isn't feeding it everything — it's feeding it the right things. A model trained on a narrow, high-quality corpus often outperforms one trained broadly on noise.</p>
<h2>Getting started with soma</h2>
<p>soma is a continual-learning architecture — one Python script, PyTorch only — built around these principles. It learns byte-by-byte, maintains a fixed-size memory, and runs inference at constant cost. No context window. No transformer attention.</p>
<p>It was presented at AAAI 2026. The mathematics behind it — geometrically spaced spectral traces as a basis for temporal memory — is motivated and peer-reviewed, not a hack.</p>
<div class="code-label">train on your data</div>
<pre><code>python soma.py train --input your_data.txt</code></pre>
<div class="code-label">chat with what you built</div>
<pre><code>python soma.py chat --resume your_model.pt</code></pre>
<p>That's it. One script. A Mac mini is enough to train something genuinely useful.</p>
<div class="callout">
<p><strong>Checkpoints are portable.</strong> A trained soma model is a single <code>.pt</code> file. You can share it, fork it, build on someone else's — each with a deterministic ID so lineage is always verifiable.</p>
</div>
<h2>The bigger picture</h2>
<p>The question "can I train my own AI?" has always had a hidden assumption: that AI training requires infrastructure only large organizations can afford. That assumption is architecture-dependent, not fundamental.</p>
<p>If the scaling dynamics of current methods are correct, then a continual architecture with constant memory and cost removes a significant fraction of that overhead — even before accounting for what continual learning actually unlocks for applications.</p>
<p>We think every large-scale AI application will eventually converge on something like this. Not as a prediction, but as a consequence of the underlying logic. Fixed memory. Constant cost. Learning that never stops.</p>
<p>The hardware you already own is enough to be part of that.</p>
<div class="cta-block">
<h3>Browse soma checkpoints on logOS</h3>
<p>Download a trained model, continue training it, share what you build. Free to browse. No compute required beyond your own machine.</p>
<a href="https://logossoma.com/checkpoints" class="cta-btn">explore checkpoints →</a>
</div>
```