logOS — log

2026-04-24

chapter 1: coherence
definition 1.1 — signal. let x(t) be a real-valued stochastic process indexed by time t ∈ ℝ. we assume x is wide-sense stationary: its mean is constant and its autocorrelation depends only on lag.
definition 1.2 — autocorrelation. R(τ) = E[x(t) · x(t + τ)] R(τ) encodes the complete second-order temporal dependence structure of x. R(0) is the total power. R(τ) for τ > 0 is the memory: how much the signal at time t constrains the signal at time t + τ.
definition 1.3 — power spectral density. S(ω) = F{R(τ)} by the wiener-khinchin theorem, R and S are fourier duals. R describes dependence in time. S describes dependence in frequency. they are the same object in two representations.
definition 1.4 — coherence. γ(τ) = R(τ) / R(0) γ(τ) ∈ [-1, 1]. coherence is the normalized autocorrelation. it measures the degree to which the signal at time t constrains the signal at time t + τ. γ(0) = 1 by construction. for a white noise process, γ(τ) = 0 for all τ ≠ 0. coherence is what makes prediction possible: if γ(τ) = 0 everywhere, no observation of the past reduces uncertainty about the future.
definition 1.5 — coherence time. τ_c = ∫₀^∞ |γ(τ)|² dτ the coherence time is the effective horizon of temporal predictability. it is the timescale over which the signal remains correlated with itself.
theorem 1.1 — bandwidth-coherence duality. τ_c ~ 1/Δω, where Δω is the spectral bandwidth. narrow spectrum ↔ long coherence time. broad spectrum ↔ short coherence time. this is the time-frequency uncertainty principle applied to correlation structure: a signal cannot be simultaneously localized in time and predictable across time. temporal precision and temporal coherence are complementary.
corollary 1.1. a signal with structure at multiple timescales has multiple coherence horizons. no single τ_c characterizes it. the coherence structure is spectral — it must be decomposed, not summarized.
definition 1.6 — bounded observation. a system is bounded if it maintains fixed finite state. it is causal if its state at time t depends only on x(s) for s ≤ t. it is continuous if it updates incrementally without resets.
these are not design choices. they are the constraints of physical existence. any system embedded in time — biological or artificial — satisfies all three. the question is: what representation of x(t) is available to such a system?
definition 1.7 — exponential trace. E_τ(t) = ∫_{-∞}^{t} (1/τ) e^{-(t-s)/τ} x(s) ds equivalently, in discrete time with α = 1/τ: E(t) = (1 - α) · E(t-1) + α · x(t) each trace is a convolution with an exponential kernel. it is the impulse response of a first-order linear time-invariant system with time constant τ. a single scalar, updated by a single multiply-add, encoding the exponentially weighted history of one channel at one timescale.
theorem 1.2 — irreducibility. among linear, time-invariant, finite-state, bounded, causal, stable operators, the exponential trace is the irreducible basis element.
proof sketch: any such operator has a transfer function H(s) that is a rational function of s with poles in the left half-plane (stability). by partial fraction decomposition, H(s) = Σ a_k / (s + p_k), where each term is a first-order exponential mode with time constant 1/p_k. the exponential trace is the time-domain realization of a single such mode. all LTI operators satisfying the constraints decompose into sums of exponential traces. the trace is irreducible because it contains a single pole — it cannot be further decomposed.
definition 1.8 — trace bank. a trace bank is a set of K exponential traces at timescales {τ_k}{k=0}^{K-1}: T(t) = {E{τ_k}(t)}_{k=0}^{K-1}
theorem 1.3 — trace bank as coherence measurement. the covariance between traces is a projection of the power spectrum: E[E_{τ_i}(t) · E_{τ_j}(t)] = ∫ S(ω) · K_{τ_i}(ω) · K*_{τ_j}(ω) dω where K_τ(ω) = 1/(1 + iωτ) is the frequency response of an exponential trace with time constant τ.
the trace bank computes a finite-dimensional projection of the signal's coherence structure. each trace measures how much energy the signal has at and below the frequency 1/τ_k. the trace bank does not represent the signal. it represents the signal's coherence.
definition 1.9 — bandpass decomposition. B_k(t) = E_{τ_k}(t) - E_{τ_{k+1}}(t), for k < K-1 B_{K-1}(t) = E_{τ_{K-1}}(t) each B_k isolates the signal's coherence between the cutoff frequencies of adjacent traces. the vector B(t) = {B_k(t)} is a causal, bounded, continuously updated spectral decomposition. it does not store the signal. it stores what is predictable about the signal, decomposed by timescale.
definition 1.10 — geometric spacing. τ_k = τ_0 · r^k for geometric ratio r > 1.
theorem 1.4 — scale invariance. geometric spacing is the unique spacing under which each octave of the frequency axis receives equal representational resolution. the decomposition treats no timescale as privileged. this is the only spacing consistent with a signal whose coherence structure is not known in advance — any other spacing imposes prior assumptions about where in the spectrum structure lives.
definition 1.11 — cross-channel interference. for traces at timescales τ_i, τ_j, harmonic interference occurs when τ_i/τ_j ≈ p/q for small integers p, q. structure at scale τ_i produces correlated response at scale τ_j, contaminating the measurement: coherence that belongs to one timescale appears in another.
theorem 1.5 — golden ratio optimality. let φ = (1 + √5)/2. among all geometric bases r, the spacing r = φ minimizes cross-channel interference. by the hurwitz theorem, |φ - p/q| > 1/(√5 · q²) for all rationals p/q. the rational approximations of φ converge more slowly than those of any other irrational.
φ-spaced traces are maximally incommensurate. the coherence measured at each scale is maximally attributable to the signal at that scale, rather than to coupling between measurement channels. the measurement is as clean as geometry permits.
definition 1.12 — credit assignment. given a prediction y(t) = f(B(t); W) with loss L(t), credit assignment is the attribution of L(t) to components of B(t) and parameters W.
theorem 1.6 — dual optimality. the gradient ∂L/∂w_{i,k} is proportional to B_{i,k}(t). when bandpass features have minimal cross-band correlation, the gradient decomposes along the same basis as the signal. each weight update is attributable to a specific channel at a specific timescale.
cross-band correlation introduces gradient interference: error at one timescale contaminates the update at another. the spacing that minimizes cross-band correlation is r = φ — the same spacing that minimizes representational interference.
the optimal basis for measuring coherence and the optimal basis for learning from coherence are the same basis. this is not coincidence. it is because both representation and learning are projections onto the same spectral structure, and the quality of both projections depends on the independence of the basis elements.
theorem 1.7 — the coherence identity. under LTI, bounded, causal constraints: (i) coherence is the structure in a signal that makes prediction possible. (ii) the trace bank is the minimal finite-state representation that preserves coherence. (iii) bandpass decomposition with geometric spacing extracts coherence without privileging any timescale. (iv) φ-spacing is the unique parameterization that simultaneously minimizes representational error and credit assignment ambiguity.
representation, decomposition, and learning are not separate problems. they are three aspects of a single operation: coherence extraction.
  chapter 2: decimation
definition 2.1 — observation. an observation is a single evaluation of the trace bank state B(t) and the computation of a prediction y(t) = f(B(t); W) with associated loss L(t) and gradient g(t) = ∂L/∂W. each observation has a fixed cost: one read of the coherence state, one forward pass, one gradient computation, one weight update. this cost is the irreducible metabolic unit of learning.
definition 2.2 — observation rate. a system observing a stream x(t) at stride S makes one observation every S samples. the observation rate is ν = 1/S. at ν = 1, every sample is observed. at ν < 1, the trace bank evolves between observations without the system learning from the intermediate states.
note: the trace bank continues to update at every sample regardless of the observation rate. the filter bank is a physical instrument — it measures the stream continuously. decimation controls how often the system reads the instrument and acts on what it sees. the measurement is not decimated. the response is.
definition 2.3 — decimation band. let the trace bank have K bands at timescales τ_k = r^k. the decimation band d ∈ {0, 1, ..., K-1} selects an observation resolution indexed by timescale rather than by sample count. the stride is: S(d) = r^d at d = 0, the system observes every sample. at d = k, the system observes at the characteristic rate of band k. the observation resolution is expressed in the same units as the coherence decomposition.
definition 2.4 — observational coherence. for band k observed at decimation band d, the observational coherence is: γ_obs(k, d) = min(1, r^(k-d))
this is the fraction of band k's coherence structure that is resolvable at stride S(d). for k ≥ d, the observation rate is at or above band k's nyquist rate — its coherence is fully resolved, γ_obs = 1. for k < d, the observation rate is below band k's nyquist rate — its coherence is partially lost, and the surviving fraction is exactly the ratio of its timescale to the observation stride.
theorem 2.1 — observational coherence as residual mutual information. let g_k be the true gradient for band k (computed from every sample) and ĝ_k be the gradient observed at stride S(d). the mutual information between g_k and ĝ_k is: I(g_k; ĝ_k) ∝ γ_obs(k, d)
for fully resolved bands (k ≥ d), the observed gradient is the true gradient. for undersampled bands (k < d), the observed gradient is an estimate whose correlation with the true gradient decays as r^(k-d). the observational coherence is not a design choice or a scaling heuristic. it is the information-theoretic fidelity of the observation at that band.
proof sketch: band k has characteristic timescale τ_k = r^k. the autocorrelation of its gradient signal decays as e^{-|τ|/τ_k}. when sampled at interval S = r^d, the correlation between successive gradient samples is e^{-S/τ_k} = e^{-r^d/r^k} = e^{-r^{d-k}}. for k < d, r^{d-k} > 1 and successive samples are approximately independent — each observation captures a fraction r^{k-d} of the gradient's variance. for k ≥ d, r^{d-k} ≤ 1 and samples remain correlated — the gradient is fully resolved.
definition 2.5 — confidence-weighted gradient. at decimation band d, the weight update for band k is: ΔW_k = γ_obs(k, d) · g_k
every band updates every step. no band is excluded. no accumulation schedule is required. the confidence weight correctly scales each band's contribution according to how much of its coherence the observation actually resolved.
theorem 2.2 — learning rate unity. at d = 0 (full resolution), γ_obs(k, 0) = 1 for all k. the observed gradient is the true gradient. the information-theoretically correct step size is the full gradient: ΔW_k = g_k
therefore the learning rate η = 1 is the unique correct value at full observational coherence. at d > 0, the effective learning rate for band k is γ_obs(k, d), which is determined by the geometry, not by a free parameter. the learning rate is not a hyperparameter. it is a consequence of observation resolution.
corollary 2.1. the traditional learning rate η is an undifferentiated proxy for two distinct quantities: observational coherence (how much of the gradient is trustworthy) and a stability constraint (how much the weights may change in a single step). the first is determined by decimation. the second is a structural bound on weight magnitude, independent of the gradient signal. separating these removes the learning rate as a free parameter.
definition 2.6 — metabolic cost. the metabolic cost of processing N samples of stream at decimation band d is: E(d) = N · r^{-d} this is the number of observations — the number of times the system reads the trace bank, computes a prediction, evaluates a gradient, and updates its weights. each of these operations has a fixed cost. the total cost is proportional to the number of times the operation is performed.
metabolic cost is denominated in observations, not in samples. the trace bank processes every sample regardless — that cost is fixed. the variable cost is the learning: how often the system engages with what the trace bank is measuring.
definition 2.7 — useful information rate. the useful information captured per observation at decimation band d is: I(d) = Σ_k γ_obs(k, d) · |g_k|²
|g_k|² is the gradient power at band k — the amount of unexplained coherence at timescale k. γ_obs(k, d) weights this by the fraction resolvable at the current observation rate. I(d) is the total resolvable learning signal per observation.
theorem 2.3 — spectral efficiency. the spectral efficiency of a coherence-extracting system at decimation band d is: η(d) = I(d) / E(d) = r^d · Σ_k γ_obs(k, d) · |g_k|²
η(d) is the useful information extracted per unit of metabolic cost. it is computable entirely from the model's internal state: the gradient magnitudes come from the sidechain, the confidence weights come from the geometry, the energy cost comes from the stride.
theorem 2.4 — the decimation-metabolism equivalence. three quantities are related by a single parameter d: (i) observation rate: ν(d) = r^{-d} (how finely the system resolves the stream) (ii) effective learning rate: η_k(d) = γ_obs(k, d) (how much each band learns per step) (iii) metabolic cost: E(d) = N · r^{-d} (how much energy the system expends)
these are not independent quantities controlled by separate mechanisms. they are three projections of a single state: the decimation band. decimation is the exchange rate between time, energy, and information.
corollary 2.2. the traditional hyperparameters of gradient-based learning — learning rate, sampling rate, update frequency — are conflations of quantities that the decimation framework separates and then reunifies on information-theoretic grounds. learning rate is observational coherence. sampling rate is metabolic expenditure. update frequency is observation rate. all three are determined by the decimation band. the only free parameter is how much energy the system spends observing.
definition 2.8 — the decimation spectrum. for a system with K bands at base r, the space of possible operating regimes is the interval d ∈ [0, K-1]. each point in this interval defines a complete, self-consistent metabolic regime:
d = 0: maximum resolution, maximum cost, all bands at full confidence. the system resolves every transition. appropriate when the stream is novel, dense with learnable structure, or when the cost of being surprised is high.
d = K-1: minimum resolution, minimum cost, only the slowest band at full confidence. the system tracks only the longest-timescale structure. appropriate for traversing familiar or redundant streams at minimum energy.
intermediate d: a spectral allocation of confidence that transitions smoothly from full resolution above d to geometric attenuation below d. this is not a degraded version of full resolution. it is a different allocation strategy, optimal for a different energy budget.
the decimation spectrum is the space of metabolic strategies available to a coherence-extracting system. the choice of d is the choice of how to be in time.

2026-04-22

<!DOCTYPE html>

<title>How Soma Works: Spectral Traces and Continual Learning | logOS</title>
<meta name="description" content="A technical walkthrough of soma's architecture — geometrically spaced spectral traces, fixed-size memory, byte-level learning, and why these design choices produce continual learning without catastrophic forgetting.">
<meta name="keywords" content="soma architecture, spectral traces AI, continual learning architecture, byte level language model, fixed memory neural network, geometric spectral basis, soma technical, logOS soma">
<link rel="canonical" href="https://logossoma.com/blog/soma-architecture-spectral-traces">

* { margin: 0; padding: 0; box-sizing: border-box; }
html { scroll-behavior: smooth; }

body {
background: var(--bg);
color: var(--text);
font-family: var(--sans);
font-weight: 300;
line-height: 1.75;
font-size: 17px;
-webkit-font-smoothing: antialiased;
}

#progress {
position: fixed;
top: 0; left: 0;
height: 2px;
background: var(--accent);
width: 0%;
z-index: 100;
transition: width 0.1s linear;
}

nav {
position: fixed;
top: 0; left: 0; right: 0;
z-index: 50;
padding: 20px 40px;
display: flex;
justify-content: space-between;
align-items: center;
background: rgba(8,8,8,0.9);
backdrop-filter: blur(12px);
border-bottom: 1px solid var(--border);
}

.nav-logo {
font-family: var(--mono);
font-size: 13px;
color: var(--accent);
text-decoration: none;
letter-spacing: 0.05em;
}

.nav-links { display: flex; gap: 28px; list-style: none; }
.nav-links a {
font-family: var(--mono);
font-size: 12px;
color: var(--text-dim);
text-decoration: none;
letter-spacing: 0.08em;
transition: color 0.2s;
}
.nav-links a:hover { color: var(--text); }

.wrapper { max-width: 720px; margin: 0 auto; padding: 0 32px; }

header { padding: 140px 0 80px; border-bottom: 1px solid var(--border); }

.tag {
font-family: var(--mono);
font-size: 11px;
letter-spacing: 0.15em;
color: var(--accent);
text-transform: uppercase;
margin-bottom: 28px;
display: block;
}

h1 {
font-family: var(--mono);
font-size: clamp(24px, 3.5vw, 34px);
font-weight: 400;
line-height: 1.25;
letter-spacing: -0.02em;
color: #f0f0f0;
margin-bottom: 28px;
}

.subtitle {
font-size: 18px;
color: var(--text-dim);
font-weight: 300;
max-width: 580px;
line-height: 1.65;
margin-bottom: 36px;
}

.meta {
font-family: var(--mono);
font-size: 12px;
color: var(--text-dimmer);
display: flex;
gap: 24px;
flex-wrap: wrap;
}

/ TOC /
.toc {
background: var(--surface);
border: 1px solid var(--border);
border-radius: 4px;
padding: 28px 32px;
margin: 48px 0;
}

.toc-label {
font-family: var(--mono);
font-size: 10px;
letter-spacing: 0.15em;
text-transform: uppercase;
color: var(--text-dimmer);
margin-bottom: 16px;
}

.toc ol {
list-style: none;
counter-reset: toc;
padding: 0;
}

.toc li {
counter-increment: toc;
padding: 5px 0;
padding-left: 28px;
position: relative;
}

.toc li::before {
content: counter(toc, decimal-leading-zero);
position: absolute;
left: 0;
font-family: var(--mono);
font-size: 11px;
color: var(--text-dimmer);
top: 7px;
}

.toc a {
font-family: var(--mono);
font-size: 13px;
color: var(--text-dim);
text-decoration: none;
border: none;
transition: color 0.2s;
}
.toc a:hover { color: var(--text); }

article { padding: 0 0 72px; }

h2 {
font-family: var(--mono);
font-size: 20px;
font-weight: 500;
color: #e8e8e8;
margin: 72px 0 20px;
letter-spacing: -0.01em;
padding-top: 72px;
border-top: 1px solid var(--border);
scroll-margin-top: 80px;
}

p { margin-bottom: 22px; color: var(--text); }

a {
color: var(--accent);
text-decoration: none;
border-bottom: 1px solid rgba(232,255,74,0.3);
transition: border-color 0.2s;
}
a:hover { border-color: var(--accent); }

.callout {
background: var(--accent-dim);
border-left: 2px solid var(--accent);
padding: 24px 28px;
margin: 36px 0;
border-radius: 0 4px 4px 0;
}
.callout p { margin: 0; color: #c8c8c8; font-size: 16px; line-height: 1.65; }
.callout strong { color: var(--accent); font-weight: 500; }

.callout.blue {
background: var(--blue-dim);
border-left-color: var(--blue);
}
.callout.blue strong { color: var(--blue); }

/ Math / formula display /
.formula {
background: var(--surface2);
border: 1px solid var(--border);
border-radius: 4px;
padding: 24px 28px;
margin: 28px 0;
overflow-x: auto;
}

.formula-label {
font-family: var(--mono);
font-size: 10px;
letter-spacing: 0.12em;
text-transform: uppercase;
color: var(--text-dimmer);
margin-bottom: 14px;
}

.formula code {
font-family: var(--mono);
font-size: 15px;
color: var(--blue);
line-height: 2;
white-space: pre;
display: block;
}

.formula p {
margin: 14px 0 0;
font-size: 14px;
color: var(--text-dim);
line-height: 1.6;
}

/ Component breakdown /
.component {
border: 1px solid var(--border);
border-radius: 4px;
margin: 20px 0;
overflow: hidden;
}

.component-header {
background: var(--surface);
padding: 16px 24px;
display: flex;
align-items: baseline;
gap: 16px;
border-bottom: 1px solid var(--border);
}

.component-name {
font-family: var(--mono);
font-size: 14px;
font-weight: 500;
color: var(--accent);
}

.component-type {
font-family: var(--mono);
font-size: 11px;
color: var(--text-dimmer);
}

.component-body {
padding: 20px 24px;
background: var(--surface2);
}

.component-body p {
margin: 0;
font-size: 15px;
color: var(--text-dim);
line-height: 1.7;
}

.component-body p + p { margin-top: 10px; }

/ Trace visualisation /
.trace-vis {
margin: 36px 0;
background: var(--surface);
border: 1px solid var(--border);
border-radius: 4px;
padding: 28px;
}

.trace-vis-label {
font-family: var(--mono);
font-size: 10px;
letter-spacing: 0.12em;
text-transform: uppercase;
color: var(--text-dimmer);
margin-bottom: 20px;
}

.trace-row {
display: flex;
align-items: center;
gap: 16px;
margin-bottom: 12px;
}

.trace-label {
font-family: var(--mono);
font-size: 11px;
color: var(--text-dimmer);
width: 60px;
flex-shrink: 0;
}

.trace-bar-wrap {
flex: 1;
height: 6px;
background: var(--surface2);
border-radius: 3px;
overflow: hidden;
}

.trace-bar {
height: 100%;
border-radius: 3px;
background: var(--accent);
opacity: 0.3;
}

.trace-row.active .trace-bar { opacity: 1; }
.trace-row:nth-child(2) .trace-bar { opacity: 0.8; }
.trace-row:nth-child(3) .trace-bar { opacity: 0.55; }
.trace-row:nth-child(4) .trace-bar { opacity: 0.35; }
.trace-row:nth-child(5) .trace-bar { opacity: 0.2; }
.trace-row:nth-child(6) .trace-bar { opacity: 0.1; }

.trace-decay {
font-family: var(--mono);
font-size: 11px;
color: var(--text-dimmer);
width: 80px;
flex-shrink: 0;
text-align: right;
}

.trace-vis-caption {
font-size: 13px;
color: var(--text-dim);
margin-top: 18px;
line-height: 1.6;
}

pre {
background: var(--surface);
border: 1px solid var(--border);
border-radius: 4px;
padding: 24px;
overflow-x: auto;
margin: 28px 0;
}

pre code {
font-family: var(--mono);
font-size: 13px;
color: #b8d4b8;
line-height: 1.7;
}

p code, li code {
background: var(--surface);
border: 1px solid var(--border);
padding: 2px 7px;
border-radius: 3px;
font-size: 13px;
color: #b8d4b8;
}

.code-label {
font-family: var(--mono);
font-size: 10px;
letter-spacing: 0.1em;
color: var(--text-dimmer);
text-transform: uppercase;
margin-bottom: 8px;
}

.cta-block {
margin: 72px 0 0;
padding: 48px;
background: var(--surface);
border: 1px solid var(--border);
border-radius: 4px;
text-align: center;
}

.cta-block h3 {
font-family: var(--mono);
font-size: 18px;
font-weight: 400;
color: #e8e8e8;
margin: 0 0 12px;
text-transform: none;
letter-spacing: -0.01em;
}

.cta-block p { color: var(--text-dim); font-size: 15px; margin: 0 0 28px; }

.cta-row { display: flex; gap: 12px; justify-content: center; flex-wrap: wrap; }

.cta-btn {
display: inline-block;
background: var(--accent);
color: #0a0a0a;
font-family: var(--mono);
font-size: 13px;
font-weight: 500;
letter-spacing: 0.05em;
padding: 14px 28px;
border-radius: 2px;
text-decoration: none;
border: none;
transition: opacity 0.2s;
}
.cta-btn:hover { opacity: 0.85; border: none; }

.cta-btn-ghost {
display: inline-block;
background: transparent;
color: var(--text-dim);
font-family: var(--mono);
font-size: 13px;
letter-spacing: 0.05em;
padding: 14px 28px;
border-radius: 2px;
text-decoration: none;
border: 1px solid var(--border);
transition: border-color 0.2s, color 0.2s;
}
.cta-btn-ghost:hover { border-color: var(--text-dim); color: var(--text); }

footer {
border-top: 1px solid var(--border);
padding: 40px 0;
display: flex;
justify-content: space-between;
align-items: center;
}

footer span { font-family: var(--mono); font-size: 12px; color: var(--text-dimmer); }
footer a { font-family: var(--mono); font-size: 12px; color: var(--text-dim); border: none; }

@media (max-width: 600px) {
nav { padding: 16px 20px; }
.nav-links { display: none; }
.wrapper { padding: 0 20px; }
header { padding: 100px 0 60px; }
.cta-block { padding: 32px 24px; }
footer { flex-direction: column; gap: 12px; text-align: center; }
}
</style>

</head>
<body>

<nav>
<a href="https://logossoma.com" class="nav-logo">logOS</a>
<ul class="nav-links">
<li><a href="https://logossoma.com/checkpoints">checkpoints</a></li>
<li><a href="https://logossoma.com/about">about</a></li>
<li><a href="https://logossoma.com/paper">paper</a></li>
</ul>
</nav>

<header>
<span class="tag">technical · architecture</span>
<h1>How soma works: spectral traces, fixed memory, and byte-level learning.</h1>
<p class="subtitle">A technical walkthrough of soma's architecture for readers who want to understand what's actually happening under the hood — before or after reading the paper.</p>
<div class="meta">
<span>James Blight</span>
<span>logossoma.com</span>
<span>AAAI 2026</span>
</div>
</header>

<div class="toc">
<div class="toc-label">contents</div>
<ol>
<li><a href="#design-goals">Design goals and constraints</a></li>
<li><a href="#byte-level">Why byte-level</a></li>
<li><a href="#traces">The spectral trace bank</a></li>
<li><a href="#geometry">Why geometric spacing</a></li>
<li><a href="#forward-pass">The forward pass</a></li>
<li><a href="#checkpoint">What a checkpoint contains</a></li>
<li><a href="#limits">Honest limitations</a></li>
</ol>
</div>

```
<h2 id="design-goals">Design goals and constraints</h2>

<p>Soma was designed around a specific set of requirements that existing architectures don't simultaneously satisfy:</p>

<ul style="list-style:none;padding:0;margin:0 0 22px;">
<li style="padding:6px 0;padding-left:20px;position:relative;color:var(--text-dim);">
<span style="position:absolute;left:0;color:var(--accent);">1.</span> Memory must be fixed-size regardless of how long the model has been trained
</li>
<li style="padding:6px 0;padding-left:20px;position:relative;color:var(--text-dim);">
<span style="position:absolute;left:0;color:var(--accent);">2.</span> Inference cost must be constant per input — not per sequence length
</li>
<li style="padding:6px 0;padding-left:20px;position:relative;color:var(--text-dim);">
<span style="position:absolute;left:0;color:var(--accent);">3.</span> Learning must be continuous — the model updates from every byte it sees
</li>
<li style="padding:6px 0;padding-left:20px;position:relative;color:var(--text-dim);">
<span style="position:absolute;left:0;color:var(--accent);">4.</span> The architecture must handle any byte stream without tokenisation or preprocessing
</li>
<li style="padding:6px 0;padding-left:20px;position:relative;color:var(--text-dim);">
<span style="position:absolute;left:0;color:var(--accent);">5.</span> Long-range structure must be preserved without catastrophic forgetting
</li>
</ul>

<p>These constraints rule out transformers (quadratic cost, static weights), standard RNNs (forgetting over long range), and progressive networks (unbounded memory). The architecture has to be designed from scratch to satisfy all five simultaneously.</p>

<div class="callout">
<p><strong>The core insight:</strong> time is the only thing every data stream has in common. If you can represent information across timescales efficiently, you can learn from anything — without knowing in advance what it is.</p>
</div>

<h2 id="byte-level">Why byte-level</h2>

<p>Most language models operate on tokens — subword units produced by a tokeniser trained on a specific corpus. Tokenisation has real advantages: it compresses the sequence, it aligns with linguistic structure, and it reduces the vocabulary size the model needs to predict over.</p>

<p>It also has costs. A tokeniser is a preprocessing step that has to be designed for a specific data type. Text tokenisers don't handle audio. Audio encoders don't handle sensor telemetry. Every new modality requires a new tokenisation scheme, which requires a new model family.</p>

<p>Bytes are universal. Every file — text, code, audio, image, binary sensor log — is a sequence of bytes. A model that learns from raw bytes doesn't need to know what it's looking at. It learns the statistical structure of whatever stream you give it.</p>

<p>The vocabulary is 256 — one entry per possible byte value. The model predicts the next byte given everything it's seen. This is a harder problem than predicting tokens (predictions are made at finer granularity), but it's a uniform problem across all data types.</p>

<div class="callout blue">
<p><strong>Practical implication:</strong> you don't need a different version of soma for text vs. sensor data vs. audio. The same model, the same script, the same checkpoint format. Modality is a property of your training data, not your architecture.</p>
</div>

<h2 id="traces">The spectral trace bank</h2>

<p>The central data structure in soma is the trace bank — a matrix of shape <code>(256, K)</code> stored in float64, where K is the number of traces (set by <code>n_bands</code>).</p>

<p>Each column in the trace bank is a trace — a running exponential average of the byte values seen so far, decaying at a specific rate. When the model sees a new byte, every trace updates:</p>

<div class="formula">
<div class="formula-label">trace update rule</div>
<code>trace_k ← α_k · trace_k + (1 - α_k) · one_hot(byte)</code>
<p>where α_k is the decay rate for trace k, and one_hot(byte) is a 256-dimensional indicator vector.</p>
</div>

<p>Each trace is a weighted average of the byte history, with recent bytes weighted more heavily. A trace with α close to 1 decays slowly — it remembers far back. A trace with α close to 0 decays quickly — it only remembers very recently.</p>

<p>The full trace bank is the feature representation passed to the rest of the network. It's a 256×K matrix encoding what the model has seen across K different timescales simultaneously.</p>

<div class="trace-vis">
<div class="trace-vis-label">timescale representation — illustrative</div>
<div class="trace-row active">
<div class="trace-label">trace 1</div>
<div class="trace-bar-wrap"><div class="trace-bar" style="width:100%"></div></div>
<div class="trace-decay">fast · α≈0.1</div>
</div>
<div class="trace-row">
<div class="trace-label">trace 2</div>
<div class="trace-bar-wrap"><div class="trace-bar" style="width:100%"></div></div>
<div class="trace-decay">α≈0.3</div>
</div>
<div class="trace-row">
<div class="trace-label">trace 3</div>
<div class="trace-bar-wrap"><div class="trace-bar" style="width:100%"></div></div>
<div class="trace-decay">α≈0.55</div>
</div>
<div class="trace-row">
<div class="trace-label">trace 4</div>
<div class="trace-bar-wrap"><div class="trace-bar" style="width:100%"></div></div>
<div class="trace-decay">α≈0.75</div>
</div>
<div class="trace-row">
<div class="trace-label">trace 5</div>
<div class="trace-bar-wrap"><div class="trace-bar" style="width:100%"></div></div>
<div class="trace-decay">α≈0.9</div>
</div>
<div class="trace-row">
<div class="trace-label">trace K</div>
<div class="trace-bar-wrap"><div class="trace-bar" style="width:100%"></div></div>
<div class="trace-decay">slow · α≈0.99</div>
</div>
<p class="trace-vis-caption">Each trace captures a different temporal horizon. Fast traces reflect recent bytes strongly. Slow traces reflect long-range structure. Together they form a spectral basis for the stream's history.</p>
</div>

<h2 id="geometry">Why geometric spacing</h2>

<p>The decay rates α_k are spaced geometrically — each one is the previous one multiplied by a fixed base (soma uses the golden ratio, φ ≈ 1.618).</p>

<p>Geometric spacing is not arbitrary. It's motivated by the structure of information in natural sequences: relevant patterns occur across many timescales simultaneously, and the density of information across timescales is approximately log-uniform. Geometric spacing allocates equal representational capacity per octave — the same number of traces covers each order of magnitude of timescale.</p>

<div class="formula">
<div class="formula-label">geometric decay spacing</div>
<code>α_k = 1 - base^(-k) for k = 1, 2, ..., K</code>
<p>With base φ ≈ 1.618 (golden ratio). This produces decay rates that are evenly distributed in log-space, covering timescales from single bytes to hundreds of thousands of bytes within a fixed number of traces.</p>
</div>

<p>The golden ratio base is a specific choice motivated by its optimal packing properties — φ minimises redundancy between adjacent timescales in a geometric sequence. Whether this is the globally optimal base is an open question. It's a principled and well-motivated choice.</p>

<h2 id="forward-pass">The forward pass</h2>

<p>Given the trace bank state after seeing all previous bytes, the forward pass to predict the next byte works as follows:</p>

<div class="component">
<div class="component-header">
<div class="component-name">U · flatten(traces)</div>
<div class="component-type">256×K → hidden_dim</div>
</div>
<div class="component-body">
<p>The trace bank (256 × K matrix) is flattened to a vector of length 256·K and projected to the hidden dimension via a learned weight matrix U. This is the feature extraction step — U learns which combinations of timescale-resolved byte statistics are predictively useful.</p>
</div>
</div>

<div class="component">
<div class="component-header">
<div class="component-name">nonlinearity → W → logits</div>
<div class="component-type">hidden_dim → 256</div>
</div>
<div class="component-body">
<p>The hidden representation passes through a nonlinearity and then a second learned matrix W to produce 256 logits — one per possible next byte. Standard cross-entropy loss against the actual next byte drives learning.</p>
</div>
</div>

<div class="component">
<div class="component-header">
<div class="component-name">Wd (optional direct readout)</div>
<div class="component-type">256·K → 256, added to logits</div>
</div>
<div class="component-body">
<p>When <code>direct_readout=True</code>, a third matrix Wd maps directly from the flattened traces to logits, bypassing the hidden layer. This skip connection improves performance on certain corpora and adds a residual pathway for short-range patterns that don't benefit from the full hidden representation.</p>
</div>
</div>

<p>The total parameter count is dominated by U and W. With <code>hidden_dim=8192</code> and <code>n_bands=46</code>, U has shape (256×46, 8192) and W has shape (8192, 256) — roughly 100M parameters in those two matrices. This is the learnable content of a checkpoint.</p>

<div class="callout">
<p><strong>Why inference is constant cost:</strong> the forward pass processes one byte at a time. The trace update is O(K). The matrix multiplications are fixed-size regardless of how many bytes have been seen previously. There is no attention over prior context — the traces are the context, and they're always the same size.</p>
</div>

<h2 id="checkpoint">What a checkpoint contains</h2>

<p>A soma checkpoint is a single <code>.pt</code> file. It bundles:</p>

<ul style="list-style:none;padding:0;margin:0 0 22px;">
<li style="padding:6px 0;padding-left:20px;position:relative;color:var(--text-dim);">
<span style="position:absolute;left:0;color:var(--accent);">→</span> <strong style="color:#c8c8c8;">Weight matrices</strong> U, W, and optionally Wd
</li>
<li style="padding:6px 0;padding-left:20px;position:relative;color:var(--text-dim);">
<span style="position:absolute;left:0;color:var(--accent);">→</span> <strong style="color:#c8c8c8;">Trace bank</strong> — the current (256, K) state in float64
</li>
<li style="padding:6px 0;padding-left:20px;position:relative;color:var(--text-dim);">
<span style="position:absolute;left:0;color:var(--accent);">→</span> <strong style="color:#c8c8c8;">Hyperparameters</strong> — n_bands, hidden_dim, base, direct_readout, and training settings
</li>
<li style="padding:6px 0;padding-left:20px;position:relative;color:var(--text-dim);">
<span style="position:absolute;left:0;color:var(--accent);">→</span> <strong style="color:#c8c8c8;">bytes_seen</strong> — total training bytes consumed
</li>
<li style="padding:6px 0;padding-left:20px;position:relative;color:var(--text-dim);">
<span style="position:absolute;left:0;color:var(--accent);">→</span> <strong style="color:#c8c8c8;">checkpoint_id</strong> — SHA256 over weights + traces + fixed config. Deterministic and unfakeable.
</li>
<li style="padding:6px 0;padding-left:20px;position:relative;color:var(--text-dim);">
<span style="position:absolute;left:0;color:var(--accent);">→</span> <strong style="color:#c8c8c8;">checkpoint_history</strong> — declared ancestor IDs, oldest first
</li>
</ul>

<p>Two checkpoints are compatible — can be continued from one or merged — if and only if their n_bands, hidden_dim, base, and direct_readout match exactly. These define the architecture; everything else is state.</p>

<div class="code-label">inspect a checkpoint</div>
<pre><code>import torch
```

ckpt = torch.load(‘your_model.pt’, map_location=‘cpu’)
print(ckpt[‘checkpoint_id’]) # sha256 hash
print(ckpt[‘bytes_seen’]) # training progress
print(ckpt[‘checkpoint_history’]) # lineage</code></pre>

```
<h2 id="limits">Honest limitations</h2>

<p>Soma is real and peer-reviewed. It also has limitations worth being clear about.</p>

<p><strong style="color:#c8c8c8;">Context is implicit, not explicit.</strong> Transformers can be prompted with specific context because they attend to it directly. Soma's context is the trace state — accumulated history. You can't inject a specific document into a soma model's "attention" the way you can with a transformer. This makes soma better for stream-learning and worse for retrieval-style tasks.</p>

<p><strong style="color:#c8c8c8;">Scale is uncharted.</strong> Transformer scaling laws are well-studied. Soma's are not. The architecture runs usefully at modest scale — the Mac mini claim is real. Whether it exhibits the same capability-with-scale behaviour as transformers is an open research question. We think it does. We don't have the empirical evidence at frontier scale to prove it.</p>

<p><strong style="color:#c8c8c8;">The community is new.</strong> There are two checkpoints on logOS right now. The ecosystem of tooling, evaluations, and shared benchmarks that the transformer world has built over years doesn't exist yet for soma. You're early. That means opportunity and roughness in equal measure.</p>

<p>None of these limit the core claims: fixed memory, constant cost, continual learning, byte-level universality. Those hold. The paper was peer-reviewed. The code runs. The architecture does what it says.</p>

<div class="cta-block">
<h3>Read the full paper or download the code</h3>
<p>Time Is All You Need — AAAI 2026. The full architecture specification is also at logossoma.com/specs.</p>
<div class="cta-row">
<a href="https://logossoma.com/paper" class="cta-btn">read the paper →</a>
<a href="https://logossoma.com/specs" class="cta-btn-ghost">technical spec</a>
</div>
</div>
```

</article>

<footer>
<span>© logOS · logossoma.com</span>
<a href="https://logossoma.com/checkpoints">browse checkpoints →</a>
</footer>

</div>

</body>
</html>

2026-04-22

<!DOCTYPE html>

<title>What Is Catastrophic Forgetting in AI? | logOS</title>
<meta name="description" content="Catastrophic forgetting is why most AI models can't learn new things without losing what they knew. Here's what it is, why it happens, and what solving it actually requires.">
<meta name="keywords" content="catastrophic forgetting, catastrophic forgetting AI, continual learning, neural network forgetting, AI learning problem, stability plasticity dilemma, what is catastrophic forgetting">
<link rel="canonical" href="https://logossoma.com/blog/what-is-catastrophic-forgetting">

* { margin: 0; padding: 0; box-sizing: border-box; }
html { scroll-behavior: smooth; }

body {
background: var(--bg);
color: var(--text);
font-family: var(--sans);
font-weight: 300;
line-height: 1.75;
font-size: 17px;
-webkit-font-smoothing: antialiased;
}

#progress {
position: fixed;
top: 0; left: 0;
height: 2px;
background: var(--accent);
width: 0%;
z-index: 100;
transition: width 0.1s linear;
}

nav {
position: fixed;
top: 0; left: 0; right: 0;
z-index: 50;
padding: 20px 40px;
display: flex;
justify-content: space-between;
align-items: center;
background: rgba(10,10,10,0.88);
backdrop-filter: blur(12px);
border-bottom: 1px solid var(--border);
}

.nav-logo {
font-family: var(--mono);
font-size: 13px;
color: var(--accent);
text-decoration: none;
letter-spacing: 0.05em;
}

.wrapper { max-width: 720px; margin: 0 auto; padding: 0 32px; }

header { padding: 140px 0 80px; border-bottom: 1px solid var(--border); }

.tag {
font-family: var(--mono);
font-size: 11px;
letter-spacing: 0.15em;
color: var(--accent);
text-transform: uppercase;
margin-bottom: 28px;
display: block;
}

h1 {
font-family: var(--mono);
font-size: clamp(26px, 4vw, 38px);
font-weight: 400;
line-height: 1.25;
letter-spacing: -0.02em;
color: #f0f0f0;
margin-bottom: 28px;
}

.subtitle {
font-size: 19px;
color: var(--text-dim);
font-weight: 300;
max-width: 580px;
line-height: 1.6;
margin-bottom: 36px;
}

.meta {
font-family: var(--mono);
font-size: 12px;
color: var(--text-dimmer);
display: flex;
gap: 24px;
}

article { padding: 72px 0; }

h2 {
font-family: var(--mono);
font-size: 20px;
font-weight: 500;
color: #e8e8e8;
margin: 64px 0 20px;
letter-spacing: -0.01em;
padding-top: 64px;
border-top: 1px solid var(--border);
}

h2:first-child { margin-top: 0; padding-top: 0; border-top: none; }

p { margin-bottom: 22px; color: var(--text); }

a {
color: var(--accent);
text-decoration: none;
border-bottom: 1px solid rgba(232,255,74,0.3);
transition: border-color 0.2s;
}
a:hover { border-color: var(--accent); }

.callout.bad {
background: var(--red-dim);
border-left-color: var(--red);
}
.callout.bad strong { color: var(--red); }

/ Analogy block /
.analogy {
background: var(--surface);
border: 1px solid var(--border);
border-radius: 4px;
padding: 32px;
margin: 36px 0;
}

.analogy-label {
font-family: var(--mono);
font-size: 10px;
letter-spacing: 0.12em;
text-transform: uppercase;
color: var(--text-dimmer);
margin-bottom: 14px;
}

.analogy p { margin: 0; font-size: 16px; color: var(--text-dim); line-height: 1.7; }
.analogy p + p { margin-top: 14px; }

/ Approach comparison /
.approaches {
margin: 36px 0;
display: flex;
flex-direction: column;
gap: 16px;
}

.approach {
background: var(--surface);
border: 1px solid var(--border);
border-radius: 4px;
padding: 24px 28px;
display: grid;
grid-template-columns: 180px 1fr;
gap: 24px;
align-items: start;
}

.approach.current {
border-color: rgba(232,255,74,0.2);
background: var(--accent-dim);
}

.approach-name {
font-family: var(--mono);
font-size: 13px;
font-weight: 500;
color: #c8c8c8;
line-height: 1.4;
}

.approach.current .approach-name { color: var(--accent); }

.approach-desc {
font-size: 15px;
color: var(--text-dim);
line-height: 1.65;
}

.approach-verdict {
font-family: var(--mono);
font-size: 11px;
letter-spacing: 0.08em;
margin-top: 8px;
color: var(--text-dimmer);
}

.approach.current .approach-verdict { color: var(--accent); }

/ Pull quote /
.pull {
margin: 48px -24px;
padding: 36px 48px;
border-top: 1px solid var(--border);
border-bottom: 1px solid var(--border);
}

.pull p {
font-family: var(--mono);
font-size: clamp(15px, 2.2vw, 19px);
font-weight: 400;
color: #e0e0e0;
line-height: 1.55;
margin: 0;
letter-spacing: -0.01em;
}

.cta-block {
margin: 72px 0 0;
padding: 48px;
background: var(--surface);
border: 1px solid var(--border);
border-radius: 4px;
text-align: center;
}

.cta-block h3 {
font-family: var(--mono);
font-size: 18px;
font-weight: 400;
color: #e8e8e8;
margin: 0 0 12px;
text-transform: none;
letter-spacing: -0.01em;
}

.cta-block p { color: var(--text-dim); font-size: 15px; margin: 0 0 28px; }

.cta-btn {
display: inline-block;
background: var(--accent);
color: #0a0a0a;
font-family: var(--mono);
font-size: 13px;
font-weight: 500;
letter-spacing: 0.05em;
padding: 14px 32px;
border-radius: 2px;
text-decoration: none;
border: none;
transition: opacity 0.2s;
}
.cta-btn:hover { opacity: 0.85; border: none; }

footer {
border-top: 1px solid var(--border);
padding: 40px 0;
display: flex;
justify-content: space-between;
align-items: center;
}

footer span { font-family: var(--mono); font-size: 12px; color: var(--text-dimmer); }
footer a { font-family: var(--mono); font-size: 12px; color: var(--text-dim); border: none; }

@media (max-width: 600px) {
nav { padding: 16px 20px; }
.nav-links { display: none; }
.wrapper { padding: 0 20px; }
header { padding: 100px 0 60px; }
.approach { grid-template-columns: 1fr; gap: 8px; }
.pull { margin: 36px 0; padding: 28px 0; }
.cta-block { padding: 32px 24px; }
footer { flex-direction: column; gap: 12px; text-align: center; }
}
</style>

</head>
<body>

<header>
<span class="tag">explainer · continual learning</span>
<h1>What is catastrophic forgetting — and why has it taken thirty years to solve?</h1>
<p class="subtitle">Every neural network that learns something new risks destroying what it already knew. This isn't a bug. It's a consequence of how the weights work. Here's what that means and why it matters.</p>
<div class="meta">
<span>James Blight</span>
<span>logossoma.com</span>
<span>AAAI 2026</span>
</div>
</header>

```
<h2>The basic problem</h2>

<p>Imagine you train a neural network to recognise cats. It learns well — high accuracy, generalises cleanly. Now you train it on dogs. When you test it on cats again, it's forgotten most of what it knew. The dog training overwrote the cat training. That's catastrophic forgetting.</p>

<p>The word "catastrophic" is accurate. It's not gradual degradation. Performance on the old task can collapse nearly completely after even a small amount of new training. The network doesn't blend the old and new knowledge — it replaces it.</p>

<div class="callout bad">
<p><strong>Why it happens:</strong> neural network weights are shared across tasks. When you update them to learn dogs, you're modifying the same parameters that encoded cats. There's no mechanism that says "protect this part of the network — it's doing something important."</p>
</div>

<p>This was first described in the late 1980s by McCloskey and Cohen, who called it "catastrophic interference." The term has stuck in various forms for thirty years because the problem has stuck. It's not solved — it's managed, imperfectly, through a collection of workarounds.</p>

<h2>Why it's harder than it looks</h2>

<p>The obvious fix is: don't let new training change the weights that matter for old tasks. The problem is that you don't know which weights matter, and even if you did, the weights that encode knowledge about cats are often the same weights that would encode knowledge about dogs most efficiently. The representations overlap.</p>

<div class="analogy">
<div class="analogy-label">analogy</div>
<p>Think of it like writing on a whiteboard in a shared office. You come in early and fill the board with something important. The next person comes in, needs the space, and erases part of it to write their own thing. They weren't being malicious — they just needed the board. There was no way to know which parts you needed to keep.</p>
<p>Neural network weights are that whiteboard. Training is always writing on shared space.</p>
</div>

<p>This is sometimes called the stability-plasticity dilemma. A network that's very stable (protects old weights strongly) can't learn new things efficiently — plasticity is low. A network that's very plastic (learns new things easily) forgets old things — stability is low. Getting both at once is the hard problem.</p>

<h2>Thirty years of workarounds</h2>

<p>The research literature on catastrophic forgetting is vast. Here are the main families of approaches and what they actually achieve:</p>

<div class="approaches">
<div class="approach">
<div>
<div class="approach-name">Elastic Weight Consolidation (EWC)</div>
<div class="approach-verdict">→ reduces forgetting, doesn't eliminate it</div>
</div>
<div class="approach-desc">Identify which weights were most important for old tasks (using Fisher information) and penalise changing them during new training. Works partially — the penalty slows forgetting but the underlying shared-weight problem remains. Doesn't scale well to many sequential tasks.</div>
</div>
<div class="approach">
<div>
<div class="approach-name">Replay buffers</div>
<div class="approach-verdict">→ effective but expensive</div>
</div>
<div class="approach-desc">Store a sample of old training data and mix it into new training batches. This works — the network stays exposed to old tasks. The cost is storing data, which grows with the number of tasks. Also raises privacy questions for personal or sensitive data.</div>
</div>
<div class="approach">
<div>
<div class="approach-name">Progressive networks</div>
<div class="approach-verdict">→ no forgetting, but memory grows without bound</div>
</div>
<div class="approach-desc">Freeze old network columns when learning new tasks and add new columns. Old knowledge is perfectly preserved — but the network grows linearly with tasks. Not viable for continuous deployment.</div>
</div>
<div class="approach">
<div>
<div class="approach-name">Fine-tuning on frozen base</div>
<div class="approach-verdict">→ avoids the problem by ignoring it</div>
</div>
<div class="approach-desc">Freeze the base model weights, only train a small adapter or head. Fast and cheap, but the base model never learns anything new. What it knew at training time is the ceiling of what it can ever know.</div>
</div>
<div class="approach current">
<div>
<div class="approach-name">Architecture-level solution</div>
<div class="approach-verdict">→ what soma does</div>
</div>
<div class="approach-desc">Design an architecture where learning is continuous by construction — fixed-size memory, no shared-weight collision, constant cost. The problem doesn't need workarounds because the architecture doesn't have the problem in the same form.</div>
</div>
</div>

<h2>Why the workarounds don't fully work</h2>

<p>Every workaround above is fighting the same thing: the architecture assumes learning happens once, offline, on a fixed dataset. Catastrophic forgetting is what you get when you try to violate that assumption without changing the architecture.</p>

<p>EWC and replay and progressive networks are impressive engineering. They make transformers and other static architectures more survivable in continual learning settings. But they're all paying a cost — in memory, in compute, in complexity — to compensate for a design that wasn't built for this.</p>

<div class="pull">
<p>The stability-plasticity dilemma isn't a fundamental law of neural networks. It's a consequence of a specific design choice: shared weights updated by gradient descent on a fixed objective. Change the design, and the dilemma changes shape.</p>
</div>

<h2>What an architectural solution looks like</h2>

<p>Soma approaches this differently. Instead of shared weights that get overwritten, soma uses a bank of geometrically spaced spectral traces — a fixed-size memory that represents information across multiple timescales simultaneously.</p>

<p>The key property: traces at different timescales decay at different rates. Fast traces capture recent patterns. Slow traces preserve long-range structure. Learning new things updates all timescales — but slow traces change slowly, so long-range structure is preserved by the architecture, not by an external penalty or replay mechanism.</p>

<p>This is why fixed memory is a feature rather than a constraint. The memory isn't a buffer that fills up — it's a spectral basis that continuously represents the stream it's seen. New data updates the representation without erasing the old one, because the old representation lives in the slow traces and the new one lives in the fast ones.</p>

<div class="callout">
<p><strong>The result:</strong> catastrophic forgetting in the classic sense doesn't occur, because there's no single shared weight being overwritten. Long-range structure is architecturally protected. Short-range adaptation is architecturally enabled. The stability-plasticity tradeoff has a different geometry.</p>
</div>

<h2>Why this took thirty years</h2>

<p>The honest answer is that catastrophic forgetting wasn't the bottleneck for most of the last thirty years. The bigger problem was building models capable enough to be worth preserving. Once you have a model that knows something valuable, forgetting becomes the thing you care about.</p>

<p>Now that frontier models are genuinely capable, the deployment problem is real. You have something worth preserving, and you need it to keep learning without destroying itself. The research attention has followed.</p>

<p>The other reason is that the transformer paradigm arrived and was so successful that most research energy went into scaling it rather than questioning its assumptions. Catastrophic forgetting is an assumption-level problem. It required stepping back from the paradigm to ask whether the shared-weight design was load-bearing.</p>

<p>It is — for attention. It isn't — for learning.</p>

<div class="cta-block">
<h3>An architecture without the problem</h3>
<p>Soma was built from scratch around continual learning as a first principle. One script, fixed memory, constant cost. Read the paper or download it and try it yourself.</p>
<a href="https://logossoma.com/paper" class="cta-btn">read the paper →</a>
</div>
```

</article>

<footer>
<span>© logOS · logossoma.com</span>
<a href="https://logossoma.com/checkpoints">browse checkpoints →</a>
</footer>

</div>

</body>
</html>

2026-04-22

<!DOCTYPE html>

<title>Why Large-Scale AI Converges Toward Continual Learning | logOS</title>
<meta name="description" content="The transformer paradigm has real structural limits. Here's an honest argument for why every serious AI application eventually needs what transformers can't provide — and what that means.">
<meta name="keywords" content="transformer limitations, continual learning AI, beyond transformers, AI architecture future, quadratic attention problem, static model problem, AI convergence">
<link rel="canonical" href="https://logossoma.com/blog/why-ai-converges-continual-learning">

* { margin: 0; padding: 0; box-sizing: border-box; }
html { scroll-behavior: smooth; }

body {
background: var(--bg);
color: var(--text);
font-family: var(--sans);
font-weight: 300;
line-height: 1.75;
font-size: 17px;
-webkit-font-smoothing: antialiased;
}

#progress {
position: fixed;
top: 0; left: 0;
height: 2px;
background: var(--accent);
width: 0%;
z-index: 100;
transition: width 0.1s linear;
}

nav {
position: fixed;
top: 0; left: 0; right: 0;
z-index: 50;
padding: 20px 40px;
display: flex;
justify-content: space-between;
align-items: center;
background: rgba(12,12,12,0.88);
backdrop-filter: blur(12px);
border-bottom: 1px solid var(--border);
}

.nav-logo {
font-family: var(--mono);
font-size: 13px;
color: var(--accent);
text-decoration: none;
letter-spacing: 0.05em;
}

.wrapper { max-width: 720px; margin: 0 auto; padding: 0 32px; }

header { padding: 140px 0 80px; border-bottom: 1px solid var(--border); }

.tag {
font-family: var(--mono);
font-size: 11px;
letter-spacing: 0.15em;
color: var(--accent);
text-transform: uppercase;
margin-bottom: 28px;
display: block;
}

h1 {
font-family: var(--mono);
font-size: clamp(24px, 4vw, 36px);
font-weight: 400;
line-height: 1.25;
letter-spacing: -0.02em;
color: #f0f0f0;
margin-bottom: 28px;
}

.subtitle {
font-size: 19px;
color: var(--text-dim);
font-weight: 300;
max-width: 580px;
line-height: 1.6;
margin-bottom: 36px;
}

.meta {
font-family: var(--mono);
font-size: 12px;
color: var(--text-dimmer);
display: flex;
gap: 24px;
}

article { padding: 72px 0; }

h2 {
font-family: var(--mono);
font-size: 20px;
font-weight: 500;
color: #e8e8e8;
margin: 64px 0 20px;
letter-spacing: -0.01em;
padding-top: 64px;
border-top: 1px solid var(--border);
}

h2:first-child { margin-top: 0; padding-top: 0; border-top: none; }

p { margin-bottom: 22px; color: var(--text); }

a {
color: var(--accent);
text-decoration: none;
border-bottom: 1px solid rgba(232,255,74,0.3);
transition: border-color 0.2s;
}
a:hover { border-color: var(--accent); }

/ Argument blocks /
.argument {
margin: 48px 0;
counter-increment: argument;
}

.argument-num {
font-family: var(--mono);
font-size: 11px;
letter-spacing: 0.15em;
text-transform: uppercase;
color: var(--text-dimmer);
margin-bottom: 10px;
}

.argument-num span {
color: var(--accent);
}

.argument h3 {
font-family: var(--mono);
font-size: 17px;
font-weight: 500;
color: #e0e0e0;
margin-bottom: 14px;
letter-spacing: -0.01em;
text-transform: none;
}

.argument p {
font-size: 16px;
color: var(--text-dim);
margin: 0;
line-height: 1.75;
}

/ Pull quote /
.pull {
margin: 48px -24px;
padding: 36px 48px;
border-top: 1px solid var(--border);
border-bottom: 1px solid var(--border);
}

.pull p {
font-family: var(--mono);
font-size: clamp(16px, 2.5vw, 21px);
font-weight: 400;
color: #e0e0e0;
line-height: 1.5;
margin: 0;
letter-spacing: -0.01em;
}

/ Concession block /
.concession {
background: var(--surface);
border: 1px solid var(--border);
border-radius: 4px;
padding: 28px;
margin: 36px 0;
}

.concession-label {
font-family: var(--mono);
font-size: 10px;
letter-spacing: 0.12em;
text-transform: uppercase;
color: var(--text-dimmer);
margin-bottom: 12px;
}

.concession p { margin: 0; font-size: 15px; color: var(--text-dim); }
.concession p + p { margin-top: 12px; }

.cta-block {
margin: 72px 0 0;
padding: 48px;
background: var(--surface);
border: 1px solid var(--border);
border-radius: 4px;
text-align: center;
}

.cta-block h3 {
font-family: var(--mono);
font-size: 18px;
font-weight: 400;
color: #e8e8e8;
margin: 0 0 12px;
text-transform: none;
letter-spacing: -0.01em;
}

.cta-block p { color: var(--text-dim); font-size: 15px; margin: 0 0 28px; }

.cta-row {
display: flex;
gap: 12px;
justify-content: center;
flex-wrap: wrap;
}

footer {
border-top: 1px solid var(--border);
padding: 40px 0;
display: flex;
justify-content: space-between;
align-items: center;
}

footer span { font-family: var(--mono); font-size: 12px; color: var(--text-dimmer); }
footer a { font-family: var(--mono); font-size: 12px; color: var(--text-dim); border: none; }

@media (max-width: 600px) {
nav { padding: 16px 20px; }
.nav-links { display: none; }
.wrapper { padding: 0 20px; }
header { padding: 100px 0 60px; }
.pull { margin: 36px 0; padding: 28px 0; }
.cta-block { padding: 32px 24px; }
footer { flex-direction: column; gap: 12px; text-align: center; }
}
</style>

</head>
<body>

<header>
<span class="tag">perspective · architecture · long read</span>
<h1>Why I think every serious AI application eventually needs what transformers can't give it.</h1>
<p class="subtitle">This is an argument, not a product pitch. Transformers are remarkable. They also have structural limits that aren't engineering problems — they're architecture problems. Here's why those limits matter and where the logic points.</p>
<div class="meta">
<span>James Blight</span>
<span>logossoma.com</span>
<span>AAAI 2026</span>
</div>
</header>

```
<h2>A few things I need to say upfront</h2>

<p>I built soma. I have an obvious stake in arguing that transformers have limits. You should weight that appropriately.</p>

<p>What I'm going to try to do here is make the honest version of this argument — one that grants transformers everything they deserve, identifies the specific structural problems that can't be engineered away, and explains why those problems push toward continual learning as a matter of logic rather than preference.</p>

<p>If the argument is wrong, I'd genuinely like to know. The paper is at <a href="https://logossoma.com/paper">logossoma.com/paper</a>. The architecture is open. Push back.</p>

<h2>What transformers actually solved</h2>

<p>It's worth being precise about why transformers took over. Before attention mechanisms, sequence models had a specific failure mode: they forgot. RNNs and LSTMs compressed history into a fixed hidden state, which meant long-range dependencies degraded with distance. Attention solved this by letting every token directly attend to every other token — no compression, no degradation, full access to the sequence.</p>

<p>This was genuinely important. It's why transformers work so well on language, where long-range structure matters enormously — a pronoun at position 800 might refer to a noun at position 12.</p>

<p>The cost of this design is quadratic. Attending to everything means O(n²) computation in sequence length. At the scales transformers are now deployed, this is managed through enormous parallel hardware and careful engineering — but it's not free and it doesn't go away.</p>

<div class="callout">
<p>The transformer's power and its cost come from the same place: full attention across the sequence. You can't have one without the other. This isn't an implementation detail — it's the architecture.</p>
</div>

<h2>The three structural limits</h2>

<div class="argument">
<div class="argument-num">limit <span>01</span></div>
<h3>Static weights after training</h3>
<p>A trained transformer doesn't learn. Its weights are frozen. Everything it knows was encoded during the training run — a single, expensive, offline process. This is fine if the world is static. The world is not static. Deploying a frozen model into a changing environment means its knowledge decays from the moment of deployment. Fine-tuning is a workaround: it requires new data, new compute, new deployment cycles. It doesn't solve the problem — it manages it at cost.</p>
</div>

<div class="argument">
<div class="argument-num">limit <span>02</span></div>
<h3>The context window is a ceiling, not a feature</h3>
<p>Context windows have grown dramatically — from thousands to millions of tokens in recent years. This is presented as progress, and in some ways it is. But the context window is also an acknowledgement that the model has no persistent memory: everything it needs to know about a conversation or document must fit within the window, right now, at inference time. When the window closes, it forgets. This is a fundamental property of the architecture, not a parameter you can tune away. Models with infinite context windows would have infinite inference cost.</p>
</div>

<div class="argument">
<div class="argument-num">limit <span>03</span></div>
<h3>Generalisation requires enormous scale</h3>
<p>Transformers learn by exposure. To generalise reliably, they need to see enormous amounts of data — which requires enormous training runs on enormous hardware. This is why frontier model training costs hundreds of millions of dollars. The scale isn't ambition; it's a requirement of the learning mechanism. A smaller transformer trained on less data is categorically less capable, not just quantitatively. This creates a structural barrier: building a useful transformer is only possible for organisations with serious compute budgets.</p>
</div>

<div class="pull">
<p>"Attention is all you need" was right for what transformers were solving. The question is whether it's still the right frame for what we need AI to do next.</p>
</div>

<h2>Why these limits aren't engineering problems</h2>

<p>The important distinction is between problems that can be solved with more engineering and problems that are load-bearing properties of the architecture. The three limits above are the second kind.</p>

<p>Quadratic attention cost can be approximated — sparse attention, linear attention, various approximations — but these all trade off the full-attention property that makes transformers powerful in the first place. Static weights can be worked around through retrieval augmentation, fine-tuning, or tool use — but none of these give the model the ability to actually learn from deployment. Scale requirements can be reduced through distillation and quantization — but a distilled model's knowledge ceiling is the original model's ceiling.</p>

<p>These are approximations and workarounds. They don't change the underlying geometry.</p>

<div class="concession">
<div class="concession-label">the fair counterargument</div>
<p>Transformers may scale to capabilities where these limits don't matter in practice. A sufficiently capable frozen model might be useful enough that continual adaptation is unnecessary. Long context windows might approximate persistent memory well enough for most applications.</p>
<p>This is a legitimate position. My response is: "for most applications" does a lot of work in that sentence. There are large and growing classes of problems — robotics, edge AI, personalised systems, real-time adaptation — where these workarounds structurally cannot close the gap.</p>
</div>

<h2>What the logic requires</h2>

<p>If you want a model that:</p>

<ul style="list-style:none;padding:0;margin:0 0 22px;">
<li style="padding:6px 0;padding-left:20px;position:relative;color:var(--text-dim);">
<span style="position:absolute;left:0;color:var(--accent);">→</span> Learns continuously from deployment data without retraining
</li>
<li style="padding:6px 0;padding-left:20px;position:relative;color:var(--text-dim);">
<span style="position:absolute;left:0;color:var(--accent);">→</span> Runs at constant cost regardless of history length
</li>
<li style="padding:6px 0;padding-left:20px;position:relative;color:var(--text-dim);">
<span style="position:absolute;left:0;color:var(--accent);">→</span> Operates usefully on modest hardware without scale requirements
</li>
<li style="padding:6px 0;padding-left:20px;position:relative;color:var(--text-dim);">
<span style="position:absolute;left:0;color:var(--accent);">→</span> Maintains persistent memory without a context window ceiling
</li>
</ul>

<p>...then you need a different architecture. Not a better transformer. A different design that makes different tradeoffs from the ground up.</p>

<p>The architecture that satisfies these properties is one where learning is continuous, memory is fixed-size, and cost per step is constant. This is not a specific architecture — it's a set of constraints that any architecture satisfying these requirements must meet. Soma is one concrete instantiation. There will be others. The point is that the constraints themselves follow logically from the requirements.</p>

<h2>What this means for where AI goes</h2>

<p>I'm not predicting transformers disappear. They're too useful and too entrenched for that. What I'm predicting is that the applications where transformers hit structural walls — edge deployment, real-time adaptation, personalised systems, anything that needs to run without a data center — will converge toward continual architectures, because the alternative is perpetually engineering around limitations that are load-bearing.</p>

<p>This is already beginning to happen in the research literature. Mamba, RWKV, and other state-space models are exploring the design space of fixed-cost sequence models. The research community is circling the same problem from different directions.</p>

<p>Soma's approach — byte-level learning, geometrically spaced spectral traces, fixed memory — is a specific and peer-reviewed answer to the same question. The paper was accepted at AAAI 2026. The mathematics is motivated. The architecture runs on a Mac mini.</p>

<div class="callout">
<p><strong>The bet logOS is making</strong> is that the community of people who build with continual architectures will need a place to share what they've built — checkpoints, lineages, forks, experiments. We're building that place now, while the community is still small enough that being early matters.</p>
</div>

<p>The transformer paper is called "Attention Is All You Need." The title makes a claim about what the fundamental operation of intelligence is: attending to the right things in a sequence.</p>

<p>Soma's paper is called "Time Is All You Need." The claim is different: the fundamental operation is learning over time. Not attending to a fixed context, but accumulating understanding from a continuous stream. Not a snapshot of knowledge frozen at training, but a process that unfolds alongside experience.</p>

<p>This isn't a marketing distinction. It's a different theory of what intelligence requires. If the transformer title is right, then scaling attention is the path forward. If the soma title is right, then the path forward is architectures that learn continuously, remember efficiently, and operate at constant cost.</p>

<p>I think the second one is right. The argument above is why. The architecture is open. You can download it, train something on it, and see for yourself.</p>

<div class="cta-block">
<h3>Read the paper. Try the architecture.</h3>
<p>Time Is All You Need — AAAI 2026. One script, PyTorch only, runs on hardware you already have.</p>
<div class="cta-row">
<a href="https://logossoma.com/paper" class="cta-btn">read the paper →</a>
<a href="https://logossoma.com/api/download/soma.zip" class="cta-btn-ghost">download soma</a>
</div>
</div>
```

</article>

<footer>
<span>© logOS · logossoma.com</span>
<a href="https://logossoma.com/checkpoints">browse checkpoints →</a>
</footer>

</div>

</body>
</html>

2026-04-22

<!DOCTYPE html>

<title>Continual Learning for Robotics and Edge AI | logOS</title>
<meta name="description" content="Static models break in the real world. Here's why continual learning isn't a research curiosity — it's the only architecture that makes sense for robotics and edge deployment.">
<meta name="keywords" content="continual learning robotics, edge AI training, on-device learning, AI sensor data, train AI on sensor streams, local AI robotics, catastrophic forgetting">
<link rel="canonical" href="https://logossoma.com/blog/continual-learning-robotics-edge-ai">

* { margin: 0; padding: 0; box-sizing: border-box; }
html { scroll-behavior: smooth; }

body {
background: var(--bg);
color: var(--text);
font-family: var(--sans);
font-weight: 300;
line-height: 1.7;
font-size: 17px;
-webkit-font-smoothing: antialiased;
}

#progress {
position: fixed;
top: 0; left: 0;
height: 2px;
background: var(--accent);
width: 0%;
z-index: 100;
transition: width 0.1s linear;
}

nav {
position: fixed;
top: 0; left: 0; right: 0;
z-index: 50;
padding: 20px 40px;
display: flex;
justify-content: space-between;
align-items: center;
background: rgba(10,10,10,0.85);
backdrop-filter: blur(12px);
border-bottom: 1px solid var(--border);
}

.nav-logo {
font-family: var(--mono);
font-size: 13px;
color: var(--accent);
text-decoration: none;
letter-spacing: 0.05em;
}

.wrapper { max-width: 720px; margin: 0 auto; padding: 0 32px; }

header { padding: 140px 0 80px; border-bottom: 1px solid var(--border); }

.tag {
font-family: var(--mono);
font-size: 11px;
letter-spacing: 0.15em;
color: var(--accent);
text-transform: uppercase;
margin-bottom: 28px;
display: block;
}

h1 {
font-family: var(--mono);
font-size: clamp(26px, 4vw, 38px);
font-weight: 400;
line-height: 1.25;
letter-spacing: -0.02em;
color: #f0f0f0;
margin-bottom: 28px;
}

.subtitle {
font-size: 19px;
color: var(--text-dim);
font-weight: 300;
max-width: 580px;
line-height: 1.6;
margin-bottom: 36px;
}

.meta {
font-family: var(--mono);
font-size: 12px;
color: var(--text-dimmer);
display: flex;
gap: 24px;
}

article { padding: 72px 0; }

h2 {
font-family: var(--mono);
font-size: 20px;
font-weight: 500;
color: #e8e8e8;
margin: 64px 0 20px;
letter-spacing: -0.01em;
padding-top: 64px;
border-top: 1px solid var(--border);
}

h2:first-child { margin-top: 0; padding-top: 0; border-top: none; }

p { margin-bottom: 22px; color: var(--text); }

a {
color: var(--accent);
text-decoration: none;
border-bottom: 1px solid rgba(232,255,74,0.3);
transition: border-color 0.2s;
}
a:hover { border-color: var(--accent); }

.callout {
background: var(--accent-dim);
border-left: 2px solid var(--accent);
padding: 24px 28px;
margin: 36px 0;
border-radius: 0 4px 4px 0;
}
.callout p { margin: 0; color: #c8c8c8; font-size: 16px; }
.callout strong { color: var(--accent); font-weight: 500; }

.scenario {
background: var(--surface);
border: 1px solid var(--border);
border-radius: 4px;
padding: 28px;
margin: 28px 0;
}

.scenario-label {
font-family: var(--mono);
font-size: 10px;
letter-spacing: 0.12em;
text-transform: uppercase;
color: var(--text-dimmer);
margin-bottom: 14px;
}

.scenario h3 {
font-family: var(--mono);
font-size: 15px;
font-weight: 500;
color: #e0e0e0;
margin-bottom: 10px;
text-transform: none;
letter-spacing: -0.01em;
}

.scenario p { margin: 0; font-size: 15px; color: var(--text-dim); }

.scenario-grid {
display: grid;
grid-template-columns: 1fr 1fr;
gap: 16px;
margin: 28px 0;
}

pre {
background: var(--surface);
border: 1px solid var(--border);
border-radius: 4px;
padding: 24px;
overflow-x: auto;
margin: 28px 0;
}

code {
font-family: var(--mono);
font-size: 13px;
color: #b8d4b8;
line-height: 1.7;
}

p code, li code {
background: var(--surface);
border: 1px solid var(--border);
padding: 2px 7px;
border-radius: 3px;
font-size: 13px;
color: #b8d4b8;
}

.code-label {
font-family: var(--mono);
font-size: 10px;
letter-spacing: 0.1em;
color: var(--text-dimmer);
text-transform: uppercase;
margin-bottom: 8px;
}

.timeline {
margin: 36px 0;
position: relative;
padding-left: 28px;
border-left: 1px solid var(--border);
}

.timeline-item {
margin-bottom: 32px;
position: relative;
}

.timeline-item::before {
content: '';
position: absolute;
left: -34px;
top: 7px;
width: 6px;
height: 6px;
border-radius: 50%;
background: var(--accent);
}

.timeline-label {
font-family: var(--mono);
font-size: 11px;
letter-spacing: 0.1em;
text-transform: uppercase;
color: var(--accent);
margin-bottom: 6px;
}

.timeline-item h3 {
font-family: var(--sans);
font-size: 16px;
font-weight: 500;
color: #e0e0e0;
margin-bottom: 6px;
text-transform: none;
letter-spacing: 0;
}

.timeline-item p {
font-size: 15px;
color: var(--text-dim);
margin: 0;
}

.cta-block {
margin: 72px 0 0;
padding: 48px;
background: var(--surface);
border: 1px solid var(--border);
border-radius: 4px;
text-align: center;
}

.cta-block h3 {
font-family: var(--mono);
font-size: 18px;
font-weight: 400;
color: #e8e8e8;
margin: 0 0 12px;
text-transform: none;
letter-spacing: -0.01em;
}

.cta-block p { color: var(--text-dim); font-size: 15px; margin: 0 0 28px; }

.cta-btn {
display: inline-block;
background: var(--accent);
color: #0a0a0a;
font-family: var(--mono);
font-size: 13px;
font-weight: 500;
letter-spacing: 0.05em;
padding: 14px 32px;
border-radius: 2px;
text-decoration: none;
border: none;
transition: opacity 0.2s;
}
.cta-btn:hover { opacity: 0.85; border: none; }

footer {
border-top: 1px solid var(--border);
padding: 40px 0;
display: flex;
justify-content: space-between;
align-items: center;
}

footer span { font-family: var(--mono); font-size: 12px; color: var(--text-dimmer); }
footer a { font-family: var(--mono); font-size: 12px; color: var(--text-dim); border: none; }

@media (max-width: 600px) {
nav { padding: 16px 20px; }
.nav-links { display: none; }
.wrapper { padding: 0 20px; }
header { padding: 100px 0 60px; }
.scenario-grid { grid-template-columns: 1fr; }
.cta-block { padding: 32px 24px; }
footer { flex-direction: column; gap: 12px; text-align: center; }
}
</style>

</head>
<body>

<header>
<span class="tag">guide · robotics · edge AI</span>
<h1>Static models break in the real world.<br>Continual learning doesn't.</h1>
<p class="subtitle">Every robotics and edge AI deployment eventually hits the same wall: the world changes, the model doesn't. Here's why that's an architecture problem — and what solving it actually looks like.</p>
<div class="meta">
<span>James Blight</span>
<span>logossoma.com</span>
<span>AAAI 2026</span>
</div>
</header>

```
<h2>The wall every edge deployment hits</h2>

<p>You train a model. You deploy it to a robot, a sensor node, an embedded device. For a while it works. Then the environment shifts — new lighting conditions, a different operator, seasonal drift in sensor readings, hardware wear. The model was trained on the world as it was. It has no mechanism to adapt to the world as it is.</p>

<p>The standard solution is to periodically retrain on new data and redeploy. This works in a data center. It doesn't work at the edge, where connectivity is unreliable, compute is constrained, and the whole point was to operate independently.</p>

<div class="callout">
<p><strong>The fundamental mismatch:</strong> transformer-based models are trained once and frozen. The real world is not frozen. Every edge deployment is a bet that the gap between training distribution and deployment reality stays small enough to tolerate. Often it doesn't.</p>
</div>

<h2>What catastrophic forgetting actually costs you</h2>

<p>The standard response to this in the research literature is "continual learning" — techniques that allow a model to incorporate new information without forgetting old information. The problem is that most approaches to continual learning are bolt-ons: elastic weight consolidation, progressive neural networks, replay buffers. They fight against the static-weight assumption of the underlying architecture.</p>

<p>The result is a tradeoff. Learn too aggressively and the model forgets what it knew. Learn too conservatively and it never adapts. This tradeoff is called catastrophic forgetting, and it's been an open problem for thirty years precisely because the architectures weren't designed for learning to be ongoing.</p>

<p>The cost in production isn't just performance degradation. It's unpredictability. A model that sometimes forgets critical behaviour is worse than a model that never adapts — you can't reason about its failure modes.</p>

<h2>What the edge actually needs</h2>

<p>Strip away the research framing and the requirements are clear:</p>

<div class="timeline">
<div class="timeline-item">
<div class="timeline-label">requirement 01</div>
<h3>Fixed memory footprint</h3>
<p>Edge devices have constrained RAM. A model that grows with data is a model that eventually fails. Memory must be bounded at architecture level, not managed around.</p>
</div>
<div class="timeline-item">
<div class="timeline-label">requirement 02</div>
<h3>Constant inference cost</h3>
<p>Quadratic attention cost is fine in a data center. At the edge, every watt and every millisecond matters. Inference cost must not scale with context or history length.</p>
</div>
<div class="timeline-item">
<div class="timeline-label">requirement 03</div>
<h3>Learning from any byte stream</h3>
<p>Sensor data isn't tokenized text. It's raw bytes — telemetry, control logs, audio, IMU readings. The model should learn directly from whatever the device produces, without a preprocessing pipeline.</p>
</div>
<div class="timeline-item">
<div class="timeline-label">requirement 04</div>
<h3>No connectivity dependency</h3>
<p>The model must learn and adapt on-device. A continual learning system that requires cloud sync for weight updates isn't continual — it's periodic fine-tuning with extra steps.</p>
</div>
</div>

<p>These aren't aspirational. They're the minimum viable properties for an AI system that genuinely operates at the edge without babysitting.</p>

<h2>Where soma fits</h2>

<p>soma was designed around exactly these constraints. It's a byte-level continual learning architecture — one Python script, PyTorch only — that learns from raw byte streams, maintains a fixed-size memory regardless of training duration, and runs inference at constant cost per byte.</p>

<p>The key architectural idea: a bank of geometrically spaced spectral traces provides a basis for representing information across timescales. This is what allows the model to retain long-range structure without growing its memory — the traces compress temporal relationships rather than storing them explicitly.</p>

<div class="scenario-grid">
<div class="scenario">
<div class="scenario-label">scenario</div>
<h3>Robot arm calibration</h3>
<p>Train on control logs. The model learns the arm's dynamics. As hardware wears, it adapts continuously — no redeployment cycle.</p>
</div>
<div class="scenario">
<div class="scenario-label">scenario</div>
<h3>Sensor anomaly detection</h3>
<p>Train on telemetry streams. Normal behaviour is learned continuously. Anomalies surface as prediction failure — no labelled dataset required.</p>
</div>
<div class="scenario">
<div class="scenario-label">scenario</div>
<h3>Edge audio processing</h3>
<p>Train on raw audio bytes from the deployment environment. The model learns the acoustic profile of that specific space, not a generalised one.</p>
</div>
<div class="scenario">
<div class="scenario-label">scenario</div>
<h3>Personal device adaptation</h3>
<p>Train on a user's own data — typing patterns, usage logs, documents. A model that genuinely knows one person, running locally, never leaving the device.</p>
</div>
</div>

<h2>In practice</h2>

<p>Training on a sensor stream looks like this:</p>

<div class="code-label">train on raw bytes from any source</div>
<pre><code>python soma.py train --input sensor_log.bin
```

python soma.py train –input telemetry_stream.csv
python soma.py train –input audio_recording.raw</code></pre>

```
<p>The model doesn't care what the bytes represent. It learns the statistical structure of whatever you feed it. Run inference the same way:</p>

<div class="code-label">inference at constant cost</div>
<pre><code>python soma.py chat --resume your_model.pt</code></pre>

<p>Memory stays fixed. Cost stays constant. The checkpoint is a single <code>.pt</code> file you can move to any device — a Mac mini, a Raspberry Pi with enough RAM, whatever you're deploying to.</p>

<div class="callout">
<p><strong>The checkpoint is the deployment artifact.</strong> No serving infrastructure, no container, no model registry. One file. Runs wherever PyTorch runs.</p>
</div>

<h2>Why this matters beyond robotics</h2>

<p>The edge AI framing is useful because the constraints are legible — limited memory, limited compute, no connectivity. But the same constraints apply anywhere you want AI to operate close to data rather than far from it.</p>

<p>A model running on your machine, learning from your data, never sending anything anywhere — that's not just an edge AI architecture. It's a different relationship between people and the models they use. Ownership instead of subscription. Specificity instead of generality. A model that gets better at your problem, not everyone's problem.</p>

<p>We think that's where this goes. Not because it's philosophically appealing, but because the economics and the architecture point there. Continual learning at fixed cost, running on hardware you already own, is strictly better than the alternative for a large class of problems. The alternative just happened to arrive first.</p>

<div class="cta-block">
<h3>Download soma and start training</h3>
<p>One script. PyTorch only. Runs on a Mac mini or your existing gaming PC. Browse community checkpoints or start from scratch.</p>
<a href="https://logossoma.com/api/download/soma.zip" class="cta-btn">download soma →</a>
</div>
```

</article>

<footer>
<span>© logOS · logossoma.com</span>
<a href="https://logossoma.com/paper">read the AAAI 2026 paper →</a>
</footer>

</div>

</body>
</html>

2026-04-22

<!DOCTYPE html>

<title>How to Train Your Own AI Model Locally | logOS</title>
<meta name="description" content="You don't need a data center. Here's how to train a real AI model on hardware you already own — and why the architecture matters more than the compute.">
<meta name="keywords" content="train AI locally, how to train AI, train AI on Mac, local AI training, continual learning AI, train your own AI model">
<link rel="canonical" href="https://logossoma.com/blog/train-ai-locally">

* { margin: 0; padding: 0; box-sizing: border-box; }

html { scroll-behavior: smooth; }

body {
background: var(--bg);
color: var(--text);
font-family: var(--sans);
font-weight: 300;
line-height: 1.7;
font-size: 17px;
-webkit-font-smoothing: antialiased;
}

/ Progress bar /
#progress {
position: fixed;
top: 0; left: 0;
height: 2px;
background: var(--accent);
width: 0%;
z-index: 100;
transition: width 0.1s linear;
}

/ Nav /
nav {
position: fixed;
top: 0; left: 0; right: 0;
z-index: 50;
padding: 20px 40px;
display: flex;
justify-content: space-between;
align-items: center;
background: rgba(10,10,10,0.85);
backdrop-filter: blur(12px);
border-bottom: 1px solid var(--border);
}

.nav-logo {
font-family: var(--mono);
font-size: 13px;
color: var(--accent);
text-decoration: none;
letter-spacing: 0.05em;
}

.nav-links {
display: flex;
gap: 28px;
list-style: none;
}

.nav-links a {
font-family: var(--mono);
font-size: 12px;
color: var(--text-dim);
text-decoration: none;
letter-spacing: 0.08em;
transition: color 0.2s;
}

.nav-links a:hover { color: var(--text); }

/ Layout /
.wrapper {
max-width: 720px;
margin: 0 auto;
padding: 0 32px;
}

/ Hero /
header {
padding: 140px 0 80px;
border-bottom: 1px solid var(--border);
}

.tag {
font-family: var(--mono);
font-size: 11px;
letter-spacing: 0.15em;
color: var(--accent);
text-transform: uppercase;
margin-bottom: 28px;
display: block;
}

h1 {
font-family: var(--mono);
font-size: clamp(26px, 4vw, 38px);
font-weight: 400;
line-height: 1.25;
letter-spacing: -0.02em;
color: #f0f0f0;
margin-bottom: 28px;
}

.subtitle {
font-size: 19px;
color: var(--text-dim);
font-weight: 300;
max-width: 580px;
line-height: 1.6;
margin-bottom: 36px;
}

.meta {
font-family: var(--mono);
font-size: 12px;
color: var(--text-dimmer);
display: flex;
gap: 24px;
}

/ Article body /
article {
padding: 72px 0;
}

h2 {
font-family: var(--mono);
font-size: 20px;
font-weight: 500;
color: #e8e8e8;
margin: 64px 0 20px;
letter-spacing: -0.01em;
padding-top: 64px;
border-top: 1px solid var(--border);
}

h2:first-child { margin-top: 0; padding-top: 0; border-top: none; }

h3 {
font-family: var(--mono);
font-size: 14px;
font-weight: 500;
color: var(--text-dim);
margin: 36px 0 12px;
letter-spacing: 0.05em;
text-transform: uppercase;
}

p {
margin-bottom: 22px;
color: var(--text);
}

a {
color: var(--accent);
text-decoration: none;
border-bottom: 1px solid rgba(232,255,74,0.3);
transition: border-color 0.2s;
}

a:hover { border-color: var(--accent); }

/ Callout /
.callout {
background: var(--accent-dim);
border-left: 2px solid var(--accent);
padding: 24px 28px;
margin: 36px 0;
border-radius: 0 4px 4px 0;
}

.callout p {
margin: 0;
color: #c8c8c8;
font-size: 16px;
}

.callout strong {
color: var(--accent);
font-weight: 500;
}

/ Code /
pre {
background: var(--surface);
border: 1px solid var(--border);
border-radius: 4px;
padding: 24px;
overflow-x: auto;
margin: 28px 0;
position: relative;
}

code {
font-family: var(--mono);
font-size: 13px;
color: #b8d4b8;
line-height: 1.7;
}

p code, li code {
background: var(--surface);
border: 1px solid var(--border);
padding: 2px 7px;
border-radius: 3px;
font-size: 13px;
color: #b8d4b8;
}

.code-label {
font-family: var(--mono);
font-size: 10px;
letter-spacing: 0.1em;
color: var(--text-dimmer);
text-transform: uppercase;
margin-bottom: 8px;
}

/ Table /
table {
width: 100%;
border-collapse: collapse;
margin: 28px 0;
font-size: 15px;
}

th {
font-family: var(--mono);
font-size: 11px;
letter-spacing: 0.1em;
text-transform: uppercase;
color: var(--text-dim);
text-align: left;
padding: 10px 16px;
border-bottom: 1px solid var(--border);
}

td {
padding: 12px 16px;
border-bottom: 1px solid var(--border);
color: var(--text);
vertical-align: top;
}

tr:last-child td { border-bottom: none; }

td:first-child { font-family: var(--mono); font-size: 13px; color: #c8c8c8; }

/ Comparison block /
.compare {
display: grid;
grid-template-columns: 1fr 1fr;
gap: 16px;
margin: 28px 0;
}

.compare-box {
background: var(--surface);
border: 1px solid var(--border);
padding: 20px;
border-radius: 4px;
}

.compare-box.highlight {
border-color: rgba(232,255,74,0.2);
background: var(--accent-dim);
}

.compare-label {
font-family: var(--mono);
font-size: 10px;
letter-spacing: 0.12em;
text-transform: uppercase;
color: var(--text-dimmer);
margin-bottom: 10px;
}

.compare-box.highlight .compare-label { color: var(--accent); }

.compare-box ul {
list-style: none;
padding: 0;
}

.compare-box li {
font-size: 14px;
color: var(--text-dim);
padding: 4px 0;
padding-left: 14px;
position: relative;
}

.compare-box li::before {
content: '—';
position: absolute;
left: 0;
color: var(--text-dimmer);
}

.compare-box.highlight li { color: var(--text); }
.compare-box.highlight li::before { color: var(--accent); content: '+'; }

/ CTA /
.cta-block {
margin: 72px 0 0;
padding: 48px;
background: var(--surface);
border: 1px solid var(--border);
border-radius: 4px;
text-align: center;
}

.cta-block h3 {
font-family: var(--mono);
font-size: 18px;
font-weight: 400;
color: #e8e8e8;
margin: 0 0 12px;
text-transform: none;
letter-spacing: -0.01em;
}

.cta-block p {
color: var(--text-dim);
font-size: 15px;
margin: 0 0 28px;
}

.cta-btn {
display: inline-block;
background: var(--accent);
color: #0a0a0a;
font-family: var(--mono);
font-size: 13px;
font-weight: 500;
letter-spacing: 0.05em;
padding: 14px 32px;
border-radius: 2px;
text-decoration: none;
border: none;
transition: opacity 0.2s;
}

.cta-btn:hover { opacity: 0.85; border: none; }

/ Footer /
footer {
border-top: 1px solid var(--border);
padding: 40px 0;
display: flex;
justify-content: space-between;
align-items: center;
}

footer span {
font-family: var(--mono);
font-size: 12px;
color: var(--text-dimmer);
}

footer a {
font-family: var(--mono);
font-size: 12px;
color: var(--text-dim);
border: none;
}

@media (max-width: 600px) {
nav { padding: 16px 20px; }
.nav-links { display: none; }
.wrapper { padding: 0 20px; }
header { padding: 100px 0 60px; }
.compare { grid-template-columns: 1fr; }
.cta-block { padding: 32px 24px; }
footer { flex-direction: column; gap: 12px; text-align: center; }
}
</style>

</head>
<body>

<header>
<span class="tag">guide · local AI training</span>
<h1>You can train your own AI.<br>Here's what that actually means.</h1>
<p class="subtitle">Most people think training AI requires a data center. It doesn't — if the architecture is right. Here's a honest look at what's involved, what the bottlenecks really are, and why they're not what you think.</p>
<div class="meta">
<span>James Blight</span>
<span>logossoma.com</span>
<span>AAAI 2026</span>
</div>
</header>

```
<h2>The assumption everyone makes</h2>

<p>When people ask "can I train my own AI?", they usually mean: can I fine-tune a model someone else built? Run LoRA on a quantized checkpoint? That's not training — that's decoration.</p>

<p>Real training means the model learns from data you give it, building internal representations from scratch. The reason people assume this requires racks of A100s is that the dominant architecture — the transformer — genuinely does. Not because compute is inherently expensive, but because the transformer's design makes it so.</p>

<div class="callout">
<p><strong>The actual bottleneck isn't compute. It's architecture.</strong> Transformers have quadratic attention cost, static weights after training, and require enormous batches to learn efficiently. Fix those and the compute story changes completely.</p>
</div>

<h2>Why transformers need so much hardware</h2>

<p>Transformer attention is O(n²) in sequence length. Every token attends to every other token. This is what makes them powerful — and what makes them expensive. As context grows, cost explodes.</p>

<p>More importantly: once a transformer is trained, it's frozen. The weights don't update during inference. This means everything the model will ever know has to be baked in upfront, during a single expensive training run on an enormous dataset. The scale isn't optional — it's load-bearing.</p>

<p>If you want to teach it something new, you retrain. Or fine-tune, which is a workaround, not a solution.</p>

<div class="compare">
<div class="compare-box">
<div class="compare-label">Transformer</div>
<ul>
<li>Quadratic attention cost</li>
<li>Static after training</li>
<li>Requires massive datasets upfront</li>
<li>Context window is a hard limit</li>
<li>Fine-tuning is expensive and lossy</li>
</ul>
</div>
<div class="compare-box highlight">
<div class="compare-label">Continual architecture</div>
<ul>
<li>Constant cost per byte</li>
<li>Learns continuously from any stream</li>
<li>Useful with modest data</li>
<li>No context window concept</li>
<li>Every inference updates the model</li>
</ul>
</div>
</div>

<h2>What continual learning actually changes</h2>

<p>A model that learns continuously isn't just cheaper to run. It's a fundamentally different kind of object. It's not a static artifact you query — it's a process that unfolds alongside whatever it's learning from.</p>

<p>This matters for training on a personal machine because the economics flip. You're not trying to compress the internet into weights in a weekend. You're feeding your model a stream of data — your data — and it accumulates understanding over time. Like you do.</p>

<p>The practical upshot: useful models don't require enormous training runs. They require sustained, focused learning on data that actually matters for your use case.</p>

<h2>What you actually need</h2>

<table>
<thead>
<tr>
<th>Hardware</th>
<th>What it gets you</th>
<th>Realistic use</th>
</tr>
</thead>
<tbody>
<tr>
<td>Mac mini (M-series)</td>
<td>Unified memory, efficient inference</td>
<td>Training on text, code, small sensor streams</td>
</tr>
<tr>
<td>Gaming PC (RTX 3090+)</td>
<td>VRAM for larger hidden dims</td>
<td>Faster training, larger models</td>
</tr>
<tr>
<td>Any modern laptop</td>
<td>CPU inference at constant cost</td>
<td>Inference, light training</td>
</tr>
</tbody>
</table>

<p>The key property you want: <strong>constant inference cost per byte</strong>. With a transformer, cost grows with context. With a continual architecture, it doesn't. This is what makes local deployment viable — not just technically, but economically.</p>

<h2>What to train on</h2>

<p>This is where it gets interesting. Because the model learns from raw bytes — not tokenized, preprocessed, formatted text — the input can be almost anything:</p>

<ul style="list-style:none; padding:0; margin: 0 0 22px;">
<li style="padding: 6px 0; padding-left: 20px; position:relative; color: var(--text-dim);">
<span style="position:absolute;left:0;color:var(--accent);">→</span> Your own writing, notes, code
</li>
<li style="padding: 6px 0; padding-left: 20px; position:relative; color: var(--text-dim);">
<span style="position:absolute;left:0;color:var(--accent);">→</span> Sensor streams, control logs, telemetry
</li>
<li style="padding: 6px 0; padding-left: 20px; position:relative; color: var(--text-dim);">
<span style="position:absolute;left:0;color:var(--accent);">→</span> Audio data, any time-series signal
</li>
<li style="padding: 6px 0; padding-left: 20px; position:relative; color: var(--text-dim);">
<span style="position:absolute;left:0;color:var(--accent);">→</span> Domain-specific corpora you've curated
</li>
</ul>

<p>The discipline isn't feeding it everything — it's feeding it the right things. A model trained on a narrow, high-quality corpus often outperforms one trained broadly on noise.</p>

<h2>Getting started with soma</h2>

<p>soma is a continual-learning architecture — one Python script, PyTorch only — built around these principles. It learns byte-by-byte, maintains a fixed-size memory, and runs inference at constant cost. No context window. No transformer attention.</p>

<p>It was presented at AAAI 2026. The mathematics behind it — geometrically spaced spectral traces as a basis for temporal memory — is motivated and peer-reviewed, not a hack.</p>

<div class="code-label">train on your data</div>
<pre><code>python soma.py train --input your_data.txt</code></pre>

<div class="code-label">chat with what you built</div>
<pre><code>python soma.py chat --resume your_model.pt</code></pre>

<p>That's it. One script. A Mac mini is enough to train something genuinely useful.</p>

<div class="callout">
<p><strong>Checkpoints are portable.</strong> A trained soma model is a single <code>.pt</code> file. You can share it, fork it, build on someone else's — each with a deterministic ID so lineage is always verifiable.</p>
</div>

<h2>The bigger picture</h2>

<p>The question "can I train my own AI?" has always had a hidden assumption: that AI training requires infrastructure only large organizations can afford. That assumption is architecture-dependent, not fundamental.</p>

<p>If the scaling dynamics of current methods are correct, then a continual architecture with constant memory and cost removes a significant fraction of that overhead — even before accounting for what continual learning actually unlocks for applications.</p>

<p>We think every large-scale AI application will eventually converge on something like this. Not as a prediction, but as a consequence of the underlying logic. Fixed memory. Constant cost. Learning that never stops.</p>

<p>The hardware you already own is enough to be part of that.</p>

<div class="cta-block">
<h3>Browse soma checkpoints on logOS</h3>
<p>Download a trained model, continue training it, share what you build. Free to browse. No compute required beyond your own machine.</p>
<a href="https://logossoma.com/checkpoints" class="cta-btn">explore checkpoints →</a>
</div>
```

</article>

<footer>
<span>© logOS · logossoma.com</span>
<a href="https://logossoma.com/paper">read the AAAI 2026 paper →</a>
</footer>

</div>

</body>
</html>

log

why phi: frequency coverage and the three-distance theorem