<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
<channel>
<title>Irvine Afri Dwicahya</title>
<link>https://irvineafri.com</link>
<description>Backend engineer. Payments, lending, and the boring parts of distributed systems.</description>
<atom:link href="https://irvineafri.com/rss.xml" rel="self" type="application/rss+xml"/>
<item>
  <title>Porting a Next.js app to React Native overnight, with an agent loop</title>
  <link>https://irvineafri.com/blog/porting-to-react-native-overnight-with-an-agent-loop</link>
  <guid isPermaLink="true">https://irvineafri.com/blog/porting-to-react-native-overnight-with-an-agent-loop</guid>
  <pubDate>Tue, 12 May 2026 00:00:00 +0000</pubDate>
  <description>How a single-threaded fleet of Claude Code agents walked a 115-task queue and shipped 50 tasks (5 of 12 milestones) to main while I slept, without ever touching git.</description>
  <content:encoded><![CDATA[<blockquote>
<p>How I ran a single-threaded fleet of Claude Code agents against a 115-task queue and got 50 tasks (5 of 12 milestones) shipped to <code>main</code> while I slept, without ever letting an agent touch git.</p>
</blockquote>
<p>This is a working journal, not a polished pitch. The system is still running as I write. The numbers below are accurate as of the most recent milestone push.</p>
<h2 id="the-setup">The setup</h2>
<p>I run KelasJenius, an interactive learning platform for Indonesian SMP/SMA students. The web app (Next.js 14 + Fastify + Postgres) is mature: auth, subscriptions, quizzes, duels over WebSocket, AI tutor, parent portal, the works.</p>
<p>The mobile app needs to ship to the App Store and Google Play with 1-for-1 parity to the web student experience, plus native Apple IAP and Google Play Billing. That's a ~15-week solo project at full-time pace. I do not have 15 weeks. I have nights and weekends.</p>
<p>So I built a system that lets a swarm of one agent at a time, working in sequence, walk a dependency-ordered task queue while I sleep.</p>
<p>In the last ~36 hours of wall-clock time the loop has shipped:</p>
<ul>
<li>M0 (backend prep), 7 tasks: bearer-token auth, WS auth via query param, <code>/api/version/check</code>, device registration, IAP migration scaffolding, CORS for native, refresh-in-body</li>
<li>M1 (mobile foundation), 11 tasks: Expo SDK 54 + RN 0.81.5 scaffold, monorepo wiring, NativeWind, providers, MMKV+TanStack, <code>apiFetch</code> with sliding renewal, smoke test</li>
<li>M2 (design system <code>packages/ui-mobile</code>), 9 tasks: tokens, primitives (KjButton/Pressable/Card/Screen/Text/Input/Skeleton), 506 SVG icons ported by codemod, theme provider, toast, motion (KjXpPopup/KjStreakFlame), data badges, dev gallery</li>
<li>M3 (auth shell), 8 tasks: login, register, forgot/reset/verify-email, <code>useCurrentUser</code>, profile tab + settings sheet, force-upgrade gate</li>
<li>M4 (core content), 9 tasks: KaTeX-via-WebView, lesson reader, dashboard, subjects tree, paywall placeholder, offline cache, offline banner</li>
<li>M5 (quiz), in progress, 4 of 10 done: state machine + 32 tests, session screen UI, confirm-before-submit, reveal animations next</li>
</ul>
<p>That's 50 of 115 tasks, with 1,565+ tests green at the most recent milestone gate.</p>
<p><img src="/posts/porting-to-react-native-overnight-with-an-agent-loop/proof2.png" alt="TASKS.md status counts at the M4 milestone push, timestamped"></p>
<p>The shipping isn't the interesting part. The interesting part is how small the set of design choices that made it boring enough to ship overnight without me at the keyboard.</p>
<h2 id="the-shape-of-the-problem">The shape of the problem</h2>
<p>Long-horizon agent work fails in three predictable places.</p>
<p><strong>Drift across tasks.</strong> Agent #4 builds a thing on top of Agent #2's misunderstanding of the spec. The error compounds.</p>
<p><strong>Untracked state.</strong> &quot;Which tasks are done? Which are blocked? What did the last agent change?&quot; If the answers live in chat scrollback you've already lost.</p>
<p><strong>Git becomes the contention point.</strong> Twelve agents force-pushing over each other, or one agent amending a commit a downstream agent already pulled. The repo's history is the single most valuable artifact, and touching it carelessly destroys the run.</p>
<p>The system I'll describe addresses each of those head-on. The architecture is boring on purpose.</p>
<h2 id="the-three-artifacts-that-run-everything">The three artifacts that run everything</h2>
<p>Everything the loop needs lives in three places under <code>apps/mobile/plans/</code>:</p>
<pre><code>apps/mobile/plans/
├── TASKS.md                          ← the queue (115 rows, 12 milestones)
├── logs/agents.log                   ← append-only audit log
├── 00-README.md                      ← the master plan
├── 01-architecture-decisions.md      ← non-negotiables, locked
├── 02-phase-0-backend-prep.md        ← preconditions + concrete tasks
├── 03-phase-1-foundation.md
├── 04-phase-2-design-system.md
├── 05-phase-3-auth-shell.md
├── 06-phase-4-core-content.md
├── 07-phase-5-quiz-daily.md
├── 08-phase-6-duel-realtime.md
├── 09-phase-7-social-leaderboard.md
└── 10-phase-8-advanced.md
</code></pre>
<h3 id="tasksmd-the-queue">TASKS.md, the queue</h3>
<p>The queue is a flat markdown table per phase. Every row is one task with: ID, title, link to a spec section in the phase doc, deps, status, commit SHA, last-updated timestamp.</p>
<pre><code>| ID    | Title                       | Spec link                              | Deps        | Status      | Commit  | Updated              |
| T4.6  | Lesson reader screen        | 06-phase-4-core-content.md#task-46     | T4.2, T4.5  | done        | 3fd8fa6 | 2026-05-12T12:30:00Z |
| T5.1a | Quiz state machine module   | 07-phase-5-quiz-daily.md#task-51       | T4.6        | done        |         | 2026-05-12T15:27:00Z |
| T5.1b | Quiz session screen UI      | 07-phase-5-quiz-daily.md#task-51       | T5.1a, T2.7 | done        |         | 2026-05-12T16:45:00Z |
| T5.1c | Confirm-before-submit flow  | 07-phase-5-quiz-daily.md#task-51       | T5.1b       | done        |         | 2026-05-12T17:30:00Z |
| T5.1d | Reveal animations + haptics | 07-phase-5-quiz-daily.md#task-51       | T5.1c       | in_progress |         | 2026-05-12T18:00:00Z |
</code></pre>
<p>Four statuses: <code>todo</code>, <code>in_progress</code>, <code>done</code>, <code>blocked</code> (external).</p>
<p>Status counts live at the top of the file and must equal working-tree truth, not git history. The orchestrator and any watchdog reconcile against the working-tree file:</p>
<pre><code>- todo: 51
- in_progress: 1
- done: 53
- blocked (external): 10
- Total: 115
</code></pre>
<p>This is the only shared state between the orchestrator and any agent. There is no database, no task service, no Jira sync. The file is the queue.</p>
<h3 id="agentslog-the-audit-trail">agents.log, the audit trail</h3>
<p>Every agent return appends one structured block:</p>
<pre><code>## 2026-05-12T08:25:00Z — T5.1a done (milestone-pending)
- agent: a8ae91c3e6ac30d62
- duration: 12m 11s
- files: apps/mobile/lib/quiz/quizMachine.ts (new — pure reducer + useQuizMachine hook),
         apps/mobile/lib/quiz/__tests__/quizMachine.test.ts (new, 32 tests),
         apps/mobile/plans/TASKS.md
- tests: 32 new tests added; workspace total now 57
- summary: Quiz state machine shipped. Pattern: useReducer + pure reducer + custom hook.
  ... [paragraph of substance: what changed, what was decided, why] ...
  Critical API correction: spec mentioned `GET /sessions/:id/next-question` but that
  endpoint does NOT exist in the live API. I verified against `apps/api/src/routes/sessions.ts`.
  The actual web flow loads all questions upfront via
  `GET /subjects/:s/topics/:t/subtopics/:st/questions`. The machine loads the full
  question array at startSession and advances client-side.
  Bug fixed during review: `questionsAnswered` was off-by-one; corrected to length.
- notes: M5 status 1/7 done. Handoff to T5.1b: consume useQuizMachine, call startSession
  on mount, observe state, drive selection via selectAnswer(optionId), submission via
  submitAnswer()...
</code></pre>
<p><code>milestone-pending</code> is a placeholder. When the milestone pushes, the orchestrator rewrites these to the real short SHA in a follow-up commit.</p>
<p>The <code>notes:</code> field at the bottom is the most important part. Every agent ends its entry with a handoff to the next agent: the precondition the next task can rely on, the path the previous agent actually used (not what the spec said), and any decision the next agent does not need to re-litigate.</p>
<p>This is how drift gets contained. The next agent reads the last 1–3 log entries before claiming, so it inherits a precise mental model of what is true in the working tree right now instead of reasoning from the spec alone.</p>
<p><img src="/posts/porting-to-react-native-overnight-with-an-agent-loop/proof3.png" alt="A tail of the real agents.log, timestamps showing the overnight run"></p>
<h3 id="the-phase-docs">The phase docs</h3>
<p>Each phase doc is self-contained. It states:</p>
<ul>
<li>Preconditions (&quot;M3 must be green; bearer auth must exist&quot;)</li>
<li>Concrete file paths (&quot;create <code>app/(dashboard)/subjects/[slug]/[topicSlug]/[subtopicSlug]/index.tsx</code>&quot;)</li>
<li>Code sketches, just enough to anchor the structure, never enough to copy-paste</li>
<li>Acceptance checklist (&quot;done when: WebView renders KaTeX block math correctly, light/dark theme switches without remount lag, no console errors&quot;)</li>
</ul>
<p>The locked-in <code>01-architecture-decisions.md</code> is the bedrock: Expo SDK 54, New Architecture on, NativeWind v4, TanStack Query + MMKV, expo-secure-store for JWT, KaTeX via WebView. An agent that proposes Zustand or AsyncStorage for tokens gets reverted.</p>
<h2 id="the-execution-model-and-how-it-evolved">The execution model (and how it evolved)</h2>
<p>I tried three workflows in the first 24 hours. The third one stuck.</p>
<h3 id="attempt-1-branch--pr--merge-bot-per-task">Attempt 1: branch + PR + merge-bot per task</h3>
<p>Each task spawns a worktree, the agent works, opens a PR, a merge-bot watches CI and merges.</p>
<p>Why it failed: per-task PRs created 100+ tiny PRs and a merge queue. Cross-task drift surfaced in PR review, which the agent then had to relitigate. Cognitive cost per task was too high to be worth the audit trail.</p>
<h3 id="attempt-2-main-only-direct-push-per-task">Attempt 2: main-only direct push per task</h3>
<p>Each agent works directly on <code>main</code>, runs <code>pnpm verify</code>, commits and pushes if green.</p>
<p>Why it failed: two problems. Rollback granularity was per-task, which is fine if a single agent broke something but useless if a <em>sequence</em> of agents had compounded a subtle error. And the git log was an unreadable wall of 50+ commits per evening, with the actual feature unit (e.g. &quot;M4 core content&quot;) spread across nine commits and three days of intermediate state.</p>
<p>There was also an integrity issue: agents occasionally forgot to flip TASKS.md to <code>done</code>, and the orchestrator's bookkeeping had to chase the agent's git history instead of the agent's reported state.</p>
<h3 id="attempt-3-main-only-milestone-batched-no-git-agents-current">Attempt 3: main-only, milestone-batched, no-git agents (current)</h3>
<p>This is the model that's been running. The rules:</p>
<ol>
<li>Agents do not touch git. At all. They edit code, run <code>pnpm verify</code>, edit <code>TASKS.md</code> in the working tree, and return.</li>
<li>The only git command an agent may run is the initial <code>git switch main &amp;&amp; git pull --rebase origin main</code> at start.</li>
<li>When all tasks in a milestone group reach <code>status=done</code> in the working tree, the orchestrator captures one squash-style commit per milestone and pushes.</li>
<li>A separate <code>chore(plans): record M&lt;n&gt; SHA</code> follow-up commit backfills the SHA into the <code>Commit</code> column of each row in that milestone.</li>
</ol>
<p>The orchestrator loop is six lines:</p>
<pre><code>loop:
  assert in_progress == 0                # working-tree TASKS.md, not origin/main
  next = lowest-numbered todo with all Deps == done
  if next is None:
    if working-tree milestone is complete: push milestone (see below)
    elif any blocked exist:                surface blocker, halt loop
    else:                                  all done, exit loop
  spawn 1 agent on `next`                 # agent does NOT touch git beyond initial pull
  wait for agent to return with status=done in working-tree TASKS.md
  append entry to apps/mobile/plans/logs/agents.log (working tree, no commit)
  if this task is the last in its milestone group: push milestone
  goto loop
</code></pre>
<p>The milestone push itself:</p>
<pre><code>1. git status                            # confirm working tree contains only milestone code + TASKS + log
2. git add -A
3. pnpm verify                           # one more time — catches integration drift between tasks
4. git commit -m &quot;&lt;prefix&gt;: &lt;name&gt; — Tx.y..Tx.z [M&lt;n&gt;]&quot; with co-author trailer
5. capture short SHA
6. backfill SHA into TASKS.md Commit columns; replace milestone-pending in agents.log
   chore(plans): record M&lt;n&gt; SHA (&lt;sha&gt;)
7. git push origin main
</code></pre>
<p>This trades rollback granularity (you can only revert a whole milestone) for shippable units (every commit on <code>main</code> is a complete, tested, parity-checked feature group). Per-task <code>pnpm verify</code> is still the per-task quality gate; the per-milestone re-verify catches anything that snuck between tasks.</p>
<p>The per-milestone re-verify has caught a real bug exactly once so far: a type drift between T2.3c and T2.4a where a primitive's prop signature shifted while icons were being ported. The orchestrator fixed it inline as part of the milestone commit (no separate task) and noted the drift in the log.</p>
<p><video controls muted playsinline preload="metadata" src="/posts/porting-to-react-native-overnight-with-an-agent-loop/video-ai-agent.mp4" aria-label="A timelapse of the agent loop running overnight"></video></p>
<h2 id="the-pnpm-verify-gate-the-only-quality-contract">The <code>pnpm verify</code> gate, the only quality contract</h2>
<p>There is no PR review. There are no human checkpoints during the loop. <code>pnpm verify</code> is the single quality contract. It runs:</p>
<table>
<thead>
<tr>
<th>Gate</th>
<th>Mechanism</th>
<th>Scope</th>
</tr>
</thead>
<tbody>
<tr>
<td>Emoji ban</td>
<td>grep over Unicode ranges, allowlisted paths</td>
<td>every UI workspace (incl. <code>apps/mobile</code>)</td>
</tr>
<tr>
<td>type-check</td>
<td><code>pnpm -r --if-present run type-check</code> (<code>tsc --noEmit</code>)</td>
<td>every workspace</td>
</tr>
<tr>
<td>lint</td>
<td><code>pnpm -r --if-present run lint</code> (eslint / next lint / expo lint)</td>
<td>every workspace</td>
</tr>
<tr>
<td>unit tests</td>
<td><code>pnpm --filter &lt;ws&gt; test</code></td>
<td>14 workspaces</td>
</tr>
<tr>
<td>build</td>
<td><code>pnpm -r --if-present run build</code> (skipped pre-commit, runs on verify / pre-push)</td>
<td>every workspace</td>
</tr>
</tbody>
</table>
<p>It runs automatically on <code>git commit</code> via <code>core.hooksPath=.githooks</code>. Emergency override is <code>VERIFY_SKIP=1</code>. The rule: only for genuine fires, fix the root cause next.</p>
<p>Two design choices that pay rent.</p>
<p>A per-workspace <code>test:unit</code> / <code>test:integration</code> split. <code>apps/api</code> and <code>packages/db</code> own DB-backed integration suites that need a live Postgres on port 14002. The unit slice runs everywhere (fresh clone, sandbox, CI), and the per-task agent loop only runs <code>test:unit</code>. The orchestrator runs <code>test:integration</code> once at milestone boundaries on a machine with the test DB up. This split is the difference between a 6-second per-task gate and a 90-second one.</p>
<p>A self-asserting matrix. <code>packages/types/src/__tests__/verify-coverage.test.ts</code> declares which workspaces are expected to participate in which gates. If a new workspace is added without being wired into <code>scripts/verify.sh</code>, that test fails. The gate audits itself.</p>
<h2 id="lessons-the-hard-won-kind">Lessons (the hard-won kind)</h2>
<h3 id="1-specs-are-guidance-code-is-truth">1. Specs are guidance. Code is truth.</h3>
<p>The single most common failure mode across 53 completed tasks was spec drift. The phase docs were written upfront, the code evolved, the agents trusted the spec. Examples from the log:</p>
<ul>
<li>T5.1a: spec said <code>GET /sessions/:id/next-question</code>. That endpoint does not exist in the live API. The agent verified against <code>apps/api/src/routes/sessions.ts</code>, found that the actual web flow loads questions upfront via a different endpoint, and built the state machine around the real shape. The handoff note to T5.1b documented the correction so the next agent didn't re-discover it.</li>
<li>T0.1: spec said <code>apps/api/src/middleware/auth.ts</code> but the live code path was <code>apps/api/src/plugins/auth.ts</code>. Agent updated the real file. Handoff noted the canonical path so T0.7 didn't trip on the same thing.</li>
<li>T4.3: spec pseudocode used <code>KjScreen onRefresh/refreshing</code> props. The real component takes <code>refreshControl={&lt;RefreshControl/&gt;}</code>. Multiple primitive prop drifts caught in one task.</li>
</ul>
<p>The rule I baked into every agent's spawn prompt:</p>
<blockquote>
<p>If the spec disagrees with the live code, the live code wins. Update the spec section's path/shape if you're sure, and document the correction in your handoff note.</p>
</blockquote>
<p>The cost is one extra <code>grep</code> per task. The benefit is that every subsequent agent inherits a corrected model.</p>
<h3 id="2-force-the-handoff-dont-trust-the-agent-to-volunteer-it">2. Force the handoff. Don't trust the agent to volunteer it.</h3>
<p>Half the value of the <code>agents.log</code> entries is the bottom <code>notes:</code> field. The first dozen agents barely filled it in. So the spawn prompt became explicit:</p>
<blockquote>
<p>Your final report MUST include a <code>Handoff</code> paragraph for the next dependent task: the precondition it can rely on, the path you actually used (not what the spec said), and any decision it does not need to re-litigate.</p>
</blockquote>
<p>After this change, every entry has a usable handoff. The pattern is so reliable I caught one bug just by reading the previous entry's handoff against the current task's spec. They disagreed, the previous agent had been right, and the spec was stale.</p>
<h3 id="3-agents-bail-mid-investigation-make-them-flip-the-row-before-they-exit">3. Agents bail mid-investigation. Make them flip the row before they exit.</h3>
<p>This was the most expensive failure mode. An agent finishes the code, runs <code>pnpm verify</code>, sees green, then, instead of flipping the <code>TASKS.md</code> row to <code>done</code>, drops out of the loop with &quot;Let me check the actual component interfaces&quot; as their final line. The work is done. The bookkeeping is not.</p>
<p>When the next agent claims the next row, it sees <code>in_progress</code> from the previous row and refuses to spawn (the precondition is <code>in_progress == 0</code>). The orchestrator has to absorb the bookkeeping by hand.</p>
<p>The fix in the spawn prompt:</p>
<blockquote>
<p>Before reporting, you MUST: (1) run <code>pnpm verify</code> to completion, (2) flip your row in TASKS.md to <code>done</code>, (3) decrement <code>in_progress</code> and increment <code>done</code> in the status counts. Report only after these three things are visible in the working tree.</p>
</blockquote>
<p>Plus an explicit confirmation line in the report:</p>
<blockquote>
<p>&quot;TASKS.md flipped to <code>done</code>, counts updated.&quot;</p>
</blockquote>
<p>After this change the bail rate dropped to roughly zero. Two agents that did bail in M4 (T4.4, T4.7) were caught by the orchestrator at the spawn precondition check and the row was finalized in seconds, with the agent's actual work intact in the working tree.</p>
<h3 id="4-codex-review-as-a-cheap-second-opinion">4. Codex review as a cheap second opinion</h3>
<p>After any non-trivial implementation, I run:</p>
<pre tabindex="0" style="color:#e5e5e5;background-color:#000;"><code><span style="display:flex;"><span>codex <span style="color:#fff;font-weight:bold">exec</span> --sandbox read-only <span style="color:#0ff;font-weight:bold">&#34;Review for bugs and logic errors&#34;</span>
</span></span></code></pre><p>It's a different model with a fresh context window reading the diff cold. It catches things the implementing agent missed because the implementing agent was deep inside its own assumptions.</p>
<p>The KjLessonWebView task (T4.2) is a clean example. The implementing agent shipped it. Codex flagged two real issues: (1) <code>onHeightChange</code> presence was incorrectly switching the WebView to content-height layout mode, and (2) <code>DOM_READY_JS</code> was running twice (once inside <code>buildKatexDoc</code>'s <code>DOMContentLoaded</code> handler, again via <code>injectedJavaScript</code>). Both got fixed in the same commit before the milestone push.</p>
<p>I treat Codex as a peer reviewer with zero relationship to the agent that wrote the code. The cost is one tool call per task. The catch rate is meaningful.</p>
<h3 id="5-codemod-what-you-can">5. Codemod what you can</h3>
<p><code>packages/ui/src/icons/icon-renderers.tsx</code> has 519 named SVG icons used across the web app. The naive approach (hand-port each to <code>react-native-svg</code>) was budgeted at three days.</p>
<p>Instead, T2.4a hand-ported the first 30 to establish the pattern: default export function, <code>react-native-svg</code> elements, <code>SvgComponentProps</code> props. Then T2.4b ran a codemod at <code>packages/ui-mobile/scripts/port-icons.mjs</code> over the remaining 489. 476 ported cleanly. 22 needed hand-port, because they use <code>&lt;text&gt;</code> SVG elements or <code>.map()</code> in their renderers, and the skip list lives at <code>packages/ui-mobile/src/icons/skipped.ts</code> so the parity test can prove every web icon is either ported or explicitly skipped.</p>
<p>T2.4c ran a parity gate test: walk every icon in the web registry, assert it exists in the mobile registry or in the skip list. If a new web icon ships, the mobile gate fails until the icon is either ported or skipped. That gate runs as part of <code>pnpm verify</code>.</p>
<p>The whole sub-phase shipped in under three hours of wall clock, including the codemod write itself. Three days saved.</p>
<h3 id="6-three-file-env-var-rule">6. Three-file env-var rule</h3>
<p>Whenever any service reads <code>process.env.X</code>, the rule is:</p>
<ol>
<li>Add the var with a safe default to <code>.env.example</code></li>
<li>Add a <code>VAR=${VAR}</code> placeholder to <code>.env.dokploy</code></li>
<li>Set the real value in the Dokploy production env config</li>
</ol>
<p>Miss any of the three and the next deploy silently breaks. I've shipped two regressions to this rule before automating it. Both took longer to debug than the rule takes to follow. The <code>deploy-supervisor</code> skill now scans <code>process.env.X</code> references against <code>.env.dokploy</code> at push time and refuses to deploy if any var is missing.</p>
<p>Same principle applied to the mobile build: every new env var consumed by <code>apps/mobile</code> (currently <code>EXPO_PUBLIC_API_BASE_URL</code>, <code>APP_VARIANT</code>) goes through all three files. If a future agent tries to read a new var without registering it, <code>deploy-supervisor</code> blocks the push.</p>
<h3 id="7-plan-up-front-execute-without-thinking">7. Plan up front. Execute without thinking.</h3>
<p>The 19 plan documents (<code>00-README.md</code> through <code>10-phase-8-advanced.md</code> plus parity matrix and conventions) total roughly 130 KB of markdown. They were written before any code was. They include:</p>
<ul>
<li>Locked architecture decisions (no agent may re-litigate)</li>
<li>Concrete file paths per task</li>
<li>Code sketches just detailed enough to anchor structure</li>
<li>&quot;Done when&quot; checklists</li>
<li>A glossary</li>
</ul>
<p>Writing this upfront felt slow. It's the highest-leverage decision I've made on this project. Every minute spent writing a clear &quot;Done when&quot; line in T4.6 saved an hour of agent thrashing during execution. Agents that hit ambiguity stall and start asking the orchestrator questions, which means I get paged in the middle of the night.</p>
<p>The phase docs are written for &quot;an autonomous coding agent (or human engineer) picking up cold.&quot; That framing forces self-containment.</p>
<h2 id="what-the-math-actually-looks-like">What the math actually looks like</h2>
<p>Wall clock over the recent two-day window:</p>
<ul>
<li>M0 (backend prep): 7 tasks, ~1.5 hours</li>
<li>M1 (foundation): 11 tasks, ~3.5 hours including dependency churn</li>
<li>M2 (design system): 9 tasks, ~6 hours (the codemod sub-phase compressed what was budgeted as 3 days)</li>
<li>M3 (auth shell): 8 tasks, ~3 hours</li>
<li>M4 (core content): 9 tasks, ~4 hours including the KaTeX prototype and offline cache</li>
<li>M5 (quiz): 4 of 10 tasks shipped so far, ~1 hour</li>
</ul>
<p>Total: ~19 hours of agent wall-clock for what the original plan estimated as ~7 weeks of solo founder calendar time. Not all of that was overnight, but most of M2–M4 ran while I was asleep. The orchestrator sent push notifications on milestone completions and on blocker surfacing; I woke up to a working <code>(dashboard)/subjects/[slug]/[topicSlug]/[subtopicSlug]</code> lesson reader I had not touched.</p>
<p>Things the system has not had to deal with yet:</p>
<ul>
<li>Native module integration (Apple IAP, Google Play Billing, Phase 9)</li>
<li>Real device testing (currently sim-only; release pipeline is Phase 11)</li>
<li>A merge conflict (single-threaded execution + <code>git pull --rebase</code> at agent start prevents this entirely)</li>
</ul>
<p>I expect Phase 9 (IAP) to be the model's first real stress test, because eight of those tasks are <code>blocked</code> on external Apple/Google account state that no agent can resolve.</p>
<h2 id="what-id-tell-someone-setting-this-up-tomorrow">What I'd tell someone setting this up tomorrow</h2>
<ol>
<li>Write the plan docs first. All of them. Before any code. The plan docs are the spec the agents read. If they're vague, the agents will fight the same battle three times across three tasks.</li>
<li>The queue is one markdown file. Not a database, not a task service. Drift between the file and the system breaks everything. Make the file the system.</li>
<li>Agents must not touch git. Let them code. Let them test. Let them flip the tracker. Push from one place, one time per milestone group. Audit log is append-only.</li>
<li>The pre-commit hook is your QA team. <code>pnpm verify</code> runs every gate every time. If it can't catch a class of bug, harden it once. Don't review by hand.</li>
<li>Force the handoff in the spawn prompt. The next agent's success depends on the previous agent's last paragraph. Make that paragraph contractual.</li>
<li>A second model reviews everything. Codex (or any agent with a fresh context window and read-only access) catches assumption-blindness from the implementing agent. It's the cheapest review you'll ever do.</li>
<li>Specs are guidance. Code is truth. Bake this into the spawn prompt verbatim. Agents that trust the spec over the code will compound errors.</li>
<li>Plan for the bail. Agents will exit mid-task. Make the orchestrator's precondition (<code>in_progress == 0</code>) self-healing: if a row is stuck <code>in_progress</code>, finalize it from the working-tree state and move on. Do not block the loop on a bail.</li>
<li>Milestone-batch the commits. Per-task commits are unreadable. Per-milestone commits are shippable units. The trade-off (coarser rollback granularity) is worth it for clean history and a clear push contract.</li>
<li>Push notifications on milestone completion and on blockers. Otherwise you wake up to a system that paused at 3 a.m. waiting for a question you could have answered in 30 seconds.</li>
</ol>
<h2 id="the-bits-i-havent-solved-yet">The bits I haven't solved yet</h2>
<p>Honest list:</p>
<ul>
<li>Phase 9 (IAP) has 8 externally-blocked tasks. Apple Developer enrollment, Small Business Program, App Store Connect product setup, Google Play Console, User Choice Billing application. The loop walks around them via the dependency graph, but the eventual unblocking is a sequence of two-day-each turnaround items that no automation can compress.</li>
<li>Real device testing. The smoke test passes on iOS Simulator and Android Emulator. Real-device QA on a TestFlight build is currently a manual gate scheduled for Phase 11.</li>
<li>Spec drift detection. Agents flag drift in their handoff notes, but the spec doc itself is never auto-updated. After M5 closes I plan a sweep agent that ingests every <code>agents.log</code> <code>Spec drift:</code> note and proposes corrections to the phase docs.</li>
<li>Long-form lessons learned never propagate back to the spawn prompt. The seven lessons in the previous section live in this blog post and in my head. They should live in a <code>CONTRIBUTING-FOR-AGENTS.md</code> that every spawn loads. That refactor is on the list.</li>
</ul>
<h2 id="closing">Closing</h2>
<p>None of this is novel. Every individual ingredient (append-only audit logs, single-threaded queues, pre-commit verification gates, milestone-batched commits, codemods for boring transforms, second-model review) is engineering practice from before LLMs existed.</p>
<p>What changed is that the things you used to need a team for now run on a laptop with an agent that you brief like a junior engineer. There's no clever prompt to copy. The work is writing a plan boring enough to execute mechanically, building a pre-commit gate strict enough to be the only reviewer, and refusing to let an agent touch the git history.</p>
<p>The hard part of solo engineering used to be doing the work. Now the hard part is deciding what work to do, and writing it down clearly enough that the agent doesn't have to ask.</p>
<hr>
<p><em>Sources: <code>apps/mobile/plans/TASKS.md</code>, <code>apps/mobile/plans/logs/agents.log</code>, <code>apps/mobile/plans/00-README.md</code>, <code>apps/mobile/plans/01-architecture-decisions.md</code>, <code>apps/mobile/CLAUDE.md</code>, <code>scripts/verify.sh</code>. All numbers and quotes are from the actual files; nothing has been edited for narrative effect.</em></p>
]]></content:encoded>
</item>
<item>
  <title>Autonomous agents inside an Indonesian company</title>
  <link>https://irvineafri.com/blog/autonomous-agents-in-an-indonesian-company</link>
  <guid isPermaLink="true">https://irvineafri.com/blog/autonomous-agents-in-an-indonesian-company</guid>
  <pubDate>Sat, 09 May 2026 00:00:00 +0000</pubDate>
  <description>A year of running autonomous agents in production at an Indonesian fintech. What actually breaks (orchestration, memory, permissions, reliability, observability, cost), and the writeup I wish someone had handed me on day one.</description>
  <content:encoded><![CDATA[<blockquote>
<p>Numbers are real but rounded. Rupiah figures use IDR 16,000/USD as
the lazy exchange anchor I keep in my head. Calibrated against a
2026 Q1 production run on GCP <code>asia-southeast2</code>, hitting OpenAI
via Azure Singapore, Anthropic in <code>us-west</code>, and a self-hosted
Llama 3.3 70B for the cheap stuff.</p>
</blockquote>
<p>Most &quot;agent&quot; articles pretend the loop is solved. Call the LLM,
parse the tool call, run it, feed the result back. Done. That's the
demo loop. The production loop is a different animal, and once you
ship one of these for an Indonesian company with rupiah on the line
and an OJK auditor on speed-dial, the differences stop being
academic.</p>
<p>I've been running autonomous agents inside that kind of company for
about a year. This is the writeup I wish somebody had handed me on
day one. The audience is engineers who already know what an MCP
server is, what a tool-call schema looks like, and roughly what an
<code>o1</code>-style reasoning trace costs per token. I'm skipping the
marketing layer.</p>
<h2 id="what-agent-means-here">What &quot;agent&quot; means here</h2>
<p>A long-running process that takes a goal, plans, calls tools,
watches the world, retries, escalates when it gets stuck, and
produces a durable artifact. Not a chatbot. Not a single LLM call
in a retry loop. Something with state that survives a process
restart, and a coordinator that decides when the work is done.</p>
<p>The agent we run most often does collections triage. Given a
delinquent borrower, it pulls the loan history, checks the WhatsApp
engagement, drafts a tailored outreach, fires the first contact,
watches the response, and either escalates to a human collector or
schedules a follow-up. End to end: 40 to 90 seconds wall-clock,
20 to 50 LLM calls, 6 to 12 tool calls. Runs about 12,000 times a
day at peak.</p>
<p>That's the shape. Now the parts.</p>
<h2 id="1-orchestration">1. Orchestration</h2>
<p>First decision: graph framework or hand-rolled. We tried both.
LangGraph, BAML, Inngest are all wonderful for the walkthrough
demo. They become a tax the moment your control flow stops being
a DAG. And real agent control flow is <em>not</em> a DAG. It has loops,
dynamic branches based on tool output, and at-least-once retries
that need state-machine guarantees the framework's abstractions
weren't built to express. We spent more time fighting the
framework than we saved.</p>
<p>So we wrote our own. State machine over Postgres + RabbitMQ. The
shape:</p>
<pre><code>[pending]
   │
   ▼
[running]  ◄────┐
   │            │  resumed after
   ▼            │  tool callback
[awaiting_tool]─┘
   │
   ▼
[completed | failed | escalated]
</code></pre>
<p>Every transition writes a row to <code>agent_runs.events</code> (append-only)
and updates <code>agent_runs.state</code> atomically, in the same transaction.
That single decision is load-bearing. Every model call, every tool
call, every external observation lands in the database as an event.
If a worker dies mid-run, and they do, often, because Indonesian
data centres lose power in ways AWS post-mortems don't capture,
another worker reads the log and resumes from the last consistent
state.</p>
<p>The pseudocode that earns its keep:</p>
<pre tabindex="0" style="color:#e5e5e5;background-color:#000;"><code><span style="display:flex;"><span><span style="color:#fff;font-weight:bold">def</span> step(run_id):
</span></span><span style="display:flex;"><span>    <span style="color:#fff;font-weight:bold">with</span> txn():
</span></span><span style="display:flex;"><span>        run = lock_run(run_id)
</span></span><span style="display:flex;"><span>        <span style="color:#fff;font-weight:bold">if</span> run.state == <span style="color:#0ff;font-weight:bold">&#39;awaiting_tool&#39;</span>:
</span></span><span style="display:flex;"><span>            <span style="color:#fff;font-weight:bold">return</span>  <span style="color:#007f7f"># someone else&#39;s problem</span>
</span></span><span style="display:flex;"><span>        events = load_events(run_id)
</span></span><span style="display:flex;"><span>        next_action = plan(run, events)  <span style="color:#007f7f"># an LLM call</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>        <span style="color:#fff;font-weight:bold">if</span> next_action.kind == <span style="color:#0ff;font-weight:bold">&#39;tool&#39;</span>:
</span></span><span style="display:flex;"><span>            event = emit(<span style="color:#0ff;font-weight:bold">&#39;tool_call.requested&#39;</span>, next_action)
</span></span><span style="display:flex;"><span>            run.state = <span style="color:#0ff;font-weight:bold">&#39;awaiting_tool&#39;</span>
</span></span><span style="display:flex;"><span>            run.save()
</span></span><span style="display:flex;"><span>            enqueue_tool(event)         <span style="color:#007f7f"># RabbitMQ delayed-message exchange</span>
</span></span><span style="display:flex;"><span>        <span style="color:#fff;font-weight:bold">elif</span> next_action.kind == <span style="color:#0ff;font-weight:bold">&#39;finish&#39;</span>:
</span></span><span style="display:flex;"><span>            run.state = <span style="color:#0ff;font-weight:bold">&#39;completed&#39;</span>
</span></span><span style="display:flex;"><span>            run.save()
</span></span><span style="display:flex;"><span>            emit(<span style="color:#0ff;font-weight:bold">&#39;run.completed&#39;</span>, next_action.result)
</span></span></code></pre><p>The trick is that <code>awaiting_tool</code> is a real, stable state with its
own timeout. Tools are <em>jobs</em>, not function calls. Calling a tool
means publishing a message. A callback later delivers the result.
That's what makes a 90-second agent run with three external HTTP
hops survivable when one of those hops takes 12 seconds because
some upstream API is having a moment.</p>
<h2 id="2-memory">2. Memory</h2>
<p>There are three kinds, and they have nothing to do with each
other. Pretending they're one thing (a &quot;memory layer,&quot; a vector
store) is the most common mistake I see.</p>
<p><strong>Run-local memory</strong> is the scratchpad inside one agent run.
Everything the agent has seen so far, including its own intermediate
reasoning. We store it as the event log on <code>agent_runs</code>. Replaying
the events deterministically reconstructs the prompt for the next
step. Token budget: 32k before we summarise.</p>
<p><strong>Episodic memory</strong> is what this agent remembers about <em>this
borrower</em> across past runs. We tried vector stores: <code>pgvector</code>,
Weaviate, Qdrant. Burned three months chasing retrieval relevance.
What actually shipped was a structured episodic table:</p>
<pre tabindex="0" style="color:#e5e5e5;background-color:#000;"><code><span style="display:flex;"><span><span style="color:#fff;font-weight:bold">CREATE</span> <span style="color:#fff;font-weight:bold">TABLE</span> borrower_episodes (
</span></span><span style="display:flex;"><span>  borrower_id   <span style="color:#fff;font-weight:bold">bigint</span>,
</span></span><span style="display:flex;"><span>  episode_at    timestamptz,
</span></span><span style="display:flex;"><span>  channel       <span style="color:#fff;font-weight:bold">text</span>,        <span style="color:#007f7f">-- &#39;wa&#39;, &#39;voice&#39;, &#39;sms&#39;
</span></span></span><span style="display:flex;"><span><span style="color:#007f7f"></span>  outcome       <span style="color:#fff;font-weight:bold">text</span>,        <span style="color:#007f7f">-- &#39;paid&#39;, &#39;pkpu&#39;, &#39;no_answer&#39;, ...
</span></span></span><span style="display:flex;"><span><span style="color:#007f7f"></span>  notes         <span style="color:#fff;font-weight:bold">text</span>,
</span></span><span style="display:flex;"><span>  vector        vector(<span style="color:#ff0;font-weight:bold">768</span>)  <span style="color:#007f7f">-- mE5, multilingual
</span></span></span><span style="display:flex;"><span><span style="color:#007f7f"></span>);
</span></span></code></pre><p>Retrieval is <code>WHERE borrower_id = $1 ORDER BY episode_at DESC LIMIT 20</code>.
The vector column is reserved for the rare &quot;find episodes
semantically like this one&quot; query that shows up maybe once a week.
The vector index is the cherry on top, not the cake. People keep
flipping that around.</p>
<p><strong>Procedural memory</strong> is the prompt. We version every system prompt
with <code>git</code>, hash it, and stamp the hash on every run. When somebody
&quot;fixes&quot; a regression by editing the prompt, we can replay the
offending run against both versions and see which one it was born
under. Sounds boring. Will save you a sprint the first time a
quality drop bisects to a four-word edit.</p>
<h2 id="3-tools">3. Tools</h2>
<p>The mistake is one big tool with a hundred arguments. The shape
that survives is many small tools, each with a tight, validated
input schema, each idempotent.</p>
<p>Every tool gets:</p>
<ul>
<li>A Zod-style schema for inputs.</li>
<li>A canonical idempotency key derived from inputs + run id.</li>
<li>A timeout. <code>p99</code> of normal latency × 3, capped at 30 seconds
for the synchronous request, longer for the async job.</li>
<li>A circuit breaker per downstream system.</li>
<li>An audit row in <code>agent_tool_calls</code> with the full request and
response payloads, encrypted at rest.</li>
</ul>
<p>The audit table isn't optional. Indonesian fintechs have auditors,
and when an auditor asks &quot;what did this agent do on borrower xyz?&quot;,
the answer needs to be one query. I've watched a peer team scramble
for two days reconstructing this from logs after the fact. Don't be
that team.</p>
<p>A failure that quietly costs you: the LLM hallucinates a tool name
that doesn't exist, or hallucinates an argument with the
slightly-wrong type. The framework most tutorials show you swallows
this and feeds a string error back to the model, hoping it
self-corrects. In production you want the orchestrator to detect
&quot;hallucinated tool / schema&quot; as a <em>category</em> of failure, count it,
alert when it spikes, and fall back to a smaller, stricter model
for the next attempt. We've watched <code>gpt-5</code> regress on a Wednesday
afternoon because of a quiet upstream model update. That's where
this metric earns its keep.</p>
<h2 id="4-permissions">4. Permissions</h2>
<p>The dangerous question: what is the agent allowed to do?</p>
<p>The lazy answer is &quot;whatever its tools let it do.&quot; That answer
ships exactly once. After that, compliance puts a hold on every
agent project for six months. I've seen it happen.</p>
<p>What works:</p>
<ol>
<li>Tools declare a <em>capability</em> (<code>payment.disburse</code>,
<code>borrower.send_wa</code>, <code>borrower.read_pii</code>).</li>
<li>Each agent run is bound to an <em>actor</em>, not a service account.
For autonomous runs, the actor is a synthetic identity tied to
the workflow definition (<code>agent:collections-tier-1</code>).</li>
<li>The orchestrator enforces capability scoping <em>before</em> the tool
is dispatched, against a per-actor policy table.</li>
<li>Capabilities have soft and hard caps. <code>payment.disburse</code> for
<code>agent:collections-tier-1</code> has a hard cap of IDR 0 (it cannot
move money) and a soft cap of zero in any policy revision.
Escalating beyond it requires a human approver in the event
log, full stop.</li>
</ol>
<p>The enforcement point matters. Don't put it in the tool. Put it in
the dispatcher. Tools assume their inputs are already authorised.
That's one audit boundary. Putting the check in N tools means N
audit boundaries, written by N engineers, each of whom forgot
something different. I learned this the slow way.</p>
<h2 id="5-reliability">5. Reliability</h2>
<p>LLM endpoints are not reliable infrastructure. Treat them like
flaky third-party APIs, because that is what they are.</p>
<p>Production reliability budget for a single agent run, last quarter:</p>
<table>
<thead>
<tr>
<th>Source</th>
<th>Failure rate (Q1 2026)</th>
<th>Mitigation</th>
</tr>
</thead>
<tbody>
<tr>
<td>OpenAI 5xx</td>
<td>0.4%</td>
<td>retry × 2 with jitter</td>
</tr>
<tr>
<td>Anthropic 5xx</td>
<td>0.6%</td>
<td>retry × 2 with jitter</td>
</tr>
<tr>
<td>OpenAI rate-limit</td>
<td>1.1%</td>
<td>model-level priority queue</td>
</tr>
<tr>
<td>Tool timeout</td>
<td>0.9%</td>
<td>per-tool circuit breaker</td>
</tr>
<tr>
<td>Hallucinated schema</td>
<td>0.3%</td>
<td>strict-mode reattempt</td>
</tr>
<tr>
<td>Indo network</td>
<td>0.2%</td>
<td>connection pool warming + retry</td>
</tr>
</tbody>
</table>
<p>Compose those naively and you get a 3.5% per-call failure rate.
Across a 30-LLM-call run, the unmitigated joint failure probability
is around 65%. Mitigations bring it under 2%. The gap between &quot;demo
works&quot; and &quot;demo works on Friday afternoon when GPT is degraded&quot; is
exactly this list.</p>
<p>Two patterns I keep coming back to:</p>
<p><strong>Idempotent at the agent level, not just the tool level.</strong> If a
worker crashes mid-step and another resumes, the resumer should
produce the same effects, not duplicate ones. The event log is what
enforces this. The resumer reads &quot;tool X was already requested with
idempotency key K&quot; and skips re-emitting. The resume is silent.</p>
<p><strong>A <code>resume</code> is not a <code>retry</code>.</strong> Resume picks up after the last
durable state. Retry replays the last step. Both are needed, in
different scenarios. Conflating them is how you send a borrower
the same WhatsApp twice.</p>
<h2 id="6-observability">6. Observability</h2>
<p>Tracing an agent is harder than tracing a microservice. A single
run has dozens of LLM calls, dozens of tool calls, branching
reasoning, prompt-version changes, and a result that may not be
&quot;success&quot; or &quot;failure&quot; but &quot;escalated to human.&quot;</p>
<p>What worked for us: OpenTelemetry for transport, Langfuse for the
agent-aware UI, and a custom trace structure where every event in
the agent's event log emits its own span.</p>
<pre><code>run.collections_triage  74,231 ms
├─ plan.step.0           1,482 ms   gpt-5  · 2.4k/0.3k tok
├─ tool.borrower.read      210 ms
├─ plan.step.1           1,623 ms   gpt-5  · 4.1k/0.5k tok
├─ tool.wa.history       1,820 ms
├─ plan.step.2             842 ms   haiku  · 1.2k/0.1k tok
├─ tool.outreach.draft   3,118 ms
├─ tool.outreach.send   12,344 ms   ← retry × 2
├─ plan.step.3           1,099 ms   gpt-5
└─ run.completed
</code></pre>
<p>That view puts model timing, tool timing, per-step token cost,
and retries onto one screen. When a teammate Slacks me &quot;this run
was slow,&quot; I can answer in under 30 seconds.</p>
<p>The metric that earned its keep: <strong>escalation rate per
sub-workflow</strong>. Not per agent. Not per model. Per <em>named step in
the workflow</em>. When a particular step starts escalating more
often, it almost always points to a model regression, a prompt
edit, or a downstream tool returning a new error shape. None of
those show up on a top-level success metric.</p>
<h2 id="7-scaling">7. Scaling</h2>
<p>The bottleneck is rarely compute. It's almost always one of: rate
limits at the model provider, latency at a downstream tool, or
worker concurrency tuned wrong.</p>
<p>Cost shape for our collections triage agent at 12,000 runs/day:</p>
<table>
<thead>
<tr>
<th>Component</th>
<th>Per run</th>
<th>Daily</th>
</tr>
</thead>
<tbody>
<tr>
<td>GPT-5 plan steps</td>
<td>$0.014</td>
<td>$168</td>
</tr>
<tr>
<td>Haiku 4.5 sub-steps</td>
<td>$0.002</td>
<td>$24</td>
</tr>
<tr>
<td>Self-hosted Llama 3.3</td>
<td>$0.0008</td>
<td>$9.60</td>
</tr>
<tr>
<td>Postgres / RMQ infra</td>
<td>(amortised)</td>
<td>$42</td>
</tr>
<tr>
<td>Observability stack</td>
<td>(amortised)</td>
<td>$18</td>
</tr>
<tr>
<td><strong>Total</strong></td>
<td><strong>$0.018</strong></td>
<td><strong>$216 / day</strong></td>
</tr>
</tbody>
</table>
<p>In rupiah that's about IDR 3.5M/day, or IDR ~290 per run. The
human collector who would otherwise make the first call costs
roughly IDR 14,000 per touch, all-in. Unit economics work, but
only because we keep the planner cheap (Haiku on the easy steps,
GPT-5 only when the plan branches into something nontrivial) and
we cap out-of-budget runs at the orchestrator level. Without that
cap, the first model spike caught us at 4× the budget for ten
hours straight.</p>
<p>The scaling lever that mattered most was <em>moving inference to
<code>asia-southeast</code></em>. Cross-region calls to OpenAI's US endpoints
were adding 180-220 ms median per call. On a 30-call run that's
about 6 seconds of pure latency tax. Once we routed bulk traffic
through Azure OpenAI in Singapore and kept Anthropic in <code>us-west</code>
only for the long-context steps, p99 dropped from ~118 seconds to
~71. That is the difference between a borrower picking up the
phone and not.</p>
<h2 id="8-failure-recovery">8. Failure recovery</h2>
<p>Every agent run is a finite state machine; failures land in named
recovery states; each recovery state has a manual override.</p>
<p>The states that matter beyond <code>failed</code>:</p>
<ul>
<li><code>stuck</code>: three consecutive plan steps failed to produce a
recognisable next action. Push to a queue read by a human
triager. Replay-friendly.</li>
<li><code>escalated</code>: agent returned &quot;hand off.&quot; A human picks up the
full event log inside our internal ops UI and continues from
the last state with a <code>human_resume</code> event.</li>
<li><code>quarantined</code>: schema-validation failures that look adversarial
(e.g., the agent kept emitting tool-calls with <code>borrower_id</code> set
to the <em>coordinator's</em> user ID). These don't replay. They alert
on PagerDuty.</li>
</ul>
<p>A specific lesson, paid in production: <strong>don't auto-retry
<code>escalated</code>.</strong> If a human said this needs eyes, an automatic
resume two hours later because of a queue redelivery will surprise
that human in the worst possible way. Resume only on explicit
human action. Ask me how I learned this. Actually, don't.</p>
<h2 id="9-agent-coordination">9. Agent coordination</h2>
<p>Multi-agent setups are oversold and undersold at the same time.
Most &quot;multi-agent&quot; systems are one orchestrator plus a few
narrow-skill agents. We have three:</p>
<ul>
<li>A <strong>planner</strong> that owns the run and chooses sub-tasks.</li>
<li>A <strong>researcher</strong> that does retrieval and summarisation against
episodic memory and the loan/transaction history.</li>
<li>A <strong>drafter</strong> that writes outbound messages in Bahasa Indonesia,
fine-tuned for collections tone (firm, lawful, never threatening).
The fine-tune mattered. The off-the-shelf model wrote outputs
that read as condescending in Bahasa formal.</li>
</ul>
<p>Coordination is just the planner calling the others as tools. They
have their own tool-call surfaces, their own audit trails, their
own per-task token budgets. They don't share memory directly. They
share it through the orchestrator's event log.</p>
<p>The &quot;agents that talk to other agents in an ad-hoc swarm&quot; pattern
sounds clever and produces remarkable demos. In production it's a
debugging nightmare. Replays are non-deterministic, blame is
diffuse, unit tests are basically impossible. We don't run it.
Maybe in 2027 the tooling catches up.</p>
<h2 id="10-long-running-execution">10. Long-running execution</h2>
<p>Some workflows take days. Our loan-restructuring agent runs as a
saga — waits for the borrower to respond, escalates internally,
schedules a callback for next Monday, and so on. The agent run can
be alive for two weeks of wall-clock time across maybe 90 seconds
of compute.</p>
<p>This works because the orchestrator is the durable state, not the
process. Workers are stateless; they grab a run, advance it one
step, release it. A <code>cron</code>-style scheduler nudges runs whose
<code>next_check_at</code> is in the past. The runs themselves don't sit in
memory waiting; they sit in Postgres.</p>
<p>The thing that kept biting us: <strong>wall-clock timeouts inside
prompts.</strong> &quot;If you haven't received a response in 24 hours,
escalate&quot; worked great until daylight savings. Jakarta doesn't
observe DST, but our customers' phones sync from carriers that
sometimes report wrong, and the agent's notion of &quot;24 hours&quot; was
inferred from the prompt, not the clock. We pulled every time
calculation out of the model and into the orchestrator. The agent
only sees <code>time_since_last_contact: 26h13m</code> as a structured input,
never raw timestamps. Day got easier.</p>
<h2 id="what-you-actually-buy">What you actually buy</h2>
<p>When the system works, the agent isn't smarter than a junior
collector. It's <em>more consistent</em>. Available at 02:00. Doesn't
forget the borrower's last interaction. Doesn't let an inflammatory
message slip through. Triages 12,000 cases a day without burnout.
That's the value. The model is a small part of it.</p>
<p>The infrastructure (the durable orchestrator, the event log, the
permission enforcement, the observability) is what makes it real.
You can swap GPT-5 for Claude tomorrow and the system keeps
running. You can't swap the orchestrator without rewriting the
company.</p>
<p>If you're building one of these for an Indonesian company, three
things land harder than the tutorials suggest:</p>
<ol>
<li><strong>Data residency.</strong> Pin inference to <code>asia-southeast</code>. The
latency wins are real and the OJK conversation gets shorter.</li>
<li><strong>Bahasa drafting tone.</strong> Off-the-shelf produces outputs that
read as condescending in Bahasa formal. You will fine-tune.</li>
<li><strong>WhatsApp.</strong> Every workflow ends at WhatsApp. Build the WA
tool first, and treat its quirks (Cloud API rate limits,
template approvals, the 24-hour service window) as first-class
infra constraints. They are.</li>
</ol>
<p>The rest is engineering.</p>
]]></content:encoded>
</item>
<item>
  <title>How I cut a lending app&apos;s API latency by ~30%</title>
  <link>https://irvineafri.com/blog/cutting-api-latency-with-a-data-transfer-layer</link>
  <guid isPermaLink="true">https://irvineafri.com/blog/cutting-api-latency-with-a-data-transfer-layer</guid>
  <pubDate>Fri, 16 Jan 2026 00:00:00 +0000</pubDate>
  <description>How I cut average API latency by ~30% on the Kredit Pintar lending backend. No clever caching trick, just a slow audit of a transfer layer nobody had looked at in a while.</description>
  <content:encoded><![CDATA[<p>Most &quot;I made the API faster&quot; posts read like magic-trick demos.
Clever caching layer in act two, latency graph drops in act three,
applause. The Kredit Pintar transfer-layer work didn't feel like
that. It felt like a slow, deliberate audit that paid off because
nobody had done one in a while.</p>
<p>This is what actually happened.</p>
<h2 id="where-we-started">Where we started</h2>
<p>Kredit Pintar is a lending app with more than five million monthly
active users. The backend is mostly Java on Spring Boot, MySQL
underneath, a busy mesh of services on Kubernetes with Argo CD
shipping changes. The data-transfer layer (the code that takes a
request, talks to whatever systems we depend on, and shapes a
response back to the caller) had grown organically. That's the
polite way of saying every owner who'd touched it had added the
field they needed and left.</p>
<p>The symptom showed up on the graphs. P50 and P95 on a handful of
hot endpoints had been creeping up. Nothing dramatic, nothing
pager-worthy, just enough that on-call kept flagging it in weekly
reviews.</p>
<h2 id="two-weeks-of-reading">Two weeks of reading</h2>
<p>The first two weeks I didn't write any new code. I read code.
Then I read traces. Then I read more code. Looking back, I wish
I'd spent two days up front on better profiling tooling. By the
time I had the picture clear, I'd already half-formed the wrong
hypothesis twice.</p>
<p>Two patterns surfaced once I'd done enough of that:</p>
<ol>
<li><strong>Redundant serialisation.</strong> The same payload was being
serialised, sent across a hop, deserialised, then re-serialised
one or two hops downstream. Fields nobody ever read travelled
the whole way for free.</li>
<li><strong>Chatty round trips.</strong> A surprising number of &quot;one logical
request&quot; flows were actually three sequential calls under the
hood. Each cheap on its own. The latencies stacked.</li>
</ol>
<p>A token-bucket rate limiter is the kind of thing every fintech
backend grows somewhere on the hot path. The shape below is the
same one that lives behind <code>/api/lab/latency</code> on this site —
<code>/labs/latency</code> runs it live against three handler variants:</p>
<div data-runnable-go="cGFja2FnZSBtYWluCgppbXBvcnQgKAoJImZtdCIKCSJ0aW1lIgopCgp0eXBlIGJ1Y2tldCBzdHJ1Y3QgewoJdG9rZW5zICAgZmxvYXQ2NAoJY2FwICAgICAgZmxvYXQ2NAoJcmF0ZSAgICAgZmxvYXQ2NCAvLyB0b2tlbnMgcGVyIHNlY29uZAoJbGFzdFRpY2sgdGltZS5UaW1lCn0KCmZ1bmMgKGIgKmJ1Y2tldCkgdGFrZShub3cgdGltZS5UaW1lKSBib29sIHsKCWIudG9rZW5zICs9IG5vdy5TdWIoYi5sYXN0VGljaykuU2Vjb25kcygpICogYi5yYXRlCglpZiBiLnRva2VucyA+IGIuY2FwIHsKCQliLnRva2VucyA9IGIuY2FwCgl9CgliLmxhc3RUaWNrID0gbm93CglpZiBiLnRva2VucyA8IDEgewoJCXJldHVybiBmYWxzZQoJfQoJYi50b2tlbnMtLQoJcmV0dXJuIHRydWUKfQoKZnVuYyBtYWluKCkgewoJYiA6PSAmYnVja2V0e2NhcDogNSwgcmF0ZTogMiwgdG9rZW5zOiA1LCBsYXN0VGljazogdGltZS5Ob3coKX0KCWZvciBpIDo9IDA7IGkgPCAxMDsgaSsrIHsKCQlpZiBiLnRha2UodGltZS5Ob3coKSkgewoJCQlmbXQuUHJpbnRmKCJyZXEgJTJkOiBva1xuIiwgaSkKCQl9IGVsc2UgewoJCQlmbXQuUHJpbnRmKCJyZXEgJTJkOiByYXRlLWxpbWl0ZWRcbiIsIGkpCgkJfQoJCXRpbWUuU2xlZXAoMTUwICogdGltZS5NaWxsaXNlY29uZCkKCX0KfQo="></div>
<h2 id="what-i-actually-changed">What I actually changed</h2>
<p>There was no single magical change. The win was the cumulative
effect of small ones:</p>
<ul>
<li>A clearer contract between the API surface and the systems
behind it. One round trip per logical operation where it used
to be two or three.</li>
<li>Tighter request shapes. Fields nobody downstream consumed
stopped travelling the wire.</li>
<li>Backwards-compatible adapters at the seams, so the rewrite
could ship in chunks and reach production traffic gradually
instead of one terrifying cutover.</li>
</ul>
<p>The unglamorous list is the win. The graph dropped because of the
list, not because of any one item on it.</p>
<h2 id="keeping-myself-honest">Keeping myself honest</h2>
<p>Two things kept me honest, and both saved me at least once.</p>
<p><strong>Traffic mirror in staging.</strong> I replayed real production
requests against the new and old paths side by side and diffed
the responses. The first time I caught a regression I was sure
wasn't there (a one-character bug in a default-value fallback),
that diff was the only reason I caught it before customers did.</p>
<p><strong>Slow rollout.</strong> Small percentage of real traffic at first, with
the old path still hot enough to fall back to. Boring. Effective.
The day the new path emitted a malformed response under one
specific timezone offset, rollback was a single config flip.</p>
<h2 id="the-result">The result</h2>
<p>Average API latency on the rewritten paths dropped by roughly
<strong>30%</strong>. P95 followed it down. The team shipped seven major
features in the same window without slipping the rewrite or each
other.</p>
<h2 id="what-id-do-differently">What I'd do differently</h2>
<p>Spend more of the early days on profiling tooling. The instinct
on a project like this is to start writing the new layer right
away. The higher-leverage move is to make it cheap to know where
time is actually being spent, and <em>then</em> start writing.</p>
<p>The other lesson, which I keep relearning: the boring, careful
audit is almost always faster than the clever rewrite. Most
performance wins at scale aren't hidden. They're sitting in the
code, waiting for somebody to read it slowly. The hard part
isn't the change. The hard part is taking two weeks to read
first.</p>
]]></content:encoded>
</item>
<item>
  <title>A backend engineer&apos;s cheatsheet for Indonesian payment rails</title>
  <link>https://irvineafri.com/blog/indonesian-payment-rails-cheatsheet</link>
  <guid isPermaLink="true">https://irvineafri.com/blog/indonesian-payment-rails-cheatsheet</guid>
  <pubDate>Wed, 24 Dec 2025 00:00:00 +0000</pubDate>
  <description>The map I wish someone had handed me on day one. BI-FAST, QRIS, GPN, SKN, RTGS, OVO, GoPay, DANA, virtual accounts. What each rail is for, what its latency actually is, and which idempotency story to trust.</description>
  <content:encoded><![CDATA[<blockquote>
<p>Working notes from three years of wiring Indonesian payment rails into
bank and lending backends. The companion lab is at
<a href="/labs/rails">/labs/rails</a> — same data, sortable.</p>
</blockquote>
<p>There's a moment, the first time you wire up Indonesian payments,
when you realise the question &quot;how do I take payment?&quot; has a dozen
different answers. Each has its own latency story, idempotency
contract, and refund path. The overseas tutorials don't help. They
explain Stripe and Adyen, and <em>neither of those is the rail</em>. The
rail is BI-FAST or QRIS or GPN, sitting underneath an acquirer or
a wallet that may also be the rail.</p>
<p>This is the map I wish somebody had handed me on day one.</p>
<h2 id="the-five-families-that-matter">The five families that matter</h2>
<p>You can group every domestic rail into five families:</p>
<ol>
<li><strong>Instant inter-bank</strong>: BI-FAST. Real-time, 24/7, capped at IDR 250M.<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup>
The default rail for retail transfers since 2021.</li>
<li><strong>QR</strong>: QRIS. One QR code reads in every wallet and every bank app.
Interoperability is the whole point. Speed is incidental.</li>
<li><strong>Domestic switching</strong>: GPN. Routes domestic debit-card
transactions through Indonesian switches. Cheaper than international
schemes, slower to dispute.</li>
<li><strong>Clearing and high-value</strong>: SKN (batch clearing) and BI-RTGS (high
value, real-time gross). Different shapes, different occasions.
Payroll goes on SKN, treasury goes on RTGS.</li>
<li><strong>Closed-loop wallets</strong>: OVO, GoPay, DANA, ShopeePay, LinkAja. Each
is its own network, plus a QRIS interface, plus an in-app SDK.</li>
</ol>
<p>Cards (Visa/Mastercard) sit slightly outside this taxonomy. Still
ubiquitous for cross-border and high-AOV, still the only rail with a
real chargeback story, still the most expensive.</p>
<h2 id="latency-but-honest">Latency, but honest</h2>
<p>The lab page sorts by latency, and that's misleading without context.
&quot;Latency&quot; here is end-to-end, from &quot;I called the API&quot; to &quot;the
counterparty sees the money&quot;. Within that bar:</p>
<ul>
<li>Wallet APIs (OVO, GoPay, DANA) are fast, typically 2 to 3 seconds,
because both legs sit inside the wallet's perimeter.</li>
<li>BI-FAST is also fast, typical 5 seconds, but p99 climbs into the
tens of seconds when the receiving bank drags its feet.</li>
<li>QRIS <em>acks</em> in 3 seconds, but merchant settlement is T+1.</li>
<li>SKN is <em>batch</em>. Four windows per business day. The &quot;latency&quot; is
effectively the wait until the next window.</li>
<li>RTGS is real-time, but business hours only.</li>
</ul>
<p>Two practical implications:</p>
<ol>
<li>If your customer is staring at a screen, you want a wallet, QRIS,
or BI-FAST. SKN is for things they don't watch land.</li>
<li>If your reconciliation runs daily, the difference between 3 seconds
and 30 seconds is invisible. Pick on cost and on idempotency
semantics, not on raw speed.</li>
</ol>
<h2 id="idempotency-stories--read-these-carefully">Idempotency stories — read these carefully</h2>
<p>Every rail says &quot;we're idempotent,&quot; and <em>every rail means a different
thing by it</em>.</p>
<ul>
<li><strong>BI-FAST</strong>: unique transaction ID per request. Reuse returns the
prior result, including the prior error. The sender bank is the
source of truth.</li>
<li><strong>QRIS</strong>: one QR string is one transaction. Double-scan is blocked
at the PJP layer. Your job is to not reuse the QR.</li>
<li><strong>GPN cards</strong>: the ARN (Acquirer Reference Number) is the
fingerprint. If your retry doesn't carry the same merchant
transaction reference, the issuer treats it as a brand new
authorization.</li>
<li><strong>OVO / GoPay / DANA</strong>: partner-supplied idempotency key on a custom
header. The wallet's API stores the key and replays the prior
response on retry. The retention window varies. Assume 24 hours and
verify in the API docs.</li>
<li><strong>SKN</strong>: batch + reference. Reverse clearing is your only out.</li>
<li><strong>VA</strong>: the VA <em>number</em> is the idempotency token. Once a VA is paid,
paying it again either bounces or creates a duplicate at the
acquirer's discretion. Not a contract you want to lean on.</li>
</ul>
<p>The rule I've internalised: <strong>carry an idempotency key on every
external call, whether the rail demands one or not.</strong> Even when the
rail enforces uniqueness for you, your code reaches the rail through
wrappers and middleware, and the wrappers will retry. If the wrapper
retries silently and the rail accepts the retry as new, your ledger
is wrong. Fix that on your side.</p>
<h2 id="refund-paths--the-unsexy-column">Refund paths — the unsexy column</h2>
<p>This is the one that bites in production. The lab has the row-by-row
detail; the headline is:</p>
<ul>
<li><strong>Wallets</strong> have proper refund APIs. Use them.<sup id="fnref:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup></li>
<li><strong>VA, BI-FAST, SKN</strong> have no scheme refund. You fire a <em>new</em>
counter-transfer, and your accounting reflects it as such.</li>
<li><strong>Cards</strong> have the strongest dispute story (90-day chargebacks) and
the weakest refund-to-customer-satisfaction ratio.</li>
<li><strong>QRIS</strong> sits awkwardly in between. In-session reversal works.
Later reversals go through the PJP, which means support tickets.</li>
</ul>
<p>If you're building a customer-facing product, refund-path quality is
the single biggest reason to prefer wallets over VAs, even when the
MDR looks worse on paper.</p>
<h2 id="the-simulator-companion">The simulator companion</h2>
<p>The <a href="/labs/payments">payment-flow simulator</a> is the live-coding
companion to all of this. It encodes the same patterns: idempotent
debits, double-delivered webhooks, timeout-with-retry,
partial-failure reconciliation. It doesn't pick a specific rail.
Pair the two: the cheatsheet for &quot;what shape is this rail?&quot;, the
simulator for &quot;what does it do under failure?&quot;.</p>
<h2 id="what-this-isnt">What this isn't</h2>
<p>Not a regulatory primer. Cite Bank Indonesia and OJK directly for
that. Not a contract. The numbers are public-range estimates and
will be wrong for the largest merchants. Not exhaustive. The
e-money schemes (LinkAja, ShopeePay) and the corporate rails
(CMS, host-to-host) sit alongside this list and don't fit on one
screen.</p>
<p>What it is: the page I wish I could have shown my past self the
week before I started writing the M-Syariah Payment API. If you're
that engineer right now, this is for you.</p>
<div class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1">
<p>The IDR 250M cap is the scheme-level ceiling per BI's PADG
23/25/PADG/2021. Sending banks can apply tighter caps; check your
issuer.&#160;<a href="#fnref:1" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:2">
<p>Test the refund path on day one of integration, not
week three. Most outages I've seen on payment integrations were
refund-shaped, not authorization-shaped.&#160;<a href="#fnref:2" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
</ol>
</div>
]]></content:encoded>
</item>
<item>
  <title>Integrating OVO, GoPay, and DANA into a Sharia core banking system</title>
  <link>https://irvineafri.com/blog/integrating-ovo-gopay-dana-into-syariah-banking</link>
  <guid isPermaLink="true">https://irvineafri.com/blog/integrating-ovo-gopay-dana-into-syariah-banking</guid>
  <pubDate>Sun, 16 Mar 2025 00:00:00 +0000</pubDate>
  <description>Notes from wiring three Indonesian e-wallets into the Bank Mega Syariah core. Idempotency, reconciliation, why the UI is the easy part, and the things I&apos;d still do the same way.</description>
  <content:encoded><![CDATA[<p>If you live in Indonesia, you probably moved money through OVO,
GoPay, or DANA this morning without thinking about it. That
&quot;without thinking about it&quot; is the whole game in payments. Inside
the bank that connects to those wallets, it's also the part that
eats the most engineering time.</p>
<p>This is what I picked up designing the Payment API at Bank Mega
Syariah, the one that wired our core banking platform into all
three wallets.</p>
<h2 id="why-this-is-hard-before-you-write-any-code">Why this is hard before you write any code</h2>
<p>The first time someone says &quot;let's integrate three e-wallets,&quot; it
sounds like roughly three times the work of integrating one. It
isn't. Each wallet has:</p>
<ul>
<li>Its own dialect for requests and responses.</li>
<li>Its own webhook model — when it fires, how it retries, what it
guarantees about delivery.</li>
<li>Its own reconciliation cadence and statement format.</li>
<li>Its own definition of success and failure, and a wider gap than
you'd expect between &quot;we accepted your message&quot; and &quot;the money
moved&quot;.</li>
</ul>
<p>Multiply that by the bank side. The core banking system is the
source of truth. Ledger postings have to be exact. A lost message
means a real person is missing real money. What looked like one
project becomes three half-projects plus the glue that joins them.</p>
<p>The glue is the project. Most of my time was the glue.</p>
<h2 id="the-shape-i-ended-up-with">The shape I ended up with</h2>
<p>One unified Payment API in front of the core, with thin adapters
per biller behind it. The internal contract is one shape; each
wallet's dialect lives in its adapter and doesn't leak inward.
That sentence is the whole architecture. Everything else was
details.</p>
<p>The pieces I'd call out, in order of how badly each one bites if
you skimp on it:</p>
<p><strong>1. Strong idempotency keys on every external call.</strong> A network
blip should never end with the user double-charged. Getting this
right at the start is cheap. Getting it wrong is a regulator
asking why, three months in, two specific accounts are out by
IDR 47,500.</p>
<p><strong>2. Webhooks: separate &quot;did the message arrive&quot; from &quot;is the
ledger consistent&quot;.</strong> It's tempting to do both in one handler.
Don't. You'll lose either reliability or correctness, and you'll
find out which one at 3am.</p>
<p><strong>3. A daily reconciliation job that proves the ledger.</strong> The
unglamorous, schedule-driven thing that catches the cases your
live code missed. Treat it as a first-class part of the product,
not a clean-up phase you'll add later when there's time. There's
never time.</p>
<h2 id="what-surprised-me">What surprised me</h2>
<p>How much of the work is naming things. &quot;Pending&quot; in OVO's world is
not &quot;pending&quot; in your ledger's world. &quot;Failed&quot; might be retryable
or it might be terminal. Different wallet, different answer. The
discipline of writing the <em>internal</em> contract, the names and
states the rest of the bank's code sees, mattered more than any
one integration.</p>
<p>Once we had a clean internal vocabulary, adding a fourth wallet
would have taken a week, not a quarter. We never did add a fourth,
but the hypothetical was the proof that the design worked.</p>
<h2 id="the-thing-nobody-tells-you-about-payment-integrations">The thing nobody tells you about payment integrations</h2>
<p>The UI is the easy part. The first time the M-Syariah app showed
a green tick that said &quot;transfer successful,&quot; it was thrilling.
The real work was making that tick <em>not lie</em>. Under packet loss.
Under timeouts. When the wallet is briefly down on a Saturday
afternoon. When their webhook arrives twice, fifteen minutes
apart. When their webhook never arrives at all, and your
reconciliation job has to figure it out the next morning.</p>
<p>If the green tick is honest, you've done the hard work. If it's
optimistic, you're a support ticket waiting to happen. There's no
third option.</p>
<h2 id="lessons">Lessons</h2>
<ul>
<li>Treat reconciliation as a product feature, not an operational
afterthought. Design it on day one. It's the only thing that
catches what live code missed.</li>
<li>The internal contract is the most important part of any
multi-provider integration. The adapters are mechanical; the
contract is the design.</li>
<li>&quot;Idempotent&quot; is a property of <em>the system</em>, not just <em>the call</em>.
It only holds when storage, retries, and consumers all
cooperate. Any one of them silently retrying breaks the property.</li>
<li>Test the refund path on day one of integration, not week three.
Most of the production outages I saw on payment work were
refund-shaped, not authorization-shaped.</li>
</ul>
<p>If I were doing the same work today on a greenfield stack, the
shape would still be this one. Different language, different
cloud, maybe an event-sourced ledger instead of the postings
model. But the unified-API-with-thin-adapters spine, strong
idempotency, reconciliation as a feature, those would be on the
wall on day one.</p>
]]></content:encoded>
</item>
<item>
  <title>The train that taught me distributed systems</title>
  <link>https://irvineafri.com/blog/the-train-that-taught-me-distributed-systems</link>
  <guid isPermaLink="true">https://irvineafri.com/blog/the-train-that-taught-me-distributed-systems</guid>
  <pubDate>Sun, 15 Jan 2023 00:00:00 +0000</pubDate>
  <description>My favourite project is still a model train I helped wire up at UGM in 2022. Here&apos;s what a Raspberry Pi and a pair of rails taught me before I knew the word &quot;microservice&quot;.</description>
  <content:encoded><![CDATA[<p>When someone asks me about distributed systems, the example I keep
reaching for is a model train. People look at me funny, fair enough.
But the project I keep coming back to in interviews, in my head, and
every time I draw a state machine on a whiteboard, is a miniature
railway I helped build at UGM in 2022.</p>
<p>So here's what that train taught me, in software-people words.</p>
<h2 id="what-it-actually-was">What it actually was</h2>
<p>A model train you could drive over the web. We sat in a small lab
in the Faculty of Engineering. There were rails on a desk and a
Raspberry Pi acting as the brain. You opened a web app, picked a
train, set a speed, switched a light on or off. Down at the rails,
a protocol called Digital Command Control encoded those
instructions onto the same pair of wires that carried the power.</p>
<p>The Raspberry Pi was the whole stack. Backend in Go, frontend in
Flask + Python, hardware loop running off the same board. My
Bachelor's thesis later pushed the work further with a Python
prototype of the DCC controller proper, hitting millisecond
precision on the wire. There's a tiny simulation of the same idea
on this site at <a href="/labs/train">/labs/train</a>. It's a toy. The
original was a slightly bigger toy.</p>
<h2 id="the-lessons-that-travelled">The lessons that travelled</h2>
<p>I didn't know the term &quot;distributed systems&quot; yet. Looking back,
the lab project was a small, complete one:</p>
<p><strong>Latency budgets are real.</strong> DCC cares about timing in
milliseconds. If your encoder slips, the decoder on the train gets
confused, and the locomotive sits there blinking. That's a debugger
you can hear. Years later when an SRE complained that p99 was up by
20ms, I knew exactly what he meant. I'd stood next to a train that
went silent because of less.</p>
<p><strong>State machines beat ad-hoc logic.</strong> A train can be moving,
stopped, accelerating, switching tracks, or in an error state.
The moment I drew that as a graph and made the transitions
explicit, the bugs almost stopped. On every backend project
since, I draw the state machine first. It's the cheapest
debugging investment I know.</p>
<p><strong>The frontend is also distributed.</strong> A web app that talks to a
Pi running a hardware loop is not the same as a web app that talks
to a database. The browser doesn't know that. Figuring out what to
show the user when the train is <em>probably</em> fine but you haven't
heard back yet is a small version of every distributed-systems
problem you'll hit later. &quot;Probably fine&quot; turned out to be the
whole job.</p>
<p><strong>Hardware tells the truth.</strong> Software lies all the time. A
misbehaving distributed system can hide behind retries and logs.
A locomotive that sits there blinking will not politely sit there
blinking. There's something honest about building systems where
the failure mode is visible. I miss it sometimes, working in
fintech, where most failures stay invisible until reconciliation
day.</p>
<h2 id="why-i-keep-talking-about-it">Why I keep talking about it</h2>
<p>Years later I work in fintech. I think about ledgers, idempotency,
timeouts, reconciliation. None of that is very far from a model
train and a state machine on a Pi.</p>
<p>The path from &quot;model train on a desk&quot; to &quot;five-million-MAU lending
app&quot; is shorter than it sounds, if you stay curious about the
seams.</p>
]]></content:encoded>
</item>
</channel>
</rss>
