Irvine Afri Dwicahya

Porting a Next.js app to React Native overnight, with an agent loop

Tue, 12 May 2026 00:00:00 +0000

How I ran a single-threaded fleet of Claude Code agents against a 115-task queue and got 50 tasks (5 of 12 milestones) shipped to main while I slept, without ever letting an agent touch git.

This is a working journal, not a polished pitch. The system is still running as I write. The numbers below are accurate as of the most recent milestone push.

The setup

I run KelasJenius, an interactive learning platform for Indonesian SMP/SMA students. The web app (Next.js 14 + Fastify + Postgres) is mature: auth, subscriptions, quizzes, duels over WebSocket, AI tutor, parent portal, the works.

The mobile app needs to ship to the App Store and Google Play with 1-for-1 parity to the web student experience, plus native Apple IAP and Google Play Billing. That's a ~15-week solo project at full-time pace. I do not have 15 weeks. I have nights and weekends.

So I built a system that lets a swarm of one agent at a time, working in sequence, walk a dependency-ordered task queue while I sleep.

In the last ~36 hours of wall-clock time the loop has shipped:

M0 (backend prep), 7 tasks: bearer-token auth, WS auth via query param, /api/version/check, device registration, IAP migration scaffolding, CORS for native, refresh-in-body
M1 (mobile foundation), 11 tasks: Expo SDK 54 + RN 0.81.5 scaffold, monorepo wiring, NativeWind, providers, MMKV+TanStack, apiFetch with sliding renewal, smoke test
M2 (design system packages/ui-mobile), 9 tasks: tokens, primitives (KjButton/Pressable/Card/Screen/Text/Input/Skeleton), 506 SVG icons ported by codemod, theme provider, toast, motion (KjXpPopup/KjStreakFlame), data badges, dev gallery
M3 (auth shell), 8 tasks: login, register, forgot/reset/verify-email, useCurrentUser, profile tab + settings sheet, force-upgrade gate
M4 (core content), 9 tasks: KaTeX-via-WebView, lesson reader, dashboard, subjects tree, paywall placeholder, offline cache, offline banner
M5 (quiz), in progress, 4 of 10 done: state machine + 32 tests, session screen UI, confirm-before-submit, reveal animations next

That's 50 of 115 tasks, with 1,565+ tests green at the most recent milestone gate.

The shipping isn't the interesting part. The interesting part is how small the set of design choices that made it boring enough to ship overnight without me at the keyboard.

The shape of the problem

Long-horizon agent work fails in three predictable places.

Drift across tasks. Agent #4 builds a thing on top of Agent #2's misunderstanding of the spec. The error compounds.

Untracked state. "Which tasks are done? Which are blocked? What did the last agent change?" If the answers live in chat scrollback you've already lost.

Git becomes the contention point. Twelve agents force-pushing over each other, or one agent amending a commit a downstream agent already pulled. The repo's history is the single most valuable artifact, and touching it carelessly destroys the run.

The system I'll describe addresses each of those head-on. The architecture is boring on purpose.

The three artifacts that run everything

Everything the loop needs lives in three places under apps/mobile/plans/:

apps/mobile/plans/
├── TASKS.md                          ← the queue (115 rows, 12 milestones)
├── logs/agents.log                   ← append-only audit log
├── 00-README.md                      ← the master plan
├── 01-architecture-decisions.md      ← non-negotiables, locked
├── 02-phase-0-backend-prep.md        ← preconditions + concrete tasks
├── 03-phase-1-foundation.md
├── 04-phase-2-design-system.md
├── 05-phase-3-auth-shell.md
├── 06-phase-4-core-content.md
├── 07-phase-5-quiz-daily.md
├── 08-phase-6-duel-realtime.md
├── 09-phase-7-social-leaderboard.md
└── 10-phase-8-advanced.md

TASKS.md, the queue

The queue is a flat markdown table per phase. Every row is one task with: ID, title, link to a spec section in the phase doc, deps, status, commit SHA, last-updated timestamp.

| ID    | Title                       | Spec link                              | Deps        | Status      | Commit  | Updated              |
| T4.6  | Lesson reader screen        | 06-phase-4-core-content.md#task-46     | T4.2, T4.5  | done        | 3fd8fa6 | 2026-05-12T12:30:00Z |
| T5.1a | Quiz state machine module   | 07-phase-5-quiz-daily.md#task-51       | T4.6        | done        |         | 2026-05-12T15:27:00Z |
| T5.1b | Quiz session screen UI      | 07-phase-5-quiz-daily.md#task-51       | T5.1a, T2.7 | done        |         | 2026-05-12T16:45:00Z |
| T5.1c | Confirm-before-submit flow  | 07-phase-5-quiz-daily.md#task-51       | T5.1b       | done        |         | 2026-05-12T17:30:00Z |
| T5.1d | Reveal animations + haptics | 07-phase-5-quiz-daily.md#task-51       | T5.1c       | in_progress |         | 2026-05-12T18:00:00Z |

Four statuses: todo, in_progress, done, blocked (external).

Status counts live at the top of the file and must equal working-tree truth, not git history. The orchestrator and any watchdog reconcile against the working-tree file:

- todo: 51
- in_progress: 1
- done: 53
- blocked (external): 10
- Total: 115

This is the only shared state between the orchestrator and any agent. There is no database, no task service, no Jira sync. The file is the queue.

agents.log, the audit trail

Every agent return appends one structured block:

## 2026-05-12T08:25:00Z — T5.1a done (milestone-pending)
- agent: a8ae91c3e6ac30d62
- duration: 12m 11s
- files: apps/mobile/lib/quiz/quizMachine.ts (new — pure reducer + useQuizMachine hook),
         apps/mobile/lib/quiz/__tests__/quizMachine.test.ts (new, 32 tests),
         apps/mobile/plans/TASKS.md
- tests: 32 new tests added; workspace total now 57
- summary: Quiz state machine shipped. Pattern: useReducer + pure reducer + custom hook.
  ... [paragraph of substance: what changed, what was decided, why] ...
  Critical API correction: spec mentioned `GET /sessions/:id/next-question` but that
  endpoint does NOT exist in the live API. I verified against `apps/api/src/routes/sessions.ts`.
  The actual web flow loads all questions upfront via
  `GET /subjects/:s/topics/:t/subtopics/:st/questions`. The machine loads the full
  question array at startSession and advances client-side.
  Bug fixed during review: `questionsAnswered` was off-by-one; corrected to length.
- notes: M5 status 1/7 done. Handoff to T5.1b: consume useQuizMachine, call startSession
  on mount, observe state, drive selection via selectAnswer(optionId), submission via
  submitAnswer()...

milestone-pending is a placeholder. When the milestone pushes, the orchestrator rewrites these to the real short SHA in a follow-up commit.

The notes: field at the bottom is the most important part. Every agent ends its entry with a handoff to the next agent: the precondition the next task can rely on, the path the previous agent actually used (not what the spec said), and any decision the next agent does not need to re-litigate.

This is how drift gets contained. The next agent reads the last 1–3 log entries before claiming, so it inherits a precise mental model of what is true in the working tree right now instead of reasoning from the spec alone.

The phase docs

Each phase doc is self-contained. It states:

Preconditions ("M3 must be green; bearer auth must exist")
Concrete file paths ("create app/(dashboard)/subjects/[slug]/[topicSlug]/[subtopicSlug]/index.tsx")
Code sketches, just enough to anchor the structure, never enough to copy-paste
Acceptance checklist ("done when: WebView renders KaTeX block math correctly, light/dark theme switches without remount lag, no console errors")

The locked-in 01-architecture-decisions.md is the bedrock: Expo SDK 54, New Architecture on, NativeWind v4, TanStack Query + MMKV, expo-secure-store for JWT, KaTeX via WebView. An agent that proposes Zustand or AsyncStorage for tokens gets reverted.

The execution model (and how it evolved)

I tried three workflows in the first 24 hours. The third one stuck.

Attempt 1: branch + PR + merge-bot per task

Each task spawns a worktree, the agent works, opens a PR, a merge-bot watches CI and merges.

Why it failed: per-task PRs created 100+ tiny PRs and a merge queue. Cross-task drift surfaced in PR review, which the agent then had to relitigate. Cognitive cost per task was too high to be worth the audit trail.

Attempt 2: main-only direct push per task

Each agent works directly on main, runs pnpm verify, commits and pushes if green.

Why it failed: two problems. Rollback granularity was per-task, which is fine if a single agent broke something but useless if a sequence of agents had compounded a subtle error. And the git log was an unreadable wall of 50+ commits per evening, with the actual feature unit (e.g. "M4 core content") spread across nine commits and three days of intermediate state.

There was also an integrity issue: agents occasionally forgot to flip TASKS.md to done, and the orchestrator's bookkeeping had to chase the agent's git history instead of the agent's reported state.

Attempt 3: main-only, milestone-batched, no-git agents (current)

This is the model that's been running. The rules:

Agents do not touch git. At all. They edit code, run pnpm verify, edit TASKS.md in the working tree, and return.
The only git command an agent may run is the initial git switch main && git pull --rebase origin main at start.
When all tasks in a milestone group reach status=done in the working tree, the orchestrator captures one squash-style commit per milestone and pushes.
A separate chore(plans): record M SHA follow-up commit backfills the SHA into the Commit column of each row in that milestone.

The orchestrator loop is six lines:

loop:
  assert in_progress == 0                # working-tree TASKS.md, not origin/main
  next = lowest-numbered todo with all Deps == done
  if next is None:
    if working-tree milestone is complete: push milestone (see below)
    elif any blocked exist:                surface blocker, halt loop
    else:                                  all done, exit loop
  spawn 1 agent on `next`                 # agent does NOT touch git beyond initial pull
  wait for agent to return with status=done in working-tree TASKS.md
  append entry to apps/mobile/plans/logs/agents.log (working tree, no commit)
  if this task is the last in its milestone group: push milestone
  goto loop

The milestone push itself:

1. git status                            # confirm working tree contains only milestone code + TASKS + log
2. git add -A
3. pnpm verify                           # one more time — catches integration drift between tasks
4. git commit -m ":  — Tx.y..Tx.z [M]" with co-author trailer
5. capture short SHA
6. backfill SHA into TASKS.md Commit columns; replace milestone-pending in agents.log
   chore(plans): record M SHA ()
7. git push origin main

This trades rollback granularity (you can only revert a whole milestone) for shippable units (every commit on main is a complete, tested, parity-checked feature group). Per-task pnpm verify is still the per-task quality gate; the per-milestone re-verify catches anything that snuck between tasks.

The per-milestone re-verify has caught a real bug exactly once so far: a type drift between T2.3c and T2.4a where a primitive's prop signature shifted while icons were being ported. The orchestrator fixed it inline as part of the milestone commit (no separate task) and noted the drift in the log.

The `pnpm verify` gate, the only quality contract

There is no PR review. There are no human checkpoints during the loop. pnpm verify is the single quality contract. It runs:

Gate	Mechanism	Scope
Emoji ban	grep over Unicode ranges, allowlisted paths	every UI workspace (incl. `apps/mobile`)
type-check	`pnpm -r --if-present run type-check` (`tsc --noEmit`)	every workspace
lint	`pnpm -r --if-present run lint` (eslint / next lint / expo lint)	every workspace
unit tests	`pnpm --filter test`	14 workspaces
build	`pnpm -r --if-present run build` (skipped pre-commit, runs on verify / pre-push)	every workspace

It runs automatically on git commit via core.hooksPath=.githooks. Emergency override is VERIFY_SKIP=1. The rule: only for genuine fires, fix the root cause next.

Two design choices that pay rent.

A per-workspace test:unit / test:integration split. apps/api and packages/db own DB-backed integration suites that need a live Postgres on port 14002. The unit slice runs everywhere (fresh clone, sandbox, CI), and the per-task agent loop only runs test:unit. The orchestrator runs test:integration once at milestone boundaries on a machine with the test DB up. This split is the difference between a 6-second per-task gate and a 90-second one.

A self-asserting matrix. packages/types/src/__tests__/verify-coverage.test.ts declares which workspaces are expected to participate in which gates. If a new workspace is added without being wired into scripts/verify.sh, that test fails. The gate audits itself.

Lessons (the hard-won kind)

1. Specs are guidance. Code is truth.

The single most common failure mode across 53 completed tasks was spec drift. The phase docs were written upfront, the code evolved, the agents trusted the spec. Examples from the log:

T5.1a: spec said GET /sessions/:id/next-question. That endpoint does not exist in the live API. The agent verified against apps/api/src/routes/sessions.ts, found that the actual web flow loads questions upfront via a different endpoint, and built the state machine around the real shape. The handoff note to T5.1b documented the correction so the next agent didn't re-discover it.
T0.1: spec said apps/api/src/middleware/auth.ts but the live code path was apps/api/src/plugins/auth.ts. Agent updated the real file. Handoff noted the canonical path so T0.7 didn't trip on the same thing.
T4.3: spec pseudocode used KjScreen onRefresh/refreshing props. The real component takes refreshControl={}. Multiple primitive prop drifts caught in one task.

The rule I baked into every agent's spawn prompt:

If the spec disagrees with the live code, the live code wins. Update the spec section's path/shape if you're sure, and document the correction in your handoff note.

The cost is one extra grep per task. The benefit is that every subsequent agent inherits a corrected model.

2. Force the handoff. Don't trust the agent to volunteer it.

Half the value of the agents.log entries is the bottom notes: field. The first dozen agents barely filled it in. So the spawn prompt became explicit:

Your final report MUST include a Handoff paragraph for the next dependent task: the precondition it can rely on, the path you actually used (not what the spec said), and any decision it does not need to re-litigate.

After this change, every entry has a usable handoff. The pattern is so reliable I caught one bug just by reading the previous entry's handoff against the current task's spec. They disagreed, the previous agent had been right, and the spec was stale.

3. Agents bail mid-investigation. Make them flip the row before they exit.

This was the most expensive failure mode. An agent finishes the code, runs pnpm verify, sees green, then, instead of flipping the TASKS.md row to done, drops out of the loop with "Let me check the actual component interfaces" as their final line. The work is done. The bookkeeping is not.

When the next agent claims the next row, it sees in_progress from the previous row and refuses to spawn (the precondition is in_progress == 0). The orchestrator has to absorb the bookkeeping by hand.

The fix in the spawn prompt:

Before reporting, you MUST: (1) run pnpm verify to completion, (2) flip your row in TASKS.md to done, (3) decrement in_progress and increment done in the status counts. Report only after these three things are visible in the working tree.

Plus an explicit confirmation line in the report:

"TASKS.md flipped to done, counts updated."

After this change the bail rate dropped to roughly zero. Two agents that did bail in M4 (T4.4, T4.7) were caught by the orchestrator at the spawn precondition check and the row was finalized in seconds, with the agent's actual work intact in the working tree.

4. Codex review as a cheap second opinion

After any non-trivial implementation, I run:

codex exec --sandbox read-only "Review for bugs and logic errors"

It's a different model with a fresh context window reading the diff cold. It catches things the implementing agent missed because the implementing agent was deep inside its own assumptions.

The KjLessonWebView task (T4.2) is a clean example. The implementing agent shipped it. Codex flagged two real issues: (1) onHeightChange presence was incorrectly switching the WebView to content-height layout mode, and (2) DOM_READY_JS was running twice (once inside buildKatexDoc's DOMContentLoaded handler, again via injectedJavaScript). Both got fixed in the same commit before the milestone push.

I treat Codex as a peer reviewer with zero relationship to the agent that wrote the code. The cost is one tool call per task. The catch rate is meaningful.

5. Codemod what you can

packages/ui/src/icons/icon-renderers.tsx has 519 named SVG icons used across the web app. The naive approach (hand-port each to react-native-svg) was budgeted at three days.

Instead, T2.4a hand-ported the first 30 to establish the pattern: default export function, react-native-svg elements, SvgComponentProps props. Then T2.4b ran a codemod at packages/ui-mobile/scripts/port-icons.mjs over the remaining 489. 476 ported cleanly. 22 needed hand-port, because they use SVG elements or .map() in their renderers, and the skip list lives at packages/ui-mobile/src/icons/skipped.ts so the parity test can prove every web icon is either ported or explicitly skipped.

T2.4c ran a parity gate test: walk every icon in the web registry, assert it exists in the mobile registry or in the skip list. If a new web icon ships, the mobile gate fails until the icon is either ported or skipped. That gate runs as part of pnpm verify.

The whole sub-phase shipped in under three hours of wall clock, including the codemod write itself. Three days saved.

6. Three-file env-var rule

Whenever any service reads process.env.X, the rule is:

Add the var with a safe default to .env.example
Add a VAR=${VAR} placeholder to .env.dokploy
Set the real value in the Dokploy production env config

Miss any of the three and the next deploy silently breaks. I've shipped two regressions to this rule before automating it. Both took longer to debug than the rule takes to follow. The deploy-supervisor skill now scans process.env.X references against .env.dokploy at push time and refuses to deploy if any var is missing.

Same principle applied to the mobile build: every new env var consumed by apps/mobile (currently EXPO_PUBLIC_API_BASE_URL, APP_VARIANT) goes through all three files. If a future agent tries to read a new var without registering it, deploy-supervisor blocks the push.

7. Plan up front. Execute without thinking.

The 19 plan documents (00-README.md through 10-phase-8-advanced.md plus parity matrix and conventions) total roughly 130 KB of markdown. They were written before any code was. They include:

Locked architecture decisions (no agent may re-litigate)
Concrete file paths per task
Code sketches just detailed enough to anchor structure
"Done when" checklists
A glossary

Writing this upfront felt slow. It's the highest-leverage decision I've made on this project. Every minute spent writing a clear "Done when" line in T4.6 saved an hour of agent thrashing during execution. Agents that hit ambiguity stall and start asking the orchestrator questions, which means I get paged in the middle of the night.

The phase docs are written for "an autonomous coding agent (or human engineer) picking up cold." That framing forces self-containment.

What the math actually looks like

Wall clock over the recent two-day window:

M0 (backend prep): 7 tasks, ~1.5 hours
M1 (foundation): 11 tasks, ~3.5 hours including dependency churn
M2 (design system): 9 tasks, ~6 hours (the codemod sub-phase compressed what was budgeted as 3 days)
M3 (auth shell): 8 tasks, ~3 hours
M4 (core content): 9 tasks, ~4 hours including the KaTeX prototype and offline cache
M5 (quiz): 4 of 10 tasks shipped so far, ~1 hour

Total: ~19 hours of agent wall-clock for what the original plan estimated as ~7 weeks of solo founder calendar time. Not all of that was overnight, but most of M2–M4 ran while I was asleep. The orchestrator sent push notifications on milestone completions and on blocker surfacing; I woke up to a working (dashboard)/subjects/[slug]/[topicSlug]/[subtopicSlug] lesson reader I had not touched.

Things the system has not had to deal with yet:

Native module integration (Apple IAP, Google Play Billing, Phase 9)
Real device testing (currently sim-only; release pipeline is Phase 11)
A merge conflict (single-threaded execution + git pull --rebase at agent start prevents this entirely)

I expect Phase 9 (IAP) to be the model's first real stress test, because eight of those tasks are blocked on external Apple/Google account state that no agent can resolve.

What I'd tell someone setting this up tomorrow

Write the plan docs first. All of them. Before any code. The plan docs are the spec the agents read. If they're vague, the agents will fight the same battle three times across three tasks.
The queue is one markdown file. Not a database, not a task service. Drift between the file and the system breaks everything. Make the file the system.
Agents must not touch git. Let them code. Let them test. Let them flip the tracker. Push from one place, one time per milestone group. Audit log is append-only.
The pre-commit hook is your QA team. pnpm verify runs every gate every time. If it can't catch a class of bug, harden it once. Don't review by hand.
Force the handoff in the spawn prompt. The next agent's success depends on the previous agent's last paragraph. Make that paragraph contractual.
A second model reviews everything. Codex (or any agent with a fresh context window and read-only access) catches assumption-blindness from the implementing agent. It's the cheapest review you'll ever do.
Specs are guidance. Code is truth. Bake this into the spawn prompt verbatim. Agents that trust the spec over the code will compound errors.
Plan for the bail. Agents will exit mid-task. Make the orchestrator's precondition (in_progress == 0) self-healing: if a row is stuck in_progress, finalize it from the working-tree state and move on. Do not block the loop on a bail.
Milestone-batch the commits. Per-task commits are unreadable. Per-milestone commits are shippable units. The trade-off (coarser rollback granularity) is worth it for clean history and a clear push contract.
Push notifications on milestone completion and on blockers. Otherwise you wake up to a system that paused at 3 a.m. waiting for a question you could have answered in 30 seconds.

The bits I haven't solved yet

Honest list:

Phase 9 (IAP) has 8 externally-blocked tasks. Apple Developer enrollment, Small Business Program, App Store Connect product setup, Google Play Console, User Choice Billing application. The loop walks around them via the dependency graph, but the eventual unblocking is a sequence of two-day-each turnaround items that no automation can compress.
Real device testing. The smoke test passes on iOS Simulator and Android Emulator. Real-device QA on a TestFlight build is currently a manual gate scheduled for Phase 11.
Spec drift detection. Agents flag drift in their handoff notes, but the spec doc itself is never auto-updated. After M5 closes I plan a sweep agent that ingests every agents.log Spec drift: note and proposes corrections to the phase docs.
Long-form lessons learned never propagate back to the spawn prompt. The seven lessons in the previous section live in this blog post and in my head. They should live in a CONTRIBUTING-FOR-AGENTS.md that every spawn loads. That refactor is on the list.

Closing

None of this is novel. Every individual ingredient (append-only audit logs, single-threaded queues, pre-commit verification gates, milestone-batched commits, codemods for boring transforms, second-model review) is engineering practice from before LLMs existed.

What changed is that the things you used to need a team for now run on a laptop with an agent that you brief like a junior engineer. There's no clever prompt to copy. The work is writing a plan boring enough to execute mechanically, building a pre-commit gate strict enough to be the only reviewer, and refusing to let an agent touch the git history.

The hard part of solo engineering used to be doing the work. Now the hard part is deciding what work to do, and writing it down clearly enough that the agent doesn't have to ask.

Sources: apps/mobile/plans/TASKS.md, apps/mobile/plans/logs/agents.log, apps/mobile/plans/00-README.md, apps/mobile/plans/01-architecture-decisions.md, apps/mobile/CLAUDE.md, scripts/verify.sh. All numbers and quotes are from the actual files; nothing has been edited for narrative effect.

Autonomous agents inside an Indonesian company

Sat, 09 May 2026 00:00:00 +0000

Numbers are real but rounded. Rupiah figures use IDR 16,000/USD as the lazy exchange anchor I keep in my head. Calibrated against a 2026 Q1 production run on GCP asia-southeast2, hitting OpenAI via Azure Singapore, Anthropic in us-west, and a self-hosted Llama 3.3 70B for the cheap stuff.

Most "agent" articles pretend the loop is solved. Call the LLM, parse the tool call, run it, feed the result back. Done. That's the demo loop. The production loop is a different animal, and once you ship one of these for an Indonesian company with rupiah on the line and an OJK auditor on speed-dial, the differences stop being academic.

I've been running autonomous agents inside that kind of company for about a year. This is the writeup I wish somebody had handed me on day one. The audience is engineers who already know what an MCP server is, what a tool-call schema looks like, and roughly what an o1-style reasoning trace costs per token. I'm skipping the marketing layer.

What "agent" means here

A long-running process that takes a goal, plans, calls tools, watches the world, retries, escalates when it gets stuck, and produces a durable artifact. Not a chatbot. Not a single LLM call in a retry loop. Something with state that survives a process restart, and a coordinator that decides when the work is done.

The agent we run most often does collections triage. Given a delinquent borrower, it pulls the loan history, checks the WhatsApp engagement, drafts a tailored outreach, fires the first contact, watches the response, and either escalates to a human collector or schedules a follow-up. End to end: 40 to 90 seconds wall-clock, 20 to 50 LLM calls, 6 to 12 tool calls. Runs about 12,000 times a day at peak.

That's the shape. Now the parts.

1. Orchestration

First decision: graph framework or hand-rolled. We tried both. LangGraph, BAML, Inngest are all wonderful for the walkthrough demo. They become a tax the moment your control flow stops being a DAG. And real agent control flow is not a DAG. It has loops, dynamic branches based on tool output, and at-least-once retries that need state-machine guarantees the framework's abstractions weren't built to express. We spent more time fighting the framework than we saved.

So we wrote our own. State machine over Postgres + RabbitMQ. The shape:

[pending]
   │
   ▼
[running]  ◄────┐
   │            │  resumed after
   ▼            │  tool callback
[awaiting_tool]─┘
   │
   ▼
[completed | failed | escalated]

Every transition writes a row to agent_runs.events (append-only) and updates agent_runs.state atomically, in the same transaction. That single decision is load-bearing. Every model call, every tool call, every external observation lands in the database as an event. If a worker dies mid-run, and they do, often, because Indonesian data centres lose power in ways AWS post-mortems don't capture, another worker reads the log and resumes from the last consistent state.

The pseudocode that earns its keep:

def step(run_id):
    with txn():
        run = lock_run(run_id)
        if run.state == 'awaiting_tool':
            return  # someone else's problem
        events = load_events(run_id)
        next_action = plan(run, events)  # an LLM call

        if next_action.kind == 'tool':
            event = emit('tool_call.requested', next_action)
            run.state = 'awaiting_tool'
            run.save()
            enqueue_tool(event)         # RabbitMQ delayed-message exchange
        elif next_action.kind == 'finish':
            run.state = 'completed'
            run.save()
            emit('run.completed', next_action.result)

The trick is that awaiting_tool is a real, stable state with its own timeout. Tools are jobs, not function calls. Calling a tool means publishing a message. A callback later delivers the result. That's what makes a 90-second agent run with three external HTTP hops survivable when one of those hops takes 12 seconds because some upstream API is having a moment.

2. Memory

There are three kinds, and they have nothing to do with each other. Pretending they're one thing (a "memory layer," a vector store) is the most common mistake I see.

Run-local memory is the scratchpad inside one agent run. Everything the agent has seen so far, including its own intermediate reasoning. We store it as the event log on agent_runs. Replaying the events deterministically reconstructs the prompt for the next step. Token budget: 32k before we summarise.

Episodic memory is what this agent remembers about this borrower across past runs. We tried vector stores: pgvector, Weaviate, Qdrant. Burned three months chasing retrieval relevance. What actually shipped was a structured episodic table:

CREATE TABLE borrower_episodes (
  borrower_id   bigint,
  episode_at    timestamptz,
  channel       text,        -- 'wa', 'voice', 'sms'
  outcome       text,        -- 'paid', 'pkpu', 'no_answer', ...
  notes         text,
  vector        vector(768)  -- mE5, multilingual
);

Retrieval is WHERE borrower_id = $1 ORDER BY episode_at DESC LIMIT 20. The vector column is reserved for the rare "find episodes semantically like this one" query that shows up maybe once a week. The vector index is the cherry on top, not the cake. People keep flipping that around.

Procedural memory is the prompt. We version every system prompt with git, hash it, and stamp the hash on every run. When somebody "fixes" a regression by editing the prompt, we can replay the offending run against both versions and see which one it was born under. Sounds boring. Will save you a sprint the first time a quality drop bisects to a four-word edit.

3. Tools

The mistake is one big tool with a hundred arguments. The shape that survives is many small tools, each with a tight, validated input schema, each idempotent.

Every tool gets:

A Zod-style schema for inputs.
A canonical idempotency key derived from inputs + run id.
A timeout. p99 of normal latency × 3, capped at 30 seconds for the synchronous request, longer for the async job.
A circuit breaker per downstream system.
An audit row in agent_tool_calls with the full request and response payloads, encrypted at rest.

The audit table isn't optional. Indonesian fintechs have auditors, and when an auditor asks "what did this agent do on borrower xyz?", the answer needs to be one query. I've watched a peer team scramble for two days reconstructing this from logs after the fact. Don't be that team.

A failure that quietly costs you: the LLM hallucinates a tool name that doesn't exist, or hallucinates an argument with the slightly-wrong type. The framework most tutorials show you swallows this and feeds a string error back to the model, hoping it self-corrects. In production you want the orchestrator to detect "hallucinated tool / schema" as a category of failure, count it, alert when it spikes, and fall back to a smaller, stricter model for the next attempt. We've watched gpt-5 regress on a Wednesday afternoon because of a quiet upstream model update. That's where this metric earns its keep.

4. Permissions

The dangerous question: what is the agent allowed to do?

The lazy answer is "whatever its tools let it do." That answer ships exactly once. After that, compliance puts a hold on every agent project for six months. I've seen it happen.

What works:

Tools declare a capability (payment.disburse, borrower.send_wa, borrower.read_pii).
Each agent run is bound to an actor, not a service account. For autonomous runs, the actor is a synthetic identity tied to the workflow definition (agent:collections-tier-1).
The orchestrator enforces capability scoping before the tool is dispatched, against a per-actor policy table.
Capabilities have soft and hard caps. payment.disburse for agent:collections-tier-1 has a hard cap of IDR 0 (it cannot move money) and a soft cap of zero in any policy revision. Escalating beyond it requires a human approver in the event log, full stop.

The enforcement point matters. Don't put it in the tool. Put it in the dispatcher. Tools assume their inputs are already authorised. That's one audit boundary. Putting the check in N tools means N audit boundaries, written by N engineers, each of whom forgot something different. I learned this the slow way.

5. Reliability

LLM endpoints are not reliable infrastructure. Treat them like flaky third-party APIs, because that is what they are.

Production reliability budget for a single agent run, last quarter:

Source	Failure rate (Q1 2026)	Mitigation
OpenAI 5xx	0.4%	retry × 2 with jitter
Anthropic 5xx	0.6%	retry × 2 with jitter
OpenAI rate-limit	1.1%	model-level priority queue
Tool timeout	0.9%	per-tool circuit breaker
Hallucinated schema	0.3%	strict-mode reattempt
Indo network	0.2%	connection pool warming + retry

Compose those naively and you get a 3.5% per-call failure rate. Across a 30-LLM-call run, the unmitigated joint failure probability is around 65%. Mitigations bring it under 2%. The gap between "demo works" and "demo works on Friday afternoon when GPT is degraded" is exactly this list.

Two patterns I keep coming back to:

Idempotent at the agent level, not just the tool level. If a worker crashes mid-step and another resumes, the resumer should produce the same effects, not duplicate ones. The event log is what enforces this. The resumer reads "tool X was already requested with idempotency key K" and skips re-emitting. The resume is silent.

A resume is not a retry. Resume picks up after the last durable state. Retry replays the last step. Both are needed, in different scenarios. Conflating them is how you send a borrower the same WhatsApp twice.

6. Observability

Tracing an agent is harder than tracing a microservice. A single run has dozens of LLM calls, dozens of tool calls, branching reasoning, prompt-version changes, and a result that may not be "success" or "failure" but "escalated to human."

What worked for us: OpenTelemetry for transport, Langfuse for the agent-aware UI, and a custom trace structure where every event in the agent's event log emits its own span.

run.collections_triage  74,231 ms
├─ plan.step.0           1,482 ms   gpt-5  · 2.4k/0.3k tok
├─ tool.borrower.read      210 ms
├─ plan.step.1           1,623 ms   gpt-5  · 4.1k/0.5k tok
├─ tool.wa.history       1,820 ms
├─ plan.step.2             842 ms   haiku  · 1.2k/0.1k tok
├─ tool.outreach.draft   3,118 ms
├─ tool.outreach.send   12,344 ms   ← retry × 2
├─ plan.step.3           1,099 ms   gpt-5
└─ run.completed

That view puts model timing, tool timing, per-step token cost, and retries onto one screen. When a teammate Slacks me "this run was slow," I can answer in under 30 seconds.

The metric that earned its keep: escalation rate per sub-workflow. Not per agent. Not per model. Per named step in the workflow. When a particular step starts escalating more often, it almost always points to a model regression, a prompt edit, or a downstream tool returning a new error shape. None of those show up on a top-level success metric.

7. Scaling

The bottleneck is rarely compute. It's almost always one of: rate limits at the model provider, latency at a downstream tool, or worker concurrency tuned wrong.

Cost shape for our collections triage agent at 12,000 runs/day:

Component	Per run	Daily
GPT-5 plan steps	$0.014	$168
Haiku 4.5 sub-steps	$0.002	$24
Self-hosted Llama 3.3	$0.0008	$9.60
Postgres / RMQ infra	(amortised)	$42
Observability stack	(amortised)	$18
Total	$0.018	$216 / day

In rupiah that's about IDR 3.5M/day, or IDR ~290 per run. The human collector who would otherwise make the first call costs roughly IDR 14,000 per touch, all-in. Unit economics work, but only because we keep the planner cheap (Haiku on the easy steps, GPT-5 only when the plan branches into something nontrivial) and we cap out-of-budget runs at the orchestrator level. Without that cap, the first model spike caught us at 4× the budget for ten hours straight.

The scaling lever that mattered most was moving inference to asia-southeast. Cross-region calls to OpenAI's US endpoints were adding 180-220 ms median per call. On a 30-call run that's about 6 seconds of pure latency tax. Once we routed bulk traffic through Azure OpenAI in Singapore and kept Anthropic in us-west only for the long-context steps, p99 dropped from ~118 seconds to ~71. That is the difference between a borrower picking up the phone and not.

8. Failure recovery

Every agent run is a finite state machine; failures land in named recovery states; each recovery state has a manual override.

The states that matter beyond failed:

stuck: three consecutive plan steps failed to produce a recognisable next action. Push to a queue read by a human triager. Replay-friendly.
escalated: agent returned "hand off." A human picks up the full event log inside our internal ops UI and continues from the last state with a human_resume event.
quarantined: schema-validation failures that look adversarial (e.g., the agent kept emitting tool-calls with borrower_id set to the coordinator's user ID). These don't replay. They alert on PagerDuty.

A specific lesson, paid in production: don't auto-retry escalated. If a human said this needs eyes, an automatic resume two hours later because of a queue redelivery will surprise that human in the worst possible way. Resume only on explicit human action. Ask me how I learned this. Actually, don't.

9. Agent coordination

Multi-agent setups are oversold and undersold at the same time. Most "multi-agent" systems are one orchestrator plus a few narrow-skill agents. We have three:

A planner that owns the run and chooses sub-tasks.
A researcher that does retrieval and summarisation against episodic memory and the loan/transaction history.
A drafter that writes outbound messages in Bahasa Indonesia, fine-tuned for collections tone (firm, lawful, never threatening). The fine-tune mattered. The off-the-shelf model wrote outputs that read as condescending in Bahasa formal.

Coordination is just the planner calling the others as tools. They have their own tool-call surfaces, their own audit trails, their own per-task token budgets. They don't share memory directly. They share it through the orchestrator's event log.

The "agents that talk to other agents in an ad-hoc swarm" pattern sounds clever and produces remarkable demos. In production it's a debugging nightmare. Replays are non-deterministic, blame is diffuse, unit tests are basically impossible. We don't run it. Maybe in 2027 the tooling catches up.

10. Long-running execution

Some workflows take days. Our loan-restructuring agent runs as a saga — waits for the borrower to respond, escalates internally, schedules a callback for next Monday, and so on. The agent run can be alive for two weeks of wall-clock time across maybe 90 seconds of compute.

This works because the orchestrator is the durable state, not the process. Workers are stateless; they grab a run, advance it one step, release it. A cron-style scheduler nudges runs whose next_check_at is in the past. The runs themselves don't sit in memory waiting; they sit in Postgres.

The thing that kept biting us: wall-clock timeouts inside prompts. "If you haven't received a response in 24 hours, escalate" worked great until daylight savings. Jakarta doesn't observe DST, but our customers' phones sync from carriers that sometimes report wrong, and the agent's notion of "24 hours" was inferred from the prompt, not the clock. We pulled every time calculation out of the model and into the orchestrator. The agent only sees time_since_last_contact: 26h13m as a structured input, never raw timestamps. Day got easier.

What you actually buy

When the system works, the agent isn't smarter than a junior collector. It's more consistent. Available at 02:00. Doesn't forget the borrower's last interaction. Doesn't let an inflammatory message slip through. Triages 12,000 cases a day without burnout. That's the value. The model is a small part of it.

The infrastructure (the durable orchestrator, the event log, the permission enforcement, the observability) is what makes it real. You can swap GPT-5 for Claude tomorrow and the system keeps running. You can't swap the orchestrator without rewriting the company.

If you're building one of these for an Indonesian company, three things land harder than the tutorials suggest:

Data residency. Pin inference to asia-southeast. The latency wins are real and the OJK conversation gets shorter.
Bahasa drafting tone. Off-the-shelf produces outputs that read as condescending in Bahasa formal. You will fine-tune.
WhatsApp. Every workflow ends at WhatsApp. Build the WA tool first, and treat its quirks (Cloud API rate limits, template approvals, the 24-hour service window) as first-class infra constraints. They are.

The rest is engineering.

How I cut a lending app's API latency by ~30%

Fri, 16 Jan 2026 00:00:00 +0000

Most "I made the API faster" posts read like magic-trick demos. Clever caching layer in act two, latency graph drops in act three, applause. The Kredit Pintar transfer-layer work didn't feel like that. It felt like a slow, deliberate audit that paid off because nobody had done one in a while.

This is what actually happened.

Where we started

Kredit Pintar is a lending app with more than five million monthly active users. The backend is mostly Java on Spring Boot, MySQL underneath, a busy mesh of services on Kubernetes with Argo CD shipping changes. The data-transfer layer (the code that takes a request, talks to whatever systems we depend on, and shapes a response back to the caller) had grown organically. That's the polite way of saying every owner who'd touched it had added the field they needed and left.

The symptom showed up on the graphs. P50 and P95 on a handful of hot endpoints had been creeping up. Nothing dramatic, nothing pager-worthy, just enough that on-call kept flagging it in weekly reviews.

Two weeks of reading

The first two weeks I didn't write any new code. I read code. Then I read traces. Then I read more code. Looking back, I wish I'd spent two days up front on better profiling tooling. By the time I had the picture clear, I'd already half-formed the wrong hypothesis twice.

Two patterns surfaced once I'd done enough of that:

Redundant serialisation. The same payload was being serialised, sent across a hop, deserialised, then re-serialised one or two hops downstream. Fields nobody ever read travelled the whole way for free.
Chatty round trips. A surprising number of "one logical request" flows were actually three sequential calls under the hood. Each cheap on its own. The latencies stacked.

A token-bucket rate limiter is the kind of thing every fintech backend grows somewhere on the hot path. The shape below is the same one that lives behind /api/lab/latency on this site — /labs/latency runs it live against three handler variants:

What I actually changed

There was no single magical change. The win was the cumulative effect of small ones:

A clearer contract between the API surface and the systems behind it. One round trip per logical operation where it used to be two or three.
Tighter request shapes. Fields nobody downstream consumed stopped travelling the wire.
Backwards-compatible adapters at the seams, so the rewrite could ship in chunks and reach production traffic gradually instead of one terrifying cutover.

The unglamorous list is the win. The graph dropped because of the list, not because of any one item on it.

Keeping myself honest

Two things kept me honest, and both saved me at least once.

Traffic mirror in staging. I replayed real production requests against the new and old paths side by side and diffed the responses. The first time I caught a regression I was sure wasn't there (a one-character bug in a default-value fallback), that diff was the only reason I caught it before customers did.

Slow rollout. Small percentage of real traffic at first, with the old path still hot enough to fall back to. Boring. Effective. The day the new path emitted a malformed response under one specific timezone offset, rollback was a single config flip.

The result

Average API latency on the rewritten paths dropped by roughly 30%. P95 followed it down. The team shipped seven major features in the same window without slipping the rewrite or each other.

What I'd do differently

Spend more of the early days on profiling tooling. The instinct on a project like this is to start writing the new layer right away. The higher-leverage move is to make it cheap to know where time is actually being spent, and then start writing.

The other lesson, which I keep relearning: the boring, careful audit is almost always faster than the clever rewrite. Most performance wins at scale aren't hidden. They're sitting in the code, waiting for somebody to read it slowly. The hard part isn't the change. The hard part is taking two weeks to read first.

A backend engineer's cheatsheet for Indonesian payment rails

Wed, 24 Dec 2025 00:00:00 +0000

Working notes from three years of wiring Indonesian payment rails into bank and lending backends. The companion lab is at /labs/rails — same data, sortable.

There's a moment, the first time you wire up Indonesian payments, when you realise the question "how do I take payment?" has a dozen different answers. Each has its own latency story, idempotency contract, and refund path. The overseas tutorials don't help. They explain Stripe and Adyen, and neither of those is the rail. The rail is BI-FAST or QRIS or GPN, sitting underneath an acquirer or a wallet that may also be the rail.

This is the map I wish somebody had handed me on day one.

The five families that matter

You can group every domestic rail into five families:

Instant inter-bank: BI-FAST. Real-time, 24/7, capped at IDR 250M.¹ The default rail for retail transfers since 2021.
QR: QRIS. One QR code reads in every wallet and every bank app. Interoperability is the whole point. Speed is incidental.
Domestic switching: GPN. Routes domestic debit-card transactions through Indonesian switches. Cheaper than international schemes, slower to dispute.
Clearing and high-value: SKN (batch clearing) and BI-RTGS (high value, real-time gross). Different shapes, different occasions. Payroll goes on SKN, treasury goes on RTGS.
Closed-loop wallets: OVO, GoPay, DANA, ShopeePay, LinkAja. Each is its own network, plus a QRIS interface, plus an in-app SDK.

Cards (Visa/Mastercard) sit slightly outside this taxonomy. Still ubiquitous for cross-border and high-AOV, still the only rail with a real chargeback story, still the most expensive.

Latency, but honest

The lab page sorts by latency, and that's misleading without context. "Latency" here is end-to-end, from "I called the API" to "the counterparty sees the money". Within that bar:

Wallet APIs (OVO, GoPay, DANA) are fast, typically 2 to 3 seconds, because both legs sit inside the wallet's perimeter.
BI-FAST is also fast, typical 5 seconds, but p99 climbs into the tens of seconds when the receiving bank drags its feet.
QRIS acks in 3 seconds, but merchant settlement is T+1.
SKN is batch. Four windows per business day. The "latency" is effectively the wait until the next window.
RTGS is real-time, but business hours only.

Two practical implications:

If your customer is staring at a screen, you want a wallet, QRIS, or BI-FAST. SKN is for things they don't watch land.
If your reconciliation runs daily, the difference between 3 seconds and 30 seconds is invisible. Pick on cost and on idempotency semantics, not on raw speed.

Idempotency stories — read these carefully

Every rail says "we're idempotent," and every rail means a different thing by it.

BI-FAST: unique transaction ID per request. Reuse returns the prior result, including the prior error. The sender bank is the source of truth.
QRIS: one QR string is one transaction. Double-scan is blocked at the PJP layer. Your job is to not reuse the QR.
GPN cards: the ARN (Acquirer Reference Number) is the fingerprint. If your retry doesn't carry the same merchant transaction reference, the issuer treats it as a brand new authorization.
OVO / GoPay / DANA: partner-supplied idempotency key on a custom header. The wallet's API stores the key and replays the prior response on retry. The retention window varies. Assume 24 hours and verify in the API docs.
SKN: batch + reference. Reverse clearing is your only out.
VA: the VA number is the idempotency token. Once a VA is paid, paying it again either bounces or creates a duplicate at the acquirer's discretion. Not a contract you want to lean on.

The rule I've internalised: carry an idempotency key on every external call, whether the rail demands one or not. Even when the rail enforces uniqueness for you, your code reaches the rail through wrappers and middleware, and the wrappers will retry. If the wrapper retries silently and the rail accepts the retry as new, your ledger is wrong. Fix that on your side.

Refund paths — the unsexy column

This is the one that bites in production. The lab has the row-by-row detail; the headline is:

Wallets have proper refund APIs. Use them.²
VA, BI-FAST, SKN have no scheme refund. You fire a new counter-transfer, and your accounting reflects it as such.
Cards have the strongest dispute story (90-day chargebacks) and the weakest refund-to-customer-satisfaction ratio.
QRIS sits awkwardly in between. In-session reversal works. Later reversals go through the PJP, which means support tickets.

If you're building a customer-facing product, refund-path quality is the single biggest reason to prefer wallets over VAs, even when the MDR looks worse on paper.

The simulator companion

The payment-flow simulator is the live-coding companion to all of this. It encodes the same patterns: idempotent debits, double-delivered webhooks, timeout-with-retry, partial-failure reconciliation. It doesn't pick a specific rail. Pair the two: the cheatsheet for "what shape is this rail?", the simulator for "what does it do under failure?".

What this isn't

Not a regulatory primer. Cite Bank Indonesia and OJK directly for that. Not a contract. The numbers are public-range estimates and will be wrong for the largest merchants. Not exhaustive. The e-money schemes (LinkAja, ShopeePay) and the corporate rails (CMS, host-to-host) sit alongside this list and don't fit on one screen.

What it is: the page I wish I could have shown my past self the week before I started writing the M-Syariah Payment API. If you're that engineer right now, this is for you.

The IDR 250M cap is the scheme-level ceiling per BI's PADG 23/25/PADG/2021. Sending banks can apply tighter caps; check your issuer. ↩︎
Test the refund path on day one of integration, not week three. Most outages I've seen on payment integrations were refund-shaped, not authorization-shaped. ↩︎

Integrating OVO, GoPay, and DANA into a Sharia core banking system

Sun, 16 Mar 2025 00:00:00 +0000

If you live in Indonesia, you probably moved money through OVO, GoPay, or DANA this morning without thinking about it. That "without thinking about it" is the whole game in payments. Inside the bank that connects to those wallets, it's also the part that eats the most engineering time.

This is what I picked up designing the Payment API at Bank Mega Syariah, the one that wired our core banking platform into all three wallets.

Why this is hard before you write any code

The first time someone says "let's integrate three e-wallets," it sounds like roughly three times the work of integrating one. It isn't. Each wallet has:

Its own dialect for requests and responses.
Its own webhook model — when it fires, how it retries, what it guarantees about delivery.
Its own reconciliation cadence and statement format.
Its own definition of success and failure, and a wider gap than you'd expect between "we accepted your message" and "the money moved".

Multiply that by the bank side. The core banking system is the source of truth. Ledger postings have to be exact. A lost message means a real person is missing real money. What looked like one project becomes three half-projects plus the glue that joins them.

The glue is the project. Most of my time was the glue.

The shape I ended up with

One unified Payment API in front of the core, with thin adapters per biller behind it. The internal contract is one shape; each wallet's dialect lives in its adapter and doesn't leak inward. That sentence is the whole architecture. Everything else was details.

The pieces I'd call out, in order of how badly each one bites if you skimp on it:

1. Strong idempotency keys on every external call. A network blip should never end with the user double-charged. Getting this right at the start is cheap. Getting it wrong is a regulator asking why, three months in, two specific accounts are out by IDR 47,500.

2. Webhooks: separate "did the message arrive" from "is the ledger consistent". It's tempting to do both in one handler. Don't. You'll lose either reliability or correctness, and you'll find out which one at 3am.

3. A daily reconciliation job that proves the ledger. The unglamorous, schedule-driven thing that catches the cases your live code missed. Treat it as a first-class part of the product, not a clean-up phase you'll add later when there's time. There's never time.

What surprised me

How much of the work is naming things. "Pending" in OVO's world is not "pending" in your ledger's world. "Failed" might be retryable or it might be terminal. Different wallet, different answer. The discipline of writing the internal contract, the names and states the rest of the bank's code sees, mattered more than any one integration.

Once we had a clean internal vocabulary, adding a fourth wallet would have taken a week, not a quarter. We never did add a fourth, but the hypothetical was the proof that the design worked.

The thing nobody tells you about payment integrations

The UI is the easy part. The first time the M-Syariah app showed a green tick that said "transfer successful," it was thrilling. The real work was making that tick not lie. Under packet loss. Under timeouts. When the wallet is briefly down on a Saturday afternoon. When their webhook arrives twice, fifteen minutes apart. When their webhook never arrives at all, and your reconciliation job has to figure it out the next morning.

If the green tick is honest, you've done the hard work. If it's optimistic, you're a support ticket waiting to happen. There's no third option.

Lessons

Treat reconciliation as a product feature, not an operational afterthought. Design it on day one. It's the only thing that catches what live code missed.
The internal contract is the most important part of any multi-provider integration. The adapters are mechanical; the contract is the design.
"Idempotent" is a property of the system, not just the call. It only holds when storage, retries, and consumers all cooperate. Any one of them silently retrying breaks the property.
Test the refund path on day one of integration, not week three. Most of the production outages I saw on payment work were refund-shaped, not authorization-shaped.

If I were doing the same work today on a greenfield stack, the shape would still be this one. Different language, different cloud, maybe an event-sourced ledger instead of the postings model. But the unified-API-with-thin-adapters spine, strong idempotency, reconciliation as a feature, those would be on the wall on day one.

The train that taught me distributed systems

Sun, 15 Jan 2023 00:00:00 +0000

When someone asks me about distributed systems, the example I keep reaching for is a model train. People look at me funny, fair enough. But the project I keep coming back to in interviews, in my head, and every time I draw a state machine on a whiteboard, is a miniature railway I helped build at UGM in 2022.

So here's what that train taught me, in software-people words.

What it actually was

A model train you could drive over the web. We sat in a small lab in the Faculty of Engineering. There were rails on a desk and a Raspberry Pi acting as the brain. You opened a web app, picked a train, set a speed, switched a light on or off. Down at the rails, a protocol called Digital Command Control encoded those instructions onto the same pair of wires that carried the power.

The Raspberry Pi was the whole stack. Backend in Go, frontend in Flask + Python, hardware loop running off the same board. My Bachelor's thesis later pushed the work further with a Python prototype of the DCC controller proper, hitting millisecond precision on the wire. There's a tiny simulation of the same idea on this site at /labs/train. It's a toy. The original was a slightly bigger toy.

The lessons that travelled

I didn't know the term "distributed systems" yet. Looking back, the lab project was a small, complete one:

Latency budgets are real. DCC cares about timing in milliseconds. If your encoder slips, the decoder on the train gets confused, and the locomotive sits there blinking. That's a debugger you can hear. Years later when an SRE complained that p99 was up by 20ms, I knew exactly what he meant. I'd stood next to a train that went silent because of less.

State machines beat ad-hoc logic. A train can be moving, stopped, accelerating, switching tracks, or in an error state. The moment I drew that as a graph and made the transitions explicit, the bugs almost stopped. On every backend project since, I draw the state machine first. It's the cheapest debugging investment I know.

The frontend is also distributed. A web app that talks to a Pi running a hardware loop is not the same as a web app that talks to a database. The browser doesn't know that. Figuring out what to show the user when the train is probably fine but you haven't heard back yet is a small version of every distributed-systems problem you'll hit later. "Probably fine" turned out to be the whole job.

Hardware tells the truth. Software lies all the time. A misbehaving distributed system can hide behind retries and logs. A locomotive that sits there blinking will not politely sit there blinking. There's something honest about building systems where the failure mode is visible. I miss it sometimes, working in fintech, where most failures stay invisible until reconciliation day.

Why I keep talking about it

Years later I work in fintech. I think about ledgers, idempotency, timeouts, reconciliation. None of that is very far from a model train and a state machine on a Pi.

The path from "model train on a desk" to "five-million-MAU lending app" is shorter than it sounds, if you stay curious about the seams.