When I started building multiplayer, I was not trying to bolt networking onto an existing single-player prototype at the last minute. A 4X game has a large world state, hidden information, turn submission, AI players, long reconnect windows, and plenty of tiny UI-only interactions that should never touch the server. I needed a multiplayer architecture that treated the server as the source of truth while still letting the Flutter client feel responsive.
This post walks through the architecture I ended up with: repository layout, protocol contracts, command dispatch, event log, snapshots, WebSocket replay, fog-of-war projection, deployment, observability, and tests.
Design Goals
Before writing the server path, I wrote down the constraints I wanted the architecture to satisfy:
- The server is authoritative for gameplay mutations.
- The client may still execute UI-only commands locally.
- Every accepted or rejected server command gets a monotonically increasing
event_offset. - The latest snapshot is the fast bootstrap path.
- The event log is the catch-up and replay contract.
- WebSocket is a notification channel, not the write path.
- Fog of war is enforced on the server for both events and snapshots.
- Deploys should drain gracefully instead of silently dropping live matches.
- Reconnect should be a normal flow, not an exceptional rescue path.
That led me to a hybrid model: HTTP commands + WebSocket notifications + PostgreSQL snapshots/events.
High-Level Architecture
flowchart TB
subgraph Client["Flutter / Flame Client"]
UI["HUD + Lobby Screens"]
Providers["Riverpod Providers"]
Transport["Network Transport Adapters"]
Renderer["Flame Renderer"]
Cache["Local Snapshot Cache"]
end
subgraph Shared["Shared Core Package"]
Commands["GameCommand"]
Events["GameEvent"]
Protocol["Wire DTOs v1"]
Rules["Rules, AI, Fog, State"]
end
subgraph Server["Dart Server"]
HTTP["shelf HTTP Routes"]
WS["WebSocket Stream"]
Dispatch["ServerCommandTransport"]
Reducer["Server Reducers"]
Visibility["Fog Filters + Snapshot Projector"]
Jobs["AI + Timeout Loops"]
end
subgraph Data["PostgreSQL"]
Users["users"]
Matches["matches"]
Players["match_players"]
EventLog["match_events"]
Snapshots["match_snapshots"]
end
UI --> Providers
Providers --> Transport
Transport --> HTTP
Transport --> WS
HTTP --> Dispatch
Dispatch --> Reducer
Dispatch --> EventLog
Dispatch --> Snapshots
Dispatch --> WS
Jobs --> Dispatch
WS --> Visibility
Server --> DataRepository Map
aonw/
├── lib/
│ ├── api/
│ │ ├── client/
│ │ ├── protocol/
│ │ ├── session/
│ │ └── transport/
│ └── game/
│ ├── application/
│ ├── domain/
│ ├── infrastructure/
│ └── presentation/
├── packages/
│ └── aonw_core/
│ └── lib/
│ ├── ai/
│ ├── game/domain/
│ └── protocol/
├── server/
│ ├── bin/server.dart
│ ├── lib/
│ │ ├── auth/
│ │ ├── domain/
│ │ ├── http/
│ │ ├── matchmaking/
│ │ ├── observability/
│ │ ├── persistence/
│ │ └── websocket/
│ └── test/
├── deploy/
└── compose.yml
Command Lifecycle
sequenceDiagram
participant Player as Player
participant Client as Flutter Client
participant API as MatchRoutes
participant Transport as ServerCommandTransport
participant Reducer as Server Reducer
participant DB as PostgreSQL
participant WS as EventBroadcaster
participant Other as Other Client
Player->>Client: Click tile / choose action
Client->>Client: Convert UI intent to GameCommand
Client->>API: POST /matches/{id}/commands
API->>API: JWT, match status, actor check, rate limit
API->>Transport: dispatch(matchId, actor, tick, turn, command)
Transport->>Transport: Acquire per-match lock
Transport->>DB: Load latest snapshot
Transport->>DB: Check duplicate/stale tick
Transport->>Reducer: Reduce snapshot + command
Reducer-->>Transport: accepted/rejected + new state + events
Transport->>DB: Transaction append event + save snapshot
Transport->>WS: Broadcast filtered event envelope
Transport-->>API: WireCommandAck
API-->>Client: Ack + snapshot + events
WS-->>Other: Event notification
Other->>API: GET /matches/{id}/snapshot
API-->>Other: Projected snapshotI deliberately use HTTP for commands. WebSocket remains the push channel. It tells other clients that something changed and gives them enough offset information to catch up safely.
Wire Protocol
class WireCommand {
final int v;
final String matchId;
final int tick;
final int? turn;
final String actorPlayerId;
final Map<String, dynamic> command;
Map<String, dynamic> toJson() => {
'v': v,
'matchId': matchId,
'tick': tick,
if (turn != null) 'turn': turn,
'actorPlayerId': actorPlayerId,
'command': Map<String, dynamic>.from(command),
};
}
The server responds with an ack that includes the authoritative snapshot:
class WireCommandAck {
final bool accepted;
final int offset;
final WireSnapshot snapshot;
final List<Map<String, dynamic>> events;
final String? reason;
}
Single Writer per Match
flowchart TD
A["Command arrives"] --> B["Per-match async lock"]
B --> C["Read latest snapshot"]
C --> D["Validate tick"]
D --> E["Authorize replay"]
E --> F{"duplicate tick?"}
F -- yes --> G["Return cached ack"]
F -- no --> H{"stale tick?"}
H -- yes --> I["409 stale_tick"]
H -- no --> J["Check turn"]
J --> K["Reduce command"]
K --> L["Postgres transaction"]
L --> M["append match_events"]
M --> N["upsert match_snapshots"]
N --> O["broadcast filtered event"]
O --> P["return ack"]Future<T> _synchronized<T>(
String matchId,
Future<T> Function() action,
) async {
final previous = _locks[matchId];
final next = Completer<void>();
_locks[matchId] = next.future;
if (previous != null) {
await previous.catchError((_) {});
}
try {
return await action();
} finally {
next.complete();
if (identical(_locks[matchId], next.future)) {
_locks.remove(matchId);
}
}
}
Event Log + Latest Snapshot
CREATE TABLE IF NOT EXISTS match_events (
match_id TEXT NOT NULL REFERENCES matches(id) ON DELETE CASCADE,
event_offset INTEGER NOT NULL CHECK (event_offset > 0),
timestamp TIMESTAMPTZ NOT NULL,
actor_player_id TEXT,
tick INTEGER,
command_json JSONB,
events_json JSONB NOT NULL,
PRIMARY KEY (match_id, event_offset)
);
CREATE TABLE IF NOT EXISTS match_snapshots (
match_id TEXT PRIMARY KEY REFERENCES matches(id) ON DELETE CASCADE,
snapshot_offset INTEGER NOT NULL CHECK (snapshot_offset >= 0),
save_json JSONB NOT NULL,
state_json JSONB NOT NULL,
created_at TIMESTAMPTZ NOT NULL
);
erDiagram
users ||--o{ matches : owns
users ||--o{ match_players : joins
matches ||--o{ match_players : has
matches ||--o{ match_events : records
matches ||--|| match_snapshots : latest
users {
text id PK
text email
text display_name
text password_hash
timestamptz created_at
}
matches {
text id PK
text owner_user_id FK
text name
text map_name
int max_players
int min_players
bool quickplay
text status
int turn
jsonb ruleset_json
timestamptz auto_start_at
text invite_code
}
match_events {
text match_id FK
int event_offset PK
timestamptz timestamp
text actor_player_id
int tick
jsonb command_json
jsonb events_json
}
match_snapshots {
text match_id PK
int snapshot_offset
jsonb save_json
jsonb state_json
timestamptz created_at
}WebSocket Replay
sequenceDiagram
participant Client as Client
participant WS as WebSocket
participant Server as Server
participant DB as PostgreSQL
Client->>WS: connect /matches/42/stream?since=17
WS->>Server: JWT in Sec-WebSocket-Protocol
Server->>DB: read events offset >= 17
DB-->>Server: backlog
Server-->>Client: event 17
Server-->>Client: event 18
Server-->>Client: ready
Server-->>Client: live event 19
Client--xWS: network drop
Client->>Client: backoff 1s, 2s, 5s...
Client->>WS: reconnect since=20await subscription.subscribe(
matchId: saveId,
token: session.token,
fromOffset: _eventLogOffset + 1,
nextOffset: () => _eventLogOffset + 1,
onEvent: (event) {
unawaited(_reloadNetworkSnapshot(saveId, liveEvent: event));
},
onSnapshotResync: (snapshot) {
_queueNetworkSnapshotApply(saveId: saveId, snapshot: snapshot);
},
);
Why the Client Reloads Snapshots after Live Events
flowchart LR
A["WebSocket event offset N"] --> B["Client checks current offset"]
B --> C["GET latest snapshot"]
C --> D{"snapshot.offset >= N?"}
D -- no --> E["retry 150ms / 350ms / 750ms"]
D -- yes --> F["apply authoritative snapshot"]
F --> G["cache snapshot locally"]
G --> H["derive renderer effects"]For this game, I do not rely on live events alone to update external client state. The event tells the client that offset N exists. The snapshot tells the client what the authoritative world looks like after that offset.
Fog-of-War Enforcement
flowchart TD
Event["StoredMatchEvent"] --> Filter["FogOfWarVisibilityFilter"]
Snapshot["StoredMatchSnapshot"] --> Projector["SnapshotVisibilityProjector"]
State["PersistentGameState"] --> Filter
State --> Projector
Filter --> VisibleEvent["Event envelope per player"]
Projector --> VisibleSnapshot["Projected snapshot per player"]
VisibleEvent --> ClientA["Player A client"]
VisibleSnapshot --> ClientAbool canSeeEvent(Map<String, dynamic> event) {
return switch (event['type']) {
'TurnEnded' => _string(event['playerId']) == playerId,
'ResearchPointsGained' => _string(event['playerId']) == playerId,
'CityFounded' => _canSeeCityEvent(event),
'UnitMoved' => _canSeeUnitMove(event),
'UnitAttacked' =>
_string(event['attackerOwnerPlayerId']) == playerId ||
_string(event['defenderOwnerPlayerId']) == playerId,
_ => false,
};
}
flowchart LR
Full["Full server state"] --> A["Own units"]
Full --> B["Visible enemy units"]
Full --> C["Own cities"]
Full --> D["Visible enemy cities"]
Full --> E["Own research and gold"]
Full --> F["Own runtime drafts"]
A --> Projected["Projected state"]
B --> Projected
C --> Projected
D --> Projected
E --> Projected
F --> ProjectedMatchmaking and Lobby State
stateDiagram-v2
[*] --> open
open --> loading: ready/start/quickplay countdown
loading --> running: all clients loaded map
running --> finished: victory/resign/outcome
running --> abandoned: missing humans
running --> paused: reserved
open --> abandoned: owner leaves / invalid active match
finished --> [*]
abandoned --> [*]Command Authorization
flowchart TD
A["POST /matches/{id}/commands"] --> B["JWT access token"]
B --> C["Match exists and running"]
C --> D["User is match player"]
D --> E["Command rate limit"]
E --> F["WireCommand.fromJson"]
F --> G{"wire.matchId == path matchId?"}
G -- no --> Bad400["400 match_id_mismatch"]
G -- yes --> H{"actorPlayerId == player.id?"}
H -- no --> Bad403["403 wrong_actor"]
H -- yes --> I["ServerCommandTransport.dispatch"]Turns, Timeouts, and AI
flowchart TB
subgraph Loops["Background Loops"]
AiLoop["AI turn poll\nAONW_AI_TURN_POLL_SECONDS"]
TimeoutLoop["Timeout poll\nAONW_TURN_TIMEOUT_POLL_SECONDS"]
end
AiLoop --> Snapshot["latest snapshot"]
TimeoutLoop --> Snapshot
Snapshot --> Runtime["ServerRuntimeState"]
Runtime --> NeedAI{"AI player pending?"}
Runtime --> TimedOut{"human timed out?"}
NeedAI -- yes --> Plan["plan in isolate\nmaxConcurrentPlans=2"]
Plan --> Dispatch["dispatch AI commands"]
Dispatch --> SubmitAI["SubmitTurn"]
TimedOut -- yes --> SubmitTimeout["SubmitTurn timedOut=true"]
SubmitTimeout --> Kick{"threshold reached?"}
Kick -- yes --> KickPlayer["kick player"]AI uses the same command transport as human players:
final ack = await dispatchCommand(
matchId: matchId,
userId: serverAiUserId,
actorPlayerId: player.id,
tick: tick,
command: GameCommandSerializer.toJson(command),
);
Deployment and Graceful Drain
sequenceDiagram
participant LB as Load Balancer
participant Server as Dart Server
participant Client as Clients
LB->>Server: GET /ready
Server-->>LB: 200 ready
Note over Server: SIGTERM during deploy
Server->>Server: lifecycle.beginDrain()
LB->>Server: GET /ready
Server-->>LB: 503 draining
Server->>Client: close WebSockets code 4012
Client->>Client: reconnect with last event_offset
Server->>Server: wait drain seconds
Server->>Server: close HTTP serverScale-Out Contract
flowchart LR
Client1["Client A"] --> LB["Load Balancer"]
Client2["Client B"] --> LB
LB -- sticky WebSocket --> S1["Server Instance 1"]
LB -- HTTP --> S2["Server Instance 2"]
S1 --> PG[("PostgreSQL")]
S2 --> PG
subgraph Future["Future Non-Sticky Mode"]
Bus["Redis or NATS Event Bus"]
end
PG -. committed offset .-> Bus
Bus -. notify .-> S1
Bus -. notify .-> S2The future event-bus contract is:
- Persist command event and snapshot in PostgreSQL.
- Publish the committed offset after commit.
- Each instance reads from PostgreSQL by offset before broadcasting.
- Keep
/readyand graceful drain behavior unchanged.
Observability
flowchart LR
Dispatch["ServerCommandTransport"] --> Metrics["ServerMetrics"]
Registry["MatchRegistry"] --> Metrics
Broadcaster["EventBroadcaster"] --> Metrics
Metrics --> Endpoint["GET /metrics"]
Endpoint --> Prometheus["Prometheus"]
Prometheus --> Alerts["Alert Rules"]- alert: AonwCommandFailureRateHigh
expr: |
rate(aonw_command_dispatch_failed_total[5m])
/ clamp_min(rate(aonw_command_dispatch_total[5m]), 0.001) > 0.05
for: 10m
labels:
severity: warning
Test Strategy
flowchart TB
A["Protocol tests. wire codecs"] --> B["Transport tests\nHTTP and WS clients"]
B --> C["Server domain tests. validator, authorizer, reducer"]
C --> D["Persistence tests. Postgres repositories"]
D --> E["Route tests. match/auth routes"]
E --> F["Integration tests. full match + Postgres smoke"]
F --> G["Manual chaos drill. SIGTERM, reconnect, replay"]flutter test
(cd server && dart analyze --fatal-infos && dart test)
(cd packages/aonw_core && dart analyze --fatal-infos && dart test)
tool/run_postgres_smoke.sh
Trade-Offs
| Decision | Benefit | Cost |
|---|---|---|
| HTTP for commands | Clean ack, auth, retry, rate limit | More overhead than raw WebSocket commands |
| WebSocket as notification bus | Simple reconnect and replay | Client often reloads snapshot after events |
| Snapshot + event log | Fast bootstrap and robust recovery | More writes per command |
| Sticky WebSocket sessions | Simple production path | Future event bus needed for non-sticky scale-out |
| JSONB state | Fast protocol evolution | Less relational structure |
| Server-authoritative AI | Same rules as humans | More backend CPU work |
Roadmap
mindmap
root((Multiplayer Roadmap))
Event bus
Redis
NATS
Fanout by committed offset
Snapshots
Historical snapshots
Replay debugger
Diff tooling
Protocol
Version negotiation
Feature flags
Migration tests
Security
Refresh during WS reconnect
Stronger anti-spam heuristics
Abuse dashboards
UX
Reconnect overlay
Conflict messages
Spectator modeFinal Architecture in One Diagram
flowchart LR
Intent["Player Intent"] --> Command["WireCommand"]
Command --> Auth["Auth + Authorization"]
Auth --> Reduce["Server Reduce"]
Reduce --> Persist["Event + Snapshot Transaction"]
Persist --> Ack["Ack with Snapshot"]
Persist --> Notify["Filtered Live Event"]
Notify --> Reconnect["Replay by Offset"]
Ack --> ClientState["Client Authoritative State"]
Reconnect --> ClientStateThe key lesson for me was that multiplayer for a turn-based 4X game should be designed as a consistency system, not as screen synchronization. A command needs an actor, tick, turn, and authoritative result. An event needs an offset. A snapshot needs to be safely projectable for one player. Reconnect needs to be normal. Once those contracts exist, the rest of the system becomes much calmer to evolve.
Leave a Reply