[AoNW] How I Built Multiplayer for a Turn-Based 4X Game with Flutter, Dart, WebSockets, and PostgreSQL

When I started building multiplayer, I was not trying to bolt networking onto an existing single-player prototype at the last minute. A 4X game has a large world state, hidden information, turn submission, AI players, long reconnect windows, and plenty of tiny UI-only interactions that should never touch the server. I needed a multiplayer architecture that treated the server as the source of truth while still letting the Flutter client feel responsive.

This post walks through the architecture I ended up with: repository layout, protocol contracts, command dispatch, event log, snapshots, WebSocket replay, fog-of-war projection, deployment, observability, and tests.

Design Goals

Before writing the server path, I wrote down the constraints I wanted the architecture to satisfy:

  • The server is authoritative for gameplay mutations.
  • The client may still execute UI-only commands locally.
  • Every accepted or rejected server command gets a monotonically increasing event_offset.
  • The latest snapshot is the fast bootstrap path.
  • The event log is the catch-up and replay contract.
  • WebSocket is a notification channel, not the write path.
  • Fog of war is enforced on the server for both events and snapshots.
  • Deploys should drain gracefully instead of silently dropping live matches.
  • Reconnect should be a normal flow, not an exceptional rescue path.

That led me to a hybrid model: HTTP commands + WebSocket notifications + PostgreSQL snapshots/events.

High-Level Architecture

flowchart TB
    subgraph Client["Flutter / Flame Client"]
        UI["HUD + Lobby Screens"]
        Providers["Riverpod Providers"]
        Transport["Network Transport Adapters"]
        Renderer["Flame Renderer"]
        Cache["Local Snapshot Cache"]
    end

    subgraph Shared["Shared Core Package"]
        Commands["GameCommand"]
        Events["GameEvent"]
        Protocol["Wire DTOs v1"]
        Rules["Rules, AI, Fog, State"]
    end

    subgraph Server["Dart Server"]
        HTTP["shelf HTTP Routes"]
        WS["WebSocket Stream"]
        Dispatch["ServerCommandTransport"]
        Reducer["Server Reducers"]
        Visibility["Fog Filters + Snapshot Projector"]
        Jobs["AI + Timeout Loops"]
    end

    subgraph Data["PostgreSQL"]
        Users["users"]
        Matches["matches"]
        Players["match_players"]
        EventLog["match_events"]
        Snapshots["match_snapshots"]
    end

    UI --> Providers
    Providers --> Transport
    Transport --> HTTP
    Transport --> WS
    HTTP --> Dispatch
    Dispatch --> Reducer
    Dispatch --> EventLog
    Dispatch --> Snapshots
    Dispatch --> WS
    Jobs --> Dispatch
    WS --> Visibility
    Server --> Data

Repository Map

aonw/
├── lib/
│   ├── api/
│   │   ├── client/
│   │   ├── protocol/
│   │   ├── session/
│   │   └── transport/
│   └── game/
│       ├── application/
│       ├── domain/
│       ├── infrastructure/
│       └── presentation/
├── packages/
│   └── aonw_core/
│       └── lib/
│           ├── ai/
│           ├── game/domain/
│           └── protocol/
├── server/
│   ├── bin/server.dart
│   ├── lib/
│   │   ├── auth/
│   │   ├── domain/
│   │   ├── http/
│   │   ├── matchmaking/
│   │   ├── observability/
│   │   ├── persistence/
│   │   └── websocket/
│   └── test/
├── deploy/
└── compose.yml

Command Lifecycle

sequenceDiagram
    participant Player as Player
    participant Client as Flutter Client
    participant API as MatchRoutes
    participant Transport as ServerCommandTransport
    participant Reducer as Server Reducer
    participant DB as PostgreSQL
    participant WS as EventBroadcaster
    participant Other as Other Client

    Player->>Client: Click tile / choose action
    Client->>Client: Convert UI intent to GameCommand
    Client->>API: POST /matches/{id}/commands
    API->>API: JWT, match status, actor check, rate limit
    API->>Transport: dispatch(matchId, actor, tick, turn, command)
    Transport->>Transport: Acquire per-match lock
    Transport->>DB: Load latest snapshot
    Transport->>DB: Check duplicate/stale tick
    Transport->>Reducer: Reduce snapshot + command
    Reducer-->>Transport: accepted/rejected + new state + events
    Transport->>DB: Transaction append event + save snapshot
    Transport->>WS: Broadcast filtered event envelope
    Transport-->>API: WireCommandAck
    API-->>Client: Ack + snapshot + events
    WS-->>Other: Event notification
    Other->>API: GET /matches/{id}/snapshot
    API-->>Other: Projected snapshot

I deliberately use HTTP for commands. WebSocket remains the push channel. It tells other clients that something changed and gives them enough offset information to catch up safely.

Wire Protocol

class WireCommand {
  final int v;
  final String matchId;
  final int tick;
  final int? turn;
  final String actorPlayerId;
  final Map<String, dynamic> command;

  Map<String, dynamic> toJson() => {
    'v': v,
    'matchId': matchId,
    'tick': tick,
    if (turn != null) 'turn': turn,
    'actorPlayerId': actorPlayerId,
    'command': Map<String, dynamic>.from(command),
  };
}

The server responds with an ack that includes the authoritative snapshot:

class WireCommandAck {
  final bool accepted;
  final int offset;
  final WireSnapshot snapshot;
  final List<Map<String, dynamic>> events;
  final String? reason;
}

Single Writer per Match

flowchart TD
    A["Command arrives"] --> B["Per-match async lock"]
    B --> C["Read latest snapshot"]
    C --> D["Validate tick"]
    D --> E["Authorize replay"]
    E --> F{"duplicate tick?"}
    F -- yes --> G["Return cached ack"]
    F -- no --> H{"stale tick?"}
    H -- yes --> I["409 stale_tick"]
    H -- no --> J["Check turn"]
    J --> K["Reduce command"]
    K --> L["Postgres transaction"]
    L --> M["append match_events"]
    M --> N["upsert match_snapshots"]
    N --> O["broadcast filtered event"]
    O --> P["return ack"]
Future<T> _synchronized<T>(
  String matchId,
  Future<T> Function() action,
) async {
  final previous = _locks[matchId];
  final next = Completer<void>();
  _locks[matchId] = next.future;
  if (previous != null) {
    await previous.catchError((_) {});
  }
  try {
    return await action();
  } finally {
    next.complete();
    if (identical(_locks[matchId], next.future)) {
      _locks.remove(matchId);
    }
  }
}

Event Log + Latest Snapshot

CREATE TABLE IF NOT EXISTS match_events (
  match_id TEXT NOT NULL REFERENCES matches(id) ON DELETE CASCADE,
  event_offset INTEGER NOT NULL CHECK (event_offset > 0),
  timestamp TIMESTAMPTZ NOT NULL,
  actor_player_id TEXT,
  tick INTEGER,
  command_json JSONB,
  events_json JSONB NOT NULL,
  PRIMARY KEY (match_id, event_offset)
);

CREATE TABLE IF NOT EXISTS match_snapshots (
  match_id TEXT PRIMARY KEY REFERENCES matches(id) ON DELETE CASCADE,
  snapshot_offset INTEGER NOT NULL CHECK (snapshot_offset >= 0),
  save_json JSONB NOT NULL,
  state_json JSONB NOT NULL,
  created_at TIMESTAMPTZ NOT NULL
);
erDiagram
    users ||--o{ matches : owns
    users ||--o{ match_players : joins
    matches ||--o{ match_players : has
    matches ||--o{ match_events : records
    matches ||--|| match_snapshots : latest

    users {
        text id PK
        text email
        text display_name
        text password_hash
        timestamptz created_at
    }

    matches {
        text id PK
        text owner_user_id FK
        text name
        text map_name
        int max_players
        int min_players
        bool quickplay
        text status
        int turn
        jsonb ruleset_json
        timestamptz auto_start_at
        text invite_code
    }

    match_events {
        text match_id FK
        int event_offset PK
        timestamptz timestamp
        text actor_player_id
        int tick
        jsonb command_json
        jsonb events_json
    }

    match_snapshots {
        text match_id PK
        int snapshot_offset
        jsonb save_json
        jsonb state_json
        timestamptz created_at
    }

WebSocket Replay

sequenceDiagram
    participant Client as Client
    participant WS as WebSocket
    participant Server as Server
    participant DB as PostgreSQL

    Client->>WS: connect /matches/42/stream?since=17
    WS->>Server: JWT in Sec-WebSocket-Protocol
    Server->>DB: read events offset >= 17
    DB-->>Server: backlog
    Server-->>Client: event 17
    Server-->>Client: event 18
    Server-->>Client: ready
    Server-->>Client: live event 19
    Client--xWS: network drop
    Client->>Client: backoff 1s, 2s, 5s...
    Client->>WS: reconnect since=20
await subscription.subscribe(
  matchId: saveId,
  token: session.token,
  fromOffset: _eventLogOffset + 1,
  nextOffset: () => _eventLogOffset + 1,
  onEvent: (event) {
    unawaited(_reloadNetworkSnapshot(saveId, liveEvent: event));
  },
  onSnapshotResync: (snapshot) {
    _queueNetworkSnapshotApply(saveId: saveId, snapshot: snapshot);
  },
);

Why the Client Reloads Snapshots after Live Events

flowchart LR
    A["WebSocket event offset N"] --> B["Client checks current offset"]
    B --> C["GET latest snapshot"]
    C --> D{"snapshot.offset >= N?"}
    D -- no --> E["retry 150ms / 350ms / 750ms"]
    D -- yes --> F["apply authoritative snapshot"]
    F --> G["cache snapshot locally"]
    G --> H["derive renderer effects"]

For this game, I do not rely on live events alone to update external client state. The event tells the client that offset N exists. The snapshot tells the client what the authoritative world looks like after that offset.

Fog-of-War Enforcement

flowchart TD
    Event["StoredMatchEvent"] --> Filter["FogOfWarVisibilityFilter"]
    Snapshot["StoredMatchSnapshot"] --> Projector["SnapshotVisibilityProjector"]
    State["PersistentGameState"] --> Filter
    State --> Projector
    Filter --> VisibleEvent["Event envelope per player"]
    Projector --> VisibleSnapshot["Projected snapshot per player"]
    VisibleEvent --> ClientA["Player A client"]
    VisibleSnapshot --> ClientA
bool canSeeEvent(Map<String, dynamic> event) {
return switch (event['type']) {
'TurnEnded' => _string(event['playerId']) == playerId,
'ResearchPointsGained' => _string(event['playerId']) == playerId,
'CityFounded' => _canSeeCityEvent(event),
'UnitMoved' => _canSeeUnitMove(event),
'UnitAttacked' =>
_string(event['attackerOwnerPlayerId']) == playerId ||
_string(event['defenderOwnerPlayerId']) == playerId,
_ => false,
};
}

flowchart LR
    Full["Full server state"] --> A["Own units"]
    Full --> B["Visible enemy units"]
    Full --> C["Own cities"]
    Full --> D["Visible enemy cities"]
    Full --> E["Own research and gold"]
    Full --> F["Own runtime drafts"]

    A --> Projected["Projected state"]
    B --> Projected
    C --> Projected
    D --> Projected
    E --> Projected
    F --> Projected

Matchmaking and Lobby State

stateDiagram-v2
    [*] --> open
    open --> loading: ready/start/quickplay countdown
    loading --> running: all clients loaded map
    running --> finished: victory/resign/outcome
    running --> abandoned: missing humans
    running --> paused: reserved
    open --> abandoned: owner leaves / invalid active match
    finished --> [*]
    abandoned --> [*]

Command Authorization

flowchart TD
    A["POST /matches/{id}/commands"] --> B["JWT access token"]
    B --> C["Match exists and running"]
    C --> D["User is match player"]
    D --> E["Command rate limit"]
    E --> F["WireCommand.fromJson"]
    F --> G{"wire.matchId == path matchId?"}
    G -- no --> Bad400["400 match_id_mismatch"]
    G -- yes --> H{"actorPlayerId == player.id?"}
    H -- no --> Bad403["403 wrong_actor"]
    H -- yes --> I["ServerCommandTransport.dispatch"]

Turns, Timeouts, and AI

flowchart TB
    subgraph Loops["Background Loops"]
        AiLoop["AI turn poll\nAONW_AI_TURN_POLL_SECONDS"]
        TimeoutLoop["Timeout poll\nAONW_TURN_TIMEOUT_POLL_SECONDS"]
    end

    AiLoop --> Snapshot["latest snapshot"]
    TimeoutLoop --> Snapshot
    Snapshot --> Runtime["ServerRuntimeState"]
    Runtime --> NeedAI{"AI player pending?"}
    Runtime --> TimedOut{"human timed out?"}
    NeedAI -- yes --> Plan["plan in isolate\nmaxConcurrentPlans=2"]
    Plan --> Dispatch["dispatch AI commands"]
    Dispatch --> SubmitAI["SubmitTurn"]
    TimedOut -- yes --> SubmitTimeout["SubmitTurn timedOut=true"]
    SubmitTimeout --> Kick{"threshold reached?"}
    Kick -- yes --> KickPlayer["kick player"]

AI uses the same command transport as human players:

final ack = await dispatchCommand(
  matchId: matchId,
  userId: serverAiUserId,
  actorPlayerId: player.id,
  tick: tick,
  command: GameCommandSerializer.toJson(command),
);

Deployment and Graceful Drain

sequenceDiagram
    participant LB as Load Balancer
    participant Server as Dart Server
    participant Client as Clients

    LB->>Server: GET /ready
    Server-->>LB: 200 ready
    Note over Server: SIGTERM during deploy
    Server->>Server: lifecycle.beginDrain()
    LB->>Server: GET /ready
    Server-->>LB: 503 draining
    Server->>Client: close WebSockets code 4012
    Client->>Client: reconnect with last event_offset
    Server->>Server: wait drain seconds
    Server->>Server: close HTTP server

Scale-Out Contract

flowchart LR
    Client1["Client A"] --> LB["Load Balancer"]
    Client2["Client B"] --> LB
    LB -- sticky WebSocket --> S1["Server Instance 1"]
    LB -- HTTP --> S2["Server Instance 2"]
    S1 --> PG[("PostgreSQL")]
    S2 --> PG

    subgraph Future["Future Non-Sticky Mode"]
        Bus["Redis or NATS Event Bus"]
    end

    PG -. committed offset .-> Bus
    Bus -. notify .-> S1
    Bus -. notify .-> S2

The future event-bus contract is:

  1. Persist command event and snapshot in PostgreSQL.
  2. Publish the committed offset after commit.
  3. Each instance reads from PostgreSQL by offset before broadcasting.
  4. Keep /ready and graceful drain behavior unchanged.

Observability

flowchart LR
    Dispatch["ServerCommandTransport"] --> Metrics["ServerMetrics"]
    Registry["MatchRegistry"] --> Metrics
    Broadcaster["EventBroadcaster"] --> Metrics
    Metrics --> Endpoint["GET /metrics"]
    Endpoint --> Prometheus["Prometheus"]
    Prometheus --> Alerts["Alert Rules"]
- alert: AonwCommandFailureRateHigh
  expr: |
    rate(aonw_command_dispatch_failed_total[5m])
      / clamp_min(rate(aonw_command_dispatch_total[5m]), 0.001) > 0.05
  for: 10m
  labels:
    severity: warning

Test Strategy

flowchart TB
    A["Protocol tests. wire codecs"] --> B["Transport tests\nHTTP and WS clients"]
    B --> C["Server domain tests. validator, authorizer, reducer"]
    C --> D["Persistence tests. Postgres repositories"]
    D --> E["Route tests. match/auth routes"]
    E --> F["Integration tests. full match + Postgres smoke"]
    F --> G["Manual chaos drill. SIGTERM, reconnect, replay"]
flutter test
(cd server && dart analyze --fatal-infos && dart test)
(cd packages/aonw_core && dart analyze --fatal-infos && dart test)
tool/run_postgres_smoke.sh

Trade-Offs

DecisionBenefitCost
HTTP for commandsClean ack, auth, retry, rate limitMore overhead than raw WebSocket commands
WebSocket as notification busSimple reconnect and replayClient often reloads snapshot after events
Snapshot + event logFast bootstrap and robust recoveryMore writes per command
Sticky WebSocket sessionsSimple production pathFuture event bus needed for non-sticky scale-out
JSONB stateFast protocol evolutionLess relational structure
Server-authoritative AISame rules as humansMore backend CPU work

Roadmap

mindmap
  root((Multiplayer Roadmap))
    Event bus
      Redis
      NATS
      Fanout by committed offset
    Snapshots
      Historical snapshots
      Replay debugger
      Diff tooling
    Protocol
      Version negotiation
      Feature flags
      Migration tests
    Security
      Refresh during WS reconnect
      Stronger anti-spam heuristics
      Abuse dashboards
    UX
      Reconnect overlay
      Conflict messages
      Spectator mode

Final Architecture in One Diagram

flowchart LR
    Intent["Player Intent"] --> Command["WireCommand"]
    Command --> Auth["Auth + Authorization"]
    Auth --> Reduce["Server Reduce"]
    Reduce --> Persist["Event + Snapshot Transaction"]
    Persist --> Ack["Ack with Snapshot"]
    Persist --> Notify["Filtered Live Event"]
    Notify --> Reconnect["Replay by Offset"]
    Ack --> ClientState["Client Authoritative State"]
    Reconnect --> ClientState

The key lesson for me was that multiplayer for a turn-based 4X game should be designed as a consistency system, not as screen synchronization. A command needs an actor, tick, turn, and authoritative result. An event needs an offset. A snapshot needs to be safely projectable for one player. Reconnect needs to be normal. Once those contracts exist, the rest of the system becomes much calmer to evolve.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *