Real-time messaging โ and handling a production emergency mid-project
โฑ ~75 min ยท 2 Sprints + Hotfix ยท Advanced
Projects 01-05 shipped without incidents. Real projects have production bugs. This project deploys Sprint 1 to production, then a critical bug appears. You'll use /agile-ship-hotfix for the structured emergency response and learn when to use /agile-ship-rollback.
Every team ships bugs eventually. The difference between chaos and control is having a structured incident response. This project teaches you to replace panic with process: branch from main, reproduce with a test, fix minimally, review fast, deploy, and learn from it.
| Role | What they do in this project | When |
|---|---|---|
๐ @po | Structures your requirements into user stories, accepts completed work | Story creation, sprint review, acceptance |
๐ @sm | Facilitates ceremonies, escalates severity during incidents, leads post-mortem | Sprint ceremonies, incident response |
๐๏ธ @arch | Evaluates SSE vs WebSockets, recommends real-time approach | Project setup, brainstorm |
๐จโ๐ป @lead | Reviews code quality, conducts expedited hotfix reviews, co-leads post-mortem | Code review, hotfix review, post-mortem |
โจ๏ธ @dev | Implements features with TDD, writes hotfix | Implementation, hotfix |
๐งช @qa | Verifies Definition of Done, checks test coverage | DoD verification |
๐ @devops | Runs CI pipeline, handles deployment, executes rollback | CI/CD, deploy, rollback |
๐ก๏ธ @sec | Reviews WebSocket authentication and authorization | Security review in Sprint 2 |
"I want to build a real-time chat app with rooms and messaging. Users should be able to create rooms, join them, and send messages that appear instantly for everyone in the room."
/agile-explore-brainstormBefore choosing a technology, @arch evaluates the options. This is a deliberate contrast with Project 03 (Realtime Dashboard), which used SSE.
/agile-explore-brainstorm I need real-time messaging for a chat app. Messages need to flow both directions โ users send and receive. Rooms with multiple users. Presence indicators (who's online).
๐๏ธ @arch responds with a scoring matrix:
## /agile-explore-brainstorm โ Results
### Problem Statement
Real-time bidirectional messaging for multi-user chat rooms.
Constraints: instant delivery, presence awareness, multiple rooms.
### Scoring Matrix
| Approach | Feasibility | Maintainability | Performance | Risk | Total |
|---------------------|------------|-----------------|-------------|------|-------|
| WebSockets (ws/Socket.io) | 9/10 | 8/10 | 9/10 | 8/10 | 34 |
| SSE + POST fallback | 7/10 | 6/10 | 6/10 | 7/10 | 26 |
| Long polling | 6/10 | 5/10 | 4/10 | 8/10 | 23 |
| WebTransport | 5/10 | 5/10 | 10/10 | 4/10 | 24 |
### Recommendation
WebSockets โ clear winner for bidirectional chat.
### Why NOT SSE This Time?
In Project 03 (Realtime Dashboard), SSE was the right choice: the server
pushed updates and clients only read. Chat is different โ users SEND
messages too. SSE is unidirectional (server โ client only). You'd need
a separate POST endpoint for sending, creating unnecessary complexity.
WebSockets give you a single persistent connection for both directions.
### What Would Change This
- If browser support is a hard constraint โ SSE + POST fallback
- If you need UDP-like performance โ WebTransport (experimental)
"WebSockets it is. The bidirectional requirement makes it the obvious choice โ SSE worked for the dashboard but won't work here."
๐ฌ "Set up the project with WebSocket support and a test framework.
Then run /agile-setup-init to populate the memory."
โ @dev scaffolds: package.json, src/, WebSocket server setup, test config
โ /agile-setup-init scans and populates memory files
โ /agile-setup-onboard generates coding rules
โ /agile-setup-health โ โ
All checks pass
๐ฌ /agile-story-create I need users to register with a username
and password so they can be identified in chat rooms
โ @po creates US-001: "User registration" (3 pts)
๐ฌ /agile-story-create I need users to create new chat rooms
with a name and join existing rooms
โ @po creates US-002: "Create and join rooms" (3 pts)
๐ฌ /agile-story-create I need users to send messages in a room
and have them appear instantly for all room members via WebSocket
โ @po creates US-003: "Send and receive messages" (5 pts)
๐ฌ /agile-story-create I need to see the message history when
joining a room so I can catch up on the conversation
โ @po creates US-004: "Message history" (2 pts)
/agile-sprint-planning๐ @sm facilitates:
## Sprint Planning Output
### Sprint Goal
"Deliver a working chat app with rooms, messaging, and history"
### Sprint Backlog
| Story | Title | Points | Priority |
|--------|--------------------------|--------|----------|
| US-001 | User registration | 3 | P1 |
| US-002 | Create and join rooms | 3 | P1 |
| US-003 | Send and receive messages | 5 | P1 |
| US-004 | Message history | 2 | P1 |
### Capacity
- Committed: 13 points
- Sprint duration: 1 week
/agile-code-branch feature US-001 user-registration
/agile-story-plan US-001
/agile-code-tdd US-001
๐ด Test: POST /auth/register creates user โ โ
๐ข Implement registration with hashed password โ โ
๐ด Test: duplicate username returns 409 โ โ
๐ข Add unique constraint check โ โ
๐ด Test: missing fields return 400 โ โ
๐ข Add validation โ โ
/agile-code-ci โ โ
All green
/agile-code-commit โ feat(auth): add user registration endpoint
/agile-code-pr โ PR created
/agile-code-pr-review โ โ
Approved (๐ก S2: extract validation to middleware)
/agile-code-merge โ Squash merged to develop
/agile-story-dod โ โ
DONE
/agile-story-accept โ โ
ACCEPTED (3 points)
/agile-code-branch feature US-002 rooms
/agile-story-plan US-002
/agile-code-tdd US-002
๐ด Test: POST /rooms creates a room โ โ
๐ข Implement room creation โ โ
๐ด Test: POST /rooms/:id/join adds user to room โ โ
๐ข Implement join with WebSocket room subscription โ โ
๐ด Test: joining nonexistent room returns 404 โ โ
๐ข Add room existence check โ โ
/agile-code-ci โ โ
All green
/agile-code-commit โ feat(rooms): add create and join room endpoints
/agile-code-pr โ PR created
/agile-code-pr-review โ โ
Approved
/agile-code-merge โ Squash merged to develop
/agile-story-dod โ โ
DONE
/agile-story-accept โ โ
ACCEPTED (3 points)
/agile-code-branch feature US-003 messaging
/agile-story-plan US-003
/agile-code-tdd US-003
๐ด Test: WebSocket message in room broadcasts to all members โ โ
๐ข Implement message broadcast via WebSocket โ โ
๐ด Test: message persists to database โ โ
๐ข Add message persistence layer โ โ
๐ด Test: message to room user hasn't joined is rejected โ โ
๐ข Add room membership check โ โ
๐ด Test: message includes sender, timestamp, content โ โ
๐ข Add message envelope format โ โ
/agile-code-ci โ โ
All green
/agile-code-commit โ feat(messages): add real-time messaging via WebSocket
/agile-code-pr โ PR created
/agile-code-pr-review โ โ
Approved (๐ก S2: consider message size limit)
/agile-code-merge โ Squash merged to develop
/agile-story-dod โ โ
DONE
/agile-story-accept โ โ
ACCEPTED (5 points)
/agile-code-branch feature US-004 message-history
/agile-story-plan US-004
/agile-code-tdd US-004
๐ด Test: GET /rooms/:id/messages returns last 50 messages โ โ
๐ข Implement paginated message history โ โ
๐ด Test: empty room returns empty array โ โ
๐ข Handle edge case โ โ
/agile-code-ci โ โ
All green
/agile-code-commit โ feat(messages): add message history endpoint
/agile-code-pr โ PR created
/agile-code-pr-review โ โ
Approved
/agile-code-merge โ Squash merged to develop
/agile-story-dod โ โ
DONE
/agile-story-accept โ โ
ACCEPTED (2 points)
/agile-sprint-review
Sprint Goal: "Deliver a working chat app with rooms, messaging, and history"
Sprint Goal Met: โ
Yes
Velocity: 13/13 points (100%)
/agile-sprint-retro
What went well: WebSocket architecture clean, TDD caught edge cases
To improve: Need encoding tests for user input fields
Action items: add input sanitization tests in Sprint 2
/agile-memory-learn โ Saved Sprint 1 learnings
This is the critical step. The app goes live and real users start using it.
/agile-ship-changelog
## Changelog: v1.0.0
### Added
- User registration with secure password hashing (US-001)
- Create and join chat rooms (US-002)
- Real-time messaging via WebSockets (US-003)
- Message history with pagination (US-004)
/agile-ship-release v1.0.0
๐ @devops: Release v1.0.0
Branch: release/v1.0.0 from develop
CI: โ
Green
Merged to main โ
Tagged: v1.0.0 โ
Back-merged to develop โ
/agile-ship-deploy
๐ @devops: Deployed v1.0.0
Smoke tests: โ
All endpoints responding
WebSocket connections: โ
Healthy
Health check: โ
All smoke tests pass. Users start creating rooms and chatting. Everything looks good... for about two days.
Two days after deploy, users report: "Messages in rooms with special characters (emojis, accented characters) in the room name cause the app to crash." A user created a room called "cafe-โ" and every message sent to it crashes the server.
"Users are reporting crashes when room names have special characters like emojis or accents. Production is affected โ the app crashes for any room with these characters."
๐ @sm assesses:
## Incident Assessment
Severity: S0 โ CRITICAL
Impact: App crashes for affected users. Any room with non-ASCII
characters in the name is unusable. Server restarts but crashes
again on next message to affected rooms.
Decision: Trigger the hotfix workflow immediately.
Do NOT wait for Sprint 2 planning. This is a production emergency.
/agile-ship-hotfix โ The Full Emergency FlowThis is the most important section of Project 06. Every step of the hotfix workflow is shown in detail because this is what you'll do under pressure in real projects.
/agile-code-branch hotfix crash-special-chars
โจ๏ธ @dev: Hotfix branch created
Branch: hotfix/crash-special-chars
From: main (v1.0.0)
Linked to: INCIDENT-001
main reflects what's in production. develop may already have Sprint 2 work that's not ready to ship. If you branch from develop, your hotfix deploy could include untested Sprint 2 code โ turning one bug into many. Always branch hotfixes from main.
/agile-code-tdd Reproduce the crash: room name with emoji causes message send to fail
// test/hotfix-special-chars.test.js
test('sending a message to a room with emoji name should succeed', async () => {
const room = await createRoom({ name: 'cafe-โ' });
const ws = await connectToRoom(room.id, testUser);
const response = await sendMessage(ws, {
roomId: room.id,
content: 'Hello from the cafe!'
});
expect(response.status).toBe('delivered');
expect(response.content).toBe('Hello from the cafe!');
});
// Result: โ FAIL โ TypeError: Cannot read properties of undefined
// (reading 'normalize') โ the room name encoding crashes
// the message routing lookup
// src/rooms/normalize.js โ BEFORE (broken)
function normalizeRoomName(name) {
return name.toLowerCase().replace(/[^a-z0-9-]/g, '');
}
// src/rooms/normalize.js โ AFTER (fixed)
function normalizeRoomName(name) {
return name.normalize('NFC').toLowerCase();
}
// The bug: the old regex stripped ALL non-ASCII characters,
// turning "cafe-โ" into "cafe-" which then failed to match
// the stored room ID in the WebSocket routing table.
// Result: โ
PASS
The fix is two lines. Hotfixes must be minimal โ fix the bug, nothing else. This is not the time to refactor, add features, or clean up tech debt. The smaller the change, the lower the risk.
/agile-code-ci
๐ @devops: CI Pipeline Results
Build: โ
Tests: โ
24/24 passing (including new hotfix test)
Lint: โ
0 errors
Security: โ
No vulnerabilities
All existing tests still pass โ the fix doesn't introduce regressions.
/agile-code-pr-review
๐จโ๐ป @lead: EXPEDITED Hotfix Review (30 min max, not 24 hours)
| # | Check | Status |
|---|--------------------------------|--------|
| 1 | Does it fix the reported issue? | โ
|
| 2 | Is the fix minimal? | โ
(2 lines changed) |
| 3 | Does it introduce regressions? | โ
No (all 24 tests pass) |
| 4 | Is the new test adequate? | โ
Covers the exact scenario |
Score: 9/10
Verdict: โ
APPROVED โ ship it
Hotfix reviews are fast (30 minutes max) but never skipped. @lead checks three things: Does it fix the issue? Does it introduce regressions? Is it minimal? Skipping review under pressure is how you turn one production bug into two.
/agile-code-merge
โจ๏ธ @dev: Hotfix merge complete
1. Merged hotfix/crash-special-chars โ main โ
2. Tagged: v1.0.1 โ
3. Merged hotfix/crash-special-chars โ develop โ
4. Branch hotfix/crash-special-chars deleted
The hotfix must merge into both main and develop. If you only merge to main, the fix is lost the next time you release from develop. Your v1.1.0 release would reintroduce the bug.
/agile-ship-deploy
๐ @devops: Deployed v1.0.1
Smoke tests: โ
All endpoints responding
Hotfix verification: โ
Room "cafe-โ" โ messages delivered
Health check: โ
WebSocket connections: โ
Stable
Sometimes the hotfix deploy itself causes problems. This is when you use /agile-ship-rollback.
/agile-ship-rollback --reason "hotfix deploy caused 500 errors on the chat endpoint"
## /agile-ship-rollback โ Emergency Rollback
๐ @devops executes:
1. Rolling back to: v1.0.0
2. Deploy v1.0.0: โ
Complete
3. Health checks: โ
All passing
4. WebSocket connections: โ
Re-established
5. 500 errors: โ
Resolved
### Status
- Production is back on v1.0.0
- The original bug (special chars) still exists
- Team has time to investigate why the hotfix deploy failed
### Next Steps
- Debug the deployment issue (not the code fix)
- Re-attempt deploy after fixing deployment config
Rollback when the deploy itself is broken โ you need production stable NOW and can debug later. Fix-forward when the code fix is almost right but needs a small tweak โ rolling back would reintroduce the original bug. Rule of thumb: rollback is faster than debugging under pressure.
After the incident is resolved, ๐ @sm and ๐จโ๐ป @lead conduct a blameless post-mortem. The goal is learning, not blame.
## Blameless Post-Mortem โ INCIDENT-001
### Timeline
- Day 0: v1.0.0 deployed. All smoke tests pass.
- Day 2: First user report โ room "cafe-โ" crashes on message send.
- Day 2 + 1h: Incident triaged as S0.
- Day 2 + 2h: Hotfix branch created, test written, fix applied.
- Day 2 + 3h: Expedited review approved. v1.0.1 deployed.
- Day 2 + 3.5h: Verified fix in production. Incident closed.
### Root Cause
UTF-8 encoding not handled in room name normalization.
The normalizeRoomName() function used a regex that stripped
all non-ASCII characters, breaking the room ID lookup for
any room with emojis, accents, or other Unicode characters.
### Why It Wasn't Caught
- No test cases used non-ASCII characters in room names
- Smoke tests only used ASCII room names
- No fuzzing or Unicode-aware input testing
### Action Items
| # | Action | Owner | Due |
|---|--------------------------------------------------|-------|-----------|
| 1 | Add encoding tests for ALL user-input fields | @qa | Sprint 2 |
| 2 | Add monitoring/alerting for crash rate spikes | @devops | Sprint 2 |
| 3 | Add Unicode room names to smoke test suite | @dev | Sprint 2 |
| 4 | Review all string normalization functions | @lead | Sprint 2 |
Total time from report to fix in production: 3.5 hours. That's the power of a structured hotfix workflow. No panic, no guessing โ just process: branch โ test โ fix โ review โ deploy โ learn.
After the incident, the team continues. The post-mortem action items are incorporated alongside new features.
๐ฌ /agile-story-create I need to show which users are currently
online in a room so people know who's available to chat
โ @po creates US-005: "Online presence indicators" (3 pts)
๐ฌ /agile-story-create I need users to react to messages
with emoji reactions (like ๐, โค๏ธ, ๐) without sending
a separate message
โ @po creates US-006: "Message reactions" (2 pts)
๐ฌ /agile-story-create I need users to receive notifications
when new messages arrive in rooms they've joined but aren't
currently viewing
โ @po creates US-007: "Room notifications" (5 pts)
๐ @sm facilitates:
## Sprint 2 Planning
### Sprint Goal
"Add presence, reactions, and notifications to enhance the chat experience"
### Sprint Backlog
| Story | Title | Points | Priority |
|--------|-------------------------|--------|----------|
| US-005 | Online presence | 3 | P1 |
| US-006 | Message reactions | 2 | P1 |
| US-007 | Room notifications | 5 | P1 |
### Capacity
- Velocity (Sprint 1): 13 points
- Committed: 10 points (conservative โ accounting for incident overhead)
### Notes
- Post-mortem action items (encoding tests, monitoring) are
technical tasks, not stories โ woven into Sprint 2 work
/agile-code-branch feature US-005 online-presence
/agile-story-plan US-005
/agile-code-tdd US-005
๐ด Test: joining room broadcasts "user joined" to members โ โ
๐ข Implement presence events on WebSocket connect โ โ
๐ด Test: leaving room broadcasts "user left" โ โ
๐ข Implement disconnect detection with heartbeat โ โ
๐ด Test: GET /rooms/:id/members returns online users โ โ
๐ข Implement members endpoint โ โ
/agile-code-ci โ โ
All green
/agile-code-commit โ feat(presence): add online presence indicators
/agile-code-pr โ PR created
/agile-code-pr-review โ โ
Approved
/agile-code-merge โ Squash merged to develop
/agile-story-dod โ โ
DONE
/agile-story-accept โ โ
ACCEPTED (3 points)
/agile-code-branch feature US-006 reactions
/agile-story-plan US-006
/agile-code-tdd US-006
๐ด Test: adding reaction to message broadcasts to room โ โ
๐ข Implement reaction WebSocket events โ โ
๐ด Test: message includes reaction counts in history โ โ
๐ข Add reaction aggregation to message query โ โ
/agile-code-ci โ โ
All green
/agile-code-commit โ feat(reactions): add emoji reactions to messages
/agile-code-pr โ PR created
/agile-code-pr-review โ โ
Approved
/agile-code-merge โ Squash merged to develop
/agile-story-dod โ โ
DONE
/agile-story-accept โ โ
ACCEPTED (2 points)
/agile-security-reviewBefore building notifications (which require persistent WebSocket connections), ๐ก๏ธ @sec reviews the WebSocket authentication model.
/agile-security-review Review WebSocket authentication โ we're about to add persistent notification connections and need to make sure auth is solid.
๐ก๏ธ @sec responds:
## /agile-security-review โ WebSocket Auth Review
### Findings
| # | Severity | Finding | Risk |
|---|----------|--------------------------------------------|-----------------------------|
| 1 | ๐ด S0 | Missing token validation on WS upgrade | Any client can connect |
| | | handshake โ auth only checked on HTTP | without authentication |
| | | endpoints, not on WebSocket connection | |
| 2 | ๐ก S2 | No token expiry check on long-lived WS | Revoked tokens remain |
| | | connections | active until disconnect |
| 3 | ๐ก S2 | Room membership not re-validated on | Users could receive |
| | | reconnect | messages from left rooms |
### Required Before Sprint 2 Release
Fix #1 is mandatory โ S0 security finding.
### Recommendation
Add JWT validation to the WebSocket upgrade handshake.
Reject connections with invalid/missing/expired tokens
BEFORE the upgrade completes.
The WebSocket upgrade handshake accepts connections without checking the JWT token. This means unauthenticated users can connect and receive messages. This must be fixed before the next release.
/agile-code-tdd Fix S0: validate JWT on WebSocket upgrade handshake
๐ด RED:
test('WS connection without token is rejected', async () => {
const ws = new WebSocket('ws://localhost:3000/chat');
await expect(waitForOpen(ws)).rejects.toThrow('401');
});
// โ FAIL โ connection succeeds without token
๐ข GREEN:
// src/websocket/auth.js
server.on('upgrade', (request, socket, head) => {
const token = parseToken(request);
if (!token || !verifyJWT(token)) {
socket.write('HTTP/1.1 401 Unauthorized\r\n\r\n');
socket.destroy();
return;
}
// ... proceed with upgrade
});
// โ
PASS
/agile-code-ci โ โ
All green (including existing WS tests updated with auth)
/agile-code-branch feature US-007 notifications
/agile-story-plan US-007
/agile-code-tdd US-007
๐ด Test: user receives notification for message in joined room โ โ
๐ข Implement notification dispatch via WebSocket โ โ
๐ด Test: no notification for room user is currently viewing โ โ
๐ข Track active room per connection โ โ
๐ด Test: notification includes room name, sender, preview โ โ
๐ข Add notification payload format โ โ
๐ด Test: unread count increments per room โ โ
๐ข Add unread counter per user per room โ โ
/agile-code-ci โ โ
All green
/agile-code-commit โ feat(notifications): add room notifications with unread counts
/agile-code-pr โ PR created
/agile-code-pr-review โ โ
Approved
/agile-code-merge โ Squash merged to develop
/agile-story-dod โ โ
DONE
/agile-story-accept โ โ
ACCEPTED (5 points)
/agile-sprint-review
Sprint Goal Met: โ
Yes
Velocity: 10/10 points (100%)
/agile-sprint-retro
What went well: Security review caught critical auth gap before release
What went well: Post-mortem action items all addressed
To improve: Should run /agile-security-review earlier in the sprint
Action items: security review in Sprint 0 for future projects
/agile-memory-learn โ Saved Sprint 2 learnings
/agile-ship-changelog
## Changelog: v1.1.0
### Added
- Online presence indicators (US-005)
- Emoji reactions on messages (US-006)
- Room notifications with unread counts (US-007)
### Fixed
- JWT validation on WebSocket upgrade handshake (Security)
/agile-ship-release v1.1.0
๐ @devops: Release v1.1.0
Tagged: v1.1.0 โ
Merged to main โ
Back-merged to develop โ
/agile-ship-deploy
๐ @devops: Deployed v1.1.0
Smoke tests: โ
(now includes Unicode room names!)
Health check: โ
| Metric | Value |
|---|---|
| Stories completed | 7/7 |
| Story points delivered | 23 (Sprint 1: 13, Sprint 2: 10) |
| Sprint goals met | โ Both |
| Production incidents | 1 (resolved in 3.5 hours) |
| Release versions | v1.0.0, v1.0.1 (hotfix), v1.1.0 |
| Security findings fixed | 1 S0 (WebSocket auth) |
| Roles involved | @po, @sm, @arch, @lead, @dev, @qa, @devops, @sec |
Production bugs are not failures โ they're part of shipping software. The hotfix workflow gives you a structured response: branch from main โ reproduce with test โ fix minimally โ expedited review โ deploy โ post-mortem. Panic is replaced by process.
You also learned that security reviews (/agile-security-review) catch vulnerabilities before they reach users, and rollback (/agile-ship-rollback) is your safety net when a deploy goes wrong.
| Command | What It Does | When You Used It |
|---|---|---|
/agile-ship-hotfix | Structured emergency fix workflow: branch from main, minimal fix, expedited review, deploy | Phase 3 โ production crash with special characters |
/agile-ship-rollback | Emergency rollback to previous stable version when a deploy fails | Phase 3 โ as fallback if hotfix deploy failed |
/agile-security-review | Security audit of specific components (auth, input handling, etc.) | Phase 4 โ WebSocket auth review before notifications |
Why does the hotfix branch from main, not develop?
Is code review skipped for hotfixes?
When should you rollback instead of fix-forward?