9.5 KiB
Packet Deduplication — Normalized Schema Migration Plan
Overview
Split the monolithic packets table into two tables:
packets— one row per unique physical transmission (keyed by content hash)observations— one row per observer sighting (SNR, RSSI, path, observer, timestamp)
This fixes inflated packet counts across the entire app and enables proper "1 transmission seen N times" semantics.
Current State
packets table: 1 row per observation. ~61MB, ~30K+ rows. Same hash appears N times (once per observer). Fields mix transmission data (raw_hex, payload_type, decoded_json, hash) with observation data (observer_id, snr, rssi, path_json).
packet-store.js: In-memory mirror of packets table. Indexes: byId, byHash (hash → [packets]), byObserver, byNode. All reads served from RAM. SQLite is write-only for packets.
Touch surface: ~66 SQL queries across db.js/server.js/packet-store.js. ~12 frontend files consume packet data.
Milestone 1: Schema Migration (Backend Only)
Goal: New tables exist, data migrated, old table preserved as backup. No behavioral changes yet.
Tasks
-
Create new schema in
db.jsinit:CREATE TABLE IF NOT EXISTS transmissions ( id INTEGER PRIMARY KEY AUTOINCREMENT, raw_hex TEXT NOT NULL, hash TEXT NOT NULL UNIQUE, first_seen TEXT NOT NULL, route_type INTEGER, payload_type INTEGER, payload_version INTEGER, decoded_json TEXT, created_at TEXT DEFAULT (datetime('now')) ); CREATE TABLE IF NOT EXISTS observations ( id INTEGER PRIMARY KEY AUTOINCREMENT, transmission_id INTEGER NOT NULL REFERENCES transmissions(id), hash TEXT NOT NULL, observer_id TEXT, observer_name TEXT, direction TEXT, snr REAL, rssi REAL, score INTEGER, path_json TEXT, timestamp TEXT NOT NULL, created_at TEXT DEFAULT (datetime('now')) ); CREATE INDEX idx_transmissions_hash ON transmissions(hash); CREATE INDEX idx_transmissions_first_seen ON transmissions(first_seen); CREATE INDEX idx_transmissions_payload_type ON transmissions(payload_type); CREATE INDEX idx_observations_hash ON observations(hash); CREATE INDEX idx_observations_transmission_id ON observations(transmission_id); CREATE INDEX idx_observations_observer_id ON observations(observer_id); CREATE INDEX idx_observations_timestamp ON observations(timestamp); -
Write migration script (
scripts/migrate-dedup.js):- Read all rows from
packetsordered by timestamp - Group by hash
- For each unique hash: INSERT into
transmissions(use first observation's raw_hex, decoded_json, etc.) - For each row: INSERT into
observationswith foreign key to transmission - Verify counts:
SELECT COUNT(*) FROM observations= old packets count - Verify:
SELECT COUNT(*) FROM transmissions< observations count - Do NOT drop old
packetstable — rename topackets_backup
- Read all rows from
-
Print migration stats: total packets, unique transmissions, dedup ratio, time taken
Validation
COUNT(*) FROM observations=COUNT(*) FROM packets_backupCOUNT(*) FROM transmissions=COUNT(DISTINCT hash) FROM packets_backup- Spot-check: pick 5 known multi-observer packets, verify transmission + observations match
Risk: LOW — additive only, old data preserved
Milestone 2: Dual-Write Ingest
Goal: New packets written to both old and new tables. Read path unchanged. Zero downtime.
Tasks
-
Update
db.jsinsertPacket():- On new packet: check if
transmissionsrow exists for hash - If not: INSERT into
transmissions, get id - If yes: UPDATE
first_seenif this timestamp is earlier - INSERT into
observationswith transmission_id - Still also write to old
packetstable (dual-write for safety)
- On new packet: check if
-
Update
packet-store.jsinsert(): Mirror the dual-write in memory model- Maintain both old flat array AND new
byTransmissionMap
- Maintain both old flat array AND new
Validation
- Send test packets, verify they appear in both old and new tables
- Verify multi-observer packet creates 1 transmission + N observations
Risk: LOW — old read path still works as fallback
Milestone 3: In-Memory Store Restructure
Goal: packet-store.js switches from flat packet array to transmission-centric model.
Tasks
-
New in-memory data model:
transmissions: Map<hash, {id, raw_hex, hash, first_seen, payload_type, decoded_json, observations: []}>Each observation:
{id, observer_id, observer_name, snr, rssi, path_json, timestamp} -
Update indexes:
byHash: hash → transmission object (1:1 instead of 1:N)byObserver: observer_id → [observation references]byNode: pubkey → [transmission references] (deduped!)byId: observation.id → observation (for backward compat with packet detail links)
-
Update
load(): Read fromtransmissionsJOINobservationsinstead ofpackets -
Update query methods:
findPackets()— returns transmissions by default, with.observationsattachedfindPacketsForNode()— returns transmissions where node appears in ANY observation's path/decoded_jsongetSiblings()— becomesgetObservations(hash)— trivial, just returntransmission.observationscountForNode()— returns{transmissions: N, observations: M}
Validation
- All existing API endpoints return valid data
- Packet counts decrease (correctly!) for multi-observer nodes
/api/perfshows no regression
Risk: MEDIUM — core read path changes. Test thoroughly.
Milestone 4: API Response Changes
Goal: APIs return deduped data with observation counts.
Tasks
-
GET /api/packets:- Default: return transmissions (1 row per unique packet)
- Each transmission includes
observation_countand optionallyobservations[] ?expand=observationsto include full observation list?groupByHashbecomes the default behavior (deprecate param)- Preserve
observerfilter: return transmissions where at least one observation matches
-
GET /api/nodes/:pubkey/health:stats.totalPackets→stats.totalTransmissions(distinct hashes)- Add
stats.totalObservations(old count, for reference) recentPackets→ returns transmissions with observation_count
-
GET /api/nodes/bulk-health: Same changes as health -
GET /api/nodes/network-status: Use transmission counts -
GET /api/nodes/:pubkey/analytics: All throughput charts use transmission counts -
WebSocket broadcast: Include
observation_countwhen sibling observations exist for same hash
Backward Compatibility
- Add
?legacy=1param that returns old-style flat observations (for any external consumers) - Include both
totalTransmissionsandtotalObservationsin health responses during transition
Risk: MEDIUM — frontend expects certain shapes. May need coordinated deploy with Milestone 5.
Milestone 5: Frontend Updates
Goal: UI shows correct counts and leverages observation data.
Tasks
-
Packets page:
- Default view shows transmissions (already has groupByHash mode — make it default)
- Expand row to see individual observations with their paths/SNR/RSSI
- Badge: "×3 observers" on grouped rows
-
Node detail panel (nodes.js + live.js):
- Show "X transmissions" not "X packets"
- Or "X packets (seen Y times)" to show both
-
Home page: Network stats use transmission counts
-
Node analytics: Throughput charts use transmissions
-
Observer detail: Keep observation counts (correct metric for observers)
-
Analytics page: Topology/RF analysis uses all observations (SNR per observation is valid data)
Risk: LOW-MEDIUM — mostly display changes
Milestone 6: Cleanup
Goal: Remove dual-write, drop old table, clean up.
Tasks
- Remove dual-write from
insertPacket() - Drop
packets_backuptable (after confirming everything works for 1+ week) - Remove
?legacy=1support if unused - Update DEDUP-DESIGN.md → mark as complete
- VACUUM the database
- Tag release (v2.3.0?)
Risk: LOW — cleanup only, all functional changes already proven
Estimated Scope
| Milestone | Files Modified | Complexity | Can Deploy Independently? |
|---|---|---|---|
| 1. Schema Migration | db.js, new script | Low | Yes — additive only |
| 2. Dual-Write | db.js, packet-store.js | Low | Yes — old reads unchanged |
| 3. Memory Store | packet-store.js | Medium | No — must deploy with M4 |
| 4. API Changes | server.js, db.js | Medium | No — must deploy with M5 |
| 5. Frontend | 8+ public/*.js files | Medium | No — must deploy with M4 |
| 6. Cleanup | db.js, server.js | Low | Yes — after bake period |
Milestones 1-2: Safe to deploy independently, no user-visible changes.
Milestones 3-5: Must ship together (API shape changes + frontend expects new shape).
Milestone 6: Ship after 1 week bake.
Open Questions
- Table naming:
transmissions+observations? Or keeppackets+ addobservations? The word "transmission" is more accurate but "packet" is what the whole UI calls them. - Packet detail URLs: Currently
#/packet/123uses the observation ID. Keep observation IDs as the URL key? Or switch to hash? - Path dedup in paths table: The
pathstable also has per-observation entries. Normalize that too, or leave as-is? - Migration on prod: Run migration script before deploying new code, or make new code handle both old and new schema?