Dossier

Medical Transcription Orchestration

Real-time captions for clinical meetings — headless bots + WebSocket fan-out.

Role

Senior Software Engineer

Company

Ceiba

Dates

2025–2026

Stack

Node.js / TypeScript / Express / Docker / WebSockets / Redis / FFmpeg / JWT / Zoom Meeting SDK / Zoom RTMS / C++

Highlights

Designed and built the Node.js orchestration backend coordinating headless meeting bots, audio streams, and a transcription service
Shipped a C++ headless meeting bot (containerized) that captures per-participant audio and opens one TCP stream per participant to the orchestrator
Authored a TypeScript alternative bot on Zoom's Realtime Media Streams SDK — same per-meeting container model as the full Meeting-SDK path, but removed the Linux Meeting SDK's silent failures and a layer of glue-code complexity
Published two TypeScript SDKs: one for bot authors (3-stage handshake, stream API, backpressure, auto-reconnect) and one for caption consumers (typed events, multi-session subscribe, auto-reconnect)
Wired per-participant audio capture, FFmpeg normalization, real-time transcription, and WebSocket caption broadcast as one end-to-end pipeline
Redis-backed session persistence with recovery on restart; dockerode-driven isolated bot container spawning per meeting

Problem

Clinical meetings needed live captions piped directly into the existing clinical UI, with two constraints Zoom’s own captioning did not satisfy: medical-term accuracy and AV-vendor independence. Zoom’s built-in captions misrendered medical terminology often enough to be unusable for clinical documentation, and locking the caption pipeline to one vendor would have blocked a planned expansion to other AV platforms.

Approach

Split into three independently-replaceable services. A headless meeting bot joins the call and exposes per-participant audio — built twice, once in C++ on the full Meeting SDK and once in TypeScript on Zoom’s newer Realtime Media Streams. An orchestrator owns session lifecycle, dockerode-driven bot spawning, FFmpeg audio normalization, and the transcription handoff. A WebSocket fan-out broadcasts captions to subscribed clinical UIs. Two TypeScript SDKs sit on the contract surfaces — bot-side (3-stage handshake, stream API, backpressure, auto-reconnect) and consumer-side (typed events, multi-session subscribe, auto-reconnect). Sessions persist in Redis, so an orchestrator restart does not drop active meetings. The bot/orchestrator boundary is explicitly AV-vendor-agnostic: a non-Zoom platform plugs in by writing a new bot, without rewriting the orchestrator or the consumer SDK.

Result

End-to-end caption latency landed in the ~200ms range based on informal production observation (no formal benchmark was run under the schedule). The orchestrator handles 10–50 concurrent meetings in production, serving five clinical providers across 50+ US hospitals in real time.

What broke

The first cut leaned on Zoom’s Linux Meeting SDK for the headless bot, and it kept biting us in production — intermittent meeting-join failures, a recording-permission prompt the SDK surfaced inside the meeting that patients and clinicians visibly found unsettling, a small slice of added latency, and a maintenance load that did not shrink with time. Replacing the bot with the RTMS path removed those failure modes: a purpose-built real-time media server for raw AV streaming, with no permission prompt to surface and no join handshake to fail on — simpler glue code on our side, and lower ongoing maintenance cost.