Video conferencing is no longer a convenience — it is the connective tissue of most distributed organisations, and the traffic it carries is exactly the traffic an attacker wants. Board discussions, customer support sessions, medical consultations, engineering reviews: all of it moves over the same SRTP streams, the same TURN relays, the same signed JWTs that nobody on the call ever sees. Securing this stack is not about picking a vendor with a padlock icon. It is about understanding where the protocol, the implementation, and the operational discipline each break, and engineering around the weakest of the three. In our experience advising teams that build or operate real-time communication platforms, the recurring failure mode is not exotic — it is the gap between what the encryption diagram promises and what the deployed system actually enforces. What does “end-to-end encrypted” actually mean in a video call? The phrase is overloaded, and the distinction matters operationally. WebRTC’s default media path uses SRTP with DTLS-negotiated keys, which encrypts the media in transit between each participant and the selective forwarding unit (SFU). That is hop-by-hop encryption: the SFU sees plaintext frames in order to route them. True end-to-end encryption — where the SFU forwards opaque ciphertext and only participating clients hold the key — requires an additional layer, typically Insertable Streams or the IETF SFrame draft, layered on top of SRTP. Most enterprise platforms run hop-by-hop by default and offer E2EE as an opt-in mode that disables features like server-side recording, live transcription, and cloud-based noise suppression. That is not a marketing oversight; it is a structural consequence of who holds the key. Teams choosing a platform need to know which mode they are in before they decide what they are allowed to discuss on it. The threat model most teams skip A defensible threat model for a conferencing platform separates four distinct adversaries: an outsider trying to join a meeting they were not invited to, a malicious participant already on the call, a compromised endpoint (laptop, mobile, conference-room appliance), and the platform operator itself. Different controls answer different adversaries. Adversary Primary control Where it commonly fails Uninvited outsider Waiting rooms, per-meeting passcodes, signed join tokens with short TTL Predictable meeting IDs; tokens that never expire Malicious participant Host moderation, screen-share permissions, lobby controls Default settings that let any participant share screen or admit others Compromised endpoint OS-level disk encryption, MDM, conditional access No telemetry on whether the joining device is managed Platform operator True E2EE (SFrame / Insertable Streams), customer-managed keys Hop-by-hop encryption marketed as “end-to-end” If a procurement decision is being made on the basis of a single security datasheet, this is the matrix that needs to be filled in honestly — not the one with the green checkmarks. Why the source code matters, and why “open source” is not a guarantee A lot of conferencing security debate collapses to “open source vs proprietary,” which is the wrong axis. The relevant property is auditability of the code path that handles keys and media — not the licensing model. Jitsi, mediasoup, and Janus expose their SFU implementations openly, which lets a security team trace how SRTP keys are derived, how DTLS fingerprints are pinned, and how the signalling layer authenticates participants. That visibility is real and useful. But open source on its own does not produce a secure system. The Heartbleed lesson — a critical OpenSSL flaw that sat in widely deployed code for years — generalises: visibility is necessary, not sufficient. What matters is whether someone you trust has actually read the code that matters, kept it patched, and validated the build provenance. A proprietary platform with a credible SOC 2 Type II audit and a published penetration-test summary can be more defensible than an open source platform deployed by a team that has never run cargo audit or its equivalent. The right question for a buyer is not “is it open source?” It is: who has reviewed the cryptographic core, when, and what did they find? AI-assisted monitoring is a double-edged surface Newer platforms ship features that use models to flag suspicious join patterns, detect deepfaked participants, or transcribe and summarise meetings. These features are genuinely useful — and they are also one of the largest expansions of the platform’s attack surface in years. The reason is that AI features almost always require the platform to see plaintext media. Server-side transcription, automated captioning, live summarisation, and behavioural anomaly detection are incompatible with true E2EE by construction. A platform that markets both “AI meeting assistant” and “end-to-end encrypted” is either restricting one to the other, or quietly redefining what end-to-end means. The decision is not whether AI assistance is good or bad. It is whether the cost — handing the platform operator a plaintext copy of the conversation, often retained for model improvement — is acceptable for the meeting in question. That decision should be made per meeting type, not per organisation, and it should be configurable in the client. Performance, encryption, and the trade-off that is mostly imaginary There is a persistent belief that strong encryption degrades audio and video quality. For the symmetric ciphers used in SRTP — typically AES-128-GCM or AES-256-GCM — this is observably false on any device with AES-NI or its ARM equivalent, which covers essentially every laptop and phone shipped in the last decade. The measurable cost is in the single-digit percentage of CPU per stream. Where performance actually degrades is in two adjacent places: DTLS handshakes on connection setup, which add latency to the first frame, and the additional cryptographic operations required for true E2EE via Insertable Streams, which can add a few milliseconds per frame and meaningfully increase CPU on low-power endpoints. Neither is a reason to disable encryption. Both are reasons to test on the actual hardware your users have, not on the developer’s M-series MacBook. A practical checklist before adopting or building a platform For teams making a buy-or-build decision, the questions worth asking in order: Does the platform support true E2EE (SFrame, Insertable Streams, MLS-based group keying), and which features does enabling it disable? How are join tokens signed, how long are they valid, and can they be revoked mid-meeting? What is the default screen-share permission for a new meeting? Default-permissive settings are where most real incidents start. Is there a published, recent third-party penetration test? Not a certification — a report. For AI features: is media processed in-region, retained for how long, and used for model training by default? For self-hosted deployments: is the build reproducible, and is there a documented key-rotation procedure? The first three are the screen on which most platforms separate. The last three are where the differences between mature vendors and immature ones become visible. Where this leaves the conversation Securing video conferencing is not a single decision; it is a stack of decisions where each layer can quietly invalidate the protection of the layer above. Encryption protects the wire; identity controls protect the room; endpoint posture protects the participant; operational discipline protects against the platform itself. None of these substitutes for the others, and the marketing language used by vendors blurs them on purpose. We pay close attention to this stack when we help clients design real-time communication systems, because the failure modes are rarely cryptographic — they are integration, configuration, and threat-model failures wearing cryptographic clothing. The platforms that age well are the ones whose defaults match the threat model of their actual users, not the ones with the longest feature list. Frequently Asked Questions Is WebRTC encrypted by default? Yes — WebRTC mandates SRTP for media and DTLS for the key exchange, so media is encrypted in transit between each client and the SFU. This is hop-by-hop encryption, not end-to-end. The SFU sees decrypted media in order to route it, which is why true E2EE requires an additional layer such as SFrame or Insertable Streams. Is open-source video conferencing more secure than proprietary? Open source enables auditability, but does not guarantee security. The relevant question is whether someone qualified has actually reviewed the code path that handles keys and media, and whether the deployment is patched. A proprietary platform with credible third-party audits can be more defensible than an unmaintained open-source deployment. Do AI features like transcription work with end-to-end encryption? Generally no. Server-side transcription, summarisation, and behavioural analytics require the platform to see plaintext media, which is incompatible with true E2EE by construction. Vendors that offer both usually require choosing one per meeting, or are using “end-to-end” loosely. Does encryption slow down video calls? For the symmetric ciphers used in SRTP (AES-GCM), the CPU cost is negligible on any modern device with hardware AES support. Measurable latency comes from DTLS handshakes during connection setup and, for true E2EE, the per-frame cryptographic operations on Insertable Streams — both are worth testing on real user hardware rather than developer machines.