[This post is part of the set we’re doing on the tech behind the No Boundaries conference, which happened in Feb 2014. See all the NB2014 posts.]
Videoconferencing is hard. There are many off-the-shelf hardware solutions sold by over-enthusiastic salespeople, but they tend to be hideously expensive. There are also many software-based videoconferencing setups, but none of them seem very geared up for linking two auditoria as we had to do, and most of them send the video through a third party rather than being a direct link. Even if you find a good solution, you have to get things set up just right to get a decent effect: usually the best systems are installed in specially configured boardrooms.
The problems
In theory it shouldn’t be that hard: there are only three real problems that need to be solved:
- Getting HD video compressed to a low enough bitrate that it can be sent over the internet
- Making sure the sending and receiving ends are set to encode/decode with very low latency
- Echo cancellation on the audio
Problem (1) is relatively easy: there’s lots of software which will compress HD video down to a couple of Mbps (“megabits per second”).
Problem (2) is a lot harder, and is where most solutions fall down. First off, normal video encoding standards such as H264 are most efficient when the encoder can look ahead to see what frames are coming up: if it’s having to compress frames as they come in then you lose either efficiency or quality.
The second difficulty with (2) is that the internet is pretty much made of jittery lag. Let me explain: when you send data, it’s split into small chunks (“packets”) which are sent individually. If it helps, think of a packet as being a frame of video. Because the internet is fundamentally unreliable, each packet will take a variable amount of time to arrive at its destination: the average time taken is called the lag, the variation in time taken is called jitter. Jitter is actually the killer of streaming media: it means you have to buffer packets (i.e. store frames in a holding pen before showing them on screen), because if you don’t buffer then you end up with very jerky video as frames come in at random times.
The problem with buffering is that it increases the delay between sending a frame over a link, and that frame being shown at the other end. When videoconferencing, the perceived delay – i.e. the time it takes someone at the other end to respond – is doubled. Once the delay hits more than 200ms then natural conversation becomes awkward, as anyone who’s used Skype or Hangouts will know: it’s very easy to start speaking over each other. Once the delay hits around 0.5 – 1s then it becomes very noticeable to audiences.
Then you have problem (3): echo cancellation. Consider a link between two venues, A and B. When someone at A talks, that gets relayed over the PA system at B. It’s then likely that the microphones at B will pick up that sound and send it back to A. The result is that someone at A hears what they’ve just said a second or two after they’ve said it, which is almost impossible to cope with. Plus it sounds dreadful. Software such as Skype gets around this by muting your speakers whenever it detects you’re talking, so you can’t hear what’s being sent back to you. Unfortunately that also means that if you accidentally talk over the person you’re calling, they cut out and you miss what they’re saying.
Our solutions
So at the No Boundaries conference, how did we solve these three problems to get a live link between Bristol and York? The simple answers are: with very careful setup, some special equipment, and a secret weapon.
Problem (1) was solved by using Teradek Cubes: each end had both a transmitter (model 205) and a receiver (model 405). These are expensive beasts, made even more expensive by the MPEG-TS licences we bought for them. There may be cheaper options for encoding and decoding, but the Cubes are reliable, do the job, and do it to broadcast standards (plus we already had one lying around). We went for the MPEG-TS licence because of the “TS” bit: it stands for Transport Stream, which means it’s MPEG compression but with extra goodies such as error correction to help it deliver reliable video over an unreliable medium such as the internet. MPEG-TS is used for most broadcasting (satellite and terrestrial) where packet loss is to be expected.
Problem (3) was solved by three means: firstly, having good audio technicians at each end who could mute the other end when their own end was speaking (a bit like Skype does, but with human intelligence thrown in). Secondly, we got every speaker wearing a mic on a headset so it was right next to their mouth, rather than use a normal mic or a tie-clip. This meant microphones didn’t pick up much of the surrounding noise, and so feedback between the venues was greatly reduced. And lastly, of course, we didn’t do true videoconferencing: what we did was constantly switch between the two venues. For instance, when someone at Watershed was speaking, York could see them, but Watershed couldn’t see York (instead Watershed’s screen would show either the speaker’s slides or a holding slide). It was more like serial unicasting than videoconferencing, which actually gave a very pleasant effect.
All of this leaves us with the big problem (2). Even though there were no conversations happening over the link, the handovers had to be quick and smooth, so minimising video delay was important. We also wanted a constant, totally uninterrupted supply of large bandwidth so we could get the best video quality possible. This is where our secret weapon was deployed: Janet.
Janet is a very fast part of the internet that joins pretty much all UK universities, colleges, and often schools. Although not as reliable as a dedicated point-to-point link between two venues, it’s generally got enough capacity to be able to carry data between two points much quicker and more reliably than going over the open internet. It’s also got enough capacity that jitter is kept to a minimum.
Luckily, Watershed’s main internet connection BMEX peers with Janet. That means that although we’re not part of Janet itself, data between Watershed and Janet goes straight onto the Janet network without ever hitting the open internet. Equally luckily, York Council could push its part of the Janet network down its fibre into the Guildhall in York. That meant that the video streams being sent between Watershed in Bristol and the Guildhall in York went purely over Janet, with all the speed and reliability that brings.
Knowing we were using Janet, we could tune the Cube encoders to use a large amount of bandwidth for top quality video (the streams were 8Mbps each way, and looked really good even when put onto 30 foot screens), and also tune the decoders to have a very small buffer. We ended up using a buffer of just 300ms, and even that was very conservative: we were getting good results with 50ms during testing. There’s no way we could have done that over the open internet: a huge thank-you is due to Janet for letting us use their network for this event.
It wasn’t all plain sailing: a couple of times we rebooted the Cubes, not because they’d crashed but because the streams were getting a bit jittery and we weren’t sure why: rebooting was just the first and quickest thing to try. The two theories were either that the Cubes had overheated, or that there was an issue with the networks. I strongly suspect the latter: network routers will drop traffic if they’ve got too much to deal with, and the first kind of traffic they’ll drop is things like streaming video. I suspect that by rebooting the Cubes we simply let some random Janet router between Bristol and York catch up with its data, and so it could start passing our stream again. I suspect we’d have had many more issues of that ilk out on the open internet.
Overall though it worked well, and the handovers between the two venues generally went smoothly. If you watch the recordings of the conference over at nb2014.org you may see some interesting video feedback during the first set of handovers (where one end was echoing video back to the other, a bit like pointing a camera at its own monitor), but the technicians quickly got the hang of doing it smoothly.
Obviously the cubes weren’t the only kit we used: each end had an HD vision mixer to mix the cameras and various media/slide sources, both to send over the link and to live stream. A huge amount of tech went into making sure the production teams and the many technicians at each end could communicate effectively to produce a professional-looking event. But the logistical end of things is probably best left for another blog post.