Network Automation ramblings by Kristian Larsson
07 Jul 2020

Retro SIP video conferencing to Webex et al using a Cisco C40 TelePresence codec

I use a ten year old Cisco C40 Telepresence codec for my daily video conferencing and I love it. It lets me join Webex conferences and share the screen of my Linux computer with no hassle!

An irk for crappy UX

I grew up with IRC. Sleek, fast and efficient. Today we have things like Slack. Pretty much the polar opposite. It is slow, sluggish and I don't particularly enjoy the user experience at all.

Similarly, I feel that video conferencing of today is a rather horrible experience. Or let me be more specific; I love video conferencing, but video conferencing software on computers cause me all sorts of headache.

I've been working remote for over 5 years in a distributed team. Asynchronous tools and workflows like Gitlab issues and email are the backbone of internal team communication. Video conferencing is the perfect compliment for quickly hashing things out or just seeing the face of your colleagues. Voice only (telephony) is nowhere near it. I love video conferencing!

The bulk of my work, like emailing and writing code, is carried out on Linux. I often want to share my screen to coworkers in order to discuss some aspect of code or similar.

Video conferencing software for Linux or any non-Windows/OS X platform is mostly an abysmal experience. Most of it is confined to a browser, as there are no native clients (though many "native" programs today are really just a web browser anyway). A lot of the times it won't work, even when you can join the conference itself, something else, like screen sharing will fail.

Even on OS X or Windows, video conference software has a tendency to deal you the "I'm just going to install this upgrade" just as you are about to join a conference. This might not happen for the software you use in your team, as you use it every day, but then you have a slew of other video conferencing apps installed for the more infrequent meetings. How much in advance of your actual meeting do you need to fire up a video conferencing app to ensure you are ready on time? This is a broader issue with modern software - a form of arrogance on the developers part for their users.

Needing multiple installed video conferencing applications highlights another problem. Just like modern chat applications each live in their own universe and won't interoperate with anything else, many video conferencing or collaboration solutions are walled gardens.

I want the lions share of communications to be based on federated technology, like email or SIP, both relying on DNS as the distributed directory. It's beautiful! I'm afraid email was the last widely spread truly federated application.

My solution; a decade old video conference hardware

Anyway, so back to the Cisco C40. It is my solution to everyday video conferencing and I couldn't be happier!

In my position at Deutsche Telekom, Cisco Jabber Video is used, which happens to present a SIP interface for video conferencing. My other work is with Cisco, which naturally is based on Webex (Teams) these days. It too exposes a SIP interface.

The Cisco Telepresence line of products are originally an acquisition of Tandberg. The Cisco C40 I have was originally called the Tandberg C40 and it is of the generation where Tandberg just got acquired by Cisco, released in late 2010. I actually have three C40 of which two have a Cisco logo on the front and the third reads TANDBERG. Regardless of logo on the front, the C40 can speak SIP and so I can use it to dial into Webex meetings and Cisco Jabber Video meetings. Never any dialogues prompting me to upgrade software just before a meeting. It's rock solid and I bought it off of eBay for about £200!

Linux screen sharing

The C40 sports a HDMI input so I can connect my Linux computer and share my screen. No more hassle with things not working. Did I say it is rock solid?

I use the xmonad windows manager which is a tiling window manager and one that works exceptionally well with multiple screens through its clever virtual desktop concept. Virtual desktops aren't tied to a screen, rather you have a global list of desktops and you pick which screen should map to which virtual desktop through keyboard shortcut. Since it is a tiling window manager, window placement is an automated task. Switching a virtual desktop between two screens of different resolution, such as my normal 4K screen and the 1080p of the C40, doesn't require any manual resizing or moving of windows. I do typically need to increase font size a bit though, which I have conveniently mapped for my normal applications;

  • Spacemacs: SPC z f to enter zoom mode, followed by + / - to increase or decrease respectively
  • terminal (urxvt): ^+ / ^-
  • Firefox: ^+ / ^-

Full HD 1080p resolution

It might be old, but the C40 supports 1080p both from the camera and for screen sharing which is pretty much on par with what more modern solutions support.

The C40 is actually just a codec unit. A 1U piece of specialized hardware (I looked at some debug logs and it appears to contain FPGAs, so it's not just a PC (also see https://www.youtube.com/watch?v=34FLhwkrwoQ if you are interested in the success of Tandberg overall and some details on HW)). The camera is an external unit connected via a HDMI cable and a serial cable (which I originally assumed was for controlling the motor, but that doesn't appear to be the case). From the C40 perspective, the camera HDMI port is really just another HDMI port and you could potentially connect something else, like another computer to it. Overall the C40 has three inputs;

  • HDMI for camera (but again, you can connect something else)
  • HDMI for computer or second camera
  • DVI for computer

I use the second HDMI and the DVI port with a DVI-to-HDMI adapter to connect to my Linux workstation and have one loose HDMI cable on my desktop which I can connect to another laptop when needed.

Camera picture quality

I use a Precision 60 camera. It is built for the SX80 codec (the generation after the C40) and isn't officially supported together with the C40 but I found out through a forum post that it actually works so I bought a few and it is working great. Picture quality is superb with a 1080p picture at 60fps and coupled with 20x zoom (10x optical zoom and 2x digital) - great if you have a big offiec. Zoomed out it provides a good wide view of my room. The motors are completely quiet - I simply cannot hear when it is moving, although the zoom and focus camera makes a slight noise. Nothing that annoys you - I have to sort of actively listen for it. It is a very good camera.

Previously I used a PrecisionHD 4x (TTC8-04) camera. It's actually just 720p but still delivers better quality video than most of my colleagues who are on the latest MacBook Pro or similar new laptop. 10 year old digital camera sensors aren't great but it has pretty decent optics - big pieces of glass just don't get out of date. Unlike my main driver, the Precision 60, the motors do make some noise but still qualify as fairly quiet. I have the 12x zoom camera too, which has 1080p output and it is noticeably (or not, heh) quieter than the PrecisionHD 4x but still makes some noise, so somewhere in between the completely quiet Precision 60 and the quite quiet PrecisionHD 4x. Unfortunately, there is some form of analog noise from my 12x camera which is why I've never really used it.

I find that the camera sensor, optics and bit rate used, usually has a larger effect on qualitative experience than the sheer resolution. This is in comparison to most modern laptops, which for some weird reason come with really poor cameras. Buy a $4000 laptop and get a $4 camera. Why does the iPhone or an iPad have so much better cameras than a MacBook Pro? I haven't used either my iPhone or an iPad extensively for video conferencing professionally so I can't speak to how it works in reality. I do however use my iPhone for Facetime and Whatsapp with family, which usually results in rather crappy video, I suspect it is because both end points are behind NAT and whatever relay is used is severely rate limiting the video stream.

Modern laptops usually have crappy cameras resulting in an overall crappy picture.

Mobile phones and tablets have decent cameras but in my experience with video calls, often end up with crappy picture due to low bitrate (large blocks visible from low bit rate encoding).

In practice, my C40 with its Precision 60 achieves delivers a very high quality video experience. Even with my older PrecisionHD 4x camera, the good lenses more than compensates for its aged camera sensor providing an overall better experience than modern web cameras.

Also, I just love that all these cameras are motorized, I just never get tired of moving about my room and pan my camera to where I'm at.

Audio

Audio is great. You have to connect external speakers, so that is largely up to your own choice of speakers. My kit came equipped with a desktop microphone with a micro-XLR output that is jacked up to the XLR input of the codec. Since there's an XLR input, the choices are endless.

Echo cancellation works well and I've understood (from my colleagues) that the mic doesn't pick up much noise. They hear me well and audio quality is overall good.

Noise and echoes are probably the two most common challenges for audio conferencing. I used to work on a fairly large IP telephony system back in the day, where we used hardware DSPs for echo cancellation. I haven't kept up to date on the advances in this area but would have assumed that like for everything else, software have caught up and perhaps surpassed hardware. Every day use of video conferencing applications point in the opposite direction though. I find that there are often echoes caused by participants using laptops. I'm not quite sure why.

One effect that I've sometimes noticed (not super common but not super rare either) is that when a participant start speaking I will miss the first part of what they are saying (not related to the C40, it happens on software clients too). It is as if their local microphone was muted or the signal level was very low and it is ramped up when they start speaking. The ramp-up takes a moment during which a word or two is lost. I'm not sure what component introduces this, like if it is the video conferencing software (I've noticed it across multiple different solutions) or in the client endpoint hardware, like many microphone arrays have local echo cancellation - but I've noticed it on MacBook while not on all MacBooks. Is it a setting? Automatic microphone level adjustment in combination with something else? Nonetheless, my C40 suffers from no such problems when I'm speaking.

Another problem I've noticed with software clients are that participants tend to speak over one another and I don't mean by a small amount, like with a high latency link two participants start speaking, notice they are colliding and one or both will stop (similar to Ethernet). No, I mean, two participants will just continue speaking over each other for complete sentences. I think the problem is that the software client, when it detects that you are speaking, it will lower the audio level from remote participants causing you simply to not notice anyone else speaking. The result for anyone but the two speaking parties is a cacophony. Horrible. This doesn't happen with the C40. It has good echo cancellation circuits so I suspect it doesn't need to employ tricks like lowering the audio volume of remote participants and thus this scenario doesn't really happen.

I don't really know what would cause this ramp-up problem and speaking-over-each-other problem - it is just based on observation and what I can only assume are solutions to mitigate noise and echo problems. Feel free to reach out to me if you have any insights!

IPv6

It supports IPv6. What else needs to be said?

Standalone vs SIP infrastructure

It is possible to operate the C40 standalone and directly call to IP addresses. Incoming calls can be placed directly to the public IP address of your C40 (or through forwarded ports if you are behind NAT).

The more elegant approach is of course to use a SIP registrar that answers for like your domain, so you can get username@example.com as your SIP address, just like your email address! I have not yet gotten this far though - I just dial out to various meetings, even for one-on-one calls (I use my personal webex room).

Interoperating with other systems, i.e. what speaks SIP?

Unfortunately, not many other video conferencing solutions appear to speak SIP. Many are walled gardens.

Cisco Webex (Teams)

AFAIK, all Webex meetings support dialing in with SIP. You can reach personal rooms using sip:NAME@ORG.webex.com, for example, my personal Cisco room is sip:krlarsso@cisco.webex.com. Meetings have a unique identifier consisting of 9 digits and you can dial in to them by dialing sip:MEETING_ID@webex.com.

There is a native Webex client for OS X and Windows. Without having measured, it feels like it provides for a lower latency experience than using the Webex Teams client to connect to the same meeting. Perhaps there is an additional proxy involved, like all media goes to some central webex teams service. Perhaps that service is in the US so my video streams are bounding across the Atlantic (I'm in Sweden). Connecting with my C40 I get considerably lower lag than when using the Webex teams client. I have not done a direct comparison with the native Webex clients but have the feeling that the C40 is on par or slightly better.

Cisco Jabber Video and other Cisco collaboration solutions

Cisco offers on-premise solutions for video conferencing that are popular with many enterprises. AFAIK they all support SIP. I have friends working for companies that have such solutions and when connected using the Cisco C40 to a client on his computer, the latency is exceptionally low - a very nice experience indeed. The point of video conferencing is to take away the parts that make it feel unnatural, it should be a close to a physical meeting as possible. Latency is one of the worst enemies here. With enough lag, we get people speaking over each other.

Zoom

SIP dial-in is an extra feature for Zoom, where the organizing party need to ensure it is enabled. Zoom comes as cloud hosted or can be installed on-premise. I am familiar with the details of each option and how it influences the ability of SIP dial-in. It sure doesn't provide the always-on SIP functionality that Cisco's solutions have, which is a pity. I don't understand why SIP, not being tied to any particular hardware (people talk about SIP trunks like it was a ISDN-PRI, but common), wouldn't just be enabled per default.

Microsoft Teams

There is some form of Direct Routing option for Microsoft Teams that allow a SIP trunk to be setup so that it's possible to dial-in with SIP. I have never been invited to a meeting that have this supported.

Jitsi

Jitsi is a web video conference solution based on WebRTC. It has a component called jigasi which acts as a SIP gateway. Unfortunately, it is audio only. This was a big disappointment to me as I spent a few hours setting up Jitsi thinking it would be my one stop solution for SIP video conferencing while also being able to invite random people on the Internet to use WebRTC (after all, I am aware not everyone has a SIP video conferencing system like me).

There is Jibri, which is another component of Jitsi that can perform various video functions, like streaming to YouTube and also SIP video conferencing supposedly. Jibri is implemented by running a windowless Chrome instance. Eek. It only supports a single stream, so bridging into a Jitsi meeting would be limited to the number of Jibri instances. That would probably work for me but I stopped at virtual FB chrome - yuck.

I should probably get over my feelings on the implementation and just set it up as it would probably be rather useful since I can invite other to my Jitsi instance.

Lifesize

I have never tried but Lifesize devices should be SIP standards compliant and should work well both in conferencing and for direct calls.

Polycom

I have never tried but Polycom devices should be SIP standards compliant and should work well both in conferencing and for direct calls.

Support

The Cisco C40 and all gear of its generation is pretty much out of support. However, it does what I need so I am not very troubled of this.

It is obviously a risk running unpatched software. You can mitigate this by placing it behind a firewall and using a SIP registrar etc in between.

Software

The C40 runs the TC series software. It appears to have been superceded by EC software. Newer codecs like the SX80 can run both TC and CE.

For my use cases, TC software seems just fine. I don't know what I would gain by using CE software.

API

The C40, or rather the TC software it runs, supports multiple APIs. There is a SSH CLI to configure things and a HTTP API that you can feed XML payloads to get it to do things.

I wrote a small shell script so I can dial directly from the command line of my computers;

#!/bin/sh
# Use the Cisco C40 in my office to dial a SIP address!
#

if [ -z "$1" ]; then
  echo "ERROR: You must provide a SIP number to dial!"
  exit 1
fi

curl --request POST --data "<Command><Dial command=\"True\"><Number>$1</Number><Protocol>SIP</Protocol></Dial></Command>"  -H 'Content-Type: text/xml' -u "admin:$(pass show web/${C40_ADDRESS} | head -n1)"  http://${C40_ADDRESS}/putxml

I have the local IP address of my C40 hard-coded in the C40_ADDRESS environment variable. I can then do dial 123456789@webex.com.

What I bought

I mentioned I have three C40 units. I bought the first as a complete kit including:

  • C40 codec
  • 12x camera
  • microphone
  • remote control
  • 2x HDMI cables
  • serial cable between codec and camera
  • XLR to mini-XLR for microphone
  • power cable

Unfortunately there was some problem with the codec. After a software upgrade from 4.x to the latest TC software (7.3.21) it failed to start up properly. Fortunately I was able to get a new one from the seller. Meanwhile I was so eager to get going that I ordered another unit, just the codec, so I could start playing around. Thus, I have three codecs, of which two are working and one is broken. I suppose I'll throw away the broken one or reuse the case for some other project.

Out of curiosity, I also bought a C20 and another camera. I wanted to be able to setup a call locally and for that I obviously need two complete end points.

As it turned out, the 12x camera has some analog noise, so I'm using the second 4x zoom camera that I bought as my primary camera right now. The lense on that 12x is to die for though :/

System components

For a system like mine, you need:

  • codec unit (C20, C40, C60, C90)
  • camera
    • TTC8-02 - 1080p60fps, the 12x camera I have
    • TTC8-04 - 720p, 4x zoom the 4x camera I have
    • TTC8-05 - newer 4x camera doing 1080p
    • TTC8-06 - 2.5x zoom - seems crappy
    • TTC8-07 - 1080p60fps, Precision60, 20x zoom, this is my main driver
  • microphone
    • AudioTecnica, JAVS etc sell desk microphones with XLR connections
    • you can be creative and get anything you want
    • how about a Condor MT600 beam forming microphone array?
      • or a Sennheiser TeamConnect Ceiling 2 beam forming array so you can be anywhere in your 100 sqm living room? ;)
  • remote control (there's also a touch screen display but it's more costly)
  • a TV / monitor

You need to be aware of compatibility. The C40/C60/C90 codecs work with all the cameras listed above as they all feature that serial port to connect to the camera. Newer cameras like the Precision60 (TTC8-07) doesn't have a serial port AFAIK, instead it uses Ethernet, as the C40/C60/C90 have a secondary Ethernet, this still works (although not officially supported). The C20 however only has a single Ethernet so I'm not sure about its compatibility.

The camera cable for the SX20 looks like a HDMI fused together with something else - I'm not sure what is is compatible with as I don't have that generation of gear.

Automatic camera follows speaker

There is a component called SpeakerTrack that is able to follow the currently speaking presenter. It's a marvelous piece of technology. It uses two cameras to allow focusing on one speaker while moving the other camera to be ready to cut to the next "scene", for example an overview of all participants in a meeting. I'm not entirely sure how it works but it has a large microphone array which I assume is for localizing the speaker in a room and then probably do fine adjustments based on video analysis. I have a hard time imagining that finding the speaker could be based purely on a microphone array since it also properly finds your head regardless if you are sitting down or standing up. That has to come out of image analysis. I've seen there are debug settings related to face detection, so this seems likely. The same technology is used in the MX700/MX800 (two large monitors with camera built-in) Telepresence systems.

SpeakerTrack 60, as its called, is a large beast. Probably not something you want in a home office, but nonetheless a very cool piece of technology. I believe all the clever analysis is done in the SpeakerTrack unit itself and so it is compatible with the C40, C60 and C90. It's connected mostly like any other camera via HDMI inputs though also uses the Ethernet jack but I suspect that is mostly for management.

While SpeakerTrack is a product, there is a feature called PresenterTrack that allows the same functionality of following the currently speaking presenter but with a single camera. It does however require the Precision 60 camera and a SX80 codec at a minimum. I have no idea of the underlying implementation. The SX80 is considerably more expensive than the C40 I have, so I probably won't be upgrading for a while. They are cheaper in the US so perhaps during my next trip there I'll get the chance but in the current situation, no one knows when that will be.

Similar systems

C20

The C20 codec is a smaller codec, both physically and in terms of features and capabilities. It supports up to 720p content and has fewer inputs and outputs.

It is intended for use in smaller locations, like huddle rooms.

C60 / C90

Same capabilities as C40 in terms of resolutions (1080p) and features, just with more inputs and outputs.

All three (C40, C60 and C90) are intended for installation in (small to large) conference rooms by systems integrators and have a wide range of features for this audience. You can customize what output is displayed on different screens, camera locations, integration with automatic blinds and similar.

SX20 / SX80

This is the generation after the Cxx devices with SX20 supposedly roughly mapping to the C20 (think huddle room) whereas the SX80 is somewhere in between the C60 and C90 in number of inputs/outputs. Being of a later generation, they can run more modern software and are still supported.

The only feature I've seen that seems way cooler is the PresenterTrack feature available in the SX80.

4K

I love high resolution screens. I use a 32" 4K monitor at home so I can fit a lot of text on it. My laptops have high DPI screens. While camera streams of people are usually quite fine at 1080p or even 720p, sharing your desktop at 1080p is painful. Unfortunately the C40 is a 1080p system since all signal processing is implemented in hardware and that hardware is designed for 1080p. The same is true for the later generation SX80. The newest video conference equipment from Cisco (generation after SX80), called Webex Rooms, offer the same 1080p streams for camera feeds but support 2160p, i.e. 4K, on ports from computers. This makes a lot of sense, since you typically want the higher resolution for screen sharing. The frame rate is much lower, like 5fps for the 2160p, presumably to keep down the bit rate. I would love to have this on my C40 but again, the FPGA is simply designed for 1080p.

The lack of 4K is the biggest disadvantage of my setup but as I mentioned above, with simple ways of enlarging content, like font size in terminals, it is manageable. For normal presentations, which use large fonts and visuals, it is not an issue.

Buying one

The prices in the US is considerably lower than in Europe. You can get a C40 for $50 or less. Some of the cameras are available at $20-30. Good mics are also fairly cheap. Assembling a complete system from individual parts is likely the cheapest way as people will sell it as untested gear. People tend to overcharge for the cables though and it might be a little tricky getting hold of the special serial cable (I haven't actually checked the pin-out - perhaps it isn't that special?).

There is somewhat of a risk in buying these units though. They are old and many sellers on eBay will sell them without any warranty. You might see failures. Getting a complete system from one seller is a safer way to a working system but could cost more. If you are willing to experiment a bit, buying a few cheap pieces and trying to assemble a system out of it might be a relatively safe and cheap way.

You can also try looking at the Lifesize equipment which should also work fine as standalone SIP endpoints and comes in quite cheap.

Feel free to reach out to me with questions or your experience in this area!

Tags: SIP

Other posts
Twitter