Opus is a lossy audio compression format developed by the Internet Engineering Task Force (IETF) designed to be suitable for interactive real-time applications over the Internet,a including music as well as speech, yet it is also very competitive for use as a storage and playback format, being a class leader at around 64 kbps and also at 96 kbps. As an open format standardised through Request for Comments (RFC) 6716,c a high quality reference implementation is provided under the 3-clause BSD licensea which compiles and runs on the vast majority of general purpose and embedded (fixed point) processors. Many Software patents which cover Opus are licensed under royalty-free terms.b Opus is also a Mandatory To Implement (MTI) codec for the upcoming WebRTC (Web Real Time Communication) specification of the World Wide Web Consortium (W3C).
Opus is able to seamlessly adapt its mode of operation without glitches or sound interruption (an illustrative demonstration of bitrate scalability is on the Opus Examples page), which can be particularly useful for mixed-content audio or varying network conditions, making the unified Opus codec superior to a suite of different codecs that might otherwise cover the same range of bitrate and quality settings and would require out-of-band signalling to instigate codec switching. The switching includes the choice of mono, stereo and other channel mappings, the use of the speech-oriented SILK layer, the general-purpose CELT layer or the hybrid of both, and the use of different audio bandwidths (4, 6, 8, 12, or 20 kHz) as well as the quality adjustments within the same operating mode that are available in most VBR-capable codecs.
Of importance mainly to interactive uses, but potentially useful in time-delayed audio streaming also, Opus includes packet loss concealment (PLC) in all modes and, in the speech-oriented modes where the SILK layer is active it also supports Forward Error Correction (FEC) where the expected rate of packet loss can be indicated to the encoder by the user or by application software and critical frames (e.g. consonant sounds) can be retransmitted at low bitrate to preserve intelligibility.
For music and general audio, the CELT layer of Opus builds on knowledge gained during xiph.org's Vorbis development and ensures as a primary goal that the total energy in each spectral band is preserved while requiring only a modest bitrate overhead to achieve this, thereby eliminating a lot of bitrate-starvation artifacts such as 'birdies' that are common in low-bitrate MP3, especially during transients, applause and cymbal sounds. This technique likewise increases coding efficiency at bitrates targetting transparent music reproduction. Short blocks (2.5 ms) are also possible for efficient transient handling. Short blocks can also be used exclusively, if very low algorithmic delay (5.0 ms) is required to enable very low-latency interative audio (e.g. live networked music performances such as remote jam sessions), though greater bitrate is then required to maintain the same quality (illustrated in Monty's CELT demo page under Constant PEAQ value, varying latency). CELT uses a number of additional techniques and provides additional advanced tools to enable encoder tuning.
Opus natively supports gapless playback (though poor player design might itself induce interruptions during playback). Playback gain is also required, making some form of ReplayGain or similar volume control possible in any compliant player.
For mono speech, Opus ranges from intelligible narrowband speech reproduction starting at 6 kbps to medium-band, wideband and superwideband speech, reaching full-band speech by around 14 kbps in encoder version 1.2 (was 21 kbps in v1.1, 29 kbps in v1.0). Above about 32 kbps, the SILK layer is no longer used at all, as CELT alone gives superior quality.
Multi-format stereo music listening tests have demonstrated the superiority of Opus at 64 kbps and 96 kbps compared to the best AAC-LC, HE-AAC and Ogg Vorbis encoders, and at 96 kbps also to 128 kbps MP3 encoded using LAME -V 5.
In encoder version 1.1 automatic detection of speech/music and bandwidth detection were introduced to improve mode decisions and VBR is less constrained, all with the aim of maximizing the quality/bitrate tradeoff, and these improvements are further enhanced in version 1.2 and 1.3. These tables are likely to require updates as the encoder is improved, especially in low-bitrate regions.
This table assumes a monophonic source sampled at CD quality or above (typ 48 kHz sampling rate) but mentions stereo compatibility for 40kbps+. The default 20ms frame size (22.5ms latency) is assumed. Note that the selection of VOIP mode will deliberately modify the sound with a High Pass Filter and emphasis of formants and harmonics to improve intelligibility of speech especially in noisy environments much as telephones do. Auto mode will not modify the sound prior to encoding so is usually better for high quality speech recordings or mixed speech and music.
One major limitation of Opus at low bitrate is that SILK is inherently VBR: it accepts no constraints in CVBR, and if forced to do CBR the quality degrades from bit-shaving. As a result, even though constrained VBR is designed such that a fixed-rate data link requires at most one frame of buffer to handle the variation in bit rate -- great news for communication links -- any use of SILK, even in hybrid mode, has the potential of breaking this intention. This makes Opus suboptimal for low-rate radio links: radio links requires a predictable buffer amount, which is only possible with CBR when SILK is used, but use of CBR in turn hurts SILK. There is a noticeable quality difference at the NB/WB switch at 9 kbps VBR / 10 kbps CBR.
This table assumes a stereophonic source sampled at CD quality or above (typ 48 kHz sampling rate). Opus will automatically use mono at very low bitrates, though a certain amount of stereo encoding can still be used (content dependent) even when mono is specified as the typical stereo mode in the table below.
For interactive use on the Internet or other packet-based networks, total bandwidth used will be subject to packet overhead. The more packet headers that are transmitted every second, the greater will be the overhead that is required. For this reason, Opus, while defaulting to 20 ms frames, supports 60 ms frames to reduce overhead when transporting low-bitrate SILK frames at the expense of greater latency, which may still be acceptable for speech, and also supports 10 ms SILK frames to reduce latency somewhat at the expense of packet overhead.
In the CELT layer, which tends to operate at higher bitrates than SILK, 20 ms frames are the default, but frames of 10 ms, 5 ms and 2.5 ms are also possible, which directly increases the frame overhead by transmitting more packets per second to achieve lower latency. In addition, as we'll see below it also reduces the quality/bitrate tradeoff of the CELT layer itself.
You probably do not want to use a frame size lower than 10 ms in applications containing speech, as doing so turns off SILK. The "lowdelay" application switch (available in FFmpeg and the raw library) turns off SILK to cut out 4 ms of synchronization delay, but a frame size of 10 ms achieves more delay reduction compared to default without sacrificing SILK.
Unlike the SILK layer, which works on fixed 10 ms blocks, 1, 2 or 6 of which can be combined into an Opus frame, the CELT layer is able to modify the encoding block lengths available to enable its use with shorter frames.
When the CELT layer uses 10 ms, 5 ms and 2.5 ms frames instead of the default 20 ms, it must use smaller transform block sizes to achieve this, thereby reducing frequency resolution in the MDCT compared to the default transform window, thus reducing encoding efficiency for tonal signals. To obtain the same frequency precision for a sound divided into shorter transform windows, improved amplitude precision is necessary, resulting in increased bitrate to obtain the same perceptual quality (or conversely lower quality at the same bitrate).
Xiph.org used matched PEAQ scores (approximate perceptual quality assessment made in software) for the CELT0.10 codec that was used as the basis of the CELT layer in the Opus reference release, which indicate the following approximate equivalent settings for stereo music.
N.B. This table is useful for interactive streaming only. For music storage & delayed playback or non-interactive streaming, latency reduction is not important and the default 20 ms frame size is preferable.
Opus code includes a compute_equiv_rate() function. Given the bitrate, framesize, cbr decision, and complexity setting, it converts the bitrate to an standard config (VBR, 20 ms frame, complexity 10) equivalent to be used for bandwidth, layer, and stereo decisions. The interesting bits are:
The format and algorithms are openly documented and the reference implementation is published as free software. The reference implementation (Opus Audio Tools, opus-tools), consisting of separate encoders and decoders, is published under the terms of a BSD-like license. It is written in C programming language and can be compiled for hardware architectures with or without floating point unit. The accompanying diagnostic tool opusinfo reports detailed technical information about Opus files, including information on the standard compliance of the bitstream format. It is based on ogginfo from the vorbis-tools and therefore, unlike the encoder and decoder, available under the terms of version 2 of the GPL.
The commandline tools of the reference version are available pre-compiled for the most popular operating systems at opus-codec.org and Mozilla's ftp server, plus in the foobar2000 free encoders pack and some alternative compiles through the hydrogenaud.io opus forum. The libopus commandline tools include encoder opusenc, decoder opusdec, and with a different license, the opusinfo opus stream & metadata analyzer.
64591212e2