CUTTING IT FINE Science Of Digital Audio Data Reduction Published in SOS August 1998 Technique : Theory + Technical THE SCIENCE OF DIGITAL AUDIO DATA REDUCTIONThough you might not realise it, the audio industry has employed data reduction strategies since the earliest days of digital systems. HUGH ROBJOHNS explains the concepts and explodes some myths.You might think of data compression and reduction as a relatively recent development in digital audio circles -- certainly it's only in the past few years that project studio musicians have started to become aware of data reduction techniques, various forms of which are employed in Minidisc recorders and some of the currently popular personal digital multitrackers to fit large amounts of audio data on to compact storage media without compromising perceived audio quality. But even in the Jurassic period of digital audio evolution, simple data reduction techniques were employed to avoid stretching the available recording and transmission media -- techniques such as reduced sample rates and low-bit quantising. These days, data reduction is far more sophisticated, and allows us to achieve more with less -- more channels, longer recording time, or better perceived resolution, with smaller discs, narrower tape, and restricted data-rate channels. Although the process of reducing the amount of data needed to represent a digital audio signal is commonly referred to as data compression, the more accurate description is data reduction. Compression implies a reversible process (you can expand the compressed material to restore the original) but the majority of data reduction strategies are lossy, meaning that data is thrown away irretrievably. A loss-less system (and there are a few) is one where some data from the original signal is not recorded, but can nevertheless be recreated perfectly on replay -- truly data compression. Data reduction is not restricted only to digital audio; noise reduction processes such as the Dolby and dbx systems involve data reduction in the analogue domain. They allow audio signals with a wide dynamic range to be stored on a tape medium with a restricted range -- just as we try to squeeze a large amount of digital audio data onto a medium with a restricted data transfer rate. Many of the techniques of analogue noise reduction apply equally to digital audio data reduction, albeit with the far greater sophistication allowed by complex DSP algorithms. Perhaps this is one reason why the masters of analogue noise reduction, Dolby Laboratories, also make some of the most highly regarded digital data reduction systems.
|
|
Another technique is to remove 'redundant data', ie. data that does not carry useful audio information. This may be done in a 'loss-less' fashion -- for example, by removing the higher-order bits when a PCM signal is very small . In this illustration, the 'unused' bits mirror the most significant or 'sign' bit (indicating the positive or negative audio cycle). A quiet signal does not traverse many quantising levels, and the higher-order bits remain static, not conveying anything useful at all -- so they can be removed without affecting the quality of the original audio. However, some means of indicating the number of missing bits is necessary so that these can be reinstated before replay. In this way the data rate can be reduced significantly without altering the recreated audio signal at all. An everyday example of this approach is the NICAM television system, which employs this idea as part of its data reduction strategy.
Another loss-less technique is to identify sample values which occur very often in the data stream. These can then be removed and replaced with something to indicate where the true value should be re-inserted. The zero-crossing point, for example, is a very frequent event which can be removed for recording but re-inserted before playback, with significant potential savings.
A more advanced technique, strongly allied to the idea of redundancy, is predictive coding . This is, in principle, a simple idea which has the advantage of suffering very little processing delay (unlike the more sophisticated perceptual coding techniques discussed below), making it a popular technique for real-time applications. Among the real-world systems that employ predictive coding are G722, APT X100, and some versions of the DTS system used in the cinema, laser discs and DVDs.
Audio signals are largely repetitive, which is why predictive coding works. The technique involves a 'predictor' which has a knowledge of typical audio signal behaviour. By looking at the preceding audio signal, it tries to anticipate what will happen next and, because of audio's repetitive nature, the prediction is generally quite accurate. If this prediction is subtracted from the original signal, only a small difference signal remains, and this is recorded or transmitted as the data-reduced result (see Figure 3).
The decoder uses the same predictor 'knowledge' to regenerate the predicted signal. The accuracy of this system is entirely dependent on the predictor algorithm -- typically around 98% of the original signal is retrieved.
The technique works less well in anticipating essentially random signals in noise-like sounds, or in predicting highly unpredictable (but crucial) transients. To improve the precision of the system, therefore, many coders use band-splitting techniques (splitting the whole audio spectrum into four separate frequency bands). This allows multiple predictors to work on simpler band-limited signals with far greater accuracy than they would if handling the complete signal.
One drawback of predictive coding is that since the decoder must use exactly the same predictor as the encoder, improvements to the 'intelligence' of the encoder can only be useful if your decoder is updated too -- otherwise the accuracy of the decoded signal will actually suffer.
In general, this kind of system works very well, providing a typical reduction ratio of around 4:1. However, it can prove fatiguing to the listener over long periods, because damaged transient signals require more 'brain power' from the listener to interpret the sound. Multiple passes through the encoding/decoding process also lead to rapid loss of signal quality, very similar to that experienced when copying analogue cassettes.
The final and most controversial strategy is to remove data which is irrelevant -- ie. data representing sounds considered to be inaudible in the presence of the other elements of a complex audio signal. This relies on frequency and temporal masking, and is entirely dependent on the accuracy of a 'perceptual model' of the human hearing system (see the 'Perceptual Modelling' box).
Perceptual coding involves very precise audio filtering and analysis; processing that is only practical using digital techniques. A bank of filters divides the audio signal into many narrow bands (typically 32 or more) prior to processing, and a perceptual model then analyses the spectral content of the audio to determine which elements are likely to be completely masked (ie. irrelevant) and can therefore be discarded. The remaining audible signals are re-quantised with a low resolution just sufficient to put the quantising noise below the masking threshold in each frequency band.
Temporal masking is calculated by dividing the signal up into blocks of samples (typically around 10 milliseconds in length) and analysing each block for transients which will act as temporal maskers. Most systems vary the length of the block to take advantage of both backwards and forwards masking (explained in the 'Perceptual Modelling' box elsewhere in this article).
Perceptual coding's complex digital filtering and spectral analysis, combined with the process of treating the audio in blocks, can create significant time delays in the audio signal. A complete encode/decode path can exhibit anything between 20 and 200 milliseconds of delay, which often causes serious problems in real-time applications.
The decoder is much simpler than the encoder, as it does not require any perceptual modelling knowledge at all. It simply has to decode incoming data back into the original filter bands and re-quantise the data to conform with a standard PCM format (based on codes embedded in the data), before re-assembling the data into a composite output signal. Upgrades to the perceptual model in the encoder will automatically result in better quality from every decoder, purely through refinements to the process of deciding which bits of the signal are irrelevant.
MPEG1 and 2, PASC (used on Philips' Digital Compact Cassette), ATRAC (used on Sony's MiniDisc), and Dolby's AC3 are all systems which use perceptual coding as their primary means of determining irrelevancy in complex audio signals. However, they can all work at a variety of sample rates, they all employ requantisation to reduce the number of bits per sample, and they all remove some degree of redundancy in the signal too (see the 'Perceptual Coding Processes' box for more on these).
|
A great deal has been written and said about data reduction systems, much of which has suggested that they are an unnecessary evil. This is not the case: provided the data reduction system is appropriate to the application and is used sensibly, it is a useful tool that allows better results than would otherwise be possible -- for example, longer recording times, smaller recording media, more tracks, or whatever.
This is not to say data reduction is suited to every situation, but digital recording with most modern data reduction systems will provide better quality than most semi-pro analogue recorders, especially cassette multitrackers.
Multitracking is a specific case where digital data reduction offers profound advantages. Since each track is carrying relatively simple signals which include a great deal of irrelevancy and redundancy, data reduction systems can work very effectively in allowing more tracks to be squeezed onto a particular medium with negligible side effects. Data reduction systems usually only reveal their weaknesses with very complex signals, so you might prefer to record your final mix on a linear PCM format, but data reduction can offer more benefits than drawbacks for the multitrack recorder.
One of the biggest problems with current data reduction systems is in coding stereo (or surround) material. This is primarily because many perceptual models are not sophisticated enough to cope with 'stereo unmasking'. This is a phenomenon whereby, although a quiet signal in one channel might be masked by louder elements in the same channel, in the context of stereo monitoring its spatial position (determined by the relative levels between the two channels) effectively unmasks it, revealing it as a separate and identifiable signal.
Imagine, for example, a rhythm guitar panned three-quarters towards the left, behind the rest of the instrumentation in a complex mix. The perceptual coder might decide that the (quiet) guitar in the right channel is masked by everything else going on, although it is sufficiently loud in the left to remain audible. Consequently, the guitar might be removed from the right channel as part of the data reduction with the result that, when listening in stereo, it will appear to be positioned hard left instead of three-quarters left. In practice, it is more likely that some frequency components of the guitar (for example harmonics) will be removed while others are retained, and so the image of the guitar will become blurred. Although this is perhaps an exaggerated example, I hope the point is clear: data reduction systems can impose instability in stereo images if they are not very carefully designed.
One solution is to treat the signal as M-S (middle and side) components rather than L-R (left-right) components. This takes advantage of the fact that most loud signals sit in the centre of the stereo image (middle in M-S) and are therefore common to both channels, leading to a lot of redundancy. By dealing with stereo in this way, a greater proportion of the data rate can be allocated to the side signals that convey imaging information. This is often called Joint Stereo Mode.
Data reduction systems necessarily reduce the amount of data, so it stands to reason that multi-generation processing will lead to quality loss. The MPEG algorithms seem to be slightly less prone to this, but it remains a general problem with all systems. Things are worst when different data reduction systems are cascaded, in which case the quality deteriorates very quickly indeed. The symptoms usually become apparent after two or three generations in the form of vague stereo imaging, noise modulation, a harshness to the sound, aliasing, and increased background (quantising) noise.
In general, it is wise to avoid recording a complex mix on a data-reduced format if any subsequent processing or copying is envisaged. It would also be sensible to avoid recording stereo source material on data-reduced systems wherever possible, but single mono sources (eg. individual tracks on a multitrack) will not be noticeably affected by data reduction.