mp3 decoder tests

  Objective sound quality test  

In this test, the output of various mp3 decoders is compared.
This reveals many problems, including missing samples and missing high-frequency components.

Objective sound quality of mp3 decoders compared to l3dec output (gif image 39kB)

What are the important results of this test?

Do we care?!

mp3 decoder gave terrible results, often loosing all frequencies below 700Hz. The result sounds like your hi-fi has broken!
Other decoders with an audible problem below 15kHz (shown in red) usually exhibit the 100Hz bug. Problems above 15kHz (shown in yellow) are probably not audible to most people, even under optimum listening conditions. Differences in the last bit (shown in light blue) are not audible to most people, and it is difficult to say that one decoder is better than another in this case.

How was this test carried out?

A wave file was created containing music, noise, test tones, silence, dither etc. It was encoded by each encoder, giving 8 mp3 files. Each of these files was decoded by each of the decoders, yielding a total of 216 wave files, 27 from each mp3 file. Each decode from a particular mp3 file was compared with every other decode from that file, by taking the difference between the two files. This gives 729 comparisons per mp3 file, or 5832 in all! Thankfully, trends soon became apparent, and most of these could be skipped.

Examining how the various decodes compared, certain things became obvious. Some decoders gave results that didn't match any of the others. mp3 to wave v1.04 wouldn't synchronise with any of the other decoders, and it was found to be skipping samples. Only two decoders were found to be identical (CEP FhG and Winamp 2.22). Sonique 1.51 always skipped near the start of the file. The difference between l3dec and Winamp 2.22 was only 1-bit, a few times per second - both were clearly based on the same decoding algorithm, but rounding at different points. The difference between Ultra Player, lame, and the Winamp mpg123 plug-in was similar, indicating that these three also had a common origin to each other (mpg123). However, the difference between the l3dec group and the mpg123 group was consistently a 1 sample signal which sounded like the original signal (but obviously very much quieter!). Which one of these two groups is more "correct" it is impossible to say, but for simplicities sake, l3dec was chosen as a reference, and all the other decoders were judged against it. This comparison yields the results shown in the above table. Had lame been chosen as a reference, the straight 6 and 7 results would be reversed, but all others would remain the same.

On the chosen scale from 0-7, each decoder scores the point below the one it failed on. For instance, if a decoder scores 2, that means it failed at point 3 - i.e. it decodes WITH "audible" problems above 15kHz.

Is this test reliable?

This test is very complex, and time consuming! However, the results pin down the performance of each decoder quite well.

The choice of Winamp 2.22 as the reference is not arbitrary. Originally, l3dec was used as the reference; Other decoder tests have chosen it as the reference because it comes from the company who developed MPEG-1 and -2 layer 3 coding. It is no co-incidence that the other decoders that match it also come from this same company! However, FhG make no claim that l3dec is a reference decoder. They simply state that it is programmed by the same team who created the decoder used in the original official MPEG-1 layer-3 quality assessment listening tests. So we must look to other verification. This is provided by the fact that it agrees with other (bug free) codecs to 1 bit, and even agrees with many buggy codecs during segments of audio which do not cause their bugs to become apparent. The decision as to which codecs contained bugs has already been largely determined in other, repeatable, tests (e.g. the 100Hz test) which do not depend on a comparison with the l3dec output. Thus the argument is not circular.

However l3dec gives a slight error decoding Blade encoded files. We can be fairly sure that l3dec is in error, since all the more recent decoders from FhG agree, as do all the other bug-free codecs - l3dec is the odd one out. This isn't concrete proof, but it seems a fair indication. Since there is some doubt, no decoder is disqualified if this is the only error it makes. However, since the balance of probability is that l3dec is in error, Winamp 2.22 (which agrees with l3dec in all other circumstances) is chosen as the reference instead.

The error is a phase difference at 21kHz - if anyone can hear this, I shall be very impressed!

FAQs

  1. I don't understand your results!!!
    The ones with all 6 or 7 scores pass this test. They sound the best.
    I took each .mp3 file, and decoded it using each of the decoders. For each .mp3 file, this gave 27 .wav files! I then compared each one of these .wav files with each other (729 comparisons, though I cheated and skipped lots when trends became obvious!). This showed which ones matched, and which one didn't, and in what way! It became clear (from this, and the previous tests) that some decodes were correct, and the ones which didn't match them were wrong. I chose Winamp 2.22 as a reference, and looked at how the results from each decoder compared with it. None were identical! 7 is very nearly identical, 6 is almost identical, and scores below that indicate that a decoder is doing something different, though that difference may not be audible.
  2. What does nearly identical mean?
    A score of 7 means that there were occasional 1-bit differences (maybe a few per second). A score of 6 means that there were almost continuous 1-bit differences, but never more than 1-bit. The first comes from using the same algorithm, but rounding differently (or compiling the code with slightly different optimisation/numerical accuracy options chosen); The second probably comes from having a slightly different, but correct algorithm, or scaling things very slightly differently. Effectively, it doesn't matter!
  3. Where are your other results?
    I compared all the decoded .wavs, but only show the comparisons with l3dec here. The complete comparison results are too extensive to put on line, but see the next question.
  4. What did the other results tell you?
    I've included conclusions from the full comparison in the results by decoder. One interesting result was that ACM 1999, Cool Edit Pro mp3-me plugin, and Winamp 2.22 give bit-identical decodes when working.
  5. If the last bit is wrong, surely the decode is wrong?
    The last bit of a binary word (the least significant bit, or LSB) can be 0 or 1. So an error in the last bit is an error or + or - 1 on a scale of -32767 to +32768. The problem is, the final binary value of each sample (a sample is one point in time on a digital representation of your music) is calculated from lots of other numbers. These numbers are usually rounded at the point where they'll make no difference to the LSB of the final result. But any representation of any number can only be so accurate (how many decimal places do you need to quote pi "accurately"?), and sometimes it rounds one way, sometimes the other. Rounding is probably a bad idea anyway (dithering is better - see the 24-bit test. But without building an mp3 decoder with virtually infinite accuracy, there's no way of telling which one of these decoders (which all agree to the last bit) is getting it right. However, we can check that they are getting it right some of the time - see the next test.
  6. Where to now?
    Go back to the list of tests or go forward to the next test.

counter