Casual Fridays: Are all MP3s equal?

There's a lot of debate online about whether people can really tell the difference between the various audio formats -- AAC, MP3, you name it. Does it really make a difference?

Recently I saw a blog post suggesting that the methodology for many so-called studies on the phenomenon was flawed. If you're going to test this sort of thing, listeners shouldn't be aware of the format they're listening to. And they shouldn't be asked to compare two versions of a song, they should simply rate how good each particular recording sounds. According to this post, few studies take the time to be rigorous about testing. (Unfortunately, I can't find the post now -- if you wrote it, or if you know who did, please let me know in the comments and I'll link to it from here.)

Anyway, even on a Casual Friday, I think we can do a little better. I created three different versions of two song clips -- 64, 128, and 256 kbps MP3 format. Then I re-encoded all of them at 256 kbps so the files are all the same size. Can you identify which recording sounds better? Is there a difference between the listening skills of "audiophiles" and ordinary listeners? Now we'll find out.

Click here to participate

The study asks you to listen to six 30-second music clips, using either your headphones or computer speakers. Then there are a few questions about your musical / audiophile experience, and that's it. Should take about 10 minutes, tops.

Since next week is Thanksgiving, I won't be able to report on the results until Friday, November 30. But that also means we'll have an extra week to collect data -- you can participate until the morning of Thursday the 29th.

More like this

Two weeks ago, we challenged our readers to see if they could discern the difference between MP3 recordings at different sampling data rates. Nearly 700 completed our study. So does a very high data rate result in a noticeable difference? Here our are basic results: Respondents rated two…
Take a listen to this brief audio clip of "Unforgettable." Aside from the fact that it's a computer-generated MIDI performance, do you hear anything unusual? If you're a non-musician like me, you might not have noticed anything. It sounds basically like the familiar song, even though the…
Both SteelyKid's kindergarten and the snow-day day-care program that the kids go to were closed today, which kind of threw a wrench in things. But it's also kind of fun, as I got to spend some time playing outside with SteelyKid on her play set in the snow. The "featured image" above is a cell-…
A week ago Friday we conducted a little survey about musical preferences. Readers were asked to listen to three different clips, then say which music they preferred. We promised you we'd be back to let you know what the preferences were, and whether they said anything about how preferences are…

This is a question I've had and not seen addressed. Not that I've looked THAT hard, to be honest. I'll go participate, now.

By marciepooh (not verified) on 16 Nov 2007 #permalink

AAC+ at 128 KBPS and 88000 Khz playback. It is the most efficient compression so far (besides OGG).

If I had to guess the outcome of this experiment in advance, I'd guess that nearly everyone will be able to distinguish between 64 and 256 kbps, but far fewer will be able to distinguish between 64 and 128 kbps and fewer still between 128 and 256 kbps.

One big problem here with this experiment is volume. Even very minor changes in volume between listenings can radically alter how "good" a person perceives a recording to sound. (Louder seems to sound better, even if listening to an identical recording.) The content of the sound clips also has a big effect on perception of the recording's "goodness" or accuracy. A final problem is that most computer speakers suck. They really do. So do most earbuds that people use to listen to MP3s (including the ones that come with the iPod). Such crappy speakers can obscure even fairly obvious differences in recording quality.

Thanks for taking me up on this! (link to my post is in my nick)

I can't wait to see the results of this one.

Hey, Johan,

That wasn't the post I was thinking of (though that's a great post too). The one I'm remembering had a bunch of links to MSM "comparisons" of different digital audio formats. I'm sure we'll track it down eventually....

I think there is a point at which the encoding has higher fidelity than the hardware it is being tested on. I could tell the low encoded file very easily - it sounded like it was being played on the RCA bookshelf stereo system sitting behind me on the shelf. I couldn't tell the difference between the other two songs.

I am not an audiophile, but my custom made computer speakers where designed and built by an audiophile a few years ago who found them 'acceptable' when he was finished with them.

Back in the college days when I first ripped my country music CDs, I could just barely hear the difference between the original CD and mp3 at 128kbs, but when I set it to 168kbs, or even 128kbs VBR, I couldn't hear the difference anymore. I am guessing that the encoders have improved enough in the 5-6 years that 128kbs is matched in fidelity to my speaker systems. Or I have gone deaf enough in the meantime that my ears have been reduced to 128kbs

Orac, I think that is the point of this test - under real listening conditions, with real, common (low-quality) sound hardware, does the differences in recording quality actually matter?

Janne --

That's true, but we're also asking people about the quality of their equipment, so it may be that with fancier equipment they will be better able to distinguish between the different sampling rates.

Recently I saw a blog post suggesting that the methodology for many so-called studies on the phenomenon was flawed.

Do you have a link?

If you're going to test this sort of thing, listeners shouldn't be aware of the format they're listening to. And they shouldn't be asked to compare two versions of a song, they should simply rate how good each particular recording sounds

I think the testers at HydrogenAudio take care of both points, although not too sure about the 2nd.

And they shouldn't be asked to compare two versions of a song, they should simply rate how good each particular recording sounds.

I'm not sure what the problem with this method is. The triangle test, for instance, seems like a sensible way to see if a subject can find the difference between two alternatives.

I have semi-fancy headphones, but I'm quite certain my built-in laptop sound card is lousy. Every time my cellphone rings, I hear this little Morse code like buzzing in my headphones. I think that masks any difference I might hear between the two higher bitrates.

I recovered from being an audiophile by convincing myself it just doesn't matter. Any concert you go to has terrible "fidelity." What I mean by fidelity is that almost everyone at the concert will hear something that is objectively different from what everyone else is hearing.

On top of that, the performance is probably subjectively poor, compared to a studio recording.

And yet almost everyone prefers a concert to any format you can name; be it SACD, vinyl, or mp3.

I work as a live audio engineer, and my experience with problems concerning recording frequently revolve around the high frequencies, since this is where aliasing begins to become a problem. A dead give-away for me has always been cymbals. Neither of these songs were very cymbal heavy (only one crash in the Santana song), which I feel would have made the differences incredibly obvious between the encodings. Think back to a low quality downloaded song and remember how cymbals can sound "splashy." From a mathematical perspective however, I think it's quite difficult to tell the difference between 256 and 128 even with sophisticated frequency analysis tools.

By Chris Grimes (not verified) on 17 Nov 2007 #permalink

If this were a test, then I would fail badly. I can barely distinguish between them, let alone which one is better. But I still tried my best (more or less like guessing answers to an oral French exam) on your survey.

Well, on my system with my ears- sorry, it all sounded the same to me... go figure. And I thought I was fussy.

Recently I saw a blog post suggesting that the methodology for many so-called studies on the phenomenon was flawed. If you're going to test this sort of thing, listeners shouldn't be aware of the format they're listening to. And they shouldn't be asked to compare two versions of a song, they should simply rate how good each particular recording sounds. According to this post, few studies take the time to be rigorous about testing. (Unfortunately, I can't find the post now -- if you wrote it, or if you know who did, please let me know in the comments and I'll link to it from here.)

The two test protocols which are most widely used on the net are ABX and ABC/HR. Properly applied, these two testing protocols prove (or demonstrate) different things, and should be used in different circumstances.

One thing which has an effect on your test is training. MP3 (and other perceptual codecs) introduce a variety of very particular artifacts to recordings. Some of these are obvious (like stereo collapse) and some are more subtle (like pre-echo on attacks) - all are easier to hear if you know what you are listening for. I recommend visits to hydrogenaudio.org and ff123.net to anybody who is interested in the field.

I recently completed a full length book about memory and the human mind. The book is called Disco Hypnotic. More information can be found at DiscoHypnotic.com. Any comments or review would be most appreciated.

Dave,

Did you re-encode all files at 256kbps? If so, this wouldn't be an accurate representation of the indicated bitrates, as each audio sample would have been compressed twice. Similarly, if it were only the 64kbps and 128kbps that were re-encoded, the 256kbps version would sound better than it should (comparatively), as it had only went through one step of compression.

Generally speaking, yes audio encoders have certainly gotten better over the years, especially the open-source Ogg-Vorbis (which I now use to digitally archive my physical CD collection). And as Marc B mentioned, knowing what compression sounds like ("training") is crucial; maybe it's a case where ignorance is bliss, but once you know what warbling, metallic highs, and pre-echo sounds like, you hear it everywhere. I think ABX tests with codecs like Vorbis have helped developers pinpoint such issues and help to continually improve audio compression in general.

I'm going to echo Chris Grimes' statement. Cymbals are consistently the most noticeable sign of poor quality in audio samples, and neither sample we listened to had many of them.

Interesting experiment.

The other day I tested myself and a friend (he was blind to which choice was playing) on 128 vs 192 kps on Windows Media and Windows Media Pro tracks, vs WAV, all ripped by a $1500 Sharp AL27 laptop from the same CD track with plenty of cymbal and very low bass.

We thought we could perceive much better tonal quality from 192 vs 128, with broader bass (not so tight) and clearer cymbals, with piano not so tinny, but not much discernible difference between 192 kps WMA Pro and WAV which is 1411 kps.

We were playing from the laptop that recorded the tracks through separate small speakers and a big woofer, none of it truly audiophile standard though.

Soon we lost any discernment, however.

I did this earlier with vinyl vs early CDs and found that obsessed audiophiles couldn't telk the difference in blind tests, but were very sure of a difference if aware of the source.

This seems to be a principle - that the mind needs a frame of reference to discern subtle differences in tone and taste, so if you carry out a blind test people immediately or rapidly lose all discernment.

Thus you can make complete fools of experts eg in wine by giving them blind taste tests and in fact many times journalists have delighted in such tests where the experts have chosen inferior wines (acknowledged inferior by general aggrement) over famous ones.

But the effect is artificial, because the frame of reference is removed. A true test probably has to provide a frame of reference of some kind, and that will always unblind the test by definition, I imagine.

I remeber when they brought New Coke out and I hated it, and tested a girlfriend and myself blind with Classic Coke and New Coke. Suddenly I couldn't really tell, though my girlfriend did a little better.

Any experiment has to take this factor into account, seems to me.

But the effect is artificial, because the frame of reference is removed. A true test probably has to provide a frame of reference of some kind, and that will always unblind the test by definition, I imagine.
The ABX protocol does give a reference. Consider a wine analogy - an ABX test would give you two bottles (and tell you what they are) and a glass of one of the two wines. The test requires you to decided which of the two bottles the unknown glass came from. You are then provided with another unknown glass. After a pre-decided number of trials the test ends.

If the probability that you were only right by guessing is sufficiently low, then the result of the test is that you can tell the difference between the wines. If the probability is above the threshold, then the test does not show anything (because it is possible for the subject to bias the test in the negative direction).

Other protocols (like ABC/HR) decide similar things in different ways.

That is my point. The reference once provided introduces bias in the mind. In effect, one (thinks one) knows what to look for ie Lafite Rothschild or plonk.

The mind/brain is a self referential instrument in matters of taste, so it really can't measure or discriminate objectively.

Once the reference framework is removed - eg if the bottles are removed - then any difference perceived will rapidly dissolve.

Just repetition has the same effect, I find.

Surely this is exactly what causes the problem with people born blind whose sight is restored. They cannot sort things out visually for a long time, because the brain/mind has no set of habitual references in a framework.

Surely you don't think that experts suddenly lose their innate experience and expertise? It is the frame of reference that is removed, so they cannot taste properly.

1) I had to listen to the "use speakers" portion through my headphones too, as I read/participate from work in a cube farm. sorry, hoped I'd be able to note that in a comment field at some point.

2) I've heard CD's that were burned by friends that had what I can only call "tape hiss," although I never got to ask them whether they were using a speedy setting or cheap equipment...

A couple of responses, both to the survey and to the comments:

1) As far as the survey, first, I agree that re-encoding at 256 kpbs might be a problem. Second, most of the bitching about sound quality I've seen on the internet has been claims--which frankly sound insane to me--that identically encoded mp3 files sound "different" on, say, a Zune and an iPod, even when equipped with similar or identical headphones/earbuds. I think it's clear that compression ratios make a difference. It's just a question of where your cut-off point is (and this is true for every measure of quality, whether it's wine or audio or video).

2) I'm curious about the frame of reference problem. The difference in wine quality is huge (to me, at least), for example, but varies depending on your mood, what you're eating, what you've just eaten or drunk, how long the wine's been opened, and many other factors. As to audio, everything also makes a difference (room size, reverb, ambient noise--my son was shouting while I was trying to listen). The problem with any controlled study is that, in the name of consistency, it tends to reduce those highs and lows to a narrow middle ground. But the fact remains that most people don't care: I have friends who can't tell the difference between VHS and DVD, who can't hear the difference between AM radio with the treble all the way down and an audiophile system with "proper" EQ.

3) Cymbals. It's the one thing I always notice if I'm listening through headphones, and if the bit rate is 128 or lower. They "whoosh" like they're going through a flanger. One reason I like the AAC format that Apple uses is that even at 128, I don't hear that (AAC evidently compresses the midrange more than the high end).

4) On the other end, some audiophiles (e.g., guitarist Eric Johnson) claim to be able to hear the difference between different brands of batteries in effects, or the difference between gold coated and brass coated plugs. Sound engineers universally claim that they deal with people like this by pretending to adjust knobs or change batteries, and that the "audiophile" then says "ah, much better. My ears are no longer bleeding." But these stories are anecdotal, and it would be great to come up with a methodologically solid study to put the twin evil ideas of "everything sounds about the same" and "I can tell the difference between Ray-O-Vac and Duracell in my audio equipment" to rest. Or prove them right.

5) As a matter of record, verified through numerous blind taste tests, I can immediately tell the difference between Coke, Pepsi, Diet Coke, Caffeine Free Coke and all the rest. I once won a bet by telling the difference between two varieties just by smelling the beverages. I don't think of this as "discriminating taste," so much as I think of it as a curse.

By Robert Rushing (not verified) on 19 Nov 2007 #permalink

"I think the testers at HydrogenAudio take care of both points, although not too sure about the 2nd."

They do. Really, anyone who complains about bad codec testing should check out the hydrogenaudio tests and discussions. MP3 codecs are among the most thoroughly tested audio components around. And Hydrogenaudio is notable as one of the few audio forums that requires adherence to DBT methodology as parts of its Terms of Service.

Here's HA's Listening Test forum:

http://www.hydrogenaudio.org/forums/index.php?showforum=40

By Steven Sullivan (not verified) on 20 Nov 2007 #permalink

There is a problem with the testing method, I think. Re-encoding with MP3 compression, even at 256 k, will corrupt the samples by discarding data. That is intrinsic to the MP3 compression method - it discards selected parts of the audio that we find difficult to notice.

The way to make a fair comparison set of samples is to copy the 64, 128 and 256 k MP3 samples to .wav format, which is lossless. By doing this the originals will be completely preserved without discarding any components, and will all be the same size so the test subjects will not be able to distinguish them by filesize.

It is well known that compressing with MP3 works remarkably well the first time around, but an MP3 that is compressed again is very likely to sound noticably worse, and even strange.

Pretty much what Robert Rushing said.

For myself, I would add that there is a confounding factor of preference. I have slightly ...odd... hearing at the mid-bass to bass ranges; I can perceive tones readily enough, but I lose "definition" in them, and they all sound very muddy to me. Consequently I tend to de-emphasize mid-bass even more than usual, because the perception of them wrecks the crispness of the mid-range. Also, my perception of harmonics at the high end seems to vary somewhat away from the norm, although it would be difficult for me to explain. All I know is that no-one else like my preferred equalizer settings. ;-) In MP3s, though, this seems to translate to my liking some of the lower quality compressions more than the higher fidelity compressions, depending on the piece.

By Luna_the_cat (not verified) on 22 Nov 2007 #permalink

As you have used lossy compression on the second pass this experiment says nothing about the relative perceived quality of the original bitrates.

It may be slightly more valid to re-encode using some form of lossless compression, however I don't see the point. ABX comparisons of many audio formats are strewn across the web, even from 'non-audiophiles'.

Dave:
First, I appreciate the effort you made in putting this together.

I respectfully must disagree with your initial assertion that the songs shouldn't be compared. My reasoning is that unless the original non-compressed music is heard, the listener/test subject has no reference as to how close the compressed version has come to equalling the original. After all, isn't the goal to show effects and quality of each version. That will allow each person to decide for themselves what they prefer to hear.

It's important to note that between intense codec compression and heavy dynamic compression we are taking away from the original work. It's my belief that a vast majority of listeners will want to hear music in a file format that give them the best sound at the most reasonable size, i.e. lossless.

Personally, I took the test on both my iPod (Sennheiser $40 headphones) and on my family room system (about $1200) with a music server using iTunes and shuffle play. The result, I was able to differentiate the three types. But, I may not be a fair subject, as I can have "trained" my ears
to listen for critical instruments/passages. As others have mentioned cymbals are very revealing.

Once again, thanks for the effort.