Forums » General Discussion Search

Analyze: Using timbre[0] to denormalize chroma values New Reply

Author Post
Posts: 6
Registered: Apr 12, 2011

Dear developers,

I have a need of denormalizing the pitch values provided in the "segments" section of the song analysis. To do that, I'll need to know the power of the corresponding audio segment (sum_{k in segment} x_k^2).

On one hand, we have loudness at two points of the segment, but this is clearly not enough to estimate the power. On the other hand, looking at the "timbre" section, I see that the first bin is "average loudness of the segment".

Will I get a meaningful estimation of the segment power by taking the square of timbre[0]?

Posts: 71
Registered: Sep 17, 2008

timbre[0] is a decent average loudness estimator of its corresponding segment. However there's a difference here between spectral power (a mathematical function), and loudness estimation (a perceptual measure). Although they're both related, the segment loudness depends on a distribution of energy across spectrum, and time. You'll likely have a hard time denormalizing pitch values reliably.

Posts: 6
Registered: Apr 12, 2011

Tristan, could you elaborate a bit on the following issues:

  1. In what units is timbre[0] given? dB?
  2. Can power (spectral or temporal) be marginalized from timbre[0]?
Posts: 6
Registered: Apr 12, 2011

Provided this is a "general discussion" topic, I would suggest EchoNest to not normalize chroma values in the analysis — some information is lost, and nothing is gained, because if anybody needs normalization, it can be easily obtained with very little computational cost...

Posts: 71
Registered: Sep 17, 2008

Yes, timbre[0] is in "dB", yet on a positive scale, and after going through a bunch of auditory filters. It's not really possible to accurately convert that information back into conventional power from the information provided, although it may be possible to approximate it somehow, given a few arbitrary assumptions in trying to invert the auditory filters. But why exactly would you want to do that anyways? There's no real loss in the pitch description as it is.

The reason why pitch values are normalized is to have them represent pitch content only, so they are comparable with each other. If you returned values without any normalization, then you'd represent both pitch and a form of magnitude all at once, even though loudness is already well described in its own way, as meaningfully as possible. You could keep unfolding like this, and wonder why only 12 pitches. You can just as well describe 88 pitches, and incorporate a bit of timbral information into the mix. But then why quantize them at all when you can return accurate partials both in time and frequency, directly out of the auditory spectrum.

Sure, eventually, I'd like to make these quantities available to developers via API. But for the time being, and for many reasons, it is a lot more convenient and digestible to deal with canonical vectors that represent one simple concept only. Even using timbre[0] for average loudness is quite a bit of a stretch that will likely get revised in the future. Thanks your input: the more demand there is around a certain feature, the more quickly it'll get updated.

Reply to this Thread

You must log in to post a reply.