|
Posts: 90
Registered: Aug 29, 2008
|
Posted: 2009-03-04 16:35:51
|
Compare music://id.echonest.com/~/AR/ARA1UU51187FB5A70B (Misfits) with music://id.echonest.com/~/AR/ARCG0MS1187FB4A666 (The Misfits). MusicBrainz has a fairly nice artist alias system. Is there any chance of us merging erroneously distinct artists? "Bonnie 'Prince' Billy" vs "Bonnie Prince Billy"? (ah, looks like you've fixed that since Sunday Hotel Alexis vs. The Hotel Alexis
What to do with obvious family connections? I can see why "Emily Haines & The Soft Skeleton" and "Emily Haines" should be in the database as separate entities, but should one be a top "similar" artist to another?
Collaborations are interesting to know about, but if I want to know about artists similar to Bonnie 'Prince' Billy, do I want to know about him recording with Tortoise?
Not trivial questions, but I've been wondering for a long time what the stance was on these things. I do see some egregious examples cleaned up, but I drew my examples from fresh queries from @recomme.
|
|
|
Posts: 666
Registered: Sep 08, 2008
|
Posted: 2009-03-04 19:33:22
|
atl:
Since we get music data from a large number of sources, we are often dealing with multiple spellings of an artist name due to artist aliases, misspellings, americanizations of international spellings and so on. We work hard to resolve these to a single ID, but it is tricky business and some aliases slip through. We are working hard to improve this resolution process so you should hopefully see things improve over time. One change that is coming soon is improved search_artist functionality that should improve your ability to properly resolve an artist name to an ENID.
With regard to get_similar and family connections, influencers, collaborators and such - I don't think there's a good general rule as to whether or not they should be included in similar artists. For some, the answer is maybe yes (sonny, cher), (Paul, Linda) and for some probably no, (Michael Jackson and Lisa Presley, Norah Jones and Ravi Shankar). We have all the data, and we do incorporate some of this info into our similarity calculations. Again, this will improve over time as we refine some of these algorithms.
We are certainly interested in what you think, so please continue to offer your ideas on how you think these things would work best.
Paul
|
|
|
Posts: 6
Registered: Feb 23, 2009
|
Posted: 2009-03-04 22:10:14
|
I second that. Also on the subject of data sanitizing, many of the URLs for "get_audio" are dead links. I know you guys are subject to the internet at large for this, but there it is.
Also, for example, with Joe Satriani, the first four audio tracks returned are different URLs, but all of the same song ("If I could Fly" -- which perhaps is someone's idea of trying to make a point about the Coldplay lawsuit).
If the intention of the API is to just sort of give a window into the Unwashed Internet and let the client sort it out, then I guess you can ignore this.
|
|
|
Posts: 666
Registered: Sep 08, 2008
|
Posted: 2009-03-05 00:02:15
|
stenkarl:
We are going through a data migration process this week from an old Model-T architecture, to a new Ferrari architecture. (All of our python code is now in Italian!). During the migration, the web services are serving up data that may be stale. You can read about it in this post:
http://developer.echonest.com/forums/thread/42/
Things should be getting back to super fresh next week. Still that doesn't mean that every MP3 link will be live. I don't think we can offer any guarantees about liveness of audio links - but we certainly try to be better than the unwashed Internet.
Hope this helps
Paul
|
|
|
Posts: 90
Registered: Aug 29, 2008
|
Posted: 2009-03-05 06:37:42
|
Paul,
Thanks for the comments. By no means did I mean to overlook the tremendous job implicit in EN's efforts in artist normalization. I've been working with it so long that I take it for granted.
I guess my question was three-fold: 1) hey, this is a hard problem, what is your thinking on this? 2) MB's open source-like approach to this seems to work well for them; is there any chance of us manually submitting suggestions for artist merges ourselves? and 3) would you consider adding some data to make users make up their own minds?
(Part of this is, yes, missing the MBID availability in APIv3. I had been considering a few features that would have used MusicBrainz's artist aliases database, and losing those MBIDs meant having to look after the data myself.)
To expand upon #3, I think if you capture collaborators, influencers, family members (I had originally meant family members in the sense of rock family trees, but you bring up good points there), and other recording projects, then there's scope to expose that relationship in an API call somewhere. I know I'd especially love alternate recording projects marked in get_similar calls so that those can be filtered out when presenting them to the user.
|
|
|
Posts: 666
Registered: Sep 08, 2008
|
Posted: 2009-03-05 12:12:30
|
atl:
Thanks as always for the insightful questions and comments. Some answers below:
hey, this is a hard problem, what is your thinking on this?
As you know we have lots of data, including lots of information about artist names, aliases, common misspellings, but we are not using all of this data yet in helping us to normalize and resolve artists. Overtime we will be improving this - so from your point of view, things should just get better. When you search for "beatles, the" or "beetle" you'll get "the beatles" and not "the black beatles". But as you say, it is a hard problem.
2) is there any chance of us manually submitting suggestions for artist merges ourselves?
I like the idea of allowing our community of developer to help us fix up problems with the data. Artist merges are a great example. Let me see what the thinking is from the knowledge team about that sort of thing (I'm still the new guy, so they may already have a plan about this sort of thing that I don't know about).
3) would you consider adding some data to make users make up their own minds?
I'm a big fan of adding more transparency to something like get_similar - but the big trick is figure out how best to represent this. There are so many possible reasons why an artist may be considered similar. If you have ideas about how you see this being represented in an API, I'd be really interested in hearing them.
Paul
|
|
|
Posts: 90
Registered: Aug 29, 2008
|
Posted: 2009-03-07 18:04:45
|
Lots of big discussions embedded in here, probably best taken at least partially off-line.
For now, I'll point you to: music://id.echonest.com/~/AR/ARHBXLJ11F43A69FD0, which is the artist "Ac dc". That's been really troubling with one of my apps, because even though it's just a "stub" artist (no profile, no connections to other artists), it still appears on search_artists calls. And it ends up being a blind alley.
What to do with such identifiers? I'm totally behind the idea of not deleting IDs from the system, but there's a point when the detritus becomes a bit of a liability. No doubt the Knowledge guys know this and deal with it on a scale orders of magnitude than I, but I just wanted to raise a voice from the other side that the issue's now at a point when it's a nuisance to API users that has trickled out to my application users.
|
|
|
Posts: 666
Registered: Sep 08, 2008
|
Posted: 2009-03-07 19:26:48
|
atl:
Thanks, so noted and added as a bug. We should know when AC DC and AC/DC are really the same band. As a temporary workaround, you may be able to sort this out a bit by:
1) Normalize artist names (using Dan Ellis normalization: http://labrosa.ee.columbia.edu/projects/musicsim/normalization.html)
2) Identify duplicates in the normalized names
3) Select the name with the highest familiarity (via get_familiarity method)
|
|