I assume that you are using the song/identify part of the API, yes? It's not really designed to be used in the situation you have described, but read on anyway.
If you are using ENMFP, then you will need to supply it with a clean excerpt, i.e., it is highly unlikely that it will work if there is speech on top of the music.
If you are using Echoprint, then there is a better chance of the music identification working properly in the presence of a small amount extra noise, but if there is a significant amount of talking over the music then it is also unlikely to work. The best thing would be to run the query using a segment of audio which has minimal talking over it (i.e., one which is mostly clean), and to run experiments to evaluate if it works in your situation.