If you try the following query:
Note that that does not search for ‘Bing’, it searches for pages indexed from Bing.
You get about 20,500 search results from Bing. This would not be such a big problem normally but it seems that Google is incorporating Bing search results and crawling Bing search pages, which seems rather odd for a company that has just accused its competitor of doing just that (never mind that Bing only used its toolbar as a url discovery device, including all the context available at the moment of discovery, not to ‘copy search results’ but that sounds a lot better in the press).
On top of that the http://m.bing.com/robots.txt file (which seems to be the source of the results) has this entry in it:
Which explicitly disables access to other crawlers on that directory (for all user agents, so that includes the google bot).
I wonder how Google is going to explain the presence of these results away, not only do they do exactly that which they accuse the Bing team of (copying search results, even if the technical aspects are much more involved than that and the sticking point seems to be the ranking of some search results), they also blatantly ignore the robots.txt file, as evidenced by the fact that in plenty of these cases there is text accompanying the URL. If it was just the URL being ‘indexed’ for the link text that would not have been the case.
The longer I spend on researching this case the more I think that Google will eventually regret going public with this rather than to talk this over quietly with the Bing team. Even if we forget about the robots.txt file (see case sensitivity issues outlined below) the fact that Google states unequivocally that looking at a competitors search results is somehow wrong, immoral or illegal then they should not be doing that themselves even if they were permitted to do so. As for those ‘gotcha’ urls and keywords not appearing elsewhere, we only have Googles word for that and in stuff like this I’d like the observations to be done by impartial people, not by those in the employ of one party or the other. How Bing came by the ranking remains to be explained in a satisfactory way.
If you live in a glass house you should not throw stones, that goes for Google probably more than for any other company on the web with regards to end user data and privacy issues as well as copying information from other websites.
edit: I’ve done some experimenting with Bing in the meantime and it seems that even though robots.txt is case sensitive, Bing is not!
Either the robots.txt is incomplete and should contain duplicates of all urls in all possible cases that you could type them in in (Search, sEarch, seArch etc in all possible permutations, a task that I think is not feasible for a large enough websites) or the robots.txt standard should be augmented to somehow indicate if the server is case sensitive when it comes to parsing path names or not. Otherwise you could always get by the robots.txt file on a case-insensitive server by manipulating a few of the characters and claiming that that permutation was not in the robots.txt file.
It would be fairly trivial for bots to test if the server is IIS (if the server identifies itself as such of course) or to try to retrieve Robots.txt and robots.txt, if those come up as equal then the sever can be assumed to be case insensitive.
For the record, I do not use Bing, I write this on a Linux computer and the only thing running Microsoft software in this house is a virtual machine for testing purposes. Just in case I get accused of being in a camp ;)
You should follow me on twitter: http://twitter.com/jmattheij