Thursday, March 26, 2009

Semantic interpretation and the effectiveness of big data

Googlers Alon Halevy, Peter Norvig, and Fernando Pereira have an article, "The Unreasonable Effectiveness of Data" (PDF), in the April 2009 IEEE Intelligent Systems on semantic interpretation using big data.

Some excerpts:
The number of grammatical English sentences is theoretically infinite ... However, in practice we humans care to make only a finite number of distinctions. For many tasks, once we have a billion or so examples, we essentially have a closed set that represents (or at least approximates) what we need.

We're left with ... interpreting the content, which is mainly that of learning as much as possible about the context of the content to correctly disambiguate it .... What we need are methods to infer relationships between ... entities in the world. These inferences may be incorrect at times, but if they're done well enough we can connect disparate data collections and thereby substantially enhance our interaction with Web data.

Unlabeled data ... is so much more plentiful than labeled data ... With very large data sources, the data holds a lot of detail. For natural language applications, trust that human language has already evolved words for the important concepts. See how far you can go by tying together the words that are already there.
The article talks in more detail about work at Google and elsewhere on extracting relationships from massive crawls of text, tables, and the deep web.

On a related note, Google announced some new features a couple days ago, improved query suggestions and snippets, that Googler Ori Allon apparently described as scanning pages "in real-time ... after a query is entered" and identifying "conceptually and contextually related sites/pages" using "an 'understanding' of content and context." Many news articles are referring to this as a step toward semantic search.

Please see also my April 2008 post, "GoogleBot starts on the deep web", which discusses related work by Alon Halevy on mining data in tables and the deep web.

Please see also my post on the WSDM 2008 keynote by Oren Etzioni on semantic interpretation. His work is mentioned a few times by Halevy et al.

[IEEE article found via the Google Research Blog]

Wednesday, March 25, 2009

Startups and being relentlessly resourceful

Paul Graham wrote an insightful article, "Be Relentlessly Resourceful", that captures what I think is the most important characteristic for the founders of a startup. Some excerpts:
A couple days ago I finally got being a good startup founder down to two words: relentlessly resourceful.

Not merely relentless. That's not enough to make things go your way except in a few mostly uninteresting domains.

In any interesting domain, the difficulties will be novel. Which means you can't simply plow through them, because you don't know initially how hard they are; you don't know whether you're about to plow through a block of foam or granite. So you have to be resourceful. You have to have keep trying new things.

Be relentlessly resourceful.
Flickr is a great example of this. The company started as a massively multiplayer online game, failed at that, and morphed, in what appears to be pure grit and determination to succeed, into a photo sharing service.

More than any other characteristic, I think this is what to look for in the founding team of a startup, determination, flexibility, and willingness to do what it takes to succeed. As Paul says, be relentlessly resourceful.

Please see also Paul Graham's recent post, "Startups in 13 Sentences", that offers some other excellent advice for startups.

For a darker view on being relentlessly resourceful, please see my Nov 2006 post, "Ruthless enough for a startup?"

Tuesday, March 24, 2009

What is a good recommendation algorithm?

I have an article on the new Communications of the ACM blog, blog@CACM, titled "What is a good recommendation algorithm?"

The post picks on the root mean squared error (RMSE) measure used for evaluating recommender systems in the Netflix Prize, talks about precision and why making recommendations is like search, and discusses some factors that impact people's perception of the usefulness and quality of the recommendations that are not captured by RMSE.

If you have thoughts on evaluating recommender systems, please go to the article and comment there. I left many questions unanswered in that post and was hoping to get a bit of a discussion going over on that new CACM blog.

The blog@CACM is just getting started, but the list of contributors is quite impressive and includes Peter Norvig, Daniel Reed, Michael Stonebraker, and many others. If you like, you can get the feed here.

By the way, if you are an ACM member or just remember it fondly from school, you might also go check out the new CACM website. It's been recently redesigned to emphasize news articles on the front page and in its many new feeds.

[Full disclosure: I'm on the CACM web board.]

Wednesday, March 11, 2009

Google launches personalized advertising

Miguel Helft at the New York Times writes in "Google to Offer Ads Based on Interests" that "Google will begin showing ads on Wednesday to people based on their previous online activities in a form of advertising known as behavioral targeting."
Google will assign those users to categories based on the content of the pages they visit. For example, a user may be pegged as a potential car buyer, sports enthusiast or expectant mother.

Google will then use that information to show people ads that are relevant to their interests, regardless of what sites they are visiting. An expectant mother may see an ad about baby products not only on a parenting site but also, for example, on a sports or fashion site that uses AdSense or on YouTube, which is owned by Google.

Google said the approach could help advertisers reach their audiences more easily and publishers to earn more from their sites. Users will also see ads that are more relevant to their interests, the company said.

Google said that it planned to segment users along 20 categories and nearly 600 subcategories.
Barry Schwartz over at Search Engine Land, in an excellent and detailed review of the new product, writes:
I asked Google how detailed can these ads get? I asked, can an advertiser pass along a specific ad to a specific user? For example, can I show an ad for the Sony HDR-XR200V if this user added the Sony HDR-XR200V to their shopping cart on my site but did not check out?

[Brad] Bender [Google’s Product Manager] said yes, but ultimately it is up to the advertiser how specific they want to get with those ads.
The post on the Official Google Blog about this new product, "Making ads more interesting", is rather bland, but it does give a good description of the goal of personalized advertising:
By making ads more relevant ... we can create more value for everyone. Users get more useful ads, and these more relevant ads generate higher returns for advertisers and publishers.
Please see also my Oct 2008 post, "Google describes perfect advertising", which quotes an article by Googler Susan Wojcicki and gives more context to what Google is trying to do with personalized advertising.

Please see also my May 2004 post, "Bringing sense to web advertising", and my July 2007 post, "What to advertise when there is no commercial intent?"

Update: Philipp Lenssen posts the full list of category interests used by Google.