Geeking with Greg: 04/01/2008

Monday, April 28, 2008

Questioning Yahoo Answers

Zoltan Gyongyi, Georgia Koutrika, Jan Pedersen, and Hector Garcia-Molina published a paper at WWW 2008, "Questioning Yahoo Answers", that has an interesting analysis of how people are using Yahoo Answers.

Some excerpts:

The majority of users are askers ... [Only] a small portion of the user population that provides answers or votes .... In practice, a large fraction of best answers are of questionable value .... Most answerers are not domain experts focusing on a particular topic, but rather individuals with diverse interests quite eager to explore questions of different nature.

Many questions are meant to trigger discussions, encourage the users to express their opinions, etc. The expected answers are subjective. Thus, in a traditional sense, there is no bad or good (or best) answer to such "questions." Consider, for example, "What do you think of Red Hot Chili Peppers?"

Note, however, that Yahoo! Answers was not designed with discussions in mind, and so it is an imperfect medium for this purpose. Users cannot answer their own questions, thus cannot participate in a discussion. Similarly, a user may only post at most one answer to a question. Yahoo! Answers has no support for threads that would be essential for discussions to diverge.

Much of the interaction on Yahoo! Answers is just noise. People post random thoughts as questions, perhaps requests for instant messaging. With respect to the latter, the medium is once again inappropriate because it does not support real-time interactions.

This evidence and our results point to the following conclusion: A question answering system, such as Yahoo! Answers, needs appropriate mechanisms and strategies that support and improve the question-answering process per se. Ideally, it should facilitate users finding answers to their information needs and experts providing answers.

Please see also my October 2007 post, "Revisiting Yahoo Answers", where I look at Yahoo Answers not as a question answering site, but as what it primarily seems to be used for, a discussion forum.

Search trails and relevance

Misha Bilenko and Ryen White from Microsoft Research had a paper at WWW 2008, "Mining the Search Trails of Surfing Crowds: Identifying Relevant Websites From User Activity" (PDF), that is a fascinating look at going beyond the first click on search results for improving search results and toward considering all the pages people visit.

An excerpt from the paper:

While query and clickthrough logs from search engines have been shown to be a valuable source of implicit supervision for training retrieval methods, the vast majority of users' browsing behavior takes place beyond search engine interactions.

This paper proposes exploiting a combination of searching and browsing activity of many users to identify relevant resources for future queries. To the best of our knowledge, previous approaches have not considered mining the history of user activity beyond search results, and our experimental results show that comprehensive logs of post-search behavior are an informative source of implicit feedback for inferring resource relevance.

Web browser toolbars have become increasingly popular ... Examples of popular toolbars include those affiliated with search engines (e.g., Google Toolbar, Yahoo! Toolbar, and Windows Live Toolbar) ... Most popular toolbars log the history of users' browsing behavior on a central server for users who consented to such logging. Each log entry includes an anonymous session identifier, a timestamp, and the URL of the visited Web page. From these and similar interaction logs, user trails can be reconstructed.

Training retrieval algorithms on interaction behavior from navigation trails following search engine result click-through leads to improved retrieval accuracy over training on only result click-through or search destinations ... Our research has profound implications for the design of Web search ranking algorithms and the improvement of the search experience for all search engine users.

Please see also Googler Daniel Russel's JCDL 2007 talk, "What are they thinking? Searching for the mind of the searcher" (PDF), which shows, starting on slide 33, that Google is using their toolbar data to analyze user behavior.

Thanks, Ionut Alex Chitu, for the pointer to Daniel's JCDL talk.

Keynotes at WWW 2008

The keynote talks at WWW 2008 promised to be exciting. With names like Google VP Kai-Fu Lee (who's departure from Microsoft famously revealed Ballmer's hidden passion for hurling chairs), Microsoft VP Harry Shum, inventor of the Web Tim Berners-Lee, and AT&T Chief Scientist David Belanger, how could you go wrong?

Surprisingly, it did go wrong. Everyone I talked to gave reviews of the keynotes that ranged from lukewarm to blisteringly negative.

The general conclusion, which also agrees with my perception, seemed to be that Kai-Fu Lee and Harry Shum gave pitches for Google and Microsoft rather than vision. And, in the Great Hall of the People, Tim Berners-Lee seemed woefully unprepared, rambling with such a lack of coherent theme that many buried themselves in alcohol to escape the pain.

Nevertheless, there were tidbits that could be extracted. Comparing Kai-Fu Lee's and Harry Shum's talks, I was struck by the way Harry Shum focused entirely on search while Kai-Fu seemed to be looking beyond search.

Kai-Fu talked mostly about cloud computing and showed Google products as examples of a shift toward cloud computing. For example, Kai-Fu pointed at GMail and Google Docs as applications moving to the browser, and integration of those applications as examples of how the browser can make applications more "task-focused" (an argument I found quite weak, by the way, even though I tend to support the ideas behind cloud computing).

Harry Shum focused on improving search, noting that "40% of queries go unanswered" and "50% require refinement", and then saying we need to look at concepts, topics, entities, and tasks to try to figure out "what the user really wants to do".

On the one hand, both talks seemed odd to me that the search giant Google offered a keynote at a Web conference not about search whereas Microsoft used their keynote to talk about little else.

On the other hand, both talks may have been looking at what the companies need to do to expand beyond where they currently are on the Web. Google needs to move beyond search and advertising. Microsoft needs to get into search and advertising. In that framework, the two speeches make a lot of sense.

David Belanger's sparsely attended talk was about how we can integrate data across three distinct interfaces, mobile, PC, and television. Unlike some who claim these three are converging, David argued that TV, PC, and mobile will stay separate and the key question will be interfacing them with each other. I was pleased that David pushed recommendations for information as necessary "in the not to distant future" when we have "access to any video ever made" and only weak ability to specify our interests beyond a desire to watch something fun and interesting. Sadly, he did not expand much on that point, instead devoting much of the talk to promoting AT&T and discussing low level networking details of data sharing and synchronization.

It is hard to say much about Tim's talk other than that he was enthusiastic about globalization, internationalization, and the semantic web. Tim pointed that Chinese was poised to become the dominant language on the Web and urged everyone to think about the changes that would cause. He promoted the semantic web, but offered nothing new. The remainder of talk was so badly unstructured -- nearly a stream of consciousness -- that it was hard to follow despite the passion he tried to put behind it.

Friday, April 18, 2008

GoogleBot starts on the deep web

Jayant Madhavan and Alon Halevy have the post on the Official Google Blog, "Crawling through HTML forms", announcing that the crawler will start submitting queries in forms in order to pull data out of databases that otherwise would not be visible.

Please see also my March 2007 posts, "The end of federated search?" and "Google and the deep web", which discuss several papers out of Google (that Jayant and Alon authored) that appear to describe Google's thoughts on what to do about the deep web.

Mousetracking in web search

Googlers Kerry Rodden, Xin Fu, Anne Aula, and Ian Spiro had a short paper last week at CHI 2008, "Eye-mouse coordination patterns on web search results pages" (ACM site), with some some very nice graphs showing common behaviors using the mouse on web search results and how they correspond to eye movements over the page (see Figure 1, 2, and 3).

Unfortunately, their results seem to indicate that mouse tracking is only weakly correlated with eye tracking, so mouse tracking might be less useful than one would hope for determining which items on a web search result page were never even seen by a searcher.

For more on using mouse tracking as a replacement for eye tracking to understand what people look at on web search result pages, please see my earlier post, "Cheap eyetracking using mouse tracking", and its discussion of a SIGIR 2007 paper by some of the same Google authors.

Udi Manber interview in Popular Mechanics

Google VP Udi Manber has an interesting interview in Popular Mechanics.

An excerpt on learning to rank from searcher behavior:

Our goal is very simple: We want to return to the user the answer that they need.

The results we show you are based not only on what we know of the Web, but also what other people have searched for .... Signals from people are the best signals.

An excerpt on personalization:

If you allow us to keep your Web history, we will improve your search ... in two ways. One is, we will tune the result for you slightly. We're not going to change the whole page -- we might change position 5 to position 3 here and there, but we'll use whatever we can from your previous searches to adapt the current search to you.

The second is, we allow you to search within your Web history ... You may remember something you did three months ago and you don't remember exactly how you did it.

And, an excerpt on how improving overall relevance requires accumulating many small twiddles that apply to specific types of searches and needs:

Last year we made over 450 improvements to the algorithm .... We have to find what weakness in the algorithm caused that result and find a general solution to that, evaluate whether a general solution really works and if it's better, and then launch a general solution.

I'll give you an example of something that came last week. We were evaluating a certain algorithm that adds diversity to the result. We did live experiments, which means we launched the algorithm to a very small percentage of users and then see how that compares to the result without the algorithm.

One of the queries that made a difference: The query was, New York Times address ... The first result right there on the snippet gives you The New York Times. It turns out that's not what the user was looking for. They were looking for an address given out by a New York Times reporter the day before. And because of this diversity and because of our emphasis on freshness and highlighting fresh results, that particular address appeared somewhere in the results, and that's what the user wanted -- that's what they went to and got the result.

That was something that surprised even us. You don’t think that when someone searches for New York Times address that they’re not looking for the address. Language is like that. Intention can be ambiguous.

There is a re-occurring theme in Udi's comments, learning from searcher's behavior. Google learns from what people searched for. Google learns from a searcher's history. Google learns what works and what does not from what people click on. Google learns from what people do when they use Google.

For more on that, please see also my previous post, "Actively learning to rank".

Please see also my earlier post, "The perils of tweaking Google by hand", which talks about whether these thousands of twiddles to the search engine and variations of them probably should be constantly tested rather than just evaluated at the time they are created.

Wednesday, April 16, 2008

Detecting near duplicates in big data

I finally got to a WWW 2007 paper out of Google I have been meaning to read, "Detecting Near-Duplicates for Web Crawling" (PDF) by Gurmeet Manku, Arvind Jain, and Anish Sarma.

The paper takes more theoretical work from Moses Charikar back in 2002, "Similarity Estimation Techniques from Rounding Algorithms" (ACM site), which describes a form of locality sensitive hashing, and applies it at very large scale (8B documents), dealing with all the practical issues along the way.

In that sense, this Google WWW 2007 paper has a lot of similarities with Monika Henzinger's excellent SIGIR 2006 paper, "Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms" (ACM site). That paper evaluated the shingles method of creating fingerprints of substrings from a document against Charikar's technique on a 1.6B document data set and found the latter to be superior.

Some excerpts from the Google WWW 2007 paper:

Documents that are exact duplicates of each other ... are easy to identify ... A more difficult problem is the identification of near-duplicate documents ... [that] differ in a small portion of the document such as advertisements, counters and timestamps.

Charikar's simhash ... is a fingerprinting technique that enjoys the property that fingerprints of near-duplicates differ in a small number of bit positions.

We maintain an f-dimensional vector V, each of whose dimensions is initialized to zero. A feature is hashed into an f-bit hash value. These f bits (unique to the feature) increment/decrement the f components of the vector by the weight of that feature as follows: if the i-th bit of the hash value is 1, the i-th component of V is incremented by the weight of that feature; if the i-th bit of the hash value is 0, the i-th component of V is decremented by the weight of that feature. When all features have been processed, some components of V are positive while others are negative. The signs of components determine the corresponding bits of the final fingerprint.

Simhash possesses two conflicting proporties: (A) The fingerprint of a document is a "hash" of its features, and (B) Similar documents have similar has values.

The paper goes on to describe how they efficiently find nearby but different hash values that represent similar documents, how they rapidly build the simhash using MapReduce, and metrics on the effectiveness of the technique.

Going to WWW

I will be at the WWW 2008 Conference in Beijing next week.

Very much looking forward to it! If you see me there, please say hello!

Google engEdu talk on Bloom filters

Ely Porat gave a Google engEdu talk, "The Bloom filter", with a good survey of Bloom filters and variants on Bloom filters.

Bloom filters let you quickly determine if an item is a member of a set with no false negatives but some false positives. They work well for very large data sets since they require about 10 bits per item for a 1% false positive rate and offer constant time inserts and lookups.

Ely's talk provides some useful motivation at the beginning, then has a nice, light introduction to Bloom filters starting at 8:40. The discussion of variants that support deletion or compression is interesting but, if you find yourself dragging, make sure to at least jump to 30:40 for the cute discussion of Cukoo Hashing and the Cukoo Bloom Filter.

Please see also the "Bloom filter" Wikipedia page which contains a nice summary of Bloom filters and many of the variants described in Ely's talk.

Tuesday, April 15, 2008

Contextual advertising and social networks

Anand Rajaraman makes a counter-intuitive but compelling argument that having a close relationship between people who view a page reduces the value of the page for contextual advertising.

An excerpt:

Consider the difference between a Facebook profile and a TripAdvisor travel review. A typical [Facebook] pageview ... is by someone known very well to the creator of the profile – a close friend or acquaintance ... A TripAdvisor travel review is seen by people completely unrelated in any way to the person or persons who wrote the reviews on the page.

The "affinity" of a social media service is the average closeness of relationship between a content creator and someone who views that content. The affinity of Facebook is very high, while the affinity of TripAdvisor is very low.

There is an inverse relationship between the affinity of a social media service and its targetability. Why is this true? The act of viewing a Facebook profile gives us very little information about the viewer, other than the fact that she is friends with the profile creator; when someone views a TripAdvisor travel review, she is definitely interested in traveling to that location.

When a page draws from a wide audience, the people who manage to get to that page often will come from search and will be interested in the content of that page. So, they can easily be targeted by targeting the content of the page.

When a page draws from a narrow audience, like a group of friends, those people are not as interested in the content as the person. So, targeting the content, as contextual advertising does, will be less effective than targeting the people.

I have said before that lack of commercial intent is the problem on social networking sites, but Anand makes the good point that diversity of interest may be a better way of describing the problem. By summarizing why people come to a page, affinity may explain the difficulty of targeting advertising to that page.

Friday, April 11, 2008

Cheap personalization using the referrer

Danny Sullivan at Search Engine Land writes:

Previous Query refinement is now coming to unpaid or "organic" search results, [Google VP Marissa Mayer] said.

For example, if someone were to search for [spain] and then [travel] after that, BOTH the ads and the organic results will be altered to take the previous query into account. To some degree, it will be as if the second query was for [spain travel].

This is a big deal. Big deal. It means that the results for many "single word" queries, which can be hard for sites to rank for when billions of listings come back, will become queries involving two or more words -- and much more specific ones.

This is a cheap and easy form of personalized search and personalized advertising that uses the history stored in the referrer instead of maintaining a separate, longer history.

As Danny says, the goal of personalized search is to help us find what we need faster and more easily by creating a dialogue with our search engine. No longer is each search treated as independent, but what we find or do not find in earlier searches impacts our future searches.

Thursday, April 10, 2008

Replication, caching, and partitioning

Brian Aker (who is a Director at MySQL) posts about "The Death of Read Replication".

It is a bit of a ramble, but Brian's thoughts are always worth a read. Brian's main point is that people have moved away from read-only replication and toward caching layers in front of the database like memcached.

My opinion on this differs somewhat. I agree that read-only replication is at best a temporary scaling solution, but I disagree that object caches are the solution.

I think caching is way overdone, to the point that, in some designs, the caching layers sometimes contains more machines than the database layer. Caching layers add complexity to the design, latency on a cache miss, and inefficiency to use of cluster resources.

I tend to be more of a fan of partitioning. I often suspect the design of the data store could be both simpler and faster if the database layer was massively partitioned, the data sharded, the caching layer removed, and the machines allocated to the caching layer moved to the database layer instead.

If the database layer can keep its working set in memory and its own caches are well designed, it should be able to exceed the performance and scalability of an architecture with a separate caching layer. It is only when a database is horribly slow (because, for example, it is always hitting disk) that caching layers look so attractive. Rather than putting layers in front to hide your shame, why not fix that slow database?

Please see also my May 2006 post, "Wikipedia and databases", which discusses their architecture which, at the time, had one huge, overwhelmed I/O bound database using read-only replication to other databases and a large caching layer.

Update: In the comments, Brian swings by and argues for a balance between partitioning and caching (and kindly forgives me for my use of "ramble").

Update: François Schiettecatte writes ([1] [2]):

At Feedster ... once we got powerful enough servers for the DBMS, we found ... memcached ... was a hinderance more than anything.

Ideally you want to ... [not] need to use caching because you get .. to your data fast enough ... That ... means having an optimized schema and a sharded database ... Take the memory you would use for caching and give it to [your] database servers.

Geeking with Greg