Tuesday, February 17, 2009

Jeff Dean keynote at WSDM 2009

Google Fellow Jeff Dean gave an excellent keynote talk at the recent WSDM 2009 conference that had tidbits on Google I had not heard before. Particularly impressive is Google's attention to detail on performance and their agility in deployment over the last decade.

Jeff gave several examples of how Google has grown from 1999 to 2009. They have x1000 the number of queries now. They have x1000 the processing power (# machines * speed of the machines). They went from query latency normally under 1000ms to normally under 200ms. And, they dropped the update latency by a factor of x10000, going from months to detect a changed web page and update their search results to just minutes.

The last of those is very impressive. Google now detects many web page changes nearly immediately, computes an approximation of the static rank of that page, and rolls out an index update. For many pages, search results now change within minutes of the page changing. There are several hard problems there -- frequency and importance of recrawling, fast approximations to PageRank, and an architecture that allows rapid updates to the index -- that they appear to have solved.

Their performance gains are also impressive, now serving pages in under 200ms. Jeff credited the vast majority of that to their switch to holding indexes completely in memory a few years back. While that now means that a thousand machines need to handle each query rather than just a couple dozen, Jeff said it is worth it to make searchers see search results nearly instantaneously.

The attention to detail at Google is remarkable. Jeff gleefully described the various index compression techniques they created and used over the years. He talked about how they finally settling on a format that grouped four delta of positions together in order to minimize the number of shift operations needed during decompression. Jeff said they paid attention to where their data was laid out on disk, keeping the data they needed to stream over quickly always on the faster outer edge of the disk, leaving the inside for cold data or short reads. They wrote their own recovery for errors with non-parity memory. They wrote their own disk scheduler. They repeatedly modified the Linux kernel to meet their needs. They designed their own servers with no cases, then switched to more standard off-the-rack servers, and now are back to custom servers with no cases again.

Google's agility is impressive. Jeff said they rolled out seven major rearchitecture efforts in ten years. These changes often would involve completely different index formats or totally new storage systems such as GFS and BigTable. In all of these rollouts, Google always could and sometimes did immediately rollback if something went wrong. In some of these rollouts, they went as far as to have a new datacenter running the new code, an old datacenter running the old, and switch traffic between datacenters. Day to day, searchers constantly were experiencing much smaller changes in experiments and testing of new code. Google does all of this quickly and quietly, without searchers noticing anything has changed.

The raw computational power is staggering already -- thousands of machines for a single request -- but what is to come seems nearly unbelievable. Jeff said that Google's machine translation models use a million lookups in a multi-terabyte model just to translate one sentence. Jeff followed by saying that Google's goal is to make all information in all languages accessible regardless of which language you choose to speak. The amount of processing required is difficult to fathom, yet it seems the kind of computational mountain that might cause others to falter calls out to Googlers.

In all, a great talk and a great start to the WSDM 2009 conference. If you want to catch Jeff's talk yourself, and I highly recommend you do, the good people at videolectures.net were filming it. With any luck, the video should be available on their site in a few weeks.

In addition to this post, you might enjoy Michael Bendersky's notes on Jeff Dean's talk. It appears Michael took about as detailed notes as I did.

Update: Jeff now has made his slides (PDF) from his talk available.

Update: Jeff Dean's talk is now available on videolectures.net along with many of the other talks from WSDM 2009.

Saturday, February 07, 2009

Book review: Introduction to Information Retrieval

My old favorite book on search, "Managing Gigabytes", is getting quite dated at this point. I've been looking for one I liked as much for the last few years, but just hadn't come across anything as practical or useful as that old text.

Now, I think I may have found a new favorite. Three search gurus, Chris Manning, Prabhakar Raghavan (head of Yahoo Research), and Hinrich Schutze, just published a wonderful new book, "Introduction to Information Retrieval".

If you work in search or if you are just the kind of person that reads textbooks for fun, this one is a great one. It not only describes how to build a search engine (including crawling, indexing, ranking, classification, and clustering), but also offers the kind of opinionated wisdom you can only get from people who have had substantial experience using these techniques at large scale.

Let me try to pick out some excerpts to show at least a few examples of why I found this book so valuable. First, on the difference between perceived relevance and measured relevance:
The key utility measure is user happiness. Speed of response and the size of the index are factors in user happiness. It seems reasonable to assume that relevance of the results is the most important factor: blindingly fast, useless answers do not make a user happy. However, user perceptions do not always coincide with system designers' notions of quality. For example, user happiness commonly depends very strongly on user interface design issues, including the layout, clarity, and responsiveness of the user interface, which are independent of the quality of the results returned.
This is a point that is often missed in information retrieval. It is extremely common to measure the quality of search results by having judges sit down and mark how relevant they think they are. That ignores speed, layout, the quality of nearby results, and the many other factors that go into perceived quality and impact user satisfaction.

On compression:
One benefit of compression is immediately clear. We need less disk space.

There are two more subtle benefits of compression. The first is increased use of caching ... With compression, we can fit a lot more information into main memory. [For example,] instead of having to expend a disk seek when processing a query with t, we instead access its postings list in memory and decompress it ... Increased speed owing to caching -- rather than decreased space requirements -- is often the prime motivator for compression.

The second more subtle advantage of compression is faster transfer data from disk to memory ... We can reduce input/output (IO) time by loading a much smaller compressed posting list, even when you add on the cost of decompression. So, in most cases, the retrieval system runs faster on compressed postings lists than on uncompressed postings lists.
The authors go on to say that simple compression techniques might be better for some applications than more complicated algorithms because small additional reductions in data size may not be worth any substantial increase in decompression time.

On supporting phrase searches:
If users commonly query on particular phrases, such as Michael Jackson, it is quite inefficient to keep merging positional posting lists. A combination strategy uses a phrase index, or just a biword index, for certain queries, and uses a positional index for other phrase queries. Good queries to include in the phrase index are ones known to be common based on recent querying behavior. But this is not the only criterion: the most expensive phrase queries to evaluate are ones where the individual words are common but the desired phrase is comparatively rare.

Adding Britney Spears as a phrase index may only give a speedup factor to that query of about three, because most documents that mention either word are valid results, whereas adding The Who as a phrase index entry may speed up that query by a factor of 1,000. Hence, having the latter is more desirable, even if it is a relatively less common query.
Another factor that might be important to consider when picking phrases to put in the phrase index is the caching layer. If [Britney Spears] is almost always cached when queried, it will be satisfied without hitting the posting lists. You probably want to use the frequency of the query hitting the posting lists, not the frequency of the query in the logs, when looking at what to put in a phrase index.

On compression, skip lists, and other attempts to speed up access to the index:
Choosing the optimal encoding for an inverted index is an ever-changing game for the system builder, because it is strongly dependent on underlying computer technologies and their relative speeds and sizes. Traditionally, CPUs were slow, and so highly compressed techniques were not optimal. Now CPUs are fast and disk is slow, so reducing disk postings list size dominates. However, if you're running a search engine with everything in memory, the equation changes again.
A nice, long excerpt on Naive Bayes as a classifier and why it works so well:
Naive Bayes is so called because the independence assumptions ... are indeed very naive for a model of natural language.

The conditional independence assumption states that features are independent of each other given the class. This is hardly ever true for terms in documents. In many cases, the opposite is true. The pairs hong and kong or london and english ... are examples of highly dependent terms.

In addition, the multinomial model makes an assumption of positional independence [and] the Bernoulli model ignores positions in documents altogether because it only cares about absence or presence ... How can NB be a good text classifier when its model of natural language is so oversimplified?

The answer is that even though the probability estimates of NB are of low quality, its classification decisions are surprisingly good ... It does not matter how accurate the estimates are ... NB classifiers estimate badly, but often classify well.

Even if it is not the method with the highest accuracy for text, NB has many virtues that make it a strong contender for text classification. It excels if there are many equally important features that jointly contribute to the classification decision. It is also somewhat robust to noise features ... and concept drift.

[But,] NB's main strength is its efficiency ... It is often the method of choice if (i) squeezing out a few extra percentage points of accuracy is not worth the trouble in a text classification application, (ii) a very large amount of training data is available and there is more to be gained from training on a lot of data than using a better classifier on a smaller training set, or (iii) if its robustness to concept drift can be exploited.

[Many] have shown that the average effectiveness of NB is uncompetitive with classifiers like SVMs when trained and tested on independent and identically distributed (i.i.d.) data ... However, these differences may be invisible or even reverse themselves when working in the real world where, usually, the training sample is drawn from a subset of the data to which the classifier will be applied, the nature of the data drifts over time ... and there may well be errors in the data (among other problems). Many practitioners have had the experience of being unable to build a fancy classifier for a certain problem that consistently performs better than NB.
On feature selection:
Feature selection [picks] a subset of the terms occurring in the training set and [uses] only this subset as features in text classification. Feature selection serves two main purposes. First, it makes training and applying a classifier more efficient ... this is of particular important for classifiers that, unlike NB, are expensive to train. Second, feature selection often increases classification accuracy by eliminating noise features ... [that] might produce ... overfitting. It may appear counterintuitive that at first that a seemingly weaker classifier is advantageous ... [but] we will see that weaker models are often preferable when limited training data are available.

Of the two NB models, the Bernoulli model is particularly sensitive to noise features. A Bernoulli NB classifier requires some form of feature selection or else its accuracy will be low.

Mutual information and chi-squared represent rather different feature selection methods ... Because its criterion is significance, chi-squared selects more rare terms (which are often less reliable indicators) than mutual information. But the selection criterion of mutual information also does not necessarily select the terms that maximize classification accuracy. Despite the differences, the classification accuracy of feature sets selected with chi-squared and MI does not seem to differ systematically ... As long as all strong indicators and a large number of weak indicators are selected, accuracy is expected to be good. Both methods do this.
On the effectiveness of linear classifiers and why using non-linear classifiers is not an obvious win:
For some problems, there exists a nonlinear classifier with zero classification error, but no such linear classifier. Does that mean we should always use a nonlinear classifier?

It is perhaps surprising that so many of the best-known text classification algorithms are linear. Some of these methods, in particular linear SVMs, regularized logistic regression and regularized linear regression, are among the most effective known methods ... Typical classes in text classification are complex and seem unlikely to be modeled well linearly. However, this intuition is misleading for the high dimensional spaces we typically encounter in text applications. With increased dimensionality, the likelihood of linear separability increases rapidly.
Excellent advice on picking which classifier to use:
If you have fairly little [training] data and you are going to train a supervised classifier, then machine-learning theory says you should stick to a classifier with high bias ... There are theoretical and empirical results that Naive Bayes does well in such circumstances ... A very low bias model like nearest neighbor is probably contraindicated. Regardless, the quality of the model will be adversely affected by the limited training data ... Often, the practical answer is to work our how to get more labeled data as quickly as you can.

If there is a reasonable amount of labeled data, then you are in a perfect position to use ... an SVM. However, if you are deploying a linear classifier such as an SVM, you should probably ... [overlay] a Boolean rule-based classifier over the machine-learning classifier ... If management gets on the phone and wants the classification of a particular document fixed right now, then this is much easier to do by hand-writing a rule than by working out how to adjust the weights of an SVM without destroying overall classification accuracy. This is one reason why ... decision trees, which produce user interpretable Boolean-like models, retain considerable popularity.

If a huge amount of data are available, then the choice of classifier probably has little effect on your results and the best choice may be unclear (cf. Banko and Brill 2001). It may be best to choose a classifier based on the scalability of training or even runtime efficiency.
A nice, intuitive explanation of why support vector machines (SVMs) work well:
For two-class, separable training data sets, there are a lot of possible linear separators ... Although some learning methods such as the perceptron algorithm find just any linear separator, others, like Naive Bayes, search for the best linear separator according to some criterion. The SVM in particular defines the criterion to be looking for a decision surface that is maximally far away from any data point ... Maximizing the margin seems good because points near the decision surface represent very uncertain classification decisions ... A slight error in measurement or a slight document variation will not cause a misclassification.

[Also,] by construction, an SVM classifier insists on a large margin around the decision boundary. Compared with a decision hyperplane, if you have to place a fat separator between the classes, you have fewer choices of where it can be put. As a result of this, the memory capacity of the model has been decreased, and hence we expect that its ability to correctly generalize to test data has been increased.

The superlinear training time of traditional SVM algorithms makes them difficult or impossible to use on very large training sets. Alternative traditional SVM solutions algorithms which are linear in the number of training examples scale badly with a large number of features, which is another standard attribute of text problems ... [and it] remains much slower than simply counting terms as is done in the Naive Bayes model.

Nonlinear SVMs ... [are usually] impractical ... [and] it can often be cheaper to materialize the higher order features and train a linear SVM ... The general idea is to map the original feature space to some higher dimensional feature space where the training set is [linearly] separable ... The easy and efficient way of doing this ... [is] the kernel trick ... 90% of the work with kernels uses ... polynomial kernels and radial basis functions ... Really, we are talking about some quite simple things. A polynomial kernel allows us to model feature conjunctions (up to the order of the polynomial) ... [For example,] pairs of words ... [require] a quadratic kernel. If triples of words give distinctive information, then we need to use a cubic kernel. A radial basis function allows you to have features that pick out circles (hyperspheres), although the decision boundaries become much more complex as multiple such features interact. A string kernel lets you have features that are character subsequences of terms.
On clustering:
K-means is the most important flat clustering algorithm ... The ideal cluster in K-means is a sphere with the centroid as its center of gravity ... The first step of K-means is to select as initial cluster centers K randomly selected documents, the seeds. The algorithm then moves the cluster centers around in space to minimize RSS ... the squared distance of each [item] from its [closest] centroid.

[We easily can get] suboptimal clustering ... from a bad choice of initial seeds ... Effective heuristics for seed selection include (i) excluding outliers ... (ii) trying out multiple starting points ... and (iii) obtaining seeds from ... hierarchical clustering [of] ... a small random sample.

Hierarchical clustering outputs ... a [useful tree] structure ... [and] does not require us to prespecify the number of clusters .... [but is at least] O(N^2) and therefore infeasible for large sets of 1,000,000 or more documents. For such large sets, HAC can only be used in combination with a flat clustering algorithm like K-means ... [which] combines the ... higher reliability of HAC with the efficiency of K-means.
And there is much, much more. This is just a small sample of the insights in those pages. Definitely worth a read.

By the way, the references sections at the end of each chapter offer excellent surveys of recent work on each topic. They are worth their weight in gold and should not be missed.

The authors made the full text of their book available online, but I'd recommend picking up a copy for your shelf too. It is too good of a book to just skim through online; it deserves a more thorough read.

Please see also my June 2006 post, "Foundations of Statistical NLP", which reviews a 1999 book by two of the same authors.

Thursday, February 05, 2009

Layoffs and tech layoffs

There has been wide coverage in the press of layoffs across companies, including many technology companies.

I do not want to comment directly, but the writing of Jeffrey Pfeffer, one of my favorite authors on management, may be of interest. From one of his books:
There is no solid evidence that using layoffs rather than less draconian methods to cut costs increases performance, and there is plenty of evidence that involuntary reductions in force damage both displaced workers and "survivors".

A University of Colorado study ... showed no link between downsizing and subsequent return on assets. A Bain study ... found that ... firms that used layoffs primarily for cost cutting suffered a drop in stock price ... A Right Associates study ... reported that layoffs were followed by lower employee morale and trust in management.
This and some of Pfeffer's other work also talk about how to do layoffs effectively if they really must be done. To summarize, the layoffs should hit management harder than workers and should provide "people as much prediction, understanding, control, and compassion" as possible so you "reduce the number of people who feel afraid" and give people, "laid off or not", a "good idea of what steps they would need to take to control their destiny." ([1])

Pfeffer also refers to another book, Responsible Restructuring, for more "evidence on the ineffectiveness of most layoffs."

Tuesday, February 03, 2009

Details on Yahoo's distributed database

The good folks at Yahoo Research published a paper at VLDB 2008, "PNUTS: Yahoo!'s Hosted Data Serving Platform" (PDF), that has juicy details on the massively distributed database behind most of Yahoo's web applications.

An excerpt:
We describe PNUTS, a massively parallel and geographically distributed database system for Yahoo!'s web applications.

The foremost requirements of web applications are scalability, consistently good response time for geographically dispersed users, and high availability. At the same time, web applications can frequently tolerate relaxed consistency guarantees.

For example, if a user changes an avatar ... little harm is done if the new avatar is not initially visible to one friend .... It is often acceptable to read (slightly) stale data, but occasionally stronger guarantees are required by applications.

PNUTS provides a consistency model that is between the two extremes of general serializability and eventual consistency ... We provide per-record timeline consistency: all replicas of a given record apply all updates to the record in the same order .... The application [can] indicate cases where it can do with some relaxed consistency for higher performance .... [such as reading] a possibly stale version of the record.
Do not miss the related work section of the paper where the authors compare PNUTS to Google Bigtable and Amazon Dynamo, among other things.

When reading the paper, a couple things about PNUTS struck me as surprising:

First, the system is layered on top of the guarantees of a reliable pub-sub message broker which acts "both as our replacement for a redo log and our replication mechanism." I have to wonder if the choice to not build these pieces of the database themselves could lead to missed opportunities for improving performance and efficiency.

Second, as figures 3 and 4 show, the average latency of requests to their database seems quite high, roughly 100 ms. This is high enough that web applications probably would incur too much total latency if they made a few requests serially (e.g. ask for some data, then, depending on what the data looks like, ask for some other data). That seems like a problem.

Please see also my August 2006 post, "Google Bigtable paper", which discusses the distributed database behind many products at Google.

Please see also my earlier post, "Highly available distributed hash store at Amazon", on the distributed database behind some features at Amazon.com.

Please see also my earlier posts, "Cassandra data store at Facebook" and "HBase: A Google Bigtable clone".

Update: One of the developers of PNUTS commented on this post, pointing out that PNUTS performance is much better in practice (1-10ms/request) when caching layers are in place and making a few comparisons to Bigtable.

Monday, February 02, 2009

Where all roads lead

Nick Carr writes about the dominance a few have in our access to information:
What we seem to have here is evidence of a fundamental failure of the Web as an information-delivery service. Three things have happened, in a blink of history's eye: (1) a single medium, the Web, has come to dominate the storage and supply of information, (2) a single search engine, Google, has come to dominate the navigation of that medium, and (3) a single information source, Wikipedia, has come to dominate the results served up by that search engine.

Even if you adore the Web, Google, and Wikipedia - and I admit there's much to adore - you have to wonder if the transformation of the Net from a radically heterogeneous information source to a radically homogeneous one is a good thing. Is culture best served by an information triumvirate?

It's hard to imagine that Wikipedia articles are actually the very best source of information for all of the many thousands of topics on which they now appear as the top Google search result.

What's much more likely is that the Web, through its links, and Google, through its search algorithms, have inadvertently set into motion a very strong feedback loop that amplifies popularity and, in the end, leads us all, lemminglike, down the same well-trod path - the path of least resistance. You might call this the triumph of the wisdom of the crowd. I would suggest that it would be more accurately described as the triumph of the wisdom of the mob.
Please see also my Dec 2007 post, "One Wikipedia to rule them all", which cites a 2007 study that claims that Wikipedia is the first result on about 30% of web searches.

Update: I should have remembered this earlier, but please see also Rich Skrenta's Jan 2007 post, "Winner-Take-All", where he says, among other things, that "Google is the start page of the internet."