Geeking with Greg: 01/01/2007

Wednesday, January 31, 2007

The Netflix Prize and big data

David Leonhardt at the NYT writes about the Netflix Prize in his article today "You Want Innovation? Offer a Prize".

Some excerpts:

When Netflix announced its prize in October, [CEO Reed] Hastings said he didn't necessarily expect contestants to make a lot of quick progress.

Computer scientists say that Cinematch, along with Amazon's recommendation system, was already one of the most sophisticated. "We thought we built the best darn thing ever," Mr. Hastings said.

But Mr. Hastings underestimated the power of an open competition. Within days, many of the top people in a field known as machine learning were downloading the 100 million movie ratings Netflix had made public.

The experts have since been locked in a Darwinian competition to build a better Cinematch, with the latest results posted on a leader board at Netflix's Web site.

With four and a half years to go in the contest, [the lead team] is already 6.75 percent better than Cinematch. And Netflix hasn't had to pay for their time.

In effect, the company "has recruited a large fraction of the machine learning community for almost no money," as [Geoffrey] Hinton, [a University of] Toronto [Computer Science] professor, put it.

While the prize money adds excitement, I think most of the enthusiasm from the research community is simply from having access to such a massive data set.

Until Netflix released their movie ratings data for this contest, the largest data sets available for experimenting with and evaluating recommender systems were the Movie Lens and EachMovie data sets. Those data sets are two orders of magnitude smaller.

Netflix made 100M ratings by 480k customers over 30k titles available to researchers. A data set of that size simply was not available until now.

This opens up new opportunities for research on recommender algorithms. Not only are there considerable challenges in scaling recommender algorithms to big data, but also, as Googler Peter Norvig points out, we may have more to learn from improving our ability to work with massive training data than we do from twiddling algorithms running over small data.

Yes, the money and visibility of the Netflix Prize is a motivator, I am sure. But, there is also excitement from getting access to big data that previously only was available inside companies like Amazon, Netflix, Yahoo, or Google.

See also my original post on the Netflix contest, "Netflix offers $1M prize for improved recs".

See also a Sept 2006 talk (PDF) by Netflix's VP of Recommendation Systems Jim Bennet that has details on their Cinematch recommender system.

Tuesday, January 30, 2007

Google Video launches recommendations

Ionut Alex Chitu reports that Google Video recently launched a recommendations feature that suggests videos "based on your search history, ratings and viewing patterns."

My first three videos are all Google TechTalks, which are good recommendations since that is pretty much all I watch on Google Video. The later picks are science videos and other bits of geekdom, not bad either.

It might be worth mentioning that Findory has video recommendations as well, launched about eight months ago, that also uses your viewing history. Findory's video recommendations mostly are for videos on Google Video and YouTube.

Also worth noting is StumbleVideo, a snazzy and easy to use video recommender from the folks at StumbleUpon.

On a side note, I think it is amusing that I found Ionut Alex's post using the related articles recommendations from Findory for this weblog. Recommendations all around, it seems.

Is Microsoft's Web war lost?

Former internet analyst Henry Blodget slams Microsoft's faltering efforts to compete with Google. An excerpt:

I agree that the "Live" brand is confusing and that Microsoft has not clearly explained what it is trying to accomplish.

What I disagree with is the implicit idea that Microsoft KNOWS what it's trying to accomplish on the web (short of somehow miraculously vanquishing Google)--and that it has any reasonable chance of achieving it.

I argued a year ago that the web war was over and Microsoft lost. This seems even clearer with every passing day.

The only question in my mind is whether Microsoft can build enough of a wall around its crown jewels--Office and Windows--to survive over the long haul.

Microsoft's confusing Live branding clearly is a mistake. As Henry says, it is one of the more glaring examples of Microsoft's lack of focus with their internet strategy.

Henry calls the Web war lost at this point. I would not be so quick to dismiss the Redmond giant.

Some opportunities may have past. For example, I think Microsoft could have competed with Google on technical innovation if they had started in 2003, but that seems less likely to work at this late date.

In 2007, with Google so far ahead, I think Microsoft will have to use its market power to win. They will have to lock up web advertising deals to starve Google's revenue growth. They will have to exploit their control of the experience on the Windows PC desktop and eliminate opportunities to go elsewhere.

The steep drop of MSN Search market share appears to be due to the weakness of the MSN/Live brand and strength of the Google brand, losing the position as the default search engine for many users (Firefox users, people with Adobe Acrobat or Java installed who opt-in, Google Toolbar users, people with Dell computers), and the failure of MSN Search to be differentiated in any positive way (by quality or by features) from Google.

All three of those have gotten worse over the last four years. The situation is so bad that some are saying the battle is lost.

I think Microsoft still has a fighting chance, but the delays have not made the battles ahead any easier. Microsoft must fix their branding issues, control the search experience in Windows, attack Google's advertising profits, and make the MSN home page and MSN Search at least "good enough".

I think Microsoft can do these things. But, the track record of the last four years has not been good. If Microsoft keeps going as they have been, Henry may be right that the Web war is lost.

See also my previous post, "MSN Search and beating Google".

See also Henry Blodget's other writings on Google vs. Microsoft.

Update: John Battelle posts that Microsoft CEO Steve Ballmer is refocusing Microsoft on countering "advertising funded business models". John says, "The order has come from the bridge of the aircraft carrier: Turn toward search and advertising models, full steam ahead!"

Monday, January 29, 2007

MyBlogLog and spam

I am amused by these ways to spam MyBlogLog, a little widget you can put on your weblog that shows the last people who visited your site, described by Emre Sokullu and Michael Jensen.

It is inevitable that a system like MyBlogLog, if popular, would get spammed. I suspect it will be difficult to prevent. Spammers have quite a financial incentive to manipulate this feature since, if successful, they essentially get an advertisement for free in a placement that would normally cost thousands.

Note that the spam attempts only started once Yahoo acquired MyBlogLog. Before this point, MyBlogLog was not big enough to attract spammers attention. With the Yahoo acquisition, a financial incentive appeared.

All very predictable. In my October 2005 post, "Getting the crap out of user-generated content", I said:

Small sites work dandy when they're only used by early adopters. Early adopters are dedicated, so the quality of the content is high. Traffic is low, so spammers don't care about them.

As they grow, as traffic increases and the products start to attract a mainstream audience, the incentive for spam go up. Suddenly, there's a profit motive, an ability to reach a wide audience at low cost. The spam floods in.

Websites hosting user-generated content need to be designed with the idea that much of the content will be crap or spam.

See also my earlier posts, "Growth, crap, and spam" and "Digg struggles with spam".

Saturday, January 27, 2007

SEO and personalized search

Fortune Interactive writes about the impact of personalized search on search engine optimization (SEO). Some excerpts:

The personalization of search alters search results from user to user based on the history and profile of the searcher.

[Mike Moran said,] "Widespread personalization will doom traditional rank checking. The question won't be 'Does my site rank No. 1?' but rather 'For what percentage of searchers does my site rank No. 1' or 'What was my average ranking yesterday?' . . . [I]t's the biggest change in search marketing since paid search."

Without a reliable means of rank checking, traditional SEO loses its foundations for optimization decisions .... Without a reliable means of rank checking, traditional SEO ... loses its metrics for determining its success or failure.

With widespread personalization on the rise, traditional SEO can be aptly characterized as a dead man walking.

White hat search engine optimization is not spam, but enough SEO is black hat that I think my previous post, "Web spam, AIRWeb, and SIGIR" is relevant here. In that post, I said:

"Winner takes all" encourages spam. When spam succeeds in getting the top slot, everyone sees the spam. It is like winning the jackpot.

If different people saw different search results -- perhaps using personalization based on history to generate individualized relevance ranks -- this winner takes all effect should fade and the incentive to spam decline.

See also my previous post, "Combating web spam with personalization".

[Fortune Interactive article found via Gary Price]

Update: Erik Dafforn writes, "The single algorithm you're chasing now will soon be 500 million little ones."

Update: Nick Wilson at Search Engine Land says:

[The] rollout of ... personalized search results for the masses ... was cataclysmic.

In short, it's a game changer. Those who adapt quickly or are already ahead of the curve will thrive in the new environment. Those too slow or in denial will perish.

"One page fits all" is now a thing of the past. Personalized search is now the default.

Update: Matt Cutts says, "With personalization ... black hat becomes a lot more difficult ... You can't rank that (trophy keyword) for everyone anymore."

Update: Another good quote from Matt Cutts (from the same SES keynote interview):

The nice thing about personalization is you don't see one monolithic set of results ... Now everyone can rank in the top ten for some niche, so there is no weird step function.

It's not winner-take-all anymore.

Remoras, plug-ins, and apis

Paul Lamere writes about "The Recommender Remoras", recommender services that glom on to much bigger applications.

Some excerpts:

One thing that iTunes does not do very well is give new music recommendations.

Poor recommendations has driven a whole industry of music recommenders that work with iTunes.

These recommenders run as iTunes plugins, sidebars and tag-alongs. When you start iTunes, these programs also start up - they keep tabs on what you are listening to, and use this data to generate recommendations for you.

These recommenders are like remoras (aka suckerfish) to the iTunes shark.

By attaching itself to a host such as a shark, a remora benefits by using the host as transport and protection and also feeds on materials dropped by the host. The recommender remoras benefit from iTunes reach onto millions of desktops and feed on the opportunities left by iTunes' poor recommender.

The problem with the iTunes remora model is that you survive at the whim of Apple. At any moment, Apple could eliminate you by simply launching a better built-in recommender service in iTunes.

However, Paul may be overstating the case a bit. Services such as iLike and Last.fm have iTunes plugins, but they also have independent websites and applications.

In fact, given that Last.fm has plug-ins for Launchcast, Winamp, Windows Media Player, Skype, and many others, I suspect these recommender services are more using the plug-ins as a marketing tool to drive traffic to their service, not as their primary means of survival. I would be curious to see numbers on how much of iLike and Last.fm usage is through iTunes so I could know more.

On a broader issue, it is concerning that using open APIs to integrate to other services and applications may create a dependency. Integration is fine if it merely increases the visibility and convenience of using your service.

But, if it gets to the point that you become a remora, you may find your survival subject to the whims of your host.

Friday, January 26, 2007

Sender reputation in GMail

There are some interesting tidbits on the spam filtering in GMail in the paper "Sender Reputation in a Large Webmail Service" (PDF) by Googler Bradley Taylor.

In particular, it sounds like one of several spam filters in GMail uses the aggregate data on what data is marked as spam (by other automatic classifiers or manually by the user) to determine the spam reputation of a domain.

If e-mails sent from a particular domain are consistently marked as spam, the reputation of that domain will deteriorate, and then eventually the entire domain will be blacklisted. Simple but good idea.

I'd really like to see an extension of this where reputation of individual senders are monitored, not to blacklist, but to whitelist.

For example, Findory has a feature where people can get a daily e-mail of their Findory front page, essentially a mailing list. As the Google paper says, "Some users are lazy and find that reporting spam on a mailing list is easier than unsubscribing ... For the sender it means their reputation is hurt worse than it needs to be."

However, mail from glinden@findory.com presumably is never marked as spam. Whitelisting individual senders with good reputation seems like it would be a good refinement to blacklisting domains with bad reputation.

Scaling MySpace

David Carr at Baseline wrote a fun article, "MySpace: Too Much of a Good Thing?", that talks about MySpace's effort to scale to their rapid growth and the trouble they had with some of their attempted solutions.

As MySpace VP of Technology Jim Benedetto said, "We made a lot of mistakes." The biggest appear to be putting all data into a relational database (an SQL Server database at that) regardless of whether that was needed or appropriate, inadequate partitioning of data, neglecting opportunities to cache, and investing in scale-up rather than scale-out of hardware.

[Found via DeWitt Clinton]

The endless nagging of RSS readers

Dave Winer writes:

Most RSS readers remind the user, all the time, how wrong he or she is. Or inadequate or lazy or behind in their work ... Who needs that.

Think about it this way. Suppose you read the paper every day. What if at the top of the paper it told you how many articles from previous issues you hadn't read.

News is not email ... Every article is not necessarily something you should read, or even look at ... If I'm not interested, or too busy -- too bad. No need to count the number of articles that didn't get my attention. It's a useless piece of data.

See also my earlier post, "RSS sucks and information overload", where I said, "The problem is scaling attention. Readers have limited time. They don't want information. They want knowledge."

[Winer post found via Matt McAlister]

Sunday, January 21, 2007

Yahoo Research on distributed web search

Gary Price at ResourceShelf points to a good, new paper out of Yahoo Research by Baeza-Yates et al, "Challenges in Distributed Information Retrieval" (PDF).

The paper is a summary of the challenges in building a web search engine that can scale to handle the expected growth of the web in the next 3-5 years.

Some excerpts:

In the ocean of Web data, Web search engines are the primary way to access content. As the data is on the order of petabytes, current search engines are very large centralized systems based on replicated clusters.

The number of Web sites continues to grow rapidly and there are currently more than 20 billion indexed pages. In the near future, centralized systems are likely to become ineffective against such a load, thus suggesting the need of fully distributed search engines.

In this paper we discuss the main issues with the design of a distributed Web retrieval system, including discussions on the state of the art.

The paper goes on to discuss the problems that come up for crawling, indexing, and query processing when, instead of all being in the same data center, clusters are distributed across a WAN. It is a great overview and a worthwhile read.

I wanted to highlight a few small but interesting points in the paper that I suspect would be easy to overlook.

First, I really like the idea mentioned on page 5 of the paper of optimizing the distributed index by "using query logs to partition the document collection and to route queries." What would be really fun to see is this optimization also being done continuously, replicating and moving index shards in real-time in response to access patterns.

Second, the paper very briefly mentions the idea of a peer-to-peer model for a web search engine. On the one hand, the paper makes this idea sound intriguing by talking about all the problems of doing distributed web search and the number of machines required (possibly 1.5M in 2010). On the other hand, there is a big difference between 30+ clusters on a WAN with 50k+ fast-interconnect, trusted machines in each cluster and using millions of untrusted machines on a P2P network.

The Baeza-Yates paper does not explore P2P web search further, it does cite "P2P Content Search: Give the Web Back to the People" (PDF) which, in turn, cites "On the Feasibility of Peer-to-Peer Web Indexing and Search" (PS). This second paper argues that P2P web search, at least in a naive implementation, would consume most of the bandwidth on the internet and therefore is not feasible. Worth a peek if you, like me, think this might be an intriguing idea to explore.

Finally, the Baeza-Yates paper mentions personalized search. I think that is interesting both because Yahoo researchers are thinking about personalized search and, like folks at Microsoft Research, they seem to be leaning toward a client-side implementation:

When query processing involves personalization of results, additional information from a user profile is necessary at search time, in order to adapt the search results according to the interests of the user.

An additional challenge related to personalization of Web search engines is that each user profile represents a state, which must be the latest state and be consistent across replicas.

Alternatively, a system can implement personalization as a thin layer on the client-side. This last approach is attractive because it deals with privacy issues related to centrally storing information about users and their behavior.

However, a client-side approach, as the MSR researches say, has the "disadvantage of ... [no] direct access to details of the Web corpus" (which limits the extent of the personalization) and, as the Yahoo researchers point out, "limits the user to always using the same terminal."

I want differential pricing for Amazon EC2

One thing mystifies me about Amazon EC2. Why doesn't the service charge different pricing for low priority usage or usage at non-peak times?

The Amazon Elastic Compute Cloud (Amazon EC2) allows people to rent virtual servers by the hour, charging $.10 per hour and $.20 per gigabyte transferred.

As Amazon says in their description of EC2:

Amazon EC2 enables you to increase or decrease capacity within minutes, not hours or days. You can commission one, hundreds or even thousands of server instances simultaneously.

Amazon EC2 passes on to you the financial benefits of Amazon's scale. You pay a very low rate for the compute capacity you actually consume.

This frees you from many of the complexities of capacity planning, transforms what are commonly large fixed costs into much smaller variable costs, and removes the need to over-buy "safety net" capacity to handle periodic traffic spikes.

In order for this to work out for Amazon, I would think Amazon also needs to avoid coordinated "periodic traffic spikes" in usage of EC2. Otherwise, they also will need to "over-buy safety net capacity" and will see low utilization rates on their cluster.

With the current pricing structure, there is no incentive to avoid peak load times. In fact, if I were using Amazon EC2 for a batch job, I probably would request my servers during US work hours, the same time EC2 is under heavy load from webservers or other real-time tasks. There is no reason to do otherwise.

I think EC2 should offer a lower rate for low priority requests for servers. Servers at this rate could be pulled from the client at any time for higher priority jobs.

Pricing could be very low because idle servers are worthless to Amazon. If the price point is near the marginal cost of the server time, this service would be attractive to many.

The benefits for Amazon are also apparent. There would be less need to over-buy capacity since capacity could be regained from low priority requests. Utilization would increase, Amazon would get paid for what would otherwise be idle time, and the economics of EC2 would improve.

I have a lot of big data processing tasks -- both for Findory and for side interests -- that fit a batch profile. I am sure other potential EC2 users do as well.

Amazon itself has many batch jobs that fit this profile, including web server log processing, personalization builds, search indexing, and data mining. All of these could be done on borrowed EC2 servers rather than using more expensive dedicated hardware.

Going a step further, I suspect many of these low priority batch jobs could benefit from a different API to Amazon EC2.

Rather than request servers at a time and then manually configure them myself, what I really want is to be able to request a MapReduce job and kick off hundreds or thousands of servers at low priority. Processing on servers that go down or are pulled from us during the job should be restarted elsewhere. At the end, I should get the completed data file.

It should be something like: "Here's my data file, some MapReduce code, and $10. Let me know when you're done."

Powerset already is running a MapReduce clone on Amazon EC2. Powerset has shown both that this service is possible and that there is demand for this service.

It would be MapReduce for the masses. No longer would you have to be at Google to do easy data processing on a massive cluster. You could borrow Amazon's EC2 cluster any time you want.

Update: Nearly three years later, Amazon launches differential pricing for EC2.

Saturday, January 20, 2007

Google's big bandwidth acquisitions

Robert Cringely's latest column, "When Being a Verb is Not Enough", looks at the massive bandwidth rights and data centers Google has acquired.

Google controls more network fiber than any other organization. This is not to say that Google OWNS all that fiber, just that they control it through agreements with network operators.

Google is building a LOT of data centers. The company appears to be as attracted to cheap and reliable electric power as it is to population proximity.

Of course this doesn't answer the question why Google needs so much capacity in the first place, but I have a theory on that.

Robert concludes that Google is trying to corner the market on bandwidth:

It is becoming very obvious what will happen over the next two to three years. More and more of us will be downloading movies and television shows over the net and with that our usage patterns will change.

Instead of using 1-3 gigabytes per month ... we'll go to 1-3 gigabytes per DAY ... a huge backbone burden on ISPs. Those ISPs will be faced with the option of increasing their backbone connections by 30X, which would kill all profits, OR they could accept a peering arrangement with the local Google data center.

I doubt this is the reason. It just does not sound Googly to me.

I think the reason is what Google accidentally disclosed back in March 2006. In my post, "In a world with infinite storage, bandwidth, and CPU power", I quoted from slide 19 of a Google Analyst Day presentation:

[Google] is inspired by the idea of "a world with infinite storage, bandwidth, and CPU power."

They say that "the experience should really be instantaneous". They say that they should be able to "house all user files, including: emails, web history, pictures, bookmarks, etc and make it accessible from anywhere (any device, any platform, etc)" which leads to a world where "the online copy of your data will become your Golden Copy and your local-machine copy serves more like a cache"

Maybe I am too idealistic, maybe a little naive, but I cannot see Google excited by Robert's idea of speculating on the value of bandwidth rights.

Trying to build a world of infinite storage, bandwidth, and CPU power, that is Googly. That infrastructure, once built, would be a tool that makes the impossible possible.

It entirely explains the massive data centers and bandwidth acquisitions. And it is entirely Googly.

Update: Robert Cringely writes a follow-up article, "Oh Brother Where Art Thou?".

Thursday, January 18, 2007

The value of recommendation engines

Alex Iskold at Read/WriteWeb writes about "The Art, Science and Business of Recommendation Engines".

Some excerpts:

There are two fundamental activities online -- Search and Browse. When a consumer knows exactly what she is looking for, she searches for it. But when she is not looking for anything specific, she browses.

It is the browsing that holds the golden opportunity for a recommendation system, because the user is not focused on finding a specific thing -- she is open to suggestions.

During browsing, the user's attention (and their money) is up for grabs. By showing the user something compelling, a web site maximizes the likelihood of a transaction. So if a web site can increase the chances of giving users good recommendations, it makes more money.

Amazon is considered a leader in online shopping and particularly recommendations. Over the last decade the company has invested a lot of money and brain power into building a set of smart recommendations that tap into your browsing history, past purchases and purchases of other shoppers -- all to make sure that you buy things.

The Amazon system is phenomenal. It is a genius of collaborative shopping and automation that might not be possible to replicate. This system took a decade for Amazon to build and perfect. It relies on a massive database of items and collective behavior that also "remembers" what you've done years and minutes ago.

I like Alex's description of searching versus browsing.

Search helps when you know what is out there and can easily say what you want. Personalization and recommendations help when you do not know what is out there. Personalization surfaces interesting items you did not know about and may not have found on your own.

And, Alex is right that recommendations can be lucrative. Personalization and recommendations apparently are responsible for 35% of sales at Amazon.com.

Yahoo blew it

Fred Vogelstein at Wired writes about "How Yahoo Blew It". Some excerpts:

Yahoo's long-delayed push to aggressively compete in the search-driven advertising business is only now getting off the ground, arguably two years behind schedule.

Semel bought a search engine in 2002 and a search-driven ad firm in 2003. All that was left to do was to put the pieces together.

They planned an overhaul of Overture's technology, a project code-named Panama. It was a disaster.

With no clear delineations, Yahoo and Overture executives fought over turf. Yahoo hired and fired a half-dozen engineering chiefs at Overture during the first year. Overture salespeople competed for business with Yahoo salespeople.

And Meisel, Overture's CEO, was ineffective -- either inept or hamstrung by bureaucracy, depending on whom you ask. Decisions big and small, from trying out new features to agreeing on budgets, had to be cleared by committee after committee in Sunnyvale.

"It was a clusterfuck," one of the participants says.

The article goes on to lay much of the blame for the mess on Terry Semel.

A few others have some insightful commentary on this piece. Barry Schwartz at Search Engine Land highlights the last line in the article:

At Yahoo, the marketers rule, and at Google the engineers rule. And for that, Yahoo is finally paying the price.

While this is phrased as a technology vs. media/marketing split, I think there may be just as much of a short-term vs. long-term split here. While Yahoo focused on quick marketing deals, Google built a technology infrastructure that acted as a force multiplier.

Rich Skrenta posted some interesting thoughts on whether a 2002 acquisition of Google by Yahoo would have helped Yahoo, concluding no, probably not. I also enjoyed hearing about the other bidders for Google during their early days, including AOL and Ask:

Heck, I tried to buy Google for AOL in 1999 .... Of course AOL would have killed them.

Jim Lanzone of Ask Jeeves also tried to buy Google, but the $1B ask was too high. Overture tried to buy them and was rebuffed.

There seems to have been a long line of folks in 1999-2000 who recognized Google's value, but couldn't justify the price then.

For more on Yahoo and their problems with their advertising platform, see also my Oct 2006 post, "Yahoo's troubles".

Update: Three months later, Yahoo again releases disappointing numbers, causing BusinessWeek to ask, "Yahoo's Next Search: A New CEO?" [Found via Search Engine Land]

Sunday, January 14, 2007

Findory rides into the sunset

Findory.com launched on January 2, 2004. The website just passed its three year anniversary and, including the early work on the ideas behind Findory, Findory has been in my life for nearly four years.

In the last few months, I have been evaluating new directions for Findory. I asked colleagues I trust for their thoughts and pulled in two senior advisors (thanks, Bill and Dan).

Some good options came out of these discussions, but none lead down a path I am passionate about. I built Findory to follow a passion.

I built Findory around the idea of applying Amazon.com-style personalization and recommendations to information. Search only helps if you can say what you want. Personalization helps you discover things that you could not have found on your own.

My passion is helping people discover information they would otherwise miss. My passion is working on ways to help people with the overwhelming flood of information in their daily lives.

Here, Findory has been successful. Findory has influenced work at Google, Microsoft, AOL, and elsewhere. I am pleased with what Findory has accomplished. At some point, I have to declare victory and move on. I am moving on now, not to a new venture, but to spend more time on health and with family.

Development on Findory now will slow to a crawl. There may be new features, but they will be rare. I no longer will spend time exploring funding, biz dev deals, or recruiting.

Findory appears to have sufficient resources to run on autopilot through most of 2007. Findory will eventually fade away, but I believe it has touched immortality through the impact it had.

It was exciting, challenging, and fun to try to build a startup. I consider myself very lucky to have had that opportunity.

Update: I am overcome by the outpouring of comments on this post. Thank you, everyone, for your kind words.

Update: Thanks also to those who posted about Findory's end, including Mike, John, Gary, Nathan, Don, Om, Brian, Danny, Richard, and many others. But, whatever you do, don't miss Valleywag's very funny post about all this. "The Gaia", I love it!

Newspapers should own local

Rich Skrenta has an interesting post asking why newspapers fail to get their excellent content up high in search results. Some excerpts:

If you go to Google and type in any restaurant name, you're not likely to ever come across a newspaper restaurant review in the results.

These would be very valuable pageviews to be getting. Adsense could do $10-30 CPM on these landings. Not to mention the value to the newspaper to hold on to a claim of authority for restaurant reviews in their area.

Newspapers have a lot of great content, really high quality stuff that cost them a lot of money to develop. Users would love to come across this content, when appropriate.

Newspapers have remarkable content on businesses and events in their communities. They should be the authoritative source for local. They should be the experts on their communities and reap the traffic from searchers seeking that expertise.

For more on this, see also my May 2006 post, "Newspapers and local content", where I said:

Newspapers should own local.

When I want information about Microsoft, Amazon, or other Seattle area companies, the best source should always be the Seattle PI. When I want information about local restaurants, I should think the obvious place to go is the Seattle PI. When I want information about concerts, events, parks, politics, traffic, entertainment, news, anything local, the best place to go should be the Seattle PI.

Sites like Citysearch should look hollow and pathetic next to the content provided by your newspaper. The Wall Street Journal should seek out the Seattle PI for access to their in-depth reporting on Microsoft. Google Local and Yahoo Local should be begging the Seattle PI for access to their pool of local advertisers.

Newspapers should be the broker for local content. Newspapers should be the master of news and advertising content for their communities. Newspapers should be the experts of local.

eBay got taken on Skype

Carlo at TechDirt follows up on the Sept 2005 eBay acquisition of Skype, saying:

eBay's strategy for Skype has never been clear, and many of the things it touted in the deal -- like the ability for eBay sellers to put a Skype link in their listings, so potential buyers could easily call them -- haven't paid off.

While sales are expected to have tripled in 2006 to $195 million, that's still not enough of a money-spinner to validate the billions eBay dropped for the company. Skype faces a big challenge in converting users of its free services into paying customers.

Overall, eBay's strategy to build Skype's business remains as unclear today as it was when they said they were buying it.

See also my post, "eBay buys Skype", that I wrote at the time of the acquisition where I said, "That's one costly deal for eBay, an inflated purchase price to acquire a company that has little to do with e-commerce."

Update: For an opposing point of view, see Henry Blodget's post, "Analyzing Skype", where he claims that 2007 Skype revenues "should" approach $500M. If the 2005 purchase price is below x10 of 2007 revenues, Henry argues, it is "a far cry from the outrageous binge-buy that many commentators described."

Update: A few months later, Henry Blodget changes his mind, saying, "Skype has lost focus ... eBay ... has no strategic reason for owning the company. eBay should sell Skype ... and focus on its core commerce business."

Update: Nine months later, Rachel Konrad at the AP reports that "eBay takes $1.43B charge for Skype" and "$900 million will be ... impairment, essentially acknowledg[ing] that San Jose-based eBay, one of the world's largest e-commerce companies, drastically overvalued the $2.6 billion Skype acquisition."

Tuesday, January 09, 2007

Amazon better than husband

I love this article in The Onion, "Amazon Recommendations Understand Area Woman Better Than Husband".

Some extended excerpts:

Pamela Meyers was delighted to receive yet another thoughtful CD recommendation from Amazon.com Friday, confirming that the online retail giant has a more thorough, individualized, and nuanced understanding of Meyers' taste than the man who occasionally claims to love her, husband Dean Meyers.

While the powerful algorithms that power Amazon's recommendations generator do not have the advantage of being able to observe Meyers' body language, verbal intonation, or current personal possessions, they have nonetheless proven more effective than Dean, who bases his gift-giving choices primarily on what is needed around the house, what he would like to own, and, most notably, what objects are nearby.

"I don't know how Amazon picked up on my growing interest in world music so quickly, but I absolutely love this traditional Celtic CD," Meyers said. "I like it so much more than that Keith Urban thing Dean got me. I'm really not sure what made him think I like country music."

"I was having some tea from that Nebraska Cornhuskers mug Dean got me for Valentine's Day, when a little e-mail from Amazon popped up out of the blue," Meyers said. "Just completely out of the blue."

"It was nice to know that on my birthday, someone or something was out there thinking about me, and what boxed sets I wanted," she added.

While Amazon is almost always accurate, the company does occasionally make a gift recommendation that does not suit her tastes ... Still, Meyers lauded Amazon's attempts at spontaneity. "At least it's trying," said Meyers.

Very amusing.

Monday, January 08, 2007

The death of Shmapdy

One of my servers, the one running shmapdy.com, died recently. It was a horrible death of total hard disk corruption from which there was no recovery.

We shall now remember good Shmapdy. It was a server that lived a full life, a pleasantly geeky tale of multiple reincarnation.

In its youth, the box was a well-loved Dell Inspiron 8100 laptop. Oh, the numbers we did crunch, the games we played. Those were the halcyon days of yore.

Sadly, the backlight on the screen failed in 2002. Crippled, the machine lost its ability to be a laptop. But, it is too early for you, young Shmapdy. You still have more life to live.

I installed Linux and brought the laptop back to life as a small, battery-backed, headless server. I tossed it on a shelf and had it serving traffic for shmapdy.com, a personal website used mostly for photo sharing.

But, laptops are not used as servers for a reason. They are not designed to be running all the time. They get hot and wear out their little fans. Their wee disks get unhappy from all the mindless spinning.

Shmapdy.com went down from overheating a few times in 2003, but installing programs to run the fans constantly and to spin down the drives during idle periods resuscitated the ailing shmapdy.

Then, came the drive failures. Twice, once in 2005, once in 2006, Shmapdy went under from drive corruption. Each time, booting under Knoppix and repairing the drive brought it back.

Over the recent holidays, it just became too much. Shmapdy had enough. The drive failed unrecoverably, corrupting the OS and much of the data on the filesystem along with it.

Shmapdy passed on. Shmapdy is no more. It has ceased to be. It has expired and gone to meet its maker. It is a stiff. Bereft of life, Shmapdy rests in peace. Shmapdy is pushing up the daisies. It kicked the bucket, shuffled off its mortal coil, run down the curtain, and joined the choir invisible. This is an ex-Shmapdy.

Update: In my odd, geeky world, Monty Python references are implicitly understood, but I should make it explicit for the rest of the planet that does not memorize such useless trivia. That last paragraph in my original post was adapted from the Dead Parrot Sketch (video).

Sunday, January 07, 2007

Discovery in corporate memory

I recently visited FXPAL, a Xerox-Fuji research lab next to Xerox PARC, and saw some fun demos of some of the projects there (thanks, Jeremy).

Afterwards, I could not stop talking about the ProjectorBox demo I saw. The idea behind ProjectorBox is to automatically record presentations effortlessly and makes all of them easily searchable.

So, for example, you can search for [personalization] across all talks given in the last couple years at FXPAL. Not only are matches in the text of the slides are shown, but also mousing over any of the slides brings up an audio clip of the room (the speaker and the audience) while that slide was being shown. This makes it quick to get back to information in a talk you saw before or find information in talks you may have missed.

The researchers (Laurent Denoue, David Hilbert, Daniel Billsus, and John Adcock) gave a Google Tech Talk called "Seamless Information Capture and Discovery for Corporate Memory" that gives a good feel for how ProjectorBox works.

The talk starts by framing the problem. They want to make it easy. They want to capture every presentation at the company and make them easily accessible, but "people don't do extra work just for the purpose of sharing information", so they want to capture the presentations "without adding any extra overhead" for speakers. They also want finding information in the presentations to be easy, helping people find useful information even if (perhaps especially if) they don't know that the content is available.

Their solution is a small box that sits between the projector and the presenter's laptop. It does video capture, then OCR on the captured video feed, then segments and indexes all the slides and audio.

The demo of searching the content starts at 12:43 in the video. They do a search, up pop the relevant slides, and mousing over a slide starts the audio from the presentation at that point.

I drool thinking about a generalized version of this. It would be incredibly cool to have this for all recorded lectures on Google Video. Or for all ACM conference talks. Or for every talk and lecture everywhere.

As the FXPAL talk continues, at 25:02 the researchers describe PAL Bar. PAL Bar is a toolbar-based recommender system for documents based on the current context (e.g. content of the current page) and the user's social network. Like Findory, the goal is to help people discover information they did not know about without requiring any effort from the user.

There is also a WWW 2006 paper, "Seamless Capture and Discovery for Corporate Memory" (PDF), that discusses ProjectorBox and PAL Bar. The paper is good, but the video is also useful to get a better feel for how ProjectorBox works. Very cool project.

Big Thinker talk by Susan Dumais

Bill Slawski at Search Engine Lands posts a great summary of Microsoft Researcher Susan Dumais' Big Thinker talk at Yahoo Research.

A low res video of Susan's talk is available. All of the talk is worthwhile if you can tolerate the quality of the video, but don't miss the part starting at 41:50 where Susan talks about her work on personalized search.

For more on the work of Susan Dumais and her group at MSR on personalized search, see also my previous posts, "Beyond the commons: Personalized web search" and "Personalized search paper from MSR".

For thoughts on how this and related work at MSR might be used by Microsoft to pound Google over the head, see also my previous post, "Using the desktop to improve search".

Findory traffic Q4 2006

Findory is seeing growth again in Q4 2006. However, some of the underlying numbers show some lingering problems from the loss of part of the search engine traffic in early 2006.

This is the chart of hits to Findory's webservers (excluding robots and redirects):

It shows a big jump in Q4 2006. Good news, it would seem. But, after digging in, much of this growth appeared to be on Findory's feeds, traffic that may not be easily monetizable.

I did a second analysis that attempted to limit the data to more directly monetizable page views. The data on this was only easily available for the most recent seven quarters.

This trend looks similar, but shows a bigger decline after Q1 2006 and smaller growth in Q4 2006. The growth is good, but it is unfortunate that so much of it is on the feeds and other areas that are not as valuable to Findory.

Robot and similar traffic is excluded from the data going into these analyses but, I have to say, it is amazing how much robot traffic Findory gets. Web crawlers hammer the poor webservers, generating another x2-3 the traffic numbers reported here in total. I would be curious to know if other news sites with lots of rapidly changing content also see this level of crawler interest.

For more on the Q1 2006 drop, see the posts on Findory's traffic after Q2 2006 and Q3 2006.

Owning the starting point on the internet

Rich Skrenta has a thought-provoking post that argues that Google now has a monopoly on online search and advertising.

Some excerpts:

Google has won both the online search and advertising markets.

Google is the start page for the Internet. The net isn't a directed graph. It's not a tree. It's a single point labeled G connected to 10 billion destination pages .... [And] it turns out that owning the starting point on the Internet is really, really valuable.

Search engines have zero user switching costs .... Zero switching costs lead to a winner-take-all market for the leader. Even a modest initial lead will snowball until majority market share is reached and maintained. This is because, faced with a choice between two products, in the absence of switching costs users will choose the better one, even if it is only slightly better.

Even if a competitor such as Yahoo, MSN or Ask were to fully close the [technology] gap at this point, they would still have to overcome the final brand perception gap ... Market research shows that users who see Google's logo on top of Yahoo's results perceive the results to be of higher quality.

Users will stick with the leader ... Forces reinforce a feedback loop of success for the leader ... Advertisers will be drawn to the leader ... strengthening its lead .... All Hail the New King Google.

Rich goes on to argue that Yahoo and others should give up, accept their new Google overloads, and work with Google instead of against them.

As much as I like Rich's piece, I disagree with parts of it. First, there is a hidden assumption that all people have the same criteria for determining what is the best search engine. Despite the strength of the Google brand, I suspect it is more the case that different people have different preferences for user interfaces and different perceptions of the relevance of the search results. Preferences may even vary for the same person when doing different tasks. These are holes in Google's armor which, when pried open, could leave the giant vulnerable.

Second, I think it is actually Microsoft that owns the starting point to the internet, not Google. Microsoft owns the default operating system (WinXP), default browser (IE), default Office suite (MS Office), and often the default start page (MSN). While it is true that Microsoft has not yet been able (or willing) to use this to full effect, I would not as quick as Rich to say that PC software is irrelevant.

See also my previous posts ([1] [2] [3] [4]) on how Microsoft's control of the desktop could change the game in search.

SIGIR 2006 Forum and workshop reports

Gary Price at ResourceShelf points out an online version of the Dec 2006 SIGIR Forum publication.

I was reading it on an airplane a couple weeks ago and had been meaning to write about it. The paper Gary highlights, "A Reference Collection for Web Spam" (PDF), is a good introduction to WEBSPAM-UK2006 collection that Yahoo Research has made available of hand-labeled spammy and non-spammy web pages.

I also enjoyed the SIGIR Workshop reports, especially the AIRWeb summary, "Adversarial Information Retrieval on the Web" (PDF), and the notes from a session on exploratory search, "Evaluating Exploratory Search Systems" (PDF).

For more on AIRWeb at SIGIR 2006, see also my previous posts, "Blog spam at AIRWeb" and " Web spam, AIRWeb, and SIGIR".

Thursday, January 04, 2007

Findory API expanded

Findory just launched a substantially expanded version of its API.

The Findory API provides access to Findory data, including news, weblog, video, and podcast articles from its crawl, related articles, personalized recommendations based on your past reading, and your reading history at Findory.

The new version of the API is much more powerful than the old. For example, you can access recent articles for any source, articles that match a keyword, or articles by subject. With this, you could build your own news site or feed reader on top of Findory instead of writing your own crawl.

You can get access to your own reading history. You could share it with others, run other seaches based on your Findory reading habits, or combine your reading with other people to create aggregated lists of favorite articles.

You can even get related articles for any article or source. Combine that with your reading history, maybe throw in some filters, and you can generate your own recommendations.

In fact, there is enough here in the new Findory API that, with a database for caching data and for remembering reader's histories, you pretty much could build your own version of Findory with it.

Here is your chance. Build your dream news reader, complete with recommendations and customized filters! Try out the Findory API!

Wednesday, January 03, 2007

Google needs code monkeys too?

Saul Hansell from the NYT wrote an article today, "Google collects better data searching for the perfect hire".

In that article, Googler Todd Carlisle said:

Sometimes, too much schooling will be a detriment to you in your job.

Wow, is that a change. Too much education can interfere with your job, Todd says. While this kind of anti-intellectualism has infiltrated the culture at other companies, it is quite a shift to see it from Google.

To be fair, as Saul reports, Todd and other Googlers are trying to optimize the hiring process. They are attempting to find what factors indicate that people will be successful, happy, and productive at Google.

However, at least from what is in the NYT article, the current process sounds less data driven than ideology driven. Not only are the measures flawed -- individual job performance does not capture the substantial impact of experts advising other teams, for example -- but also the replacement system they built never outperformed the original in tests. From the article, "It is too early to tell if the system is working. The surveys have been in use in about a dozen areas for several months."

See also Googler Chris Sacca's post, "Did you get the memo about the new cover sheets we are using?", lamenting the influx of MBAs and middle management goo at Google.

[Found on Findory]

Update: On a lighter note, do not miss the "Code Monkey" video by Jonathan Coulton and Mike Booth. Very funny.