Error Message: Google Research Director Peter Norvig on Being Wrong

Aug 03, 20103:04 PM

Google, the company, entered this world in 1998. I’m not sure how long it took for Google , the verb, to follow — but I do know that millions of people engage in that particular activity many, many times each day. For half of all Internet users worldwide, Google is the portal to the collected and digitized wisdom (and folly) of humanity. Google’s search engine has changed how we conduct research, plan vacations, resolve arguments, find old acquaintances, and check out potential mates. It’s also given us new ways to interact with maps, mail, books, news, and documents, radically reshaping the way we think about almost every imaginable medium.

Peter Norvig, the director of research at Google, has been involved in this project since its toddlerhood. Norvig joined the company in 2001 and, from 2002 to 2005, served as its director of search quality — a position that put him charge of the company’s core Web search algorithms. Below, he and I talk about (among other things) how engineers think about error, what’s good about failing fast, and why Google buys cheap computers.

***

I’m interested in the way that attitudes about error vary across professional cultures — doctors typically think about error very differently than pilots and politicians and so forth — as well as across the cultures of different companies, even within the same field. How would you characterize the overall attitude toward error at Google?

There’s a story going back to the founding of Google: One of the venture capitalists came to [company founders] Larry [Page] and Sergey [Brin] and said “OK, the first thing you have to decide is, is this company going to be run by sales or by marketing? They said, “We think we’ll take engineering.” He laughed and said, “Oh, you naive college kids, that’s not the way the real world works.” And they said, “Well, we want to try it.” Ten years later, that experiment is still running; engineering is still the center of the company. And it seems like it’s worked.

And, like you say, it does create a very different attitude toward error. If you’re a politician, admitting you’re wrong is a weakness, but if you’re an engineer, you essentially want to be wrong half the time. If you do experiments and you’re always right, then you aren’t getting enough information out of those experiments. You want your experiment to be like the flip of a coin: You have no idea if it is going to come up heads or tails. You want to not know what the results are going to be.

What about errors not in experimentation but in implementation or execution? What do you do about mistakes like that, which can presumably compromise your product?

As an engineer, you’re just used to the idea that there are errors. No matter how good you think you are, the industry standard is that if you write 100 lines of program, there’s probably going to be one error in it. So you have to build all your systems expecting that. We built an entire development process around the idea that errors exist and the need to minimize their impact.

What does that process look like?

Well, first, there are two kinds of error we deal with. One is, there’s a clear error in the code: It’s supposed to do one thing and it does something else. In that case, you know when you’ve got it wrong, and you’ll know when you’ve got it right. The second is, how good are the results? Say you do a search and it shows you links; there is no definitive right or wrong to the question of, “Did it work?” But you can say, “Well, this one worked better than that one.”

That’s interesting. Thomas Kuhn, the great historian and philosopher of science, makes a similar point — that you can’t say whether an individual theory is right, you can only say which of two theories fits the facts better.

Right. And we test at both levels — the clear-cut case where it’s wrong, as well as the one where you’re trying to figure out what works better. At the first level — and we share this with most software companies — before anyone can check in a piece of code that they’ve written, somebody else has to sign off on it. And then we have all these review processes and test processes at multiple levels to see if [your code] gives the right answer. A large proportion of the code you write is testing what somebody else wrote or what you yourself wrote. The work you’re doing is often about, “Am I getting the right answer?” rather than, “How do I compute the right answer?”

What do you do about technological failures? I assume that sometimes it’s not the software but the hardware that goes wrong and that the price of those problems can be pretty steep.

I think Google was early in accepting hardware errors. Other companies have tried to say, “Well, if you can buy big, expensive computers that are more reliable, then you’ll have fewer breakdowns and you’ll do better.” Google decided to buy lots of cheap computers that break down all the time, but because they’re so much cheaper, you can design the system with multiple backups and ways to route around problems and so forth. We just architect the system to expect failure. Google was very innovative in this area and saved a lot of money as a result.

How did that innovation come about?

In part, I think it was visionary. But, in part, it was just that the problem we are attacking made it easier. If you’re doing a Web query and some of the computers break in the middle and you don’t get exactly the same result as someone else doing the same query, well, OK. You don’t want to drop the top result; if I do a search of the New York Times , I want nytimes.com to be the top result. But what should the 10th result be? There is no right answer to that. If a hardware error means we dropped one result and somebody had a different result at No. 10, there’s no way of saying that’s right or wrong. Whereas if I’m a bank, I can’t say, “Oh, one out of every million transactions, I’m just going to lose that money.” I can’t have that level of failure. But at a search company, you’re more tolerant of error.

I’ve been at both ends. My previous job was at NASA, where you really don’t want your shuttles to blow up very often. So there they spend hundreds of millions of dollars to protect their astronauts’ lives. Here, we’re kind of at the other end. Failure is always an option at Google.

I want to talk about innovation, because it seems to me that the price of trying new things is that most of them fail. How do you build a tolerance for that kind of failure into a public corporation that’s accountable to its bottom line? Getting things wrong might be necessary to getting things right, but failure can be costly.

We do it by trying to fail faster and smaller. The average cycle for getting something done at Google is more like three months than three years. And the average team size is small, so if we have a new idea, we don’t have to go through the political lobbying of saying, “Can we have 50 people to work on this?” Instead, it’s more done bottom up: Two or three people get together and say, “Hey, I want to work on this.” They don’t need permission from the top level to get it started because it’s just a couple of people; it’s kind of off the books.

Two or three months isn’t very much time. How do you decide at that point whether an idea is going to succeed or fail?

When you talk about being wrong, I think of that mostly from a statistical inference point of view, and within the company, we’re really good at making decisions based on statistics. So if we have an idea — “You know, here’s a way I can make search better” — we’re really good at saying, “Well, let’s do an experiment. Let’s compare the old way with the new way and try it out on some sample searches.” And we’ll come back with a number and we’ll know if it’s better and how much better and so on. That’s our bread and butter.

OK, but what about things that can’t be measured experimentally?

Right, that’s the question. When it comes to something that doesn’t really have statistics, that’s harder for us. Take something like launching Gmail, where it wasn’t a question of, “Can we make it work?” It was a question of, “Well, gee, who are the other players in this game?” It was Microsoft, Yahoo, and AOL, and they were all either partners or rivals or both to us, so then the question is, “How are they going to react if we do this?” And we had this idea to offer it for free but to have ads on the sides, so it was like: “Are people going to think that’s creepy?”

You can’t really do experiments for things like that; you can’t get at it through statistics. I suppose you can have focus groups, but focus groups really aren’t important; it’s more about what the press is going to say. So those types of decisions have to be made more on gut instinct rather than a statistical basis. And as a company, that’s harder for us.

Can you parse the idea of gut instinct for me? What is it? What are you actually relying on when you make those so-called instinctive decisions?

Good question. I guess it’s experience. It’s projecting into the future based on what you’ve done in the past. Is this going to work? Are we going to be able to build it on time? Is it going to perform as expected? You get a feel for things like that by having built similar projects.

The harder part, I think, is judging the likely reaction. Yes, we can build this, but are people going to like it? Or: Is somebody else going to build a better one first? Or: How are other companies going to react to this? I guess that’s also an experience thing, but that part’s much harder, because it involves not just what we can do, but how other people are going to respond to what we can do.

So how good would you say your gut instinct is? When you fall in love with an idea or a project — when your intuition says, “We should go for this, this is going to work” — are you usually right?

I think we often have good intuition about where things are going in general. As a company, we made a big bet on mobile and the Android platform because we knew that people were going to be using their phones rather than their desktops for computing, and that gamble worked out. But the details —w as creating the android platform the right way to do it? Should we have partnered with someone else or created a different system? — that was harder to say. A lot of the time, the strategic ideas are clear, but how to get there is not.

Earlier in this series, I interviewed Ira Glass , the host of This American Life , and he said that for every story we hear on the show, they start developing 10 or so and go into production on three or four. He also talked about sitting in on a meeting at The Onion and learning that they kill 30 or 40 pretty funny headlines in order to generate one really funny one. What about you guys? What would you say is your failure-to-success rate?

It’s hard to say, because it varies a lot. Some teams are taking on projects that they pretty much know can be done. Let’s say you’ve got some storage system that’s not fast enough or big enough and we need to design one that’s going to be better. Essentially we have to build that; we have no other choice. So we know it’s going to get done. Maybe it won’t quite meet the specifications — maybe it takes a little too long and maybe it’s not quite as fast — but it gets done. So there you have a very high rate.

Then with things like search quality, we have all these ideas of how to make search better, and I’d say maybe half of those end up working. Sometimes you start down a path and then you find out it doesn’t help, it doesn’t make any difference.

Half strikes me as a pretty good ratio. What about the success rate for the kinds of experiments your users see, like all the stuff in Google Labs ? Do those catch on fairly often, or is it mostly like, “Well, that was a nice idea.”

Most of the things you see in Google Labs are there because we didn’t quite know what to do with them, so certainly less than half of them become hits. I don’t know what the exact number is. Some of them are already winnowed out by the time they get there; if we thought they were really big, they’d be on the main site rather than the Labs site. Some of them are there because it was easier; maybe there’s a security issue or brand-image issue with making it part of the main site, and we didn’t want to go through that process if we don’t have to. So we said, “Well let’s throws it on Labs and if it becomes really popular, we’ll think about how to integrate it.”

Despite all the experiments Google has initiated since it began, the vast majority of your profits — I’ve heard between 97 percent and 99 percent — come from just one thing: advertisements related to search. Obviously, then, income generation is not the metric you’re using to decide if a product succeeds or fails. What is?

You’re right that most of the money comes in through ads. But you can think of everything else as bringing in customers so that they’ll click on the ads. We know the value of adding a new customer, and we can see what the usage is of individual sites. So we can say, “This feature is popular, our usage is going up, and because usage is going up, we’re making more money.” We do things to make Google better so people will come to Google and click on the ads.

What’s interesting, though, is that we’re now at the scale where we can also do things that just make the Web better. We do a lot of open-source projects, because if we release code and some other company makes something really cool that makes the Internet better, we benefit, too. About half of Internet users are using Google search, so if another company builds something and two people start using the Internet because of it, we’re going to get one of them.

Google has been remarkably successful at creating popular products. How does the company create a culture that’s conducive to generating new ideas?

Well, we have great people, and that’s a huge part of it. But I think the main thing is just trying a lot of ideas. We’ve built the ultimate system for making demos internally. If a startup company has an idea, it’s like, “Well, I need a copy of the Web to make my idea work, I need a thousand computers, I gotta go raise money to do that.” So they spend months or years raising money and building infrastructure.

Whereas we have all of that. Somebody can learn how to use it in their first day and say, “OK, I have an idea, and these pieces are already here, and I can just connect them together and see if it works.” And if it doesn’t work today, next week I’ll have another idea. And I haven’t wasted months going down one path. It’s like playing with tinker toys or something. You plug ‘em together, you try something, and if you think it’s good, you keep going. And if it isn’t, you put them down and start on something new.

I’m struck by how long some products stay in beta testing at Google. Gmail, for instance, was launched in 2004 but wasn’t upgraded from beta status until 2009, by which point it had 146 million users. What’s the reasoning behind that?

There’s two parts to that, a technical engineering part and a public relations part. From the technical engineering point of view, you define a project and say, “These are the features this should have, and until it has all those features, it’s still beta.” But then there’s another decision which is: When is it worth launching? Something can be missing a couple of features and still be worth launching, and we’ve chosen to do it that way, whereas other companies seem less likely to do so. I don’t know why the PR people are open to that here. Maybe it gives the impression that Google is always changing and products aren’t quite finished. And maybe they want to give that impression.

The whole beta model is completely at odds with conventional production and manufacturing; you never see General Motors release a beta version of a car, for instance. What’s the cost-benefit tradeoff involved in releasing versions of products that you know are still flawed and incomplete?

There’s a big difference between the products we have, which mostly live on our servers where we have the ability to update them every day, and a car, which once you ship out becomes very expensive to recall. Traditional software is somewhere in between; if you’re selling CDs that you put in boxes and ship to stores, there’s a cost to updating that, but it’s less than the cost of a car. But for us, it’s a process of continual change. We expect to change our servers every day; that’s natural for us.

Is part of the benefit to you the open-source advantage — the fact that your customers find the flaws for you?

Yeah, sure, both explicitly — in terms of them saying, “Hey, here’s a problem,” and also implicitly, in terms of how they interact with it. We see the statistics, we measure how often they click on the first result, how often they have to do a follow-up search, and we get an idea of whether they’re satisfied or not. And then we make a change and see if the statistics look better.

Last week in this column, I spoke with Wikipedia co-founder Larry Sanger about how we use the Web to organize, validate, and disseminate information. Google’s stated mission is to organize the world’s information. Does the company do anything to prioritize more accurate or at least more credible results in its searches?

In a sense, the core innovation behind Google — this notion of page rank, of how many other people are pointing to your result — is a crude measure of credibility. And that came about because of frustration with the quality of the results yielded by earlier search engines. The first search engines were built on a kind of library-sciences technology: To do a search, you looked for documents that mention those key words the most times. So you would often end up with results that were off-target but happened to have a high density of the keywords. Page rank said: If everyone else is pointing to this page, it must be a good one.

That’s still how it works to a degree, but multiple things have happened since then. One is that now there’s a war between the good guys and the spammers — people who are trying to artificially inflate certain pages by reverse — engineering our system and building pages that can falsely claim credibility. So we have to watch out for that. And then part of the problem with credibility by citation is that it takes time. You don’t instantly gather links the first day something is published, which makes it harder to follow new news items and so on. So we need a way to weigh the freshness of a new result versus the accumulation of credibility over time. But yes, we are always looking at quality and credibility as well as salience.

Interesting. It sounds like page rank uses consensus as a stand-in for credibility. That slippage is hardly unique to Google-all of us use consensus as a stand-in for credibility sometimes-but it can be pretty misleading.

Yeah, that’s always a problem. One way we try to counter that is diversity. We haven’t figured out any way to get around majority rules, so we want to show the most popular result first, but then after that, for the second one, you don’t want something that’s almost the same as the first. You prefer some diversity, so there’s where minority views start coming in.

What do you think have been Google’s biggest mistakes?

I can’t speak for the whole company, but I guess not embracing the social aspects. Facebook came along and has been very successful, and I may have dismissed that early on. There was this initial feeling of, “Well, this is about real, valid information, and Facebook is more about celebrity gossip or something.” I think I missed the fact that there is real importance to having a social network and getting these recommendations from friends. I might have been too focused on getting the facts and figures — to answer a query such as “What digital camera should I buy?” with the best reviews and facts, when some people might prefer to know “Oh, my friend Sally got that one; I’ll just get the same thing.” Maybe something isn’t the right answer just because your friends like it, but there is something useful there, and that’s a factor we have to weigh in along with the others.

What about you yourself — what have you personally been most wrong about?

One thing is how fast things change. I was in a meeting a while ago and somebody was discussing a new project — this was in an area I hadn’t touched for a while — and I said “Oh, isn’t it the case that such and such?” And they kind of snorted derisively and said, “Yeah, well, that’s the way the Web was four years ago, but that approach doesn’t work anymore.” I think that’s happening constantly. You think you have this experience — and we talked about how important experience is for having intuitions — but experience can go out of date very quickly.

If you could hear someone else interviewed about being wrong, who would it be?

Last night I saw Jeff Ma speak — he’s the MIT student who did the card-counting in Vegas, which the movie 21 was about. He talked about one of his first days betting and how there are certain situations where the statistics say, “If you’re in this position, you should double your bet.” So Ma finds himself in that position and doubles his bet and then the dealer deals himself 21 and Ma loses $50,000. Then a couple of hands later he was in the same situation and now he’s down $100,000. So he went back to his room and said, “What did I do wrong?” He thought about it and said, “I didn’t do anything wrong; the statistics are what they are and I did the exact right play for the statistics. The dealer just got lucky.” So he went back and kept playing the same strategy and ended up winning $70,000 or something over the weekend. He makes the point that if you’re making the right decision, even if you get a bad result, you’re not really wrong.

And if I could interview a dead guy — and automatically improve my French, while we’re wishing for the impossible — I’d take [French mathematician and astronomer Pierre Simon] Laplace. I think that he deserves most of the credit for Bayesian probability theory , and most of Bayes’ fame comes from having his name on the theorem, not for actually doing the work.

Kathryn Schulz is the author of Being Wrong: Adventures in the Margin of Error . She can be reached at kathryn@beingwrongbook.com . You can follow her on Facebook here , and on Twitter here .

This blog features Q&As in which notable people discuss their relationship to being wrong. You can read past interviews with Wikipedia co-founder Larry Sanger , NASA astronaut-turned-medical-error-guru James Bagian , hedge-fund manager Victor Niederhoffer , mountaineer Ed Viesturs , This American Life host Ira Glass , celebrity chef Anthony Bourdain , Sports Illustrated senior writer Joe Posnanski , education scholar and activist Diane Ravitch , and criminal defense lawyer and pundit Alan Dershowitz .