Narayanan Shivakumar on Google's Hardware

Submitted by Dale on July 30, 2006 - 3:15am

Tagged:

The following notes are part 1 of 3 from Narayanan 'Shiva' Shivakumar's presentation at the July 27th VanHPC meeting. These notes cover Shivakumar's discussion of the Google hardware infrastructure.

Google's goal to organize the worlds information
How do we keep scaling? - We like to pre-compute as much as we can.
How do we built the right computing platform?
- Focus on price/performance
- Insure app developers can use the infrastructure for other things
Early decision was to use lots commodity PCs
Decided early on they needed to partition across a lot of machines because of data scale
Lots of PCs requires the ability to swap bad hardware and have things continue to work
Lots of PCs means big heat problem
In house rack, pc motherboard, low-end storage, linux
Buy things out of the backs of trucks if we can
Key challenges: affordable high performance networking, power (in)efficiency
Problem with networking is cost of fast NICs
Networking doesn't scale in a good way
We are always looking at how to hook up thousands of machines
Question: Do you know mean time to failure on your PCs
Answer: It's not just the machine, it's switches and other stuff to. We've built up a model of this. Can't give specific numbers because of variability. e.g., different hardware platforms.
Question: On networking, have you done any experiments or analysis on multicast
Answer: There are certain apps that this works for, it's not a large group. Search architecture doesn't need multicasting. Copying lots of data there is a limited application
Question: Have you looked at the hardware costs of multicasting
Answer: Yes, but I can't say anything about that. Data copying is a big deal for us.
Question: Using SNMP or custom apps for monitoring?
Answer: Early on we used perl scripts. On our current scale we've gone through 3 generations of monitoring. It's a large scale problem because of aggregation.
Question: Doesn't the manpower of equipment management (hardware swapping) out-weight the benefit of using cheaper machines?
Answer: Look at the cost of Sun equipment, the human cost of switching is not the significant cost
Question: In your statistics have you found a product that never failed (hardware)
Answer: I suspect not, but don't have an answer for that
Question: You don't have Windows boxes, if you buy a company with a windows product, what do you do
Answer: We have a variety of client software in Windows like desktop search. We have server products for windows market, as well
We recognized that power was going to be an issue for us, performance per wattage is an issue, even though other metrics are getting better
Working with the chip manufacturers on cooling
Lots of things going on in the area of general cooling
Just providing the raw electric power can start to become an issue
Question: Do you truly have heterogeneous systems, different machines all over the place?
Answer: Yes, we have a large Linux group to make sure things happen. We do try to keep hardware in same specs (e.g. memory speed)
Question: Are these desktop motherboards or workstation mother boards?
Answer: These are not desktop motherboards because we use multithreading
Question: My inner hippy is screaming to ask what you do with dead computers. Does Google care about environment?
Answer: First time I've been asked and I'm not sure what we do with the machines. I will find out.
Question: Are your data centers in different regions or one place?
Answer: We have a number of data centers in different places. This helps keep network transit time down
Question: What is the scale of your production infrastructure?
Answer: If you read newspapers, more than 10,000. Google doesn't comment publicly on this. It's a lot! Problem we have, this is a competitive advantage. We talk about it as much as we can without trying to give anything away.
Question: In terms of hardware/software faults, what is typical ratio? Where are most of the faults?
Answer: How do you define a fault? Just failures? Slow speed of a query return response? Can't really answer the question
Question: Does an application ever migrate across data centers?
Answer: Yes. MapReduce will address this a little
Question: What happens if you see a lot of uptime, a cause of concern? (i.e., Preventative reboot)
Answer: Pretty sure our application will crash before then
Question: Do you do prevent maintenance?
Answer: Yes, via routine replacement. If we know a batch is really bad it gets replaced earlier. Sometimes have to code around issue. One engineer said this job was the first time he'd had to do preventive programming because hardware memory error checking didn't work.

Narayanan Shivakumar on Google's Hardware

Search

Recent Comments