Google says a set of crushed wheels used for moving its server racks triggered a chain reaction that may have disrupted Search, Gmail, and other services for some users.
A rack of servers at one of its data centers started overheating to the point where CPUs were automatically throttled, ultimately because a set of rack wheels couldn't bear the weight of Google's cloud kit.
Steve McGhee, a solutions architect at Google Cloud, says Google users "most likely" wouldn't have noticed errors caused by the rack's crushed wheels. But the chain of events resulted in enough CPU throttling to cause "user harm".
Fortunately, the incident wasn't as serious as one from June last year,caused by a failure in Google's automation software, which took down Gmail, YouTube, and customers' applications. That incident prompted a big apology to customers and a commitment to do better in future.
SEE: Cloud v. data center decision (ZDNet special report) | Download the report as a PDF (TechRepublic)
This time the company has decided to tell the story to illustrate the lengths it goes to to find the root cause of disruptions even when they don't noticeably impact users.
The latest event came to light when Google recently kicked off an investigation after a site reliability engineer noticed a spike in errors from machines on its edge network that cache content users frequently access. The machines were immediately taken offline to stop them impacting customers, allowing other machines to take up the slack.
Google engineers noticed some border gateway protocol (BGP) network errors but their characteristics suggested issues with the machines rather than the router. Further investigation turned up kernel messages in machines on the edge network that revealed CPU clock throttling.
The engineers found that failing systems were isolated to machines on a single rack. All of this investigation was happening remotely. Unable to explain why the rack was overheating enough to cause kernel errors, the engineers then requested Google's on-site data-center workers to physically check out the problem rack.
Soon after the data-center team reported back with a brief message and a picture of the rack's crushed wheels.
"Hello, we have inspected the rack. The casters on the rear wheels have failed and the machines are overheating as a consequence of being tilted," the team explained.
"The wheels (casters) supporting the rack had been crushed under the weight of the fully loaded rack," said McGhee.
"The rack then had physically tilted forward, disrupting the flow of liquid coolant and resulting in some CPUs heating up to the point of being throttled."
SEE: There's more to Google than Google: Dataset Search comes out of beta
It's not clear why the wheels were crushed but Google engineers feared it could be a more widespread problem and so they replaced all the racks that could be vulnerable to the same broken-wheel tilting issue.
The problem has caused Google to reconsider how it moves new racks into its data centers when they're being built.
Google's engineers discovered that casters on the rear wheels had failed, ultimately causing the machines to overheat.
The alarming tilt of a refrigeration unit also pointed to the underlying problem.
- Microsoft answers Amazons Verizon cloud partnership with new 5G-enabled Azure Edge Zones, starting with AT&T - GeekWire - April 1st, 2020
- Local Break Out (LBO) and Its Role in Bringing Cloud to the Edge - Data Center Knowledge - April 1st, 2020
- Snap Launches App Stories to Expand Its Reach - Nasdaq - April 1st, 2020
- Holy Water: A creative water-holing attack discovered by Kaspersky - ETCIO.com - April 1st, 2020
- StorageOS hits the big 2.0 as it targets more 'mature' clusters - DevClass - April 1st, 2020
- North West Logistics moves to the cloud - Tina Massey - April 1st, 2020
- Micron Gains as Cloud Strength Boosts its Earnings and Guidance - TheStreet - March 26th, 2020
- Cloud-Based Security Tool Adoption: Latest Research Findings - MSSP Alert - March 26th, 2020
- Supermicro Accelerates AI and Deep Learning with NGC-Ready Servers - insideHPC - March 26th, 2020
- COVID-19 puts corporate WFH capabilities to the test - SC Magazine - March 26th, 2020
- Discover aspects of the Hybrid Cloud Market as it value achieves $171926 million with CAGR 21.7% - WhaTech Technology and Markets News - March 26th, 2020
- Dell debuts oven-ready AI platforms to ease researchers' setup pain - Blocks and Files - March 26th, 2020
- How is Coronavirus Affecting the Daily Lives of Architects? Our Readers Answer - ArchDaily - March 26th, 2020
- Enabling AI with edge computing and HCI - Techerati - March 26th, 2020
- StackPath and Broadcom Collaborate to Boost Cloud Compute Services at the Edge - Yahoo Finance - March 25th, 2020
- 5 Innovative Applications of Edge Computing - AiThority - March 25th, 2020
- NVIDIA, Azure And AWS Offer Free Resources To Fight Against COVID-19 - Analytics India Magazine - March 25th, 2020
- Cloud Server: Market 2020 What Factors will drive the Market in Upcoming Years Dell, HP, IBM, Oracle, Cisco, Fujitsu, Hitachi, NEC - News Times - March 25th, 2020
- Centrica operations are boosted by development of cloud-based system - Information Age - March 25th, 2020
- Coronavirus: How organisations align IT costs with business value - Software Testing News - March 25th, 2020
- Sysdig Provides the First Cloud-Scale Prometheus Monitoring Offering - Business Wire - March 25th, 2020
- Insights on the ASEAN and Taiwan Cloud Infrastructure-as-a-Service Market - Forecast to 2025 - ResearchAndMarkets.com - Yahoo Finance - March 25th, 2020
- Say Goodbye to legacy: Moving over to cloud-native availability solutions - Bdaily - March 25th, 2020
- Thought Machine's Paul Taylor: 'Events like this will push banking into the cloud' - AltFi - March 25th, 2020
- Head in the Clouds: Managing Security in a Multi-Cloud World - TechSpective - March 25th, 2020
- Laser Links will link small satellites to Earth and each other - Laser Focus World - March 25th, 2020
- The trouble with cloud evaluations amid too many choices - TechTarget - March 25th, 2020
- Cloud Computing is Helping Deal with Coronavirus - Analytics Insight - March 23rd, 2020
- Owner Says Tesla Model 3 Window Shattered By Itself, Accused Of FUD - InsideEVs - March 23rd, 2020
- Insights on the ASEAN and Taiwan Cloud Infrastructure-as-a-Service Market - Forecast to 2025 - ResearchAndMarkets.com - Business Wire - March 23rd, 2020
- Link11 Offers Its Cloud-Based DDoS Protection Solution To Public Sector Organizations Free Of Cost During COVID-19 - SecurityInformed - March 23rd, 2020
- Here's how AI is helping Indian Insurance industry improve customer experience - ETCIO.com - March 23rd, 2020
- New HomeKit-compatible pan-and-tilt Eufy camera seems to be on the way - 9to5Mac - March 23rd, 2020
- AI can better predict drug response to lung cancer therapies - ETCIO.com - March 23rd, 2020
- Evolution of the Cloud Conversation - UC Today - March 23rd, 2020
- Galvanizing the new age of IT with AI and hybrid cloud - ETCIO.com - March 23rd, 2020
- London teacher with half a million hits on maths website during coronavirus outbreak shortlisted as among world's best - Evening Standard - March 23rd, 2020
- Surge in home working highlights Microsoft licensing issue: If you are not on subscription, working remotely is a premium feature - The Register - March 23rd, 2020
- Google Teams Up with Solo.io to Extend Istio - Container Journal - March 23rd, 2020
- Running These Workloads? You Should Take A Look At The IBM Z15 - Forbes - March 23rd, 2020
- Cloud Servers Market Share Analysis and Research Report by 2025 - Express Journal - March 19th, 2020
- VPN deal: 73% off and a cloud storage freebie with this limited time offer - TechRadar India - March 19th, 2020
- Ampere's server chip has 80 Arm CPU cores? Yeah, well, our ThunderX3 will have 96 with 384 threads, says Marvell - The Register - March 19th, 2020
- Pushing the cloud to the limit, with UKFast's Chris Folkerd - Techerati - March 19th, 2020
- The global malware analysis market was valued at US$ 2.55 Bn in 2018 and is expected to grow at a CAGR of 29.2% during the forecast period 2019 -... - March 19th, 2020
- Immersion Neuroscience Index Reveals the Public Craves Direction From Its Elected Leaders, Not Celebrities, During a Crisis - Business Wire - March 19th, 2020
- Memo to MSPs: Is There A Safe Way to Visit Customer Offices? - ChannelE2E - March 19th, 2020
- Global Hyper-Converged Infrastructure Markets to 2025 - IBM, Cisco, Huawei and Microsoft are the Forerunners of this $27 Billion-Projected Industry -... - March 19th, 2020
- Spectro Cloud Launches With $7.5 Million to Help Enterprises Realize the Promise of Kubernetes - GlobeNewswire - March 19th, 2020
- Pockets of Density Will Be More Common in Future Data Centers - Data Center Frontier - March 19th, 2020
- Cloud of uncertainty for restaurants, bars, and servers - WKBW-TV - March 17th, 2020
- Google reveals the wheels almost literally fell off one of its cloudy server racks - The Register - March 17th, 2020
- The Last Hurrah Before The Server Recession - The Next Platform - March 17th, 2020
- Google Translates real time transcription feature is out now for Android - The Verge - March 17th, 2020
- Data storage in the cloud: 5 ways to make it faster and cheaper - TechGenix - March 17th, 2020
- Spectro Cloud Launches With $7.5 Million to Help Enterprises Realize the Promise of Kubernetes - Container Journal - March 17th, 2020
- Need to build a high-performance private cloud? You need the QNAP TVS-1282T3 Thunderbolt 3 NAS - ZDNet - March 17th, 2020
- Your No. 1 Cloud Threat is 'Excessive Permissions' - CXOToday.com - March 17th, 2020
- Bipartisan Senate Judiciary Committee introduces bill that censors online content and attacks encryption - World Socialist Web Site - March 17th, 2020
- Business101: 5 Reasons why small and medium businesses should consider cloud technology - IOL - March 17th, 2020
- NetApp Acquires Talon Storage, Providing Integration with Cloud Services and File Shares - Database Trends and Applications - March 17th, 2020
- Microsoft, Google, Slack, Zoom et al struggling to deal with a spike in remote tools thanks to coronavirus - The Register - March 17th, 2020
- Differences Between The Types Of Hosting Services - Knnit - Knnit - March 17th, 2020
- Cloud Server Market 2020: Applications, Types and Growing Trends in Market, Gro - News by aeresearch - March 17th, 2020
- Cloud firms turn to long nights, employee health checks to survive work-from-home rush - Japan Today - March 17th, 2020
- Where to look for cost savings in the cloud - TechCentral.ie - March 17th, 2020
- Solar Industry Grows 23% in 2019, But Clouds Are on the Horizon - Nasdaq - March 17th, 2020
- Azure admins' cold sweat likely caused by a 'isolated' power problems that browned out West Central USA region - The Register - March 17th, 2020
- Lloyds to partner with Google Cloud - Business Insider - Business Insider Nordic - March 11th, 2020
- What Is An Advanced Cloud? - Forbes - March 11th, 2020
- Cloud Adoption Will Be On A Rise In 2020 - HostReview.com - March 11th, 2020
- Alveo U25 SmartNIC turnkey solution for the Cloud - Fudzilla - March 11th, 2020
- VMware embraces Kubernetes with vSphere 7 - Blocks and Files - March 11th, 2020
- Supermicro Unveils MegaDC Servers - The First Commercial Off The Shelf (COTS) Systems Designed Exclusively for Hyperscale Datacenters - Associated... - March 11th, 2020
- COVID-19 Global Outbreaks: Coordinating Your Remote Business Operations - China Briefing - March 11th, 2020
- How Mircom Group is using technology to turn buildings into active, networking machines - The Globe and Mail - March 11th, 2020
- Mastering the multicloud maze: how to choose the right solution for your business - UKTN - March 11th, 2020
- The not-so-Smart Home is available now to disable the Server in the Cloud, behind the Lightify - Play Crazy Game - March 11th, 2020
- Supermicro's data centre on a pole sends IoT processing to the Edge - Data Economy - March 9th, 2020
- Amazon, Microsoft cloud-computing can weather a recession and coronavirus, analysts say - The Columbian - March 9th, 2020