<kbd id="afajh"><form id="afajh"></form></kbd>
<strong id="afajh"><dl id="afajh"></dl></strong>
    <del id="afajh"><form id="afajh"></form></del>
        1. <th id="afajh"><progress id="afajh"></progress></th>
          <b id="afajh"><abbr id="afajh"></abbr></b>
          <th id="afajh"><progress id="afajh"></progress></th>

          探究微軟悉尼數(shù)據(jù)中心西區(qū)斷服事件

          共 11691字,需瀏覽 24分鐘

           ·

          2024-06-21 21:14

          本次案例微軟澳大利亞東部數(shù)據(jù)中心經(jīng)歷了一次長達(dá)46小時(shí)的中斷事件,起因是電力供應(yīng)問題導(dǎo)致冷卻系統(tǒng)故障,進(jìn)而影響服務(wù)。微軟對此的反思和應(yīng)對措施集中在優(yōu)化緊急操作程序(EOP),尤其是冷水機(jī)組的自動(dòng)重啟機(jī)制,以減少人工干預(yù)需求。


          這一事件凸顯了即便在高度自動(dòng)化的環(huán)境中,關(guān)鍵時(shí)刻能夠快速響應(yīng)仍是確保服務(wù)連續(xù)性的關(guān)鍵因素。正所謂“解決問題的關(guān)鍵,是找到關(guān)鍵的問題。”

          人員配置標(biāo)準(zhǔn):數(shù)據(jù)中心是否面臨不必要的中斷風(fēng)險(xiǎn)?

          Staffing levels: are data centers at risk of unnecessary outages?

          隨著數(shù)據(jù)中心自動(dòng)化程度的提高,客戶自然希望確保他們的數(shù)據(jù)可用性能夠盡可能接近100%,并詢問是否有足夠的員工可用以實(shí)現(xiàn)高水平的正常運(yùn)行時(shí)間。當(dāng)潛在的中斷風(fēng)險(xiǎn)發(fā)生時(shí),是否有足夠的技術(shù)人員值班可用,以盡快恢復(fù)服務(wù)。
          With increasing data center automation, it’s only natural for clients to want assurance that their data will be available as close to 100 percent of the time as possible, and to ask whether enough data center staff are available to achieve a high level of uptime. They also want to know that when a potential outage occurs, there are enough technicians on duty or available to restore services as soon as possible.

          2023年8月30日,微軟在悉尼的澳大利亞東部地區(qū)遭遇了一次宕機(jī),持續(xù)了46小時(shí)。
          Microsoft suffered an outage on 30th August 2023 in its Australia East region in Sydney, lasting 46 hours. 

          客戶在訪問或使用Azure、Microsoft 365和Power Platform服務(wù)時(shí)遇到問題。它是由08:41(UTC)的電力中斷觸發(fā)的,并影響了澳洲東區(qū)的三個(gè)可用區(qū)之一。
          Customers experienced issues with accessing or using Azure, Microsoft 365, and Power Platform services. It was triggered by a utility power sag at 08.41 UTC and impacted one of the three Availability Zones of the region.

          微軟官方解釋說:“電壓驟降導(dǎo)致冷卻系統(tǒng)中冷機(jī)的部分離線,在努力恢復(fù)冷卻的同時(shí),數(shù)據(jù)中心的溫度上升到高于運(yùn)行閾值的水平。我們關(guān)閉了小部分既定的計(jì)算和存儲(chǔ)規(guī)模單元,以降低溫度并防止損壞硬件。”
          Microsoft explains: “This power sag tripped a subset of the cooling system chiller units offline and, while working to restore cooling, temperatures in the data center increased to levels above operational thresholds. We powered down a small subset of selected compute and storage scale units, both to lower temperatures and to prevent damage to hardware.”

          盡管如此,絕大多數(shù)服務(wù)在22:40(UTC)之前已經(jīng)恢復(fù),但直到2023年9月3日20.00 (UTC)才全面緩解。微軟表示,這是因?yàn)橐恍┓?wù)受到了長期的影響,“主要是由于依賴于恢復(fù)存儲(chǔ)、SQL數(shù)據(jù)庫和Cosmos DB服務(wù)。”
          Despite this, the vast majority of services were recovered by 22.40 UTC, but they weren’t able to complete a full mitigation until 20.00 UTC on 3rd September 2023. Microsoft says this was because some services experienced a prolonged impact, “predominantly as a result of dependencies on recovering subsets of Storage, SQL Database, and/or Cosmos DB services.”

          電壓驟降的原因

          Voltage sag cause

          據(jù)微軟稱,市電電壓驟降是因?yàn)槲挥诎拇罄麃問|部地區(qū)可用區(qū)18英里處的電力基礎(chǔ)設(shè)施遭到雷擊。電壓驟降導(dǎo)致多個(gè)數(shù)據(jù)中心的冷卻系統(tǒng)冷卻機(jī)組關(guān)閉。部分機(jī)組自動(dòng)重啟,但仍有13臺(tái)機(jī)組重啟失敗,需要人工干預(yù)。為此,現(xiàn)場團(tuán)隊(duì)訪問了冷機(jī)所在的數(shù)據(jù)中心屋頂設(shè)施,并依次重啟一個(gè)數(shù)據(jù)中心至下一個(gè)數(shù)據(jù)中心的冷機(jī)。”
          The utility voltage sag was caused, according to the company, by a lightning strike on electrical infrastructure situated 18 miles from the impacted Availability Zone of the Australia East region. They add: “The voltage sag caused cooling system chillers for multiple data centers to shut down. While some chillers automatically restarted, 13 failed to restart and required manual intervention. To do so, the onsite team accessed the data center rooftop facilities, where the chillers are located, and proceeded to sequentially restart chillers moving from one data center to the next.”

          影響了什么?

          What was the impact?

          “當(dāng)團(tuán)隊(duì)到達(dá)需要手動(dòng)重啟的最后五個(gè)冷機(jī)時(shí),這些冷卻器(冷凍水回路)內(nèi)的水已經(jīng)達(dá)到過高的溫度,無法重新啟動(dòng)。在這種情況下,重啟被自我保護(hù)機(jī)制所抑制,該機(jī)制的作用是防止在高溫下處理水可能發(fā)生的冷機(jī)損壞。無法重新啟動(dòng)的五臺(tái)冷機(jī)為受此事件影響的兩個(gè)相鄰數(shù)據(jù)機(jī)房提供冷卻。”
          “By the time the team reached the final five chillers requiring a manual restart, the water inside the pump system for these chillers (chilled water loop) had reached temperatures that were too high to allow them to be restarted. In this scenario, the restart is inhibited by a self-protection mechanism that acts to prevent damage to the chiller that would occur by processing water at the elevated temperatures. The five chillers that could not be restarted supported cooling for the two adjacent data halls which were impacted in this incident.”

          微軟表示,受影響的兩個(gè)數(shù)據(jù)機(jī)房至少需要四臺(tái)冷機(jī)才能冷卻。在電壓下降之前,冷卻能力由七臺(tái)冷機(jī)提供,其中五臺(tái)正在運(yùn)行,兩臺(tái)處于備用狀態(tài)。由于數(shù)據(jù)機(jī)房溫度升高,部分網(wǎng)絡(luò)、計(jì)算和存儲(chǔ)基礎(chǔ)設(shè)施開始自動(dòng)關(guān)閉。溫度上升影響了服務(wù)可用性。然而,現(xiàn)場數(shù)據(jù)中心團(tuán)隊(duì)不得不在UTC時(shí)間11:34開始對剩余的網(wǎng)絡(luò)、計(jì)算和存儲(chǔ)基礎(chǔ)設(shè)施進(jìn)行遠(yuǎn)程關(guān)閉,以保護(hù)數(shù)據(jù)持久性、基礎(chǔ)設(shè)施健康,并解決熱失控問題。
          Microsoft says the two impacted data halls require at least four chillers to be operational. The cooling capacity before the voltage sag consisted of seven chillers, with five of them in operation and two on standby. The company says that some networking, compute, and storage infrastructure began to shut down automatically as data hall temperatures increased. This temperature increase impacted service availability. However, the onsite data center team had to begin a remote shutdown of any remaining networking, compute, and storage infrastructure at 11.34 UTC to protect data durability, infrastructure health, and to address the thermal runaway.

          人員配置評估

          Staffing review

          微軟列舉了多項(xiàng)緩解措施,其中包括在數(shù)據(jù)中心增加技術(shù)人員人手配置,“在更改冷機(jī)管理系統(tǒng)之前,準(zhǔn)備好執(zhí)行冷機(jī)的手動(dòng)重啟程序,以防止重啟失敗。”夜班團(tuán)隊(duì)臨時(shí)從三名技術(shù)人員增加到七名,使他們能夠充分理解根本問題,從而采取恰當(dāng)?shù)木徑獯胧1M管如此,微軟認(rèn)為,如果當(dāng)時(shí)遵循了“基于負(fù)荷”的冷機(jī)重啟順序,當(dāng)時(shí)的技術(shù)人員配置將足以防止影響發(fā)生。
          Amongst the many mitigations, Microsoft says it increased its technician staffing levels at the data center “to be prepared to execute manual restart procedures of our chillers prior to the change to the Chiller Management System to prevent restart failures.” The night team was temporarily increased from three to seven technicians to enable them to properly understand the underlying issues, so that appropriate mitigations can be put in place. It nevertheless believes the staffing levels at “the time would have been sufficient to prevent impact if a ‘load based' chiller restart sequence had been followed, which we have since implemented.”

          報(bào)告補(bǔ)充道:“初步回看事后調(diào)查報(bào)告中提到的數(shù)據(jù)中心人員配置水平僅考慮了現(xiàn)場‘關(guān)鍵環(huán)境’工作人員數(shù)量。這并沒有準(zhǔn)確描述我們數(shù)據(jù)中心的總體人員配置水平。為了消除這一誤解,我們在狀態(tài)歷史記錄頁面上公布的初步公開事后調(diào)查報(bào)告中做出了修改。”
          It adds: “Data center staffing levels published in the Preliminary PIR only accounted for “critical environment” staff onsite. This did not characterize our total data center staffing levels accurately. To alleviate this misconception, we made a change to the preliminary public PIR posted on the Status History page.”

          然而,在 “Azure事件回顧:VVTQ-J98”的深入討論中,微軟亞太區(qū)數(shù)據(jù)中心運(yùn)營副總裁Michael Hughes針對有關(guān)現(xiàn)場工作人員比公司最初聲明的更多的評論進(jìn)行了回應(yīng)。還有人提出,真正的解決方案不一定是增加現(xiàn)場人員數(shù)量。也有人建議,真正的解決方案應(yīng)該是應(yīng)急操作程序(EOPs)中基于模式的順序,這可能并不會(huì)改變?nèi)藛T配置水平。
          Yet in a Deep Dive ‘Azure Incident Retrospective: VVTQ-J98’, Michael Hughes – VP of APAC datacenter operations at Microsoft, responded to comments about more staff being onsite than the company had originally said were present. It was also suggested that the real fix wasn’t necessarily to have more people onsite. It was also suggested that the real fix is a mode-based sequence in the emergency operating procedures (EOPs), which may not change staffing levels.

          Hughes解釋說:“報(bào)告中提到的三個(gè)內(nèi)容只是與那些可以重置冷水機(jī)組的人員有關(guān)。現(xiàn)場有他們的運(yùn)營人員,同時(shí)也有運(yùn)營中心的人。所以那份信息是不準(zhǔn)確的,你是對的。”他讓我們設(shè)身處地想象一下,當(dāng)時(shí)有20臺(tái)冷機(jī)出現(xiàn)了3次電壓驟降,且都處于錯(cuò)誤狀態(tài)。隨后,有13臺(tái)需要手動(dòng)重啟,這要求在非常大的場地范圍內(nèi)調(diào)配人力。
          Hughes explains: “The three that came out in the report just relate to people who are available to reset the chillers. There were people in their operation staff onsite, and there were also people in the operations center. So that information was incorrect, but you’re right.” He asks us to put ourselves in the moment with 20 chillers posting 3 sags and all in an erroneous state. Then 13 require a manual restart, requiring the deployment of manpower across a very large site.

          他補(bǔ)充道:“你得跑到建筑物的屋頂上去手動(dòng)重置冷機(jī),而且時(shí)間緊迫。”冷機(jī)受到影響,溫度不斷上升,工作人員不得不急忙奔波于場地各處,試圖重置冷機(jī)。但他們未能及時(shí)到達(dá)機(jī)組前,導(dǎo)致了熱失控。優(yōu)化的方案是優(yōu)先處理負(fù)荷最高的數(shù)據(jù)中心——那些熱負(fù)荷最高、運(yùn)行機(jī)架數(shù)量最多的區(qū)域且需要恢復(fù)冷卻功能的數(shù)據(jù)中心。
          “You’ve got to run out onto the roof of the building to go and manually reset the chiller, and you’re on the clock”, he adds. With chillers impacted and temperatures rising, staff are having to scramble across the site to try to reset the chillers. They don’t quite get to the pod in time, leading to the thermal runaway. The answer in terms of optimization is to go to the highest load data centers – those that have the highest thermal load and highest number of racks operating to recover cooling there.

          因此,重點(diǎn)是恢復(fù)熱負(fù)荷最高的冷機(jī)。這意味著對微軟緊急操作程序(EOP)部署方式的一種調(diào)整,關(guān)乎系統(tǒng)應(yīng)有的運(yùn)作方式,而這些本應(yīng)由軟件來處理。自動(dòng)重啟本應(yīng)發(fā)生,Hughes認(rèn)為不應(yīng)該需要任何人工干預(yù)。現(xiàn)在這個(gè)問題已經(jīng)得到了解決。他認(rèn)為“如果有軟件能解決問題,你就永遠(yuǎn)不會(huì)想要部署人力去修復(fù)問題。”這促使冷機(jī)管理系統(tǒng)的變更,以防止此類事故再次發(fā)生。
          So, the focus was to recover the chillers with the highest thermal load. This amounts to a tweak on how Microsoft’s EOP is deployed, and it’s about what the system is supposed to do, which he says should have been taken care of by the software. The auto-restart should have happened, and Hughes argues that there shouldn’t have had to be any manual intervention. This has now been fixed. He believes that “you never want to deploy humans to fix problems if you get software to do it for you.” This led to an update of the chiller management system to stop the incident from occurring again.

          行業(yè)問題及風(fēng)險(xiǎn)

          Industry issue and risk

          Uptime Institute數(shù)字基礎(chǔ)設(shè)施運(yùn)營副總裁Ron Davis補(bǔ)充說,要指出的是,這些問題及其相關(guān)風(fēng)險(xiǎn)不僅限于微軟事件。“我曾親身經(jīng)歷過這類事件,當(dāng)電力故障發(fā)生時(shí),冗余設(shè)備未能切換啟用,冷凍水溫度迅速上升至一個(gè)程度,以至于相關(guān)的冷機(jī)無法啟動(dòng),”
          Ron Davis, vice president of digital infrastructure operations at the Uptime Institute, adds that it’s important to point out that these issues and the risks associated with them exist beyond the Microsoft event. “I have been involved in this sort of incident, when a power event occurred and redundant equipment failed to rotate in, and the chilled water temperature quickly increased to a level that prohibited any associated chiller(s) from starting,”

          他補(bǔ)充道:“這種情況會(huì)發(fā)生,而且可能發(fā)生在任何組織身上。數(shù)據(jù)中心的運(yùn)營至關(guān)重要。從設(shè)施角度來看,保持?jǐn)?shù)據(jù)中心的正常運(yùn)行時(shí)間和可用性是其首要任務(wù)。”接著是行業(yè)面臨人員短缺的問題。他表示,從設(shè)備、系統(tǒng)和基礎(chǔ)設(shè)施的角度來看,這個(gè)行業(yè)正在走向成熟。即使是遠(yuǎn)程監(jiān)控和數(shù)據(jù)中心自動(dòng)化也在不斷改進(jìn)。然而,在緊急情況下,特別是在微軟案例中概述的那種應(yīng)急響應(yīng)期間,仍然嚴(yán)重依賴關(guān)鍵運(yùn)維技術(shù)人員的存在和行動(dòng)。
          he comments before adding: “This happens. And it can potentially happen to any organization. Data center operations are critical. From a facilities standpoint, uptime and availability is a primary mission for data centers, to keep them up and running.” Then there is the issue of why the industry is experiencing a staffing shortage. He says the industry is maturing from an equipment, systems, and infrastructure perspective. Even remote monitoring and data center automation are getting better. Yet there is still a heavy reliance on the presence and activities of critical operating technicians - especially during an emergency response as outlined in the Microsoft case.



          寫在最后



          在數(shù)據(jù)中心自動(dòng)化日益增強(qiáng)的背景下,客戶對數(shù)據(jù)可用性接近100%的需求促使行業(yè)重新審視人員配置與運(yùn)營策略。很多時(shí)候,單一的原因?qū)е碌膯栴}是疊加的,人員配置應(yīng)綜合考慮業(yè)務(wù)連續(xù)性要求,以及應(yīng)急響應(yīng)的程序也應(yīng)持續(xù)改進(jìn)。通過這種多維度的策略,數(shù)據(jù)中心才能更好地準(zhǔn)備和應(yīng)對未來可能出現(xiàn)的各種挑戰(zhàn),確保服務(wù)的高可用性和客戶數(shù)據(jù)的安全性。


          展望未來,數(shù)據(jù)中心行業(yè)將更注重智能化管理和預(yù)防性維護(hù),如何讓自動(dòng)工具更加場景化,優(yōu)化人員和工具的配合。利用人工智能和機(jī)器學(xué)習(xí)預(yù)測并解決潛在問題,減少對外部突發(fā)事件的敏感性。最終,結(jié)合技術(shù)創(chuàng)新與人力資源優(yōu)化,實(shí)現(xiàn)更加穩(wěn)定可靠的數(shù)據(jù)中心運(yùn)營,將是行業(yè)共同追求的目標(biāo)。

          瀏覽 221
          點(diǎn)贊
          評論
          收藏
          分享

          手機(jī)掃一掃分享

          分享
          舉報(bào)
          評論
          圖片
          表情
          推薦
          點(diǎn)贊
          評論
          收藏
          分享

          手機(jī)掃一掃分享

          分享
          舉報(bào)
          <kbd id="afajh"><form id="afajh"></form></kbd>
          <strong id="afajh"><dl id="afajh"></dl></strong>
            <del id="afajh"><form id="afajh"></form></del>
                1. <th id="afajh"><progress id="afajh"></progress></th>
                  <b id="afajh"><abbr id="afajh"></abbr></b>
                  <th id="afajh"><progress id="afajh"></progress></th>
                  免费看黄色网址 | av软件在线 | 黄片在线免费观 | 视频一区 中文字幕 | 久久午夜福利 |