

Our systems attempted to replace these unhealthy instances with new instances. Some of our instances were marked unhealthy by our automation because they couldn’t reach the backends that they depended on.

Slack became unavailable.Īround this time two things happened independently. The increased packet loss led to much higher latency for calls from the web tier to its backends, which saturated system resources in our web tier. As load increased so did the widespread packet loss. However, the mini-peak at 7am PST - combined with the underlying network problems - led to saturation of our web tier. We manage the scaling of our web tier and backends to accommodate these mini-peaks. Slack has a traffic pattern of mini-peaks at the top of each hour and half hour, as reminders and other kinds of automation trigger and send messages (much of this is external - cronjobs from all over the world). At this point Slack itself was still up - at 6.57am PST 99% of Slack messages were being sent successfully (but our success rate for message sending is usually over 99.999%, so this was not normal). While our infrastructure seemed to generally be up and running, we observed signs that we were seeing widespread network degradation, which we escalated to AWS, our main cloud provider. Our metrics backends were still up, meaning that we were able to query them directly - however this is nowhere near as efficient as using our dashboards with their pre-built queries. We still had various internal consoles and status pages available, some command line tools, and our logging infrastructure. We pulled in several more people from our infrastructure teams because all debugging and investigation was now hampered by the lack of our usual dashboards and alerts.

To narrow down the list of possible causes we quickly rolled back some changes that had been pushed out that day (turned out they weren’t the issue). We immediately paged in our monitoring team to try and get our dashboard and alerting service back up. As initial triage showed the errors getting worse, we started our incident process (see Ryan Katkov’s article All Hands on Deck for more about how we manage incidents).Īs if this was not already an inauspicious start to the New Year, while we were in the early stages of investigating, our dashboarding and alerting service became unavailable. During the Americas’ morning we got paged by an external monitoring service: Error rates were creeping up. The day in APAC and the morning in EMEA went by quietly. January 4th 2021 was the first working day of the year for many around the globe, and for most of us at Slack too (except of course for our on-callers and our customer experience team, who never sleep). This material may not be published, broadcast, rewritten or redistributed without permission.And now we welcome the new year.
#SLACK OUTAGE 2021 SOFTWARE#
Microsoft Teams is a direct competitor to Slack and it is a software giant that competes with Salesforce.Ĭopyright 2021 The Associated Press. The deal is aimed at giving the two companies a better shot at competing against longtime industry powerhouse Microsoft. Slack is being acquired by for $27.7 billion. The companies hope to be better able to compete against Microsoft, which is a threat to both of them. The outage comes about a month after said it would acquire Slack for $27.7 billion. More complaints rolled in as the sun hit the West coast and there were still outages four hours after it began in New York City. And in September, Microsoft services had an outage that lasted for five hours. In August, Zoom went down briefly just as many students were beginning the school year at home. Google went down briefly in December, with people in several countries briefly unable to access their Gmail accounts, watch YouTube videos or get to their online documents during an outage Monday.

Internet service outages are not uncommon, are usually resolved relatively swiftly and are only rarely the result of hacking or other intentional mischief. Slack said that people should check for updates. At 12:30 p.m., service was still sporadic and Slack said the outage was ongoing, but that some users may begin to see improvement. Eastern time and disrupted service in the U.S., Germany, India, the U.K., Japan and elsewhere. “Our team is currently investigating and we’re sorry for any troubles this may be causing," Slack said in a prepared statement. The company stopped releasing its daily user count after topping 12 million last year. It's the latest tech glitch to show how disruptive technical difficulties can be when millions of people are depending on just a few services to work and go to school from home during the pandemic. Slack, the messaging service used by millions of people for work and school, suffered a global outage on Monday, the first day back for most people returning from the New Year's holiday.
