Yesterday, I experienced my first moment of pure dread since joining the HelpDocs team. The stupid thing is, in hindsight, it wasn’t even that big a deal.
It was 4:02pm—I know, because that’s what it says on Slack. Jake and Jarratt had signed off for the night. They’re in India, so it was around 9pm for them, and I was left to crack on with my work as they relaxed with books and wine halfway around the world.
It was a typical day. I’d posted a blog post the day before and had spent some time on social media content. I was getting into some video editing when the familiar clunk of a new Intercom ticket chimed in, interrupting an otherwise straightforward Wednesday.
I’ve mentioned we share support at HelpDocs, albeit disproportionate in favor—or not so much—of Jake and Jarratt. Tickets come few and far between, and tend to peak and trough with some volatility, though it’s rare there’s a single “issue” that’s causing problems across the board.
Our standard tickets range from pricing inquiries to customers wanting to make custom changes to their knowledge base and running into trouble. Of course, there’s the occasional complaint—though not many by any stretch of the imagination—but in general things are pretty standard.
Since I’ve been at HelpDocs we’ve never had any “serious” problems or outages. In fact, I’ve never experienced a site-wide issue at all.
That was, until 4:02pm yesterday.
It had been less than a minute before I switched over to Intercom and picked up the message. It was from one of our long-standing customers, was polite, and alluded to an issue with the search function on their account. The long and short of it was, well, it wasn’t working!
No sooner had I finished reading the message and begun to take a brief look to see if I could do anything when another ticket chimed in.
“Search not working”, they said—though I may be paraphrasing for brevity!
Another one. This probably doesn’t bode well, I thought.
I searched the error message and found nothing concrete. I looked through previous tickets and found no solution. And immediately felt a sense of panic.
Another chime, another ticket. 3 now, and then 4 and 5. They were coming in thick and fast.
Another one chimed, though it threw me this time. A photo of two children sat in high chairs eating ice creams—fucking spam! Of all the times to get a spam message, this wasn’t it. Unless this was something cryptic. A protest, perhaps, at the lack of search function.
Here you searchless bastards, a photo of 2 kids with ice-cream cones is the perfect way to illustrate my annoyance with you—perhaps not!
I threw myself into Slack—well as much as anyone can. It was a mouse click really, but that doesn’t add to the drama of the situation—and hammered out an urgent message to get the attention of the rest of the team.
Search down. Send Help.
That’s what I intended to say though reading them back, the messages seem more a frantic garble of information.
CLANG, that’s 6, if you don’t count the spam—who’d now sent through a few messages filled with Cyrillic symbols and emoji.
I can’t lie, I was panicked. Ok, so it was only 6 tickets. And in the grand scheme of things, search going down for a little while isn’t the end of the world. I wasn’t even convinced it was our issue anyway, given the rest of the site was up and no new commits had been made on the product side of things since 11am—one of the perks of automatically sharing commits in Slack.
But when you generally get 6 tickets in an entire day, shared amongst the team, having all of them in the space of 5 minutes was kind of worrying. What’s more, the rest of the team had logged off for the day, and this was the first time I’d experienced any kind of shared issue amongst our users.
I was alone, with an issue I knew nothing about, no way of fixing it, unsure who was even at fault, with tickets to deal with.
I took a moment. The only thing I could really do was let people know we were aware of the issue and that we were looking into it. I knew as soon as the team saw the messages they’d be back online and straight into investigating the issue. I wouldn’t be alone for long.
An update tweet, perhaps, would quell some of the flames and stop any further tickets coming in.
As I wrote out the tweet, I was purposeful with my wording. I wasn’t sure what the issue was apart from it breaking search. Given the rest of the platform was working, it felt like it wasn’t our issue, but nobody wants to hear that.
I didn’t want to claim responsibility for it if it’s not our fault, but the last thing I wanted to do—to our users—was pass the buck.
🚨UPDATE: Some users have reported an issue with search, with the error: "no available connection: no Elasticsearch node available" 😬— HelpDocs (@HelpDocs) October 31, 2018
We're aware of the issue and are working on fixing it right now. Apologies for the inconvenience. We'll keep you updated as much as we can.
No sooner had I posted the tweet when the cavalry arrived.
A simple “We’re looking into it 🙂” from Jake was all it took to alleviate the panic.
I headed back to Intercom to let people know we were looking into it, and would update them when we had fixed it, to see Jarratt was already on that too!
We closed the 6 tickets—and blocked the spam 😒—and within the space of 5 minutes, we had a fix.
Turns out one of our vendors had inadvertently changed (and misconfigured) an important security setting on our account. Our application servers were briefly unable to connect to our search cluster. It was resolved in a couple clicks, but it required manual involvement from us.
CRISIS AVERTED: there was a temporary issue with one of our vendors, now resolved. We’ll keep monitoring. Thanks everyone for the heads up, and for your patience! 🙏— HelpDocs (@HelpDocs) October 31, 2018
Lesson Learned: It's Important to Choose the Right Words
It’s funny how something as simple as an incident tweet can feel so difficult to write. There’s so much information you want to get across, but you’re bound by character limits. You also don’t want to do something silly like apportion blame, or even take too much of the heat yourself—well, if you don’t have to!
In our case, I think I did the latter. I’m not the sort of person who sees any benefit in finger pointing and as a result tend to default to “how can I/WE fix the problem”, even when it’s not our problem to fix.
So, while trying hard to make sure people knew we were working on fixing the issue—or at least we would be at the next possible opportunity—I inadvertently took a level of responsibility for the issue that wasn’t ours to take.
The difference is one word: Fixing. In fact, a better word in this instance would have been: Investigating.
I hadn’t the foggiest idea what was wrong. As I said, I didn’t know who was to blame. As a team, I guess we hadn't prepared this kind of situation, and unfortunately, in my fluster, I’d defaulted to taking the heat.
For me, this highlights a couple of areas I could be better at in the future.
The first is having all the facts before taking responsibility and choosing words a little more carefully. In this case, there wasn’t significant fallout, but it could have been much worse.
The second is mitigating the crisis inside my head. While I might see faults like these as massive issues, the truth is they’re often much less urgent for everyone else. I guess that’s the problem with software with such a good track record—damn you 100% uptime.
Sure, it’s great that the team were able to fix the issue within an hour of it hitting our radar, but in the grand scheme of things, few people even noticed there was an issue and only 6 felt it necessary to get in touch—and even then, it didn’t feel like it was urgent for them.
At the end of the day, the world would have survived until the morning, even without search powering our knowledge bases. 🙃