Global incident

Incident Report for SMSFactor

Postmortem

🇫🇷

Le 13 novembre à 10h03, une alerte nous informe qu'un de nos serveurs, responsable du bon fonctionnement du backend de notre plateforme my.smsfactor.com, du mail2sms ainsi que du Rappel de RDV, ne répond plus. Nous constatons l'impossibilité de nous connecter à la machine en SSH et son absence de réponse au ping sur l'interface réseau public.

Nous contactons immédiatement notre fournisseur pour les informer du problème. Ils confirment qu'ils y travaillent.

Vingt minutes plus tard, la machine responsable de l'API et de l'ensemble des autres services cesse également de répondre. Nous parvenons à reprendre le contrôle de cette machine 15 minutes après, à 10h35. L'API redevient alors disponible et fonctionnelle.

Ce n'est qu'à 12h21 que notre fournisseur, après avoir basculé la première machine en mode maintenance, nous informe de sa disponibilité retrouvée. À ce moment-là, tous nos services sont à nouveau opérationnels. Notre fournisseur nous fournit l'explication suivante :

Les disques de votre serveur n'étaient plus détectés, ce qui a nécessité un « cold reset » de celui-ci. Nous ne pouvons expliquer pourquoi les disques n'étaient plus détectés.

Nous sommes sincèrement désolés pour ce nouvel incident et pouvons vous assurer que nous avons déjà entrepris les mesures nécessaires pour mieux gérer ce type de situation à l'avenir.

‌

🇬🇧

On November 13th at 10:03 AM, we received an alert informing us that one of our servers, responsible for the proper functioning of the backend of our platform my.smsfactor.com, mail2sms, and Appointment Reminder, was no longer responding. We found it impossible to connect to the machine via SSH and noticed its lack of response to ping on the public network interface.

We immediately contacted our provider to inform them of the problem. They confirmed that they were working on it.

Twenty minutes later, the machine responsible for the API and all other services also stopped responding. We managed to regain control of this machine 15 minutes later, at 10:35 AM. The API then became available and functional again.

It was not until 12:21 PM that our provider, after switching the first machine to maintenance mode, informed us of its restored availability. At that point, all our services were operational again. Our provider gave us the following explanation:

The disks of your server were no longer detected, which required a "cold reset" of the server. We cannot explain why the disks were no longer detected.

We sincerely apologize for this new incident and can assure you that we have already taken the necessary measures to better manage this type of situation in the future.

Posted Nov 14, 2024 - 16:32 CET

Resolved

This incident has been resolved.

Posted Nov 13, 2024 - 14:25 CET

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Nov 13, 2024 - 12:29 CET

Update

We are continuing to investigate this issue.

Posted Nov 13, 2024 - 10:59 CET

Investigating

We are currently experiencing an a global incident, rendering our services unusable.

Posted Nov 13, 2024 - 10:23 CET

This incident affected: API, Customers Portal, Webhooks, Operator Network, Reminder, Mail2SMS, and VLN.