Our services are unavailable
Incident Report for SMSFactor
Postmortem

English version 🇬🇧

Overview

Starting at 12:24 CET the 2nd of February, we noticed our frontend application server being slow to respond and handle requests. After investigating, we realized we were receiving an unusual amount of incoming requests, resulting in a DDOS scenario.

Timeline (CET)

  • 12:24 - Our front server is receiving unusual amount of incoming requests.
  • 12:25 - We start working on it and try to figure out the source of the incoming requests
  • 12:45 - The consistent incoming requests started creating 'Too many open files' errors on our server, making all applications unable to work properly. We then deciding the temporary cut the traffic so that we could let the server go back to a normal state while identifying the source of the large amount of requests.
  • 12:52 - The server is running back again as expected and the source of the large amount of requests has been handled.

Duration

  • Start: 12:24 CET
  • Stop: 12:52 CET
  • Downtime: Yes
  • Downtime duration : 7 minutes

Follow-up action items

The source of the requests was legit as it wasn’t an attack of some sort. We were just unable to handle such an amount. We already took measures to ensure it won’t happen again by moving components to a different server to better load-balance the traffic. ‌

Version française đŸ‡«đŸ‡·

Aperçu

À partir de 12h24 CET le 2 fĂ©vrier, nous avons remarquĂ© que notre serveur d'application frontal Ă©tait lent Ă  rĂ©pondre et Ă  traiter les demandes. AprĂšs enquĂȘte, nous avons rĂ©alisĂ© que nous recevions un nombre inhabituel de demandes entrantes, ce qui a provoquĂ© un scĂ©nario de DDOS.

Chronologie (CET)

  • 12h24 - Notre serveur frontal reçoit un nombre inhabituel de requĂȘtes entrantes.
  • 12h25 - Nous commençons Ă  travailler dessus et essayons de dĂ©terminer la source des requĂȘtes entrantes.
  • 12h45 - Les requĂȘtes entrantes constantes ont commencĂ© Ă  gĂ©nĂ©rer des erreurs de type "Too many open files" sur notre serveur, rendant toutes les applications incapables de fonctionner correctement. Nous avons alors dĂ©cidĂ© de couper temporairement le trafic afin de permettre au serveur de revenir Ă  un Ă©tat normal tout en identifiant la source du grand nombre de requĂȘtes.
  • 12h52 - Le serveur fonctionne Ă  nouveau comme prĂ©vu et la source du grand nombre de requĂȘtes a Ă©tĂ© gĂ©rĂ©e.

Durée

  • DĂ©but : 12h24 CET
  • Fin : 12h52 CET
  • IndisponibilitĂ© : Oui
  • DurĂ©e de l'indisponibilitĂ© : 7 minutes

Actions de suivi

La source des requĂȘtes Ă©tait lĂ©gitime car il ne s'agissait pas d'une attaque malveillante. Nous n'Ă©tions simplement pas en mesure de gĂ©rer une telle quantitĂ©. Nous avons dĂ©jĂ  pris des mesures pour nous assurer que cela ne se reproduira pas en dĂ©plaçant certains composants vers un autre serveur afin de mieux Ă©quilibrer la charge du trafic.

Posted Feb 05, 2024 - 11:20 CET

Resolved
This incident has been resolved.
Posted Feb 02, 2024 - 15:48 CET
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Feb 02, 2024 - 12:57 CET
Investigating
We are currently investigating this issue.
Posted Feb 02, 2024 - 12:50 CET
This incident affected: API, Customers Portal, Webhooks, Reminder, Mail2SMS, and VLN.