We are in the business of moving money. A Payment Transaction transcends multiple services within our stack. There are multiple failure points and the ability to react to a failure and recover from the failure will be critical to ensure higher availability of the systems to merchant partners and customers. The ability to trace and monitor a transaction is supercritical from an operational perspective and for the stakeholders involved. Since it’s impractical to avoid exceptions due to multiple services and stakeholders in the path of a payment transaction, a reliable and reactive observability stack with the ability to monitor and alert these exceptions in near real-time with help the operations to identify issues in the external system like outages, issues due to incremental code rollouts and take corrective actions. The stack should provide the ability to trace the transaction as it flows through multiple services.
- Functional Observability involves business context monitoring in the context of payment, including approval rates, decline rates, a spike in declines, missing files, exceptions during file processing, etc.
- Business errors that occur when new functionality is rolled out to Production and the ability to compare the Approval rates before and after the code rollouts should be realized using the Observability stack in place
- Telemetry data includes the flow specific parameters logged from each service and the same data aggregated across different parameters.
Cornerstones of Pine Lab’s Observability Stack
- Functional Monitoring through the ELK Stack
- Telemetry data from the application is sourced from the application and sent to the sinks in a vendor-agnostic way using the open telemetry APIs
- Functional alerts were set up for the business failures with a threshold and notified to the Operations team to proactively attend to issues
- Non-functional alerts for a spike in memory, CPU and the unresponsive app helps to take the corrective action.