Design an end to end monitoring solution (Systems Interview):
To begin, let’s look at the different aspects of a system that need to be monitored and why they are important:
Availability Monitoring insures a resource is accessible and responsive. A resource can include everything from a data center, application, individual servers, and services (databases). Resource availability is likely the most critical monitoring. The failure of a part of the system requires immediate action to resolve. Availability monitoring is critical as it ensures your key systems are up and available to end users.
Capacity Monitoring insures a system as a whole has ample resources in which to run. Capacity monitoring should look at two different levels of the system: the server level and the resource/application level. At the server level, metrics like available disk space, free memory, CPU load, disk IO, and network traffic. Spikes in any of these metrics are likely to lead to decreased performance at the server and resource/application level. The second level that’s important to monitor is the resource/application level. This includes clustered systems like databases, Elastic Search, Memcached, Riak, etc., and multi-node applications (load balanced application nodes). Capacity monitoring of multi server resources/applications should focus on throughput, average response time, and particularly, slow responses. Monitoring slow responses help to identify uncommon requests that have the potential to tie up the system, and can lead to failures of dependent systems. Effective capacity monitoring can allow for proactive action. Actions might include bringing up additional servers to help with higher load, or optimizing slow actions to reduce loads on a system. Capacity monitoring is also important for predicting total system costs.
Transaction Monitoring looks at the velocity and response time of key business transactions, generally across multiple systems. An example might be the checkout process on xxxxxxxxx.com. An online checkout process generally touches a number of different systems. The inventory system might be checked to insure all cart items are currently available, the shipping address might need need to be verified, a hold will need to be applied to the credit card for the purchase amount, the warehouse pick system will need the order, and the ERP will need to record the transaction. Monitoring transaction response time and velocity allows for the identification of business critical issues. An unexpected slowdown in checkout velocity is indicative of a problem in the system. Monitoring response time can help identify bottlenecks in the current system. Tracking a transaction through the system can help identify which part of the system might be responsible for a workflow issue. Transaction monitoring is critical to ensuring key business value actions are happening as expected. A failure in checkout could potentially have a huge cost to the business in terms of lost revenue.
Synthetic Transaction Monitoring involves periodically performing key business transactions with an automated tool to measure end user response times. Examples of synthetic transactions include running a Selenium script through the process of selecting and adding an item to the cart, or checking out. Synthetic transaction monitoring helps identify issues that may be seen by an end user. End user response time has a noticeable impact on conversion and user happiness. Identifying slow experiences allows a team to be proactive in addressing end user performance issues.
Application Performance Monitoring provides insight into the response times of various parts of an application. This can include high level metrics, like the total response time of requests. Ideally, very small incremental steps of processing a request should be recorded, allowing an engineer to identity the particular point of slowness in the software. Transactions like database or external service calls are very important to monitor as they identify needed database optimizations or potential points of failure. Being able to examine response time at a method level allows for the understanding of performance bottlenecks. Capturing metrics at such a granular level can provide incredible visibility into the workings of an application, but quickly will lead to huge amounts of data. In high volume systems, the best approach may be to sample a small set of transactions in a production setting. Application Performance monitoring is important in improving the quality and efficiency of each particular application.
Operational Analytics monitors tasks related to operational efficiency. These might include metrics like the time to pick and pack an order, time between order placed and order shipped, or response time of customer inquiries. These metrics help identify inefficiencies and bottleneck in the current operational process.
Passive Monitoring examines network traffic flowing between different parts of the system. As passive monitoring tracks real traffic, it can provide the ability to “replay” the moments leading up to an issue, examining the TCP traffic. Passive monitoring can also provide insight into the volume of communication within the larger system, helping to identify communication bottlenecks. As with application performance monitoring, passive monitoring can generate huge volumes of data. Sampling may help reduce the volume of data while still offering insight into the flow of communication within the larger system. Passive monitoring can be important in figuring out why an issue occurred. Being able to replay the events leading up to a problem can help engineers to replicate and resolve issues.
Solution: The elements of an effective monitoring system can be broken down into three main categories: Producers, Collectors, and Consumers. Producers are all the systems that send data as part of monitoring. The Collector is the system that collects and aggregate the data. Consumers are services that subscribe to the Collector and process or act on the data they are interested in. Let’s look at these individually:
Producers generate data. They can take a number of different forms. It might be a daemon process running on server that sends system metrics every few seconds. It could be an application that sends execution time of particular controller action. It might be a service that pings for the uptime of other services, or the results of automated Selenium tests. Producer data should be in JSON form and follow a standard data schema relevant to it’s intention. Producers send data to the Collector via TCP. Each producer can be written in the language that makes the most sense given it’s purpose and location.
Collectors collect and aggregate data. Kafka seems like the right tool for the job at this scale. It would be run as a cluster to ensure availability and durability. The Collector provides a unified platform for Producers to publish to and Consumers to consume from.
Consumers act on the data they are interested in from the Collector. One of the challenges around monitoring and analytics is the way in which data needs to be presented. Consumers should be tailored to their intended audience. An executive dashboard might only watch for key Operational Analytics such as warehouse efficiency or daily revenue numbers. This system could periodically important all related events, aggregate it, and store those aggregated results locally for display. Another Consumer might bulk upload all the collected data to S3 for future processing with Hadoop. Yet another consumer could monitor a particular application in real time merging capacity, transaction, synthetic transaction, and application monitoring to provide a detailed picture to how an application is running.
This Producer/Collector/Consumers approach provides nearly infinite options on the Producer and Consumer sides. Applications can be tailored to the needs of their end users and utilize the technologies that make the most sense for the given problem. The only real requirement for producers and consumers is an understanding of the common data scheme for a particular message type. Changes in data schemes can be mitigated through versioning.
I’ll now give some examples of Producers and Consumers for each of the above monitoring focuses.
For Availability Monitoring, a monitoring agent should installed on each machine and run periodic health checks. This agent could be written in a number of different languages (Python, Go, etc.). A Consumer would watch for registered servers, and generate an alert if a consumer hadn’t checked in recently enough or a health check failed. Some external monitoring should also be in place to insure connectivity between networks.
For Capacity Monitoring, a monitoring agent should installed on each machine and generate periodic stats to be sent to the Collector. These might include load (uptime), memory (free -m), available disk space (df -h). Key metrics should be sent to Collector. A Consumer could display these metrics in graph form (d3.js), or alert if a threshold is passed. With the high volume of near live data, a key value store like Riak would be an option for storing and searching on data points. This data would not need a long life cycle, and could be aged off after a week or two.
For Transaction Monitoring, the Producer would likely be part of the application, working the same way logging does in an application. It is important to include the time of transaction as part of the event message. Transaction data seems very ripe for a variety of consumers. The business might be interested in metrics around checkout, order size, order velocity and product interest. Engineering might need to see a more detailed log of slow transactions, or view them in the context of more detailed application performance metrics.
For Operational Analytics, the application responsible for the particular event should send data from within the relevant application (ex. order shipped event sent when the order is marked as shipped). A consumer might consist of a series of graphs showing counts in comparison with historical data for management review.
For Passive Monitoring, agents could be installed to listen on mirrored ports and send all communication to the collector. This data could be archived in S3 and processed through Hadoop.
One of the challenges with aggressive monitoring is the amount of data generated. In most cases, the majority of that data is not particularly useful, but it becomes critical when issues occur in the system. In this implementation, I would be particularly concerned about over engineering a complete solution. I would focus initially on writing simple, configurable, reliable producers for server level monitoring, and building libraries to ensure engineers could log events and actions from within application code. It’s better to have more data and not consume it.
On the Consumer side, I’d focus on critical areas with no good visibility. If there is no existing monitoring, availability monitoring and capacity monitoring are key starting points. Consumer applications should be single purpose and strive to be as simple as possible. In memory or key/value stores can and should be leveraged where appropriate. Each consumer should handle aging or warehousing data as is appropriate for their function. Consumers should also be written with APIs, to allow other consumers to access relevant information.
A challenge with monitoring is how much information to display and how to display it. It’s important that consumers are designed to answer the questions being asked, just showing users lots of information. Keeping consumers small, modular, and focused will be key. Monitoring should be considered an ongoing project. With the increased visibility comes the opportunity to ask more complex questions.