IT Best Practices- Troubleshooting issues in the cloud
Although Amazon is known for its reliability and advanced features, problems and outages can still occur. In a nutshell, Amazon EC2 instances act as virtual servers to run your applications (you can select multiple combinations of CPU, memory, storage, and networking capacity), so they are the fundamental building blocks for your cloud computing needs. Similarly, Amazon EBS volumes provide persistent block level storage for use with Amazon EC2 instances. You can also rely on Amazon RDS instances to set up, operate, and scale a relational database in the cloud, or opt for unlimited cloud & internet storage with Amazon S3 buckets.
If you are an IT pro running your applications in the cloud, be aware that slow responses, Web problems and performance bottlenecks can also be due to issues with your cloud infrastructure, so you want to ensure your cloud resources are functioning at peak performance for your applications. Plus, since Amazon has a pay-as-you-go pricing model, you want to ensure that your cloud capacity is properly used, so you don’t overpay for cloud resources you don’t really need either. In addition, in a multi-tenancy deployment --where you share physical servers with other companies, aka neighbors, their Web operations can also impact your performance (e.g. “stolen CPU”). ~Site24x7 gives you complete visibility into your Amazon cloud resources, so you can ensure your cloud resources are properly allocated, and your application performance is top-notch.
Here are some quick guidelines to help you get started:
- Configure Site24x7 Amazon Monitors. Read more.
- Site24x7 will automatically discover and ~track key performance indicators across all your cloud resources including:
- EC2 instances (availability, CPU utilization, network traffic, Disk I/O)
- ESB Volumes (volume traffic, latency, volume I/O, bandwidth, throughput)
- RDS Instances (network read/write latency, read/write throughput, CPU, active database connections, availability and average number of disk read/write operations per second)
- S3 Buckets (name, location, creation time, size, number of objects and virtual folders)
For example, as shown below, once Site24x7 is configured, you can quickly see at a glance availability and performance of your Amazon cloud (12 EC2 instances, 13 ESB Volumes and 1 RDS instance)
1. Memory Leaks: Just like in a datacenter, processes that are not written to effectively use memory, will overload the allocated memory in the EC2 instance. When an application runs out of memory, it can crash and cease functioning altogether. If available memory is less than 10% you might have a memory leak, especially if you see sudden jump back to normal once a faulty process is terminated and restarted.
2. Latency problems: Unpredictable ESB volumes with high latency will slow down your applications as processing queues up. In addition, a sustained increase of VolumeQueueLength above 1 on a standard EBS volume should be treated as exhausting the throughput of that EBS volume, an issue that should be addressed.
3. “Stolen CPU” is a measure of the cycles a CPU should have been able to run but could not due to the hypervisor diverting cycles away from your instance to a neighbor. ~To ensure that you have enough resources for your application, baseline CPU usage during normal operations and peak times. High CPU Steal is usually an indicator of noisy neighbors. If you have detected that there is stolen CPU in your EC2 instance, redeploy the application elsewhere. ~Using a command such as ‘iostat 1’ you can measure the amount of CPU Steal your EC2 instance is experiencing.
4. Corrupt Disk or Disk Full: The filesystem will become read-only, and you will be unable to write to disk. To detect this type of problem, you can analyze the Site24x7 screenshot shown below. To correct this problem, re-launch the problematic instance.
5.~ Instance at capacity: You can discover that your instance is at capacity by checking the CPU, memory, and disk IO evolution over time. If this is a problem for you, consider upsizing your EC2 Instance to a higher CPU, memory, storage, and networking capacity combination.
6. RDS at Capacity: Like EC2 instances, databases can also hit capacity. This can be discovered by checking the CPU, memory, and disk IO on the RDS instance.
7. Underutilization of EC2 or RDS. In a cloud deployment, resources can be underutilized also and eat into your bill. You should periodically review and analyze CPU, Memory and Disk metrics to ensure a nice balance for your applications.~ If your cloud resources are underutilized consider downsizing your EC2 Instances.
Are you looking for additional IT troubleshooting tips? Check out our recent blogs -~Web performance waterfall charts for IT pros and Troubleshooting slow SQL queries.
Good luck with your troubleshooting efforts!