Posts

What are IOPS and Why are They Important?

In every engineer’s life, there comes a time when your company decides it’s time to choose a new web host. This process can go smoothly, or you could crash and burn. The difference in the outcomes often comes down to: IOPS.

What are IOPS?

So what are IOPS? IOPS stands for Input/Output operations per second. This is an allowed value pre-determined by your webhost, which controls how many Input/Output operations are allowed at once on your server. After the threshold is reached, the host may begin throttling these operations, causing requests and processes to queue up. This in turn causes sleep-state processes, which inflates your server load more and more, until the backlog queue of requests finally completes. The processes left waiting in this time are affected by “IOWait.”

So why are IOPS important?

With that in mind, it’s first important to understand how many IOPS your site really needs. Start by getting some stats from your current environment. How long does it take your current server to process requests? How many CPU cores and how much Memory is allocated on your current environment? At your daily peak for CPU load, how many active processes were running and how fast were they? What is the ratio of disk reads to disk writes?

What IOPS benchmarks are important?

Second, don’t take the IOPS metrics different companies boast about too seriously. IOPS can vary drastically depending on the type of operation, size of the server block, number of requests, and weight of those requests. It would be easy for a vendor to falsely inflate the IOPS their systems can handle by showing a dedicated block with lightweight reads and writes. It’s important to ask about how the benchmarks were made. What was the read/write ratio? Were there other tenants on the disk? Is there a set point at which requests are throttled?

Getting specific

It might be more appropriate to develop a test instance of one of your heavier-trafficked sites on a new web host, and perform load testing to see how the type of traffic your site normally experiences performs on the new environment. The more specific to your own needs the testing can be, the better. Be sure to check with the vendor on their policies for load testing before getting started.

Scaling a Forum Site

Help! My Site is Super Dynamic!

One of the most troublesome issues a developer or engineer can face is keeping your hosting cost low for a highly dynamic or inherently uncacheable site. How can a popular site that’s constantly updating scale well with spikes in traffic?

Many times the initial reaction will be: you simply can’t. Most systems that allow WordPress sites to scale well involve full-page caching, which simply isn’t an option for these types of sites. You need users to see the constant changes as they happen on the site, not minutes or hours later. These sites require a high level of dynamic content that is simply hard to accomplish.

Fragment Caching

If the budget simply won’t allow for more hardware, then it’s time to start thinking about what can and can’t be cached on the site. Is the header and footer going to always be the same no matter which user is on the site? Will the front page be the same? WordPress uses an Object Cache class which by default stores items wrapped in the wp_cache() function. By default the Object Cache is non-persistent, but you can couple it with Memcached to store the Object Cache items in Memory, to be served across all users.

By default WordPress will serve repeated query results from Object Cache. But really, anything you wrap in the wp_cache() function can be served from cache, including the HTML output from your header, footer, sidebar, and more. This is commonly known as “fragment caching.” WordPress provides some great examples of how to implement this on their Object Cache codex page.

Microcaching

On top of fragment caching, you may also consider using “Microcaching” at the Nginx level. This is a little-known process wherein Nginx caches your site’s static files like images, css, and javascript for long periods of time, while caching pages for a single second. This can vastly improve your site’s scalability if your site is constantly updating and changing. Check out the benchmarks in Microcaching WordPress in Nginx for a comprehensive example.

Linux and Page Faults

What are Page Faults?

If you’ve ever been monitoring a Linux process you might notice an odd metric: Page Faults. On the surface Page faults can sound scary. But in reality, a page fault is the natural process of requesting physical Memory to be allocated to different processes.

Linux divides up its physical Memory into “pages,” which are then mapped to virtual memory for individual processes. Each “page” equates to about 4kb of physical memory. And, just because the pages are mapped to a certain process, this doesn’t necessarily mean the process is actively using those pages. There can be both inactive and active pages in use by a process on the server.

So what is a page fault? Don’t worry, it doesn’t really have anything to do with errors. A page fault simply happens when a program is executing some code, and the executable code for that program isn’t currently being actively held in physical pages of Memory. Linux responds by allocating more pages to that program so that it can complete its execution of the code.

Minor Page Faults vs Major Page Faults

You may also see a shockingly high number of “minor page faults” in monitoring your process. Minor page faults are even less worrisome than major ones. This simply means a process requested data that’s already stored in Memory. The Memory just wasn’t allocated to the process requesting it. Linux will remap the currently used data to be shared between the two processes, without having to access the physical memory. The easiest way to remember the difference is: minor page faults can be served from existing virtual memory, while major page faults have to request more pages be allocated from the physical memory.

Page Faults and Swap

So what happens if your server has already mapped all its pages of the physical memory to different processes, but more Memory is needed to perform a task? The system will begin “writing to swap.” This is not a great scenario in any situation. The system will write some of the pages it’s holding onto in Memory to the disk. This frees some pages to satisfy incoming page fault requests.

Compared to using virtual Memory, writing pages to disk is super slow. And, writing to disk is an IO operation which can be throttled on many hosts if it happens too much. In regards to page faults, the process of using swap is the most worrisome. Using swap can easily become a downward spiral! Pages requested are written to disk, which overwrites other pages requested previously. But when the previous pages are requested again the cycle repeats itself again and again, until your system crashes.

With that in mind, the important aspects of your system to monitor are not necessarily the major or minor page faults, but rather the amount of swap being used.

Have you experienced issues with swap usage or page faults on your own system? Tell me about it in the comments.

TTFB and PHP

What Causes TTFB?

High Time to First Byte, or TTFB, is a commonly misunderstood problem on websites. It’s easy to look at this metric strictly as a server issue. The reality is, high TTFB doesn’t necessarily mean your server is slow or overloaded. So why would a page show over 1 second for TTFB if the server’s load looks just fine? The answer typically comes down to the dynamic page creation process required with PHP sites.

PHP vs HTML

To fully understand this issue, you’ll first want to take a look at PHP’s predecessor: HTML. HTML websites offer a nice simplicity in that when a page is requested, the web server simply has to locate and serve a static file. With WordPress and other PHP-based sites, the web server follows a set of directives to generate the page by executing PHP code and communicating with a database.

The database provides information like what page content exists on each page, post ID, URL, and autoloaded options. Once the page has been generated, it can be served to your site’s visitor. So while HTML has a single file to serve on a page visit, PHP has to create everything as new each time the page is requested. By nature, this takes longer! That’s expected. But PHP offers far more diversity in the level of dynamic content it can serve, which is why so many users still choose this option.

Cause and Effect

So what should you try if you have a high TTFB? What if it’s intermittent? How does it affect your site? For one, high TTFB could cause a high bounce rate if it gets high enough. Users typically will wait 2-3 seconds to start seeing content on your site before they leave, or at least have a degree of frustration. Not to mention, TTFB could affect your search rankings too. Google ranks pages based on popularity, but also on how fast they load, security, and mobile readiness.

Troubleshooting

If your WordPress site is plagued by high TTFB, start troubleshooting by creating an uncached dev environment to factor out any page caching that’s been helping intermittently. Since page cache stores a static copy of your page to be served to repeat visitors, it can often confuse the matter by showing low TTFB intermittently on your tests. Using an uncached environment will give you a clearer view of what’s truly causing the issue.

In your staging environment, start by eliminating factors en masse – try deactivating all plugins, or activating a default theme. After each, test your TTFB using a resource like www.webpagetest.org. If it ends up being the plugins, start enabling them one at a time (or in smaller groups if you have a lot of plugins to get through). After each change, note how fast your site loads and run a test. Once you have narrowed it down, you can also try a diagnostics tool like the Query Monitor plugin to show you if any queries to the database are to blame. This might help you determine if there’s a specific setting in your plugin or theme affecting your TTFB.

What struggles with TTFB have you experienced? Were you able to find the source of your issue? Tell your story in the comments.