Notes on Systems Performance by Brendan Gregg
Terminology
- IOPS: Reads and Writes per second
- Throughput: The rate of work performed. Example: For databases Throughput can refer to the operation rate (number of transactions per second)
- Latency: A measure of time an operation spends waiting to be serviced.
- Response time: The time for an operation to complete (including the time to transfer the result)
- Utilization: is often used for operating systems to describe device usage, such as for the CPU. Can be time based: the average amount of time the server or resource was busy (U = B/T); Capacity-base: In the context of capacity planning.
- Saturation: The degree to which more work is requested of a resource than it can process is saturation. Saturation begins to occur at 100% utilization (capacity-based). E.g. there is not a thread to respond to a request, so request must be queued
- Knee Point: the point where performance stops scaling linearly due to a resource constrain.
- Pressure Stall Information (PSI): gives averages for CPU, memory and I/O.
- inode: an index node is a data structure containing metadata for a file system object, including permissions, timestamps and data pointers.
OS Terminolgy
- Interrupts: An interrupt is a signal to the processor that some event has ocurred that needs processing and interrupts the current execution of the processor to handle it.
- Async Interrupts: Hardware devices can send Interrupt Service Requests to the processor which arrives asynchronoysly
- Sync Interrupts: Generated by Software Instructions (traps, exceptions and faults). For this interrupts, the resposible software and instructions are still on CPU.
- User land: User-level programs and libraries
- Context Switch: Switch from runnint a thread or process to another
- Mode Switch: A switch between Kernel and User Mode
Hardware related
- Hardware Counters (PMCs) are programmable hardware registers on the processor that provide low-level performance information at the CPU cycle level. Only with PMCs can you messure efficiency of the CPU instructions, the hit ratios of the CPU caches, the utilization of memory and device buses and so on.
How OS works?
Creating a process
Processes are normally created using the fork(2) system call on Unix systems (in Linux, it’s normally a wrap around clone(2) syscall). The exec(2) system call can be then called to begin execution of another program.
Stacks
A stack is a memory storage area for temporary data, organized as a LIFO list. It’s used to store less important data than that whitch fits the CPU register set.
While executing a syscall, a process thread has two stacks: a user-level stack and a kernel-level stack.
Linux
Syscalls
- pool(2) is a system call to check for the status of file descriptors, which serves a similar function to polling, although it’s event-based so it does not suffer the performance cost of pooling. The pool(2) interface suports multiple file descritors as an array, which requires the application to scan the array when events occur to find the related file descriptor. An alternative is epool(2) which can avoid the scan and therefore will be O(1).
Tooling / Kernel
- netlink is a special socket address family for fetching kernel information. Use of netlink involves opening a socket with the AF_NETLINK address family and then using a series of send(2) and recv(2) to pass requests in binary structs. While this is more complicated interface than /proc it’s more efficient.
- strace The most common use is to start a program using strace, which prints a list of system calls made by the program.
- kprobes (https://docs.kernel.org/trace/kprobes.html) Kprobes enables you to dynamically break into any kernel routine and collect debugging and performance information non-disruptively
- uprobes similar to kprobes, but allow to dynamically instrument functions in applications and libraries in user-space
- load averages: The load average is an average of the number of runnable process over a given time period. For example, in a computer with one CPU, a load average of 10 would mean that 1 process was running on the cpu while 9 others were ready to run (e.g. not blocked for I/O). When running uptime, the load average showed is for: 1, 5 and 15 minutes. If we had 10 CPU and load average of 10, it would mean that by average for that period all CPUs were busy with at least one thread. Note: On linux, load average includes CPU, Disk and other resoures.
Application performance techniques
- Increasing the I/O size is a common strategy used by applications to improve throughtput. There is a downside when the application does not need larger I/O sizes. A database performaing 8Kbyte random reads may run more slowly with a 128Kbyte disk I/O size, since 120Kbytes are wasted.
- Buffering (data before pushing to the next level) can be used to improve write performance.
I/O
- Write-back caching: it works by treating the writes as completed after the transfer to main memory, and writing them to disk some time later. Also called asynchronous write
- Synchronous writes: for databases, only writing to main memory is unnaceptable. It’s possible to do write synchronously by using a flag when opening the file (e.g. O_SYNC)
- Commiting previous writes: Instead of syncrhonously writting individual I/O, applications may commit (flush) the previous writes using syscalls such as fsync.
- Pre-fetch: While reading, the kernel can detect a sequential read workload based on the current and previous file I/O offsets, and then predict and issue disk reads before the application request them.
- raw I/O: is issued directly to disk offsets, bypassing the file system. It has been used by some applications (databases), that can mange and cache their own data better than the filesystem cache.
- direct I/O: applications can use the file system byt bypass the file system cache.
- memory-mapped files: Memory mappings are created using the mmap(2) syscall. This avoids the syscalls and context switch overheads incurred when calling read(2) and write(2) syscalls to access file data. A disvantage of using mmapings on multiprocessor systems can be the overhead to keep each CPU MMU in sync.