UNIX Program Execution, CPU, Memory and Virtualization

More notes on UNIX Systems. See other article here:

Unix System

Program Layout in Memory

Program versus process

A program is an executable file containing a sequence of instructions
A process is a program in execution
For reasons of technology today
- A program is usually stored on disk or other non-volatile secondary storage
- A process resides, at least partially, in working memory of the computer

Process memory layout

32-bit systems usually have shared libraries at the lowest address, followed by the text segment
Starting addresses for the text segment vary
- Usually 0x400000 or 0x8048000 = low address

Text and data segments

Text segment is also called the code segment
- Contains actual program instructions and any statically linked libraries
- Typically marked as executable and read only
- Self-modifying code is a special case
Data segment
- Contains initialized global variables and static strings

BSS segment

BSS = Block Started by Symbol
or BSS = Better Save Space
Holds uninitialized global variables
- By C language convention these variables are automatically initialized to 0
- Initialization to 0 may be done when executable image is loaded or done dynamically when referenced
- Dynamic initialization called zero-fill-on-demand

Heap segment

Dynamically allocated memory
- An allocation is obtained by making a call to malloc()
Grows upward (increasing memory addresses) as more space is requested

Stack segment

Holds temporary, or automatic, variables
- Arguments that are passed during a function call
- Information needed to return to a previous point in the program
Grows downward (decreasing addresses) as items are added
Retreats upward as items are removed
Stacks are LIFO (last-in, first-out) queues
Two operations
- PUSH – add an item to the stack
- POP – retrieve the most recently added item

Stack Memory

X86_64 hardware support for stack

RBP, base pointer
RSP, stack pointer, points to next available address
RSP decrements/increments by 8 (byte addresses), which is done by:
PUSH and POP instructions

RBP – base pointer

Represents currently active region of the stack
Used in combination with an offset to reference all local variables
- Accesses are relative to RBP
RBP updated any time a function is called or returns

Stack frame

Frame is pushed to the stack upon a function call
Popped on function return
Contains data for function calls, typically
- Previous value of frame pointer
- Parameters for callee routine
- Return address to caller routine
- Return value
- Automatic (local) variables of callee routine
Frame details are defined by the calling convention

Stack frame layout as a linked list

RBP always points to the start of the previous stack frame
- Which contains the prior RBP
Sequence
- RBP was pushed onto the stack
- RSP was moved to RBP
- Arguments to func1() were pushed onto the stack
- func1() is called
- Space for local variables was allocated
- Local variables set to initial values, if specified
Note this is a linked list (look at Lab)

Two types of calling consents for stacks: Caller Saved, Callee Saved

Register preservation – caller saved

Functions share one set of processor registers
Calling convention dictates how they are shared
- Caller-saved: the calling function is responsible for saving them

void foo() {
// push registers to save them before calling bar()
bar();
// pop regs to restore registers
}

Register preservation – callee saved

Callee-saved: called function must save and restore

void bar() {
// push registers to save them before bar() re-uses them
do_things; // likely placing local values in registers
// pop registers to restore them just before return
return;
}
...
void foo() {
lines_of_code;
bar();
other_lines_of_code;
}

Calling conventions

Dictate how registers are shared
How the stack is managed when a function is called
- Return address location
- For x86, how RBP, etc are employed
Also dictate how a process interacts with the kernel

cdecl – “C declaration”

Function parameters pushed onto stack in right to left sequence (the last argument is pushed on the stack first)
RAX for (primitive) return values
Caller-saved stack
Other conditions pertain

System V AMD64 ABI

ABI – application binary interface, an interface between two binary program modules, often one module is a user program and the other is the operating system kernel
System V AMD64 ABI is used by Solaris, Linux, FreeBSD, and macOS
Defacto standard for Unix systems
RDI, RSI, RDX, RCX, R8, and R9
- integer or pointer arguments
- R10 instead of RCX for kernel
XMM0-7
- Floating point
Additional arguments on the stack
Return value in RAX and RDX
Callee must save and restore RBP, RBX, R12-R15 if used
- Not for system calls
Lots of details
- http://refspecs.linuxbase.org/elf/x86_64-abi-0.99.pdf

Additional – Memory Management in Linux, Use of register pointers.

See how the Stack Pointer is below the EBP. The stack pointer is pointing to local variables. EBP+4 points to the instruction address for caller to call next, above that are the argument variables.

CPU accesses all of the data of current stack frame in execution through the EBP register value.

Everything works around the EBP pointer. The Base Pointer does not change, becomes the constant starting point. You add/minus from it to access other registers. The ESP is always changing.

Values above the EBP are return addresses and input arguments. Everything below the EBP are local variables/functions.

Programs that can manipulate these memory pointers to run malicious code underneath the program and access system resources as root.

CPU level security

Advanced Encryption Standard (AES) Instruction Set

ASENC: One round
AESNCLAST: Last round
AESDEC: One round decryption
AESDECLAST: Last round decryption
AESKEYGENASSIST: Assist in AES round key generation
AESIMC: AES Inverse Mix Columns
PCLMULQDQ: Carry-less multiply

RDRAND (intel)

X86 on-chip random number generator
Used by “Intel Secure Key”
- Pairs of 256-bit raw entropy samples from “hardware entropy source” applied to AES in CBC-MAC mode
- Reduces it to a single 256-bit conditioned sample, used as the seed for a cryptographically-secure deterministic random-bit generator
NIST SP 800-90A
Maximum of 511 128-bit samples before seed is changed
RDSEED instruction
- Newer, intended as entropy source for software pseudorandom number generator (PRNG)
- Thermal noise-based = pulls the electricity from CPU which could vary and thereby adds additional noise to the randomness
How do you know the thermal noise is truly random

Address space layout randomization (ALSR)

Mitigates against return-to-libc style attacks
- Such attacks require reliably locating relevant function(s) so that they can be misused
ASLR changes location of executable, stack, heap, libraries
Robustness requires more than 32-bit address space

NX bit – which addresses hold code, which not

No-eXecute
Segregate code from other storage
Historically implicit with Harvard architectures where separate memories hold instructions and data
Mark certain areas of memory as non-executable
- Limits effectiveness of buffer overflow attacks
- Also hinders, return-to-libc attacks

Exec shield – security enhancement for Linux

Developed at Red Hat, first release by Ingo Molnar in 2003
Approximates NX emulation by tracking upper code segment limit
- Cannot protect pages below the limit
- mprotect() higher memory (like the stack)? Everything below it is now executable

PaX – security enhancement for Linux

ASLR for Linux
From the grsecurity project
NX emulation and a many other related capabilities

RSBAC – security enhancement for Linux

Rule Set Based Access Control
Last major update, 9/13/2016
Similar to SELinux

AppArmor – security enhancement for Linux

Kernel module
Alternative to SELinux
Filesystem agnostic
Included in Ubuntu
Owned by SUSE

Stack canaries

Use to detect a stack buffer overflow before injected code execution can begin
Place a small, random integer just before (at a smaller address) the return address
To overwrite the return address thus requires overwriting the canary
Check the canary before returning, if canary value has changed its detecting a possible hack, therefore perhaps crash the program
gcc option for using stack canaries: -fstack-protector-all

Virtual Memory

Typical memory sizes and access times

For performance reasons, application program use of DRAM and hard disk should be carefully managed

Level	Size in bytes	Type access time (ns)
Registers (64 64bit)	512	0.1
DRAM (8 GB)	4294967296	60.00
Hard Disk (1 TB)	1000000000000	10000000.00

Memory has been complex for a long time

“… a system has been devised to make the core and drum* combination appear to the programmer as a single level store, the requisite transfers taking place automatically.”
- Kilburn et al., “One-level storage systems”, 1962
*Drum: same concept as a hard disk drive, but much simpler mechanically, using stationary heads located at various positions around a rotating cylinder

Virtual memory motivation

Allow efficient and safe [correct] sharing of memory among multiple programs
Allow main memory to serve as a cache for hard disk drives (transparent to application)
Remove the programming burdens arising from a too small amount of main memory
Note: motivations change with changes in memory technology

Memory management, manual versus automatic

Suppose we have:
- Internet Explorer (100MB)
- Microsoft Word (100MB)
- Yahoo Messenger (30MB)
- Operating System (90MB)
Computer has 256MB of RAM
- May have to quit a program before starting another
Virtual memory loads/unload programs or portions thereof automatically as needed

Programmer burden

Without virtual memory, it is the programmer’s job to make programs fit
- Divide into mutually exclusive chunks of code and data
- Dynamically load/unload chunks as needed
- Same for libraries
Difficult task

Safe sharing of main memory

Do not known in advance which programs will share main memory
Virtual memory limits sharing to explicit cases
How? Every program has its own address space, these are virtual addresses
Virtual memory translates virtual addresses to physical addresses
Also enforces protection

Historic virtual memory implementations

Historically
- Process swapping – entire memory footprint of process moved in and out (swapped) between memory and disk
- Segment swapping – entire parts, “segments” (determined by programmer) are swapped
Drawbacks
- Too much information moved at one time
- Slow, inefficient

Virtual memory terminology

Physical memory divided into fixed power-of-2 size frames
Virtual memory into pages of the same size as frames
- Any page can be placed in any frame
Missing page? Called a page fault
CPU emits virtual addresses
- Translated/mapped by a combination of hardware and software
- Memory mapping or address translation

Demand-based paging

Unit of memory swapped is a fixed-size page
- Usually 4KiB now, can be 2MiB on x86_64 “long mode”
Eliminates external fragmentation, and internal fragmentation is less than one page
Time to load one page is huge, circa 107 nanoseconds – Therefore we:
Strive to avoid loading a given page more than once
Main memory operates as a fully associative cache
- Only compulsory misses and capacity misses; no conflict misses

The working set of a program is that collection of pages that hold all program instructions and data that are needed for a significant duration, say one second or longer
- As a program runs its working set will shift, when it does page faults are expected
Pages in main memory are called resident
Resident set refers to all in-memory pages for a given process
Ideally, at almost all times for a program its resident set working set

Virtual Memory = indirection

Virtual Memory Translation requires both Hardware (memory management unit MMU) and OS. Memory address are grouped into Pages

When Program 3 needs memory, the oldest program 0 is pulled out and put on disk. The map to it is updated and Program3 now gets the memory spot

New M1 Mac uses 16KB pages, compared to traditional 4KB pages

When all physical memory is used, it starts doing swaps (copy) to the hard disk. There are different algorithms that determine what parts of physical memory gets swapped(FIFO, NRU, Random). When the memory that was swapped is needed again, the page table with error since no longer in Page Table, this sends CPU interrupt which the OS can then handle the reswapping.

Today on MacOS, the translation page table actually holds address on HD, it only loads it from HD to Memory when that part of the program is being used. Virtual memory usage can be see in MacOS Activity Monitor.

Executable Files

Executable file formats

a.out – Used by BSD (Berkeley Software Distribution) and early AT&T UNIX
- Rarely used today
- BSD UNIX and AT&T UNIX are predecessors to modern UNIX/Linux operating systems
COFF – Common Object File Format
- Windows
ELF – Executable and Linkable Format
- Used by most UNIX systems today

ELF Structure (Executable, Linkable, Format)

ELF file header
- Magic number
- 32- or 64-bit indicator, Endianness, ELF Version
- Target ABI (often set to System V), object file type = Application Binary Interface ABI
- ISA
- Entry point for program
- Pointer to program header
- Pointer to section header

Program header

Tells loader how to construct the process image
- Segments
- Types
- Flags
- File offset
- Virtual address
- Size in file
- Size in memory

Section header

Lists image sections and metadata for each
- Type – data, string, notes, etc
- Flags – writable, executable, etc
- Virtual address
- Offset in file image
- Size
- Alignment
readelf –headers /bin/ls
objdump -p, -h, -t

Roles of binary file, linker, and loader

All modern operating systems work as follows
- The following structure follows from goal of code reuse
- Reuse of code modules, code in libraries
Program binary carries the blueprint of the memory map for the running process
The skeleton of an executable file is created by the linker by combining the binary files from the compiler to fill out the text, data, and other sections

The loader, a system utility, opens the executable, reads information related to program sections, and populates the process memory map structure

Building a program

Start with source code, e.g., hello.c
Preprocessor
Compiler
Assembler
Linker
Loader

Preprocessor

When a .c file is compiled, it is first scanned and modified by a preprocessor before being handed to the compiler
Code Cleanup = Replace any of the 9 trigraph sequences “??x” with corresponding character and splice lines joined by an escaped newline character
Replace comments with whitespace
Perform macro expansion
Process directives – find lines beginning with # and hide them from the compiler or take some action
- #include, #define, #undef
- #ifdef, #ifndef, #else, #endif
- C preprocessor can do math in support of its work
  - #if (FLAG % 4 == 0) || (FLAG == 13)

Steps in compiling

Scanning or lexical analysis
- Groups program into tokens
  - Token = smallest meaningful unit of a program
Parsing or syntax analysis
- Create a parse tree
- Shows how tokens “fit together”
- Context-free grammar
Semantic analysis
- Determines/discovers meaning
- Builds symbol table
- Builds syntax tree
Code generation
- Traverse symbol table and syntax tree
- Generate assembly language instructions: loads, stores, arithmetic operations, tests, branches, …
No syntax errors at this point
nm –v will show the symbol table

Assembler

Performs assembly – maps individual assembly language instructions to individual machine code instructions
No translation, mapping is fixed from the assembly form to the machine form
- The mapping is fixed
The assembler program makes two passes over the assembly program
- First pass builds a table of the addresses of all labels so that forward references to instructions are allowed
- Second pass uses the table and the fixed mapping to emit an object file, which does not say where these program sections will reside in computer memory
gcc –S stops after the compilation process is complete and before calling the assembler

Libraries

Libraries are collections of object files that carry out common computational tasks
- Internal symbols are indexed for fast lookup by the linker
Searched for symbols that are not defined in the program
- Symbol found, pull it into executable (static)
- Otherwise include a pointer to the file, loaded by loader

Statically linked libraries

Faster, to a degree
Larger binaries
Fixed version, no updates
Portable app because no external dependencies

Dynamically linked libraries

More complexity
Easy to upgrade libraries
- Vulnerabilities
Have to manage versions
An app may need a library that does not exist on the runtime machine; best to use OS-standard dynamic libraries
Loader re-links every time program is executed
Explore with
- readelf –dynamic <pathname> , e.g., readelf –dynamic /bin/ls
- ldd /bin/ls

Late linking

Linking a function call to a library can be expensive
- Have to go through code and replace the symbol with its address
Delay until the call actually takes place
- Calls stub the Procedure Lookup Table (PLT) function
- Invokes dynamic linker to load the function into memory and obtain real address
  - Rewrites address that the sub code references
  - Only happens once

Loader

Copies program sections created by the linker into the process memory map in computer memory
Does not need to know much about inner structure of sections
Focus is on section attributes such as read-only or read-write and whether patching is needed before process launch

Goal = try reusing code.

eof

solidfish