took notes on how compilers compile basic c programs, and how they get run via libc
This commit is contained in:
parent
bd650dfc08
commit
58c053782e
260
sandbox/c/NOTES.md
Normal file
260
sandbox/c/NOTES.md
Normal file
@ -0,0 +1,260 @@
|
||||
# Notes
|
||||
|
||||
## http://nickdesaulniers.github.io/blog/2016/08/13/object-files-and-symbols/ (and it's follow up post)
|
||||
|
||||
These files are being used as an example:
|
||||
|
||||
```c
|
||||
// main.c
|
||||
// declare that these exist, but it's defined in hello.c
|
||||
void hello();
|
||||
void hello2();
|
||||
int main() {
|
||||
hello();
|
||||
hello2();
|
||||
return 0;
|
||||
}
|
||||
```
|
||||
|
||||
```c
|
||||
// hello.c
|
||||
#include <stdio.h>
|
||||
void hello() {
|
||||
puts("Hello World!");
|
||||
}
|
||||
```
|
||||
|
||||
```c
|
||||
// hello2.c
|
||||
#include <stdio.h>
|
||||
void hello2() {
|
||||
puts("Hello Again!");
|
||||
}
|
||||
```
|
||||
|
||||
* Object files (.o)
|
||||
- `clang -c main.c hello.c hello2.c`
|
||||
- Contain the actual compiled machine code, but the addresses used need to
|
||||
be relocated when compiling the full binary.
|
||||
- Contain symbol table, which relates addresses to variables and functions
|
||||
defined in the object file.
|
||||
- `nm` can be used to inspect symbol table. Includes "undefined" symbols,
|
||||
which are symbols used by the object file but which aren't defined within
|
||||
it (and which are presumably defined elsewhere).
|
||||
```
|
||||
▻ nm main.o
|
||||
U hello
|
||||
U hello2
|
||||
0000000000000000 T main
|
||||
|
||||
▻ nm hello.o
|
||||
0000000000000000 T hello
|
||||
0000000000000000 r .L.str
|
||||
U puts
|
||||
|
||||
▻ nm hello2.o
|
||||
0000000000000000 T hello2
|
||||
0000000000000000 r .L.str
|
||||
U puts
|
||||
```
|
||||
- `readelf` can be also used to dump the contents of the object file's
|
||||
symbol table on linux (`-s` displays symbol table):
|
||||
```
|
||||
▻ readelf -s main.o
|
||||
Symbol table '.symtab' contains 6 entries:
|
||||
Num: Value Size Type Bind Vis Ndx Name
|
||||
0: 0000000000000000 0 NOTYPE LOCAL DEFAULT UND
|
||||
1: 0000000000000000 0 FILE LOCAL DEFAULT ABS main.c
|
||||
2: 0000000000000000 0 SECTION LOCAL DEFAULT 2
|
||||
3: 0000000000000000 0 NOTYPE GLOBAL DEFAULT UND hello
|
||||
4: 0000000000000000 0 NOTYPE GLOBAL DEFAULT UND hello2
|
||||
5: 0000000000000000 37 FUNC GLOBAL DEFAULT 2 main
|
||||
|
||||
▻ readelf -s hello.o
|
||||
Symbol table '.symtab' contains 6 entries:
|
||||
Num: Value Size Type Bind Vis Ndx Name
|
||||
0: 0000000000000000 0 NOTYPE LOCAL DEFAULT UND
|
||||
1: 0000000000000000 0 FILE LOCAL DEFAULT ABS hello.c
|
||||
2: 0000000000000000 13 OBJECT LOCAL DEFAULT 4 .L.str
|
||||
3: 0000000000000000 0 SECTION LOCAL DEFAULT 2
|
||||
4: 0000000000000000 29 FUNC GLOBAL DEFAULT 2 hello
|
||||
5: 0000000000000000 0 NOTYPE GLOBAL DEFAULT UND puts
|
||||
```
|
||||
|
||||
* Static library files (.a)
|
||||
- `ar` utility creates uncompressed, static (those might be synonomous in
|
||||
this context?) archives, with the `.a` extension.
|
||||
- In the context of compiling code, `.a` files are archives of multiple
|
||||
object files, with the symbol table preserved in a way where nm and ilk
|
||||
can still understand it.
|
||||
```
|
||||
▻ ar r hello.a hello.o hello2.o
|
||||
▻ nm hello.a
|
||||
|
||||
hello.o:
|
||||
0000000000000000 T hello
|
||||
0000000000000000 r .L.str
|
||||
U puts
|
||||
|
||||
hello2.o:
|
||||
0000000000000000 T hello2
|
||||
0000000000000000 r .L.str
|
||||
U puts
|
||||
```
|
||||
- This `.a` file can then be passed into clang as if it was an object file,
|
||||
and the resulting binary would statically contain all symbols from the
|
||||
archive that it needs:
|
||||
```
|
||||
▻ clang main.o hello.a
|
||||
▻ ./a.out
|
||||
Hello World!
|
||||
Hello Again!
|
||||
```
|
||||
|
||||
* Dynamic/shared library files (.so on Linux, .dylib on OSX, .dll on Windows)
|
||||
- If multiple programs share the same library and are being statically
|
||||
compiled then, when run, that library ends up in memory twice. Dynamic
|
||||
linking allows the library to be dynamically linked in at runtime, to save
|
||||
memory use.
|
||||
- Can be compiled from either source or object files:
|
||||
- `clang -shared hello.c hello2.c -o hello.so`
|
||||
- `clang -shared hello.o hello2.o -o hello.so`
|
||||
- Then used in final compilation normally: `clang main.o ./hello.so`
|
||||
```
|
||||
▻ clang main.o ./hello.so
|
||||
▻ ldd a.out
|
||||
linux-vdso.so.1 (0x00007ffd5665c000)
|
||||
./hello.so (0x00007f7f9bbe8000)
|
||||
libc.so.6 => /usr/lib/libc.so.6 (0x00007f7f9b830000)
|
||||
/lib64/ld-linux-x86-64.so.2 => /usr/lib64/ld-linux-x86-64.so.2 (0x00007f7f9bfec000)
|
||||
```
|
||||
- `strace ./a.out` can be used to view all system calls a binary makes while
|
||||
it runs, including opening and reading dynamic libaries, which will look
|
||||
like:
|
||||
```
|
||||
openat(AT_FDCWD, "./hello.so", O_RDONLY|O_CLOEXEC) = 3
|
||||
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\360\4\0\0\0\0\0\0"..., 832) = 832
|
||||
fstat(3, {st_mode=S_IFREG|0755, st_size=7784, ...}) = 0
|
||||
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f58fc93a000
|
||||
```
|
||||
- `LD_DEBUG` env variable can also be used for tracing information, as well
|
||||
as dynamic library search path stuff.
|
||||
- If hello.so were placed into `/usr/lib` or `/usr/local/lib` then
|
||||
compilation could be done with just `clang hello.o -lhello`. `-L` can add
|
||||
library search paths as well.
|
||||
- `pkg-config` and its associated `.pc` files can be used by library authors
|
||||
to specify flags required when compiling using that shared library (e.g.
|
||||
to include necessary header files and whatnot).
|
||||
- `LD_PRELOAD` can be used to pre-eminently link in a shared library which
|
||||
will get searched first before symbols from the "real" shared libraries
|
||||
are searched, allowing for code-replace and such.
|
||||
|
||||
## http://www.linuxjournal.com/article/1059
|
||||
|
||||
* ELF Header
|
||||
- Contains (I think) information on sections, their sizes, and their offsets
|
||||
|
||||
* ELF File Sections
|
||||
- After the header, ELF files composed of multiple sections
|
||||
- Each section is composed of information of a similar type
|
||||
- All sections are loaded into memory, presumably, but certain are loaded
|
||||
into read-only pages (like .text, the executable code) and others are
|
||||
loaded into read-write.
|
||||
- It would seem that the read-only/read-write dichotemy is enforced by the
|
||||
"memory manager", which is part of the cpu?
|
||||
- Different sections:
|
||||
- .text (ro): executable code
|
||||
- .data (rw): variables the user has specified an initial value for
|
||||
- .bss (rw): variables the user has not specified an initial value for,
|
||||
separate from .data because there's no need to waste space in the
|
||||
binary file with zeros.
|
||||
- symbol table for debugging (and possibly dynamic linking?)
|
||||
|
||||
* Shared libraries
|
||||
- so's are designed to be "position independent", meaning when the so is
|
||||
loaded at binary runtime the place in memory that it is loaded into is not
|
||||
actually important. The `-fPIC` compiler option used-to-be/is (?)
|
||||
important in order to enable this. (PIC being "position independent code")
|
||||
- Compiler reserves a register which points to the start of a "global offset
|
||||
table", which is used to support global variables within shared libraries
|
||||
using PIC. (I guess shared library global vars are shared across
|
||||
processes?)
|
||||
- Procedure Linkage Table is like the GOT but for functions, it's basically
|
||||
a jump table within the library file. If the user wants to redefine one of
|
||||
the shared library's functions, and have all other functions within the so
|
||||
use that new one, then the PLT entry for that function is the only need
|
||||
which gets changed.
|
||||
|
||||
* Compiling
|
||||
- During compilation the compiler will keep track of symbols needing
|
||||
"relocating", meaning they are external to the object file and will need
|
||||
to be patched in during linking. Each relocated symbol is marked as such
|
||||
in the symbol table (I think), along with the offset into .text where that
|
||||
symbol was used, and where the linker needs to place the actual address.
|
||||
|
||||
## https://blog.oracle.com/ksplice/hello-from-a-libc-free-world-part-2
|
||||
|
||||
With the following file:
|
||||
```c
|
||||
// main.c
|
||||
void alloc_boi() {
|
||||
char *str = "Hello world";
|
||||
}
|
||||
|
||||
void _start() {
|
||||
alloc_boi();
|
||||
asm("movl $1, %eax;" // what
|
||||
"movl $0, %ebx;"
|
||||
"int $0x80;");
|
||||
}
|
||||
```
|
||||
|
||||
* Compile the above with `clang -nostdlib main.c`, the result will have very
|
||||
little in it, but it seems there's still some extra uneeded sections which
|
||||
could be removed.
|
||||
|
||||
* `main.c` uses `_start` instead of `main` since that's the actual first
|
||||
function called, but normally it gets filled with libc junk (like importing
|
||||
environment variables and such).
|
||||
|
||||
* The exit call is needed to be explicitly defined otherwise the process won't
|
||||
exit, instead execution will run past `.text` and segfault.
|
||||
|
||||
* The disassembly from above looks like this:
|
||||
```
|
||||
Disassembly of section .text:
|
||||
|
||||
0000000000000250 <alloc_boi>:
|
||||
250: 48 8d 05 1d 00 00 00 lea 0x1d(%rip),%rax # 274 <_start+0x14>
|
||||
257: 48 89 44 24 f8 mov %rax,-0x8(%rsp)
|
||||
25c: c3 retq
|
||||
25d: 0f 1f 00 nopl (%rax)
|
||||
|
||||
0000000000000260 <_start>:
|
||||
260: 50 push %rax
|
||||
261: e8 ea ff ff ff callq 250 <alloc_boi>
|
||||
266: b8 01 00 00 00 mov $0x1,%eax
|
||||
26b: bb 00 00 00 00 mov $0x0,%ebx
|
||||
270: cd 80 int $0x80
|
||||
272: 58 pop %rax
|
||||
273: c3 retq
|
||||
```
|
||||
The `alloc_boi` section is the interesting one:
|
||||
|
||||
- `%rsp` is the stack pointer, `%rbp` is apparently general purpose but is
|
||||
used in this context as the "frame pointer", meaning the start of the stack
|
||||
frame. This is a small optimization which allows referencing memory from a
|
||||
point which is constant during the function (the frame's start) rather than
|
||||
a point which changes (the stack pointer's position). This optimization can
|
||||
be negated by compiling with `-fomit-frame-pointer`
|
||||
|
||||
- It also contains `lea 0x1d(%rip),%rax` at the top of `alloc_boi`'s
|
||||
disassembly. `lea` is "load effective address". Basically puts the
|
||||
calculated pointer into `%rax`. The pointer being calculated is
|
||||
`0x1d(%rip)`, which is the instruction pointer + 0x12. The instruction
|
||||
pointer's value is always the next instruction to be run, and in this case
|
||||
is `0x257`. Adding `0x1d` to that gives `0x274`, which is the first byte in
|
||||
the `.rodata` section, the start of the `Hello World` string.
|
||||
|
||||
- The subsequent `mov %rax,-0x8(%rsp)` is moving the pointer (stored in
|
||||
`%rax`) and putting it onto the stack.
|
Loading…
Reference in New Issue
Block a user