%run -i ../python/common.py
#%run -i ../python/ln_preamble.py

24. Introduction#

So now that we have a sense for how programs work and we know how to write assembly code, that is very close to the native language of the computer, let us try and figure out how to ensure that most humans never have to do that!

Understanding how the relationship between the cpu and the values in memory give us the ability to construct computer programs is very powerful. However, there are many disadvantages to trying to write and develop all our software with tools that only provide a raw binary view of programming. In particular, the assembler and linker are very powerful but they require us to write a lot of carefully constructed code and to know a lot about the specific cpu and operating system that we what our program to run on.

24.1. Downside: The human burden of assembly programming and programmer lock in.#

Consider how when writing assembly code we can demark any area of memory with a symbolic label and place any byte values we want in that area memory. While this give us a great deal of freedom and power it also means that we have to make lots of decisions and write lots of carefully constructed assembly just to write a function or layout a data stucture. For example we are responsible for knowing that one label represents the start of a function while another represents an else condition of an if construct. Or that another label demarks a single byte that we want to use as a byte sized unsigned integer followed by an area of memory that we want to use as an eight byte signed integer. Think of all the trouble we might create if we accidentally get confused and use our areas of memory inconsistently in different parts of our program. For example we accidentally treat the label of the else clause as function by calling it or we directly jump to the function label instead of calling it. Or perhaps we move a four byte value to the area that we intended to use as a single byte unsigned integer followed an eight byte signed value. This would likely “corrupt” both the single byte value and the 8 byte value all in one fell swoop. We could spend days and days trying to figure out why our program seems to do strange things once in a while. Or worse we never realize it and our code simply one day causes an airplane to crash.

If focsed all ones effort in learning all the nuances of the assembly language of a particular CPU then you have locked yourself into what CPU’s and OS’s you can be productive in.

24.2. Downside: The code we write is locked to a specific CPU#

Futher when we write assembly code we end up with a program that can only work on a computer that has a cpu that is compatible with the instruction set we wrote our code in. For example we might write a program to sort numbers in 6502 assembly language. Unsurprisingly the executable we create cannot be executed on a computer with an Intel X86 processor or one with an ARM AArch64 processor. But worse the source code, the ascii ‘.S’ files, of our program cannot be assembled using and X86 or ARM assembler to produce a binary for systems using those processors. After all the code we wrote specifically identifies 6502 instructions which translate direclty into binary opcodes for the 6502. The other processor likely don’t have these particular instructions and even if they did they need not translate to the apprropriate opcodes. However, it is obvious that not only should it be possible to write a program to sort numbers on all of these computers the general structure of our program should be the same. After all, all of these computers are von Neumann computers where they implement a basic processing loop and their instructions and data are represented by bytes in memory. It would be really nice if we could write our code in a common assembly language and the assemblers for each processor could convert. Of course this unfortunately might also mean that we could not use instructions that don’t easly have a counter part on each processor. For example it is easy to see that all of the systems will have instructions for moving values to and from memory to either specific registers and have instructions for conducting addition. However it is unlikely that they all would have an equivalent of the X86 popcnt instruction that calculates the number of one bits in a 64 bit vector and stores the result to a memory location. So we would need to restrict our genric assembly language to avoid the use of such instructions and of course loose the power of their use.

24.3. Downside: The code we write is locked to a specific OS#

Lastly consider what we had to do in an assembly program to use facilities of the operating system. We had to explictly know what functions the specific operating system has, what they do, what argument they require, what registers to use to pass the arguments, what they calls return, the cpu specific instructions for call the operating system functions. Knowing all this we then had to write the CPU specific assembly. As such our program became speific to both the CPU and OS. However, again it seems reasonable that most Operating systems will support ways for connecting to a webserver on the internet and sending and receiving bytes from it. It would be really nice to write code that can run or at least be translated by a tool to run on the particular OS we are using.

24.4. High level Langauges#

Now many of us likely have written programs in languages like Python, Java, etc. and have never had any of the issues assocaited with writing assembly code. So how is this possible if in the end the hardware only can execute instructions and interpret data that are represented as bytes in memory?

These languages do not run natively. Rather the languages are implememented by software that “translate” or interpret programs written in the language they support. For example in order to run a Python program you must run an Python executable, A Python Interpreter, that then process your python program. There is nothing particularly special about the Python Interpreter executable. It is a binary has been built to execute on your OS and cpu. Just like the executables we built in the prior portion of this book.

As a matter of fact, using our knowledge of UNIX and assembly we can validate this fact for ourselves.

If we want we can even run it in the debugger

TermShellCmd("echo -e 'set disassembly-flavor intel\nb _start\nrun\ndisass\nquit' | gdb $(type -p python)", noposttext=True, markdown=True)

From here, like the programs we wrote, we could single step it and examine how it uses the CPU, memory and the calls it makes to the OS.

The Python interpreter is a native program that reads in python programs as ASCII lines of text and then interprets them according to the Python language definition.

24.4.1. But …#

Take a look again at how big the python binary is

TermShellCmd("ls -lh $(type -p python3.9)")

That’s a lot bigger than the programs we wrote in assembly code! Could you imagine writing something that big? Futher could you imagine writing a new version for every computer type and OS pair? As a matter of fact, it is likely if we did have to do that then we would not even have languages like Python in the first place!

24.5. The birth of “C”#

Well as a matter of fact this is precisely the kind of problem that the early inventors of the UNIX operating system were facing. The UNIX kernel is also a very complex program that one would like to have operate on many different types of computers. For the most part operating system kernels had been largely written in Assembly language for the particular computer the OS was going to run on. Further to support a new computer the OS had to be carefully rewritten for the new computer’s CPU features, memory organization and devices. Borrowing from the idea of using high-level languages to implement applications the developers of UNIX choose to evolve the UNIX kernel in a way that would allow much of the system to be portable. That is to say the work to be done to get the system running on a new computer would not require writing new assembly code for all but a small part of the code. However, the nature of operating systems code is that it needs to often manipulate the CPU and memory directly and easily inter-operates with assembly code. This is in contrast to the high-level languages that were being developed for applications. Those languages largely went out of their way to hide as much of the machine details as possible from the programmer and restrict what could be done in the language. Doing so avoids the need to understand the details of how data and code is represented as binary values in memory and to keep programmers from making various kinds of errors.

What was needed was a programming language that was in-between assembly language and the high-level languages. Where the new language would not hide the nature of memory from the programmer nor excessively restrict what can be done while still helping to avoid common pitfalls of assembly programming. Rather the language should precisely let the programmer work with memory but in a way that is more generic and portable than writing code for a particular CPU assembly language.

So the approach taken with the UNIX operating system was to develop a program, that we call a compiler, that could translate source code in this generic machine oriented language to assembly code for the particular computer that was being targeted. The assembly code the compiler produced can then be treated as input in the the systems assembler and linker. If much of the OS could be written in this generic language then getting the OS to run on a new computer really meant writing a new compiler for the system and then “re-compile” the OS source with it. In addition to portability the language would help a programmer avoid common errors and the need to know the details of every CPU. I

“C” is the language that was developed by the authors of the UNIX operating system and the program that translates it into assembly is a “C compiler”.

../_images/history.svg

In the above we have liberally shortened and potentially distorted history for the sake of comprehensibility. For a more complete and historically accurate discussion of how the “C” programming language came into being see The Development of the C Language*.

24.6. C as an alternative to Assembly programming#

So how do we maintain the power of assembly programming while also avoiding the downsides?

Almost by definition we must make some trade-offs

24.7. “C” a life beyond UNIX OS development.#

This language proved to be so useful its ability to enable expert programmers to write new tools that many of the programs that traditionaly had to be written in assembly language like the high-level application language translators and interpreters were rewritten to C.

24.8. Our approach to “C”#

We will take an unconventional approach to exploring and learning the “C” programming lanaguge. Rather than just considering it as a programming language and its attendant syntax we will develop our understanding as we explore how it b

Constructing C from the ground up:

  • organizing our opcode bytes into non-overlapping “callable” units of bytes

    • unique label that demarks “entry point”

    • all terminating cases end with a return

    • fixed convention for passing arguments and general register use between caller and callee

    • fixed convention for local variables

      • stack frame organization

      • freedom to avoid stack and only use registers if possible

    • while still supporting the ability to cheat

  • organizing and ensuring consistent use of data bytes into data objects (however with the ability to cheat if we want too)

  • intrinsic byte oriented objects

  • direct support for working with addresses

    • exposing the address of data objects

    • allowing addresses to be assocaited with a type

    • allowing one to create data objects to hold typed addresses

  • arrays of intrinisics

  • global vs local

  • structures and unions

  • arrays of structs and unions

  • struct and union address types

  • Ability to express common flow control idioms

    • if then else

    • do, while; while; for

    • limited ability to demark and jump to labels : goto

    • jump tables : switch

    • function pointers

  • Support for side stepping the rules

    • cheating and using and address of a data object in a arbitrary way

    • working with an arbitrary address as either data or opcodes

    • inserting opcodes directly using assembly syntax

  • Using the OS

  • More generally idomatic support for easily creating and using libraries of code

Remember all of what we learnt about assembling and linking still happens “C” is just a step added before these two stages. Instead of writing .S files directly we are writting .C that each get translated into an associated .S by the compiler which then gets translated into a relocatable object module .o which all then get linked into an executable via the linker. This means our ‘C’ code naturally fits into and extends what we can do the our binary oriented tools versus restricting what we can do.

Compilers also can codify best practices for a particular machine and OS. After all it has to translate the ‘C’ code into machine and OS specific assembly code. For the most part however, OS specific differences are handled by use of libraries written to support common functions that are implemented for each OS that is supported.

24.9. Preliminaries: The C Compiler : Our very onw personal assembly programmer#

24.9.1. The ToolChain#

24.10. cc1, as, ld#

24.10.1. Compiler driver: gcc/cc#

24.11. Controlling / Directing the compiler#

24.11.1. Options#

24.11.1.1. Optimization options#

24.11.1.2. Language options#

24.11.1.3. Debugger options#

24.11.1.4. Machine Specific options#

24.11.2. Our focus#

We will use a set of options that let us focus on understanding the core ability of the compiler to generate assembly code for us. Specifically we will be focusing on how it translates “C” syntax into assembly code. We will use a set of options that make it easier to see this relationship by having the compiler avoid adding extra assembly syntax needed for the debugging, or to improve the security of the code it generates.