Lifecycle of a hello world Program

What happens under the hood.

Mar 05, 2024

In this article we will try to understand how a simple Hello World program works internally. I will try to cover as much as possible in this article, but honestly I don’t think it is easy to cover all the topics, if we really want to understand things from the first principles.

So let’s begin.

Source file

hello.c

#include <stdio.h>

int main() {
printf("hello, world\n");
return 0;
}

So our hello.c file is nothing but a sequence of bits organized in a particular format to make sense of the things.

Let’s see how this works.

As we all know that computers don’t really understand human language and can only interpret the data if it’s in 0-1 format, it is important to understand how the above written code is converted to computer understandable code.

Each character we have written, has an ASCII code attached to it. Below is the ascii text representation of the code:-

1. # (35)

2. i (105)

3. n (110)

4. c (99)

5. l (108)

6. u (117)

7. d (100)

8. e (101)

9. (32)

10. < (60)

11. s (115)

12. t (116)

13. d (100)

14. i (105)

15. o (111)

16. . (46)

17. h (104)

18. > (62)

19. \n (10)

20. \n (10)

21. i (105)

22. n (110)

23. t (116)

24. (32)

25. m (109)

26. a (97)

27. i (105)

28. n (110)

29. ( (40)

30. ) (41)

31. (160)

32. (32)

33. { (123)

34. \n (10)

35. (160)

36. (32)

37. p (112)

38. r (114)

39. i (105)

40. n (110)

41. t (116)

42. f (102)

43. ( (40)

44. " (34)

45. h (104)

46. e (101)

47. l (108)

48. l (108)

49. o (111)

50. , (44)

51. (32)

52. w (119)

53. o (111)

54. r (114)

55. l (108)

56. d (100)

57. \n (10)

58. " (34)

59. ) (41)

60. ; (59)

61. (32)

62. \n (10)

63. (160)

64. (32)

65. r (114)

66. e (101)

67. t (116)

68. u (117)

69. r (114)

70. n (110)

71. (32)

72. 0 (48)

73. ; (59)

74. (32)

75. \n (10)

76. } (125)

Every integer value associated with the characters above represents a byte. Like the first character ‘#’ associates to an integer value 35, and 35 can be represented as a bunch of bits. The size of each bunch of these bits is 8, equivalent to the size of a single byte.

Note:- Each text line in code ends with an invisible new line, which is represented by ”\n” and corresponds to an ASCII value of 10.

Now that we understand how the hello.c program gets converted into computer understandable code i.e bits, it’s now time to know what happens when this code gets compiled into an executable file which we can directly run from our terminal.

The Compiler System

The compilation process is done using a GCC compiler.

A GCC (GNU Compiler Collection) is nothing but a collection of open source compilers which help us to convert a source file into an executable object file.

There are basically 4 steps involved in the process to convert a source file to an executable file.

Let’s understand each process in detail.

Pre-Processor

Before the core compilation begins, the preprocessor modifies your source code, preparing it for the compiler. It is a very important phase in the compilation system.

Every line written in our C program that begins with ‘#’ is an instruction for the pre-processor and is known as a preprocessor directive. The preprocessor interprets these directives to modify your code before the actual compilation takes place.

Here’s a list of most common pre-processor directives:

Core Directives

#include
#define

Conditional Compilation

#ifdef
#ifndef
#else
#endif

In our hello world program we have only used #include directive. The preprocessor in this case fetches all the contents from the stdio.h header file and inserts that code into our source file.

The file that we get after preprocessing is hello.i .

Compiler

The compiler takes the preprocessed code as input and translates it into assembly language specific to the target processor architecture (e.g., x86, ARM).

Assembly language is a low-level, human-readable representation of the machine code instructions that the processor directly understands.

Okay, now let’s try to understand the code that gets generated after the compiler converts our hello.i file to hello.s file

.LC0:
.string "hello, world"
main:
push rbp
mov rbp, rsp
mov edi, OFFSET FLAT:.LC0
call puts
mov eax, 0
pop rbp
ret

Machine code generated by the compiler for hello.i

The assembly language is basically divided into four components:-

Instructions (Mnemonics): These are short commands that tell the processor to perform specific operations. Examples:

MOV: Move data between registers or memory locations.
ADD: Add two values.
SUB: Subtract two values

Registers: These are tiny, incredibly fast memory locations built into the processor itself. Assembly code manipulates data in registers.

Labels: Marks locations in your code for branching and function references.\

Directives: These are not instructions for the processor but commands for the assembler itself. They do things like:

Define data in memory.
Set symbols or constants
Specify sections of the program.

Step-by-Step Explanation

.LC0: .string "hello, world"

.LC0: A label assigned to the following data.

.string: A directive telling the assembler to store a null-terminated string (hello, world\0).

main:

This label marks the entry point of your main function.

push rbp

Pushes the current value of the base pointer register (rbp) onto the stack. This is part of saving the function's context.

mov rbp, rsp

Moves the value of the stack pointer register (rsp) into the base pointer register (rbp). This sets up rbp as a reference point for accessing local variables on the stack.

mov edi, OFFSET FLAT:.LC0

OFFSET FLAT: Likely a linker relocation directive.

Loads the effective address of the "hello, world" string (which is stored at the .LC0 label) into the edi register. The edi register is conventionally used for the first argument to functions on x86-64.

call puts

call: Instruction to call a subroutine (function).

puts: A standard C library function for outputting strings to the console.

mov eax, 0

Moves the value 0 into the eax register. In many calling conventions, the eax register is used to hold the return value of a function, with 0 conventionally indicating success.

pop rbp

Pops the saved value of rbp from the stack, restoring the base pointer to its previous value.

ret

Returns from the main function, terminating the program.

Assembler

Next, the assembler translates hello.s into machine language instructions, packages them in a form known as a relocatable object program, and stores the result in the object file hello.o.

This file is a binary file containing 17 bytes to encode the instructions for function main.

Humans cannot understand this code. Even if we open this file, it will be all gibberish in nature.

Note:- Although there is a lot more that can be discussed regarding the assembler, it is out of scope of this article. Basically the assembler is also responsible for connection between the hardware and the compiler.

Linker

Notice that our hello program calls the printf function, which is part of the standard C library provided by every C compiler. The printf function resides in a separate precompiled object file called printf.o, which must somehow be merged with our hello.o program. The linker handles

this merging.

The result is the hello file, which is an executable object file (or simply executable) that is ready to be loaded into memory and executed by the system.

Example:-

Let’s suppose we have a file1 which contains our main function and a file2 which has a function sum() defined in it.

So now after compilation of both files we have file1.o and file2.o. These two files now have to be linked with each other. This is done by the linker.

Some points to remember after reading this article are:-

You might think that this is a very detailed analysis of how a simple hello world program works, but in reality this is also just the tip of the iceberg. There are many(many) other things which happen while these processes occur. In Fact all the processes we discussed under the compilation system can themselves be dissected into many smaller processes.
Any information you see in this article is available for you to read online. This is nothing new that I have described in this article. My recommendation would be to read the CSAPP book. It is a must read for everyone who considers him/herself as a Software Engineer.
It is not at all important to know each and every detail about how a language/process works.

Saket’s Substack

Discussion about this post

Ready for more?