Skip to content
Snippets Groups Projects
Select Git revision
  • 1e6ece95ef8da803fe9586e5fba6abedf205ef12
  • tutorial-1 default protected
  • main
3 results

clang.md

Blame
  • Forked from mactavish96 / ALP4-Tutorials
    136 commits behind the upstream repository.
    user avatar
    Mactavish authored
    1e6ece95
    History
    Code owners
    Assign users and groups as approvers for specific file changes. Learn more.

    C Programming Language Review

    This document is not intended to be a primer on the C programming language. It only gives you the essential information you need to complete your programming assignments.

    We will focus on common errors, caveats and important concepts that might ease your cognitive overhead when you program in C. We will also recommend some tools that help you speed up your development and debugging.

    Specific Sized Numbers

    C only guarantees minimum and relative size of "int", "short" etc... The range that each type can represent depends on the implementation.

    The integer data types range in size from at least 8 bits to at least 32 bits. The C99 standard extends this range to include integer sizes of at least 64 bits.

    The types are ordered by the width, guaranteeing that wider types are at least as large as narrower types. E.g. long long int can represents all values that a long int can represent.

    If you need to have an exact width of something, you can use the {u|}int{#}_t type to specify:

    • signed or unsigned
    • number of bits

    For example:

    • uint8_t is an unsigned 8-bit integer
    • int64_t is an signed 64-bit integer

    All theses types are defined in the header file stdint.h instead of in the language itself.

    Undefined Behaviours

    The C language standard precisely specifies the observable behavior of C language programs, except for:

    • Undefined behaviours
    • Unspecific behaviours
    • Implementation-defined behaviours
    • Locale-specific behaviours

    More information about these can be found here.

    We are going to focus on undefined behaviours in this section.

    What are Undefined Behaviours

    • The language definition says: "We don't know what will happen, nor care of that matter".
    • This often means unpredictable behaviour.
    • Often contributes to bugs that seem random and hard to reproduce.

    What we have to do is to pay attention to these possible behaviours and avoid them in the source code.

    We will use UB and undefined behaviours interchangeably in the later sections.

    Frequent Undefined Behaviours

    Here are some common undefined behaviours you may or may not have already encountered before:

    Signed Overflow

    #include < limits .h >
    
    int foo(int a)
    {
        int b = INT_MAX + a; // UB, b can be anything
        return b;
    }

    Division by Zero

    #include <stdio.h>
    
    int func() {
        int gv;
        printf("Enter a integer number: ");
        scanf("%d", &gv);
        return (23 / func()); // UB
    }

    NULL Pointer Dereference

    int foo(int* p)
    {
        int x = *p;
        if (!p)
            return x; // Either UB above or this branch is never taken
        else
            return 0;
    }
    
    int bar()
    {
        int* p = NULL;
        return *p;    // Unconditional UB
    }

    Value of a Pointer to Object with Ended Lifetime

    int* fun(int x) {
        int y = 2;
        y = x + y;
        return *y; // UB
    }

    Use of Indeterminate Value

    #include <stdio.h>
    
    int main() {
        int a;
        int b = a; // UB
        printf("a = %d\n", a);
        printf("b = %d\n", b);
        return 0;
    }

    String Literal Modification

    #include <stdio.h>
    
    int main() {
        char *p = "some text here";
        p[2] = 'O'; // UB
    }

    Access Out Of Bounds

    #include <stdio.h>
    
    int main() {
        int arr[5] = { 1, 2, 3, 4, 5 };
        int b = arr[7]; // UB
        printf("b = %d\n", b);
    }

    Pointer Used After Freed

    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    
    int main() {
      char str[9] = "tutorial";
      char ftr[9] = "aftertut";
      int bufsize = strlen(str) + 1;
      char *buf = (char *)malloc(bufsize);
      if (!buf) {
        return EXIT_FAILURE;
      }
      free(buf);
      strcpy(buf, ftr); // UB
      printf("buf = %s\n", buf);
    
      return EXIT_SUCCESS;
    }

    This list goes on, you can find more information about undefined behaviours in this thesis and in the C99 Standard

    True or False

    There is no explicit Boolean type in old-school C. Alternatively, you can use the boolean type in the header file #include <stdbool.h> introduced in C99.

    #include <stdio.h>
    #include <stdlib.h>
    #include <stdbool.h>
    
    int main(void) {
        bool keep_going = true;  // Could also be `bool keep_going = 1;`
        while(keep_going) {
            printf("This will run as long as keep_going is true.\n");
            keep_going = false;    // Could also be `keep_going = 0;`
        }
        printf("Stopping!\n");
        return EXIT_SUCCESS;
    }

    What evaluates to FALSE in C?

    • 0 (integer)
    • NULL
    • Basically anything where all the bits are 0 is false

    What evaluates to TRUE in C?

    • Anything that isn't false is true

    sizeof() Operator

    sizeof(type) returns number of bytes in object. By C99 definition, sizeof(char) == 1.

    The operator returns a value in size_t defined in many headers like <stddef.h>, <stdio.h> etc. The actual type that holds the value is implementation-defined.

    However, sizeof is not a function! It is a compile-time operation.

    Pointers in C

    Pointers are probably the single largest source of bugs in C, so be careful when you use them.

    Type of Pointers

    • Pointers are used to point to any kind of data, int, char, struct etc.
    • void * is a type that can point to anything (generic pointer).
    • You can have pointers to pointers: int ****x, declares x as pointer to a pointer to a pointer to a pointer of an int.
    • You can have pointers to functions: int (*fn) (void *, void*) = &foo;, fn is a function that accepts two void * pointers and return an int. Use (*fn)(x, y) to invoke the function.

    Casting and Casting Pointers

    You can cast (change the type) of basic C types which converts them:

    int x;
    float y;
    y = (float) x;

    For pointers it only changes how they are interpreted:

    typedef struct {
        int x;
        int y;
    } Pointer;
    
    void foo(void *v) {
        ((Pointer *) v)->x = 24;
    }

    Void Pointer

    We should also know that void pointers can be casted into any type of pointer.

    In C, library functions like malloc, calloc etc. return void pointers, which we can then cast in to any other type of pointers as we need:

    #include <stdio.h>
    #include <stdlib.h>
    
    int main() {
      int num = 9;
      int *ptr= (int *)malloc(sizeof(int));
      ptr = &num;
      printf("ptr: %d", *ptr);
      return 0;
    }

    However, void pointers have some limitations:

    • We can not dereference void pointers directly to access the values stored at those addresses
    • Pointer arithmetic on void pointers is not possible, we have to first cast it into an appropriate type.

    Pointer Arithmetic

    Valid pointer arithmetic:

    • Add an integer to a pointer: ptr += 1;
    • Subtract 2 pointers (in the same array, evaluates to their distance apart in number of elements, with the type ptrdiff_t).
    • Compare pointers (<, <=, ==, !=, >, >=)
    • Compare pointers to NULL

    Everything else is illegal since it makes no sense:

    • Adding two pointers
    • Multiplying pointers
    • subtract pointers from integer

    *p++ vs (*p)++

    These are common in many codebases, and the first time you see them, you might be confused.

    x = *p++ is actually doing x = *p; p = p + 1;.

    x = (*p)++ is actually doing x = *p; *p = *p + 1;.

    Arrays

    Array variable is simply a pointer to the 0th element. So char *string and char string[] are nearly identical declarations. But the subtle difference is that, char *string is viewed as a string literal and can not be modified via subscript. In contrast, char string[] is just a character array whose elements can be modified via subscript.

    So we have a[i] == *(a + i). But unfortunately, when an array is passed to the function, it is passed as a pointer, and the size information is lost.

    Arrays in C are very primitive:

    • An Array in C does not have the information to its own length, not like arr.length in other languages
    • Array's bounds are not checked at all
      • So we can easily access off the end of an array
      • We muss pass the array and its size together to any function that is going to manipulate it

    Strings

    String in C is just an array of characters: char string[] = "hello world".

    String in C is null-terminated which means the special character \0 marks the end of a string.

    There are lots of auxiliary functions provided by the standard library in <string.h>, but be aware of how they treat the null character.

    For example, the strlen function returns the length of the string excluding the null character.

    There are lots of ways to initialize a string:

    char c[] = "abcd";
    
    char c[50] = "abcd";
    
    char c[] = {'a', 'b', 'c', 'd', '\0'};
    
    char c[5] = {'a', 'b', 'c', 'd', '\0'};
    
    char str* = "string literal"; // can not be modified via array subscript

    C Memory Management

    Memory can be viewed as an array of consecutively addressed memory cells. Typical size of each size is 1 byte. A char takes one byte whereas other types use multiple cells depending on there size.

    Program Address Space

    The figure below depicts the classical address space of a program:

    There a typically 4 regions:

    • Stack: local variables inside functions, grows downwards
    • Heap: space requested for dynamic data via malloc(), resizes dynamically, grows upwards
    • Static data: variables declared outside functions, does not grow or shrink. Loaded when program starts, can be modified
    • code: loaded when program starts, does not change
    • 0x0000 0000 is reserved and unwriteable/unreadable, so the program crashes on null pointer access.

    Storage Duration

    Objects have a storage duration that determines their lifetime. There are four storage duration available in C: automatic, static, thread, allocated. We won't cover thread storage duration here.

    Automatic

    Basically anything you declared within a block or a function has automatic storage, which means their lifetimes begins when the block in which they're declared begins execution, and ends when execution of the block ends.

    If the block is entered recursively, new objects will be created each time and have their own storage.

    Static

    Objects declared in file scope have static storage duration. The lifetime of these objects is the entire duration of the program and their stored value is initialized only once prior to main function.

    Allocated

    Allocated storage is allocated and deallocated through library functions on requests, using dynamic memory allocation functions.

    A concrete example of these storage durations:

    #include <stdio.h>
    #include <stdlib.h>
    
    /* static storage duration */
    int A;
    
    int main(void)
    {
        printf("&A = %p\n", (void*)&A);
    
        /* automatic storage duration */
        int A = 1;   // hides global A
        printf("&A = %p\n", (void*)&A);
    
        /* allocated storage duration */
        int *ptr_1 = malloc(sizeof(int));   /* start allocated storage duration */
        printf("address of int in allocated memory = %p\n", (void*)ptr_1);
        free(ptr_1);                        /* stop allocated storage duration  */
    }

    Dynamic Memory Allocation

    C supports function for heap management:

    • malloc: allocate a block of uninitialized memory
    • calloc: allocate a block of zeroed memory
    • free: free previously allocated block of memory
    • realloc: change size of previously allocated block (careful - it might move!)

    The following is an example of binary tree implementation using dynamic memory allocation:

    #include <stdio.h>
    #include <stdlib.h>
    
    struct TreeNode {
      int val;
      struct TreeNode* left;
      struct TreeNode* right;
    };
    
    struct TreeNode* create_node(int val) {
      struct TreeNode* node = (struct TreeNode*) malloc(sizeof(struct TreeNode));
      node->val = val;
      node->left = NULL;
      node->right = NULL;
      return node;
    }
    
    // Traverse the tree in-order and print the values
    void traverse(struct TreeNode* node) {
      if (node == NULL) {
        return;
      }
      traverse(node->left);
      printf("%d ", node->val);
      traverse(node->right);
    }
    
    
    struct TreeNode* insert(struct TreeNode* root, int val) {
      if (root == NULL) {
        return create_node(val);
      }
      if (val < root->val) {
        root->left = insert(root->left, val);
      } else {
        root->right = insert(root->right, val);
      }
      return root;
    }
    
    int main() {
      struct TreeNode* root = NULL;
      root = insert(root, 5);
      root = insert(root, 3);
      root = insert(root, 7);
      root = insert(root, 2);
      root = insert(root, 4);
      root = insert(root, 6);
      root = insert(root, 8);
      traverse(root);
      return 0;
    }

    There is a problem for this program, can you find out?

    Critical Situations

    • Memory leak: if you forget to deallocate memory - your program will eventually run out of memory
    • Double free: if you call free twice on the same memory - possible crash or exploitable vulnerability
    • Use after free: if you use data after calling free - possible crash or exploitable vulnerability

    In short, too many bad things can happen if you don't manage the memory correctly!

    Any solution? Yes, use valgrind.

    C Compilation Process

    Unlike other interpreted programming languages, we use compilers to compile C written programs.

    A full compilation in C is depicted in the following figure:

    A detailed explanation can be found here.

    C Preprocessor

    You often see C preprocessor macros defined to create "small functions"

    But they aren't actual functions, it just changes the text of the program.

    #include just copies that file into the current file and replace the arguments.

    Example:

    #define twox(x) (x + x)
    
    // twox(3); => (3 + 3);
    
    // this could lead to unexpected behaviours
    // int y = 2;
    // int z = twox(y++); => z = (y++ + y++);  the value of z actaully depends on the order of evaluation

    You can also use #define to define some constants:

    #define ARR_SIZE 100

    Conditional Inclusion

    Frequently, you’ll need to write different code to support different implementations. You can use the preprocessing directives #if, #elif, #else to conditionally include source code.

    Here is a simple example of using conditional inclusion:

    #include <stdio.h>
    
    int main() {
    #ifdef __linux__
        printf("I am the Happy Penguin!\n");
    #elif _WIN32
        printf("Welcome to MS Windows ( I rule!).\n");
    #elif __APPLE__&&__MACH__
        printf("Welcome to I am cool!\n");
    #else
        printf("Uh! who am i?\n");
    #endif
    }

    Note that the above #ifdef strings are standard ways of detecting the Operating System the code is being compiled on. See https://sourceforge.net/p/predef/wiki/OperatingSystems/ for further details.

    Header Guards

    One problem you will face when writing header files is preventing programmers from including the same file twice in a translation unit.

    Given that you can transitively include header files, you could easily include the same header file multiple times by accident.

    Header guards ensure that a header file is included only once per translation unit.

    Suppose we have a bar.h file:

    #ifndef BAR_H
    #define BAR_H
    
    int func(void) { return 1; }
    
    #endif /* BAR_H */

    And a foo.c file:

    #include "bar.h"
    #include "bar.h" // Repeated inclusion is // usually not this obvious.
    
    int main(void) {
        return func();
    }

    Using the header guard prevents the function definition of func being included twice.

    A common practice when picking the identifier to use as a header file guard is to use the salient parts of the file path, filename, and extension, separated by an underscore and written in all capital letters. E.g. FOO_BAR_BAZ_H for a file located in foo/bar/baz.h.

    There are other ways of using the preprocessor directives and macros, this article and GCC documentation provide extensive information about them.

    C Program Structure

    We've talked about storage duration above. Storage duration and linkage are closely related. In C, you can use the storage-class specifiers to specify the storage duration and linkage of an object or a function, they are:

    • auto: automatic duration and no linkage
    • register: automatic duration and no linkage; address of this variable cannot be taken (we won't cover this, it's quite rare )
    • static: static duration and internal linkage
    • extern: static duration and external linkage

    Linkage

    Linkage refers to the ability of an identifier (variable or function) to be referred to in other scopes.

    C provides three kinds of linkage:

    • none: The identifier can be referred to only from the scope it is in.
    • external: The identifier can be referred to from everywhere in the program. (E.g. from other source file).
    • internal: The identifier can only be referred to within the translation unit that contains the declaration.

    There are some implicit rules if no storage-class specifier is provided, the defaults are:

    • extern for all functions
    • extern for all objects at file scope
    • auto for objects at block scope

    Let's look at some examples.

    // flib.h
    #ifndef FLIB_H
    #define FLIB_H
        void f(void);              // function declaration with external linkage
        extern int state;          // variable declaration with external linkage
        static const int size = 5; // definition of a read-only variable with internal linkage
        enum { MAX = 10 };         // constant definition
    #endif // FLIB_H
    // flib.c
    #include "flib.h"
    static void local_f(int s) {}  // definition with internal linkage (only used in this file)
    static int local_state;        // definition with internal linkage (only used in this file)
     
    int state;                     // definition with external linkage (used by main.c)
    void f(void) { local_f(state); } // definition with external linkage (used by main.c)
    // main.c 
    #include "flib.h"
    int main(void)
    {
        int x[MAX] = {size}; // uses the constant and the read-only variable
        state = 7;           // modifies state in flib.c
        f();                 // calls f() in flib.c
    }

    Special Use of static

    Declaring a variable at block scope as static creates an identifier with no linkage, but it does give the variable static storage duration:

    #include <stdio.h>
    
    void foo() {
        static int count = 0; // count has no linkage but has static storage duration
        printf("Function has been called %d times\n", ++count);
    }
    
    int main() {
        foo();
        foo();
        foo();
        return 0;
    }

    Compiler Options

    Here are some recommended compiler and linker flags for GCC and Clang:

    • -O2: optimize your code for speed/space efficiency
    • -Wall: turn on recommended compiler warnings (Always use this option, some of the warnings can save you hours of debugging!)
    • -Werror: turn warnings into errors
    • -g: enable debugging. Need this option if you want to use a debugger such as GDB.
    • -o <output filename>: Name the output executable file with a given filename.
    • -I <dir>: add the directory dir to the list of directories to be searched for header files.
    • -std=<standard>: specify the language standard, e.g. -std=c11
    • -pedantic: issue warnings demanded by strict conformance to the stardard
    • -D_FORTIFY_SOURCE=2: detect runtime buffer overflow
    • -fpie -Wl,-pie: enable full ASLR(address space layout randomization) for better security