C code generation

From Liberty Eiffel Wiki
Revision as of 22:04, 3 March 2013 by Ramack (talk | contribs) (27 revisions: initial import from SamrtEiffel Wiki - The Grand SmartEiffel Book)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

General aspects

Generated files

When compiling to C, SmartEiffel generates the following output:

  • One or more C files
  • A C header file, projectname.h.
  • A type-to-id mapping file (see #Mapping types to IDs section below).
  • A C compilation script (projectname.make or projectname.bat depending on the platform).

If you are using the -no_split mode, all the C code will be put inside a projectname.c file. Otherwise, code will be split in chunks (of more or less the same size) called projectnameN.c, where N is a positive integer number. The number of chunks may vary depending on the size of the project. The policy for splitting files will probably change in SE 2.4.

The compilation script will contain a list of commands to invoke the system C compiler and compile the relevant files with the proper flags (extracted from the command line, ACE file and/or plugin options). It will only include compile commands for those C files which have no associated object file, or that have changed since the previous compilation. It will also contain a final linking command (to generate the executable) when any of the relevant C flags has changed.

The SmartEiffel runtime (i.e., auxiliar routines to handle some mechanisms like garbage collection, exceptions, etc) and plugins are not linked, but embedded instead. That means that the code for those components will be embedded inside the .c/.h files. The runtime files are located in the sys/runtime/c directory of the compiler distribution. For example when compiling with the GC enabled, the file sys/runtime/c/gc_lib.c is copied verbatim into one of the generated projectN.c and the file sys/runtime/c/gc_lib.h is copied verbatim into the projectN.h.

Mapping types to IDs

When compiling to C, every type existing in the system is assigned an id, which is a unique positive integer number. This number is used in different parts of the C code instead of the actual type name. For example if the type STRING is assigned id 7, the feature is_equal of type STRING will be mapped to a C routine called r7is_equal (the 7 in the name comes from the type id).

Note that ids are assigned to types and not to classes. Most of the time there is one type per class, but there can be possibly more types per class. For example, even if ARRAY is a single class, ARRAY[INTEGER] and ARRAY[STRING] are distinct types, and as such will get different ids.

A few of the types based on library classes (all INTEGER*, REAL*, CHARACTER, BOOLEAN, POINTER, NATIVE_ARRAY[CHARACTER] and STRING) have a predefined, fixed id. For details on implementation of this fixed mapping, check the ID_PROVIDER class, specially feature make.

After compilation, the mapping of types to ids is stored in a text file called project.id. This file is a useful help to read the generated C code. It is also used by the compiler itself when recompiling; the compiler uses it to keep the mapping between different runs. Maintaining the mapping helps to get similar generated code between compiler runs, which means that fewer C files need to be recompiled.

Mapping Eiffel types to C types

Not every type needs a direct C implementation. A lot of types are never instantiated, because they are deferred or just because the system never creates an instance of them. Types which are effectively used in the system are called "live" types. For each live type with id N, a C type is created called TN. For some of the standard base types, the definition of the type is hardcoded in the compiler; for example, in your project .h file you will find:

typedef int32_t T2 ("2" is the id for INTEGER)

typedef double T5 ("5" is the id for REAL_64)

For references (which might be polymorphic), the C type T0 is defined, and references are of type T0 *.

For most other live types, a C struct is used, where fields of the struct correspond to the attributes (proper and inherited) of the type. The struct of the type is called struct SN. For example, if the type STD_OUTPUT has id 38, you will get the following code in the .h file:

typedef struct S38 T38;
/* ... */
struct S38{Tid id;T0* _filter;T2 _buffer_position;T9 _buffer;T2 _capacity;};

Note that each field has the corresponding field name on Eiffel, preceded by an underscore (to avoid possible name clashes with C keywords, for example if you have a class with some field called "static" or "int"). The type of the field has the corresponding C type if the field is expanded: you can see in the struct above that the attribute buffer_position was declared as INTEGER whose type has id 2; buffer is of type T9 because type id 9 was mapped to type NATIVE_ARRAY[CHARACTER]. If the field is not of an expanded type, it is declared as a C field with type T0 *.

Also note that there is an additional field Tid id. This field is an integer field containing the type id for this structure, in this case the field should always be set to 38. The field is used to identify the type in every case that a pointer (usually a T0 * may point to more than one type of structure). That happens not only when using polymorphism on the original source, but also when using some internal polymorphic functions existing in the stack-dump printing code and in the garbage collector. If the compiler decides that the id field is not needed (happens a lot on boost mode with no GC, but also in expanded types), the field is omitted.

Now it is easy to explain the definition of type T0:

typedef struct S0 T0;
struct S0{Tid id;};

Native arrays have a special, different implementation. The C type is defined as a a pointer to the element type. When the element type is a reference type, the mapping of that element type is a T0 * so:

  • NATIVE_ARRAY[CHARACTER] , with CHARACTER having type id 3, will be mapped to a T3 *
  • NATIVE_ARRAY[STRING] , with STRING having type id 7, will be mapped to a T0 **

For each type, there is also an initialization constant defined to set the default values for the object. The constant to initialize values of type with id N is called MN. For default values, the constant gets a hardcoded value and is defined as a macro:

#define M5 (0.0) /* 5 is the id for REAL_64 */

For generic instantiations of NATIVE_ARRAY, the default is also defined as a macro and is always NULL:

#define M9 NULL

For structures, the initial value is an extern variable defined at the .h file, with its value set at one of the .c files. The initial value sets the id field if present to the correct type id, while other fields are respectively set to their default values. for the example struct above, the initialization code is:

extern T38 M38; /* in the .h file */
 /* in the .c file */
T38 M38={38,(void*)0,0,(void*)0,0};

(explain what happens when an type has no attributes)

Mapping Eiffel features to C code

Routines

Eiffel routines are mapped to C functions. For a routine in the type with id N, called eiffel_name, a C function called rNeiffel_name is generated. That routine:

  • Is declared as void for procedures or has a return type of the obvious mapping type of the result for functions (that is T0 * for references, or TN for expanded types).
  • Has an argument se_dump_stack *caller. This argument is used for describing the activation record of the caller routine; more details about this are given in the #Exception handling section. In boost mode, this information is not needed and this argument is removed.
  • Has an argument called C with the type of Current directly mapped. This is one of the few cases where references are not changed to T0 *, but the specific id is used even for reference types. Note that an expanded current will have a declaration of Tnn C, while a reference type will have a declaration of Tnn *C; this is because there is no possible polymorphism here. In a few cases, for routines that do not need information on the current instance, this argument is omitted.
  • Has arguments called a1, a2, ... for each of the arguments of the Eiffel routine. This arguments are mapped to C types in the usual way.

Note that each routine in the Eiffel source may be remapped as multiple C functions, one per live descendant type (and generic variation). This happens even if there is no redefinition, because C types may be different for the same piece of Eiffel code (due to anchors, generics, and the change of Current). Note that when generating code for this routines, anchored types and generics are resolved to specific types and thus require no special handling.

When there is a call and the compiler can statically decide the run time type of the call target, the call is mapped directly to one of these functions. When there is a possible polymorphic target, a "switching function" is generated. The switching function is called as XMname, where M is the type id of the static target. This function has a similar prototype to the functions described above, but with C (the argument to pass Current) declared as T0 * (note that polymorphic calls are always done on reference targets). The implementation of the function is a switch or nested ifs, which call the corresponding rP when the run-time type of the object is the one with id P (P should be the id of a type which conforms to the type with id M). The switching function may contain an argument called position in non-boost modes, with a codification of the source position of the call for error reporting purposes (see #Exception handling).

In boost mode, some simplifications may be made. Specially, some routines are inlined instead of being mapped to a C function. The switching functions may be inlined too.

Attributes

As seen before, Eiffel attributes have corresponding attributes in the C structures which represent instances. Attribute access is translated to structure field access when the type of the object can be decided at compile-time. However, an Eiffel expression x.attr, when the run-time type of x is not completely decided, can not be translated directly as field access, because the compiler doesn't know at compile time how to typecast the T0 * which represents x; and in fact a cast can not be done safely, because due to inheritance (possibly multiple), the attribute can be at different offsets in the structures that represent the possible live run time types of x.

In those cases, a switching function is also generated. The branches on the switch check the live type and do the proper typecast and field access in each case.

Note that this also generalizes to the cases where a query is implemented in some subtypes as a function and in other subtypes as an attribute. In those cases, the switching function has some branches doing function calls and other branches doing field access.

Once routines

For once routines, one or more routines are generated like for other non-once routines. But also one or two global variables are generated: a flag to remember if the routine has already been called, and in the case of once functions a second one for the cached result. These two variables have in their name the id of the class where the once function is declared. By "id of a class", it actually means "the id of the type which directly represents the class" (which is always one because generic classes can not have once features declared directly on them).

The flag variable is called int fBCMname (fBC means "flag at base class"); the cached result is called "oBCMname" (once for base class), with the type mapped in the normal way. They are declared in the.h file, and initialized on one of the .c files. The routines in the live-types (which may be more than one, and their type ids of several of them will not be M), will have code like:

if (fBCmmname==0) {
  fBCmmname=1; {
   /* translation of routine body here, using oBCmmname as `Result' */
}}
return oBCmmname; /* Only in functions */

Note that using the id of the base class and not of the live type ensures that the once results are effectively shared systemwide (once-per-system instead of once-per-live-type).

Implementing local variables and Result values

Local variables in Eiffel routines are mapped one by one to local variables in the mapped C function. The C types used is the usual mapping (T0 * for references, TN for expanded types). Locals are declared an initialized at the top of the routine. The name of the local is the Eiffel name preceded by an underscore. So, if you have local i:INTEGER; some_string: STRING, the C code for implementations of that routine will have

T2 _i=0;
T7 _some_string=(void *)0;

Additionaly, Eiffel functions get an additional local variable in the C code called R, which maps the Eiffel Result special variable. It is typed and initialized as any other local variable. Eiffel functions always get exactly one return statement in their C mapping, on their last line, saying return R;.

An exception to the above are once functions. As mentioned before, the result for that functions is a global variable. In that case, the R local is not declared, uses of Result in the Eiffel code are mapped to uses of the global variable, and the last line of the routine will be a return oBCmmname;.

Exception handling

The managed stack

Even if not required for implementing the rescue construct, SmartEiffel implements a "managed stack". This managed stack consists on extra information embedded on the execution stack, which is useful to provide debugging and backtrace information; it is also used for assertion checking. The managed stack is used in all compilation modes except for the boost mode.

In C, for every function call executed a stack frame is allocated. The stack frame contains the function arguments, the local variables, and probably some machine-dependent and C compiler dependent information (for example, a return address). The order of these elements on the stack frame may also be variable between platforms. SmartEiffel adds some local variables to each routine which hold some metadata about the structure of those stack frames, for example:

  • which routine is the one corresponding to that frame
  • where are the locals and arguments located in the stack
  • where is each frame located in the call chain

This information is stored in a local variable of every routine called ds of type se_dump_stack. A variable size part of the information is stored in another local variable called locals, which is a stack allocated array of pointers. A typical stack frame might look like this:

The image above shows heap-allocated objects in blue, stack allocated (expanded) objects in red, global runtime structures in green and stack allocated runtime info/pointers in white. The routine above uses Current in some way (because of the local variable C. It has two arguments (a1, a2) and the first one is of an expanded type (shown in red). It is a function returning some expanded type (it has an R variable), and it has two local variables, local_exp of an expanded type, and local_ref of some reference type.

The local ds variable is initialized on routine entry. It has the following fields

  • ds.fd points to the frame descriptor. The frame descriptor is a structure declared as a local static variable (so, it is globally allocated and shared) and contains information shared by all the stack frames of the same routine (a string with the Eiffel routine name and class, number of arguments, etc).
  • ds.current is initialized to &C. This allows code using the data structure to get the value of Current in this stack frame. Note that the pointer points to the location of the C variable, which may be the current object if Current is of an expanded type, but usually will be again a pointer to the actual object. If the routine has no need for Current, this value will be NULL
  • ds.p is an integer value with an encoding of the position (file, line, column) inside the source Eiffel code which was last executed while this frame was active. See the next section for details.
  • ds.locals is initialized to &locals or to NULL in the cases where locals is not needed (see below).
  • caller points to the ds local variable of the calling function. This value is passed on the caller argument added by the compiler to each routine.

There is a global runtime variable called se_dst (smarteiffel dump-stack top) which always points to the se_dump_stack of the currently running routine. So, starting from se_dst and following the caller attribute of the dump-stacks, you can follow a linked list of all the active routines in the call stack. For example if you compile a system with root class ROOT with id 37 and creation routine make and that routine calls some routine1 of the same class, which calls the item function of string, at some point the link chain will look like this (stack grows towards the bottom of the figure):

Encoding source positions

For debugging purposes, it is useful to map positions in C code to source files. In the compiled C code, source positions are represented as an unsigned int. The implementation expects that the C int type has at least 32 bits.

There are two ways of encoding positions, one that encodes just a file and a line number, and another one that also encodes a column number. The last bit of the value is 0 when there is an encoded column. The column number, if present, is in the following 7 least significant bits. Then, the line number comes (16 bits if no column present, or 13 bits otherwise). Finally, a file identifier is in the remaining bits. So the two possible layouts are (see http://smarteiffel.loria.fr/tools/api/tools.d/kernel.d/POSITION.html#mangling):

  • 15 bits for file id, 16 bits for line number, 1 bit set to 1
  • 11 bits for file id, 13 bits for line number, 7 bits for column number, 1 bit set to 0

There are macros to decode positions in the no_check.h runtime header. They are used when printing tracebacks and error messages. The macros are called se_position2*. The mangling, in the compiler, is defined in the POSITION class.

File ids are actually the same number as type ids. A globally allocated array char *p[N] is declared by the compilers where N is the highest type id. The routine initialize_eiffel_runtime assigns initial values to the elements of this array, in the style of p[7]="/usr/lib/SmartEiffel/lib/string/string.e";. Some assignments are also done aliasing the strings, for example p[99]=p[88]; this happens when 99 and 88 are ids of two types which are actually based on the same class (due to genericity).

On the generated C code, you will find a lot of encoded positions. Each times the compiler generates code for a piece of Eiffel source at a distinct position, an assignment ds.p=0x<encoded>; is inserted (usually with a human readable comment like /*l133c5/string.e*/). The assignment modifies the status of the current dump stack frame (see previous section); in that way, if an exception triggers, there is detailed information about the source code positions for each of the routines in the stack.

Recovering from exceptions

When an exception is triggered internally (a developer exception, assertion check, or internal checks like the one for Void targets) the internal_exception_handler is called. When an external signal is received, signal_exception_handler is called. Both behave similarly (the signal catcher does some platform specific signal handling stuff). If there is no rescue clause, they print a traceback (except in boost mode, which has no managed stack to print), and exits.

When the program passes through a routine with a rescue handler, the information about the handler is added to a stack. When exiting one of those routines, the top of that stack is removed. The internal handlers know whether there is some rescue clause to catch the exception or not by looking if the stack is empty.

The stack is implemented as a linked list of nodes stored locally, at the stack frames of routines with rescue clauses. The top of the stack is pointed by a global variable rescue_context_top, set to NULL when the stack is empty. Each node of the rescue stack is of type rescue_context, a structure containing the following fields:

  • jb, a jump buffer returned by the C function setjmp, which is used to jump back to the routine when handling the exception.
  • next, a pointer to the rescue context below, or NULL when we are at the bottom of the stack.
  • top_of_ds, a pointer to the managed stack frame of the routine which had this rescue close. This field is not present in boost mode, because there is no managed stack.

Every routine containing a rescue class (or in a class with a redefined default_rescue feature) declares a variable called rc of type struct rescue_context. At the very beginning of the compiled routine, the jb field of this context is initialized, with a block like this:

if(SETJMP(rc.jb)!=0){/*rescue*/
   ... compiled rescue clause ...
   internal_exception_handler(Routine_failure);
}

This will set the jump buffer, so the body of the rescue clause is executed after a longjmp (which will happen on the event of an exception). The last line of the block describes the standard behaviour of propagating the routine failure if a rescue block ends without running a retry statement.

After the rescue section at the beginning of the routine there is a C label called retry_tag:; the retry statement is implemented as a goto retry_tag;. Note that this means that SE is unable to retry from outside the routine (and the compiler actually checks that a retry statement is inside a rescue clause); this is a bit different to other Eiffel variants which allow retry instructions anywhere. Precondition checking is after this tag, so the check is actually done again if the rescue clause retries.

After precondition checking, the rescue context is actually added to the stack context (which means that a precondition failure will trigger an outer rescue clause, not the rescue clause of the same routine with the rescue handler. The initialization looks like:

rc.next = rescue_context_top;
rescue_context_top = &rc;
rc.top_of_ds=&ds;

At the end of the routine, the context is removed:

rescue_context_top = rc.next;

Any exception produced internally (by assertion checking, void-checking, loop variants, etc.) is implemented by a call to the internal_exception_handler function, defined in sys/runtime/c/exceptions.c. Signals are handled by signal_exception_handler which is very similar. These functions do the folowing:

  1. Setting some meta-information about the exception cause; specially global variables, internal_exception_number (exception cause, same constants as in class EXCEPTIONS), signal_exception_number (signal id when exception is caused by signal), original_exception_number (original cause of exception; preserved even when the final cause is usually "routine failure") and additional_error_message (sometimes some extra info, the assertion tag for assertions).
  2. If rescue_context_top == NULL it means that there is no rescue clause to catch this. An error message and a stack trace (on non-boost mode) are printed, and the program ends with exit (EXIT_FAILURE);.
  3. If rescue_context_top != NULL:
    1. The top rescue context is stored in the variable current_context
    2. The top rescue context is removed from the top of the stack.
    3. Assertion checking is reset, calling to reset_assertion_checking. See section on how assertion checking works for details.
    4. It does a longjmp() to current_context->jb. This should send the control flow right into the rescue handler.

Printing tracebacks

Garbage Collection