Login | Register
My pages Projects Community openCollabNet

build-interceptor
Project home

If you were registered and logged in, you could join this project.

Summary Intercepts the .i files of a project while it is built with gcc.
Categories construction, process, testing
Owner(s) dsw

Build Interceptor

Summary: Build Interceptor captures the .i files of any project while it is built from source using the gcc tool-chain.

Maintainer: Daniel S. Wilkerson
Developers: Karl Chen

Anyone who has tried this on a large scale will find out that it is non-trivial to build a project from source and obtain the .i files generated during the build process. I give step-by-step instructions on how to use the provided scripts to do this without *any* modification to the build process of the project you are trying to capture.

This work was supported by professors Alex Aiken and David Wagner and was done at UC Berkeley.

Here are the current releases. Feel free to just get the current Subversion repository version as a guest user.

A warning on the toxicity of the techniques necessary while operating in a Byzantine Compilation Regime

WARNING: During its install process, build-interceptor searches for programs on your system that resemble a compiler tool-chain and messes with them. It is therefore necessary to do this install (but not the interception) as root. I must recommend that you install a separate operating system image (perhaps within a chroot bubble or a virtual machine) solely for the purpose of building packages with build-interceptor.

That said, my install/de-install makefile is pretty smart at preventing you from shooting yourself in the foot (it saved me several times): you can review the files to be moved before committing to do so and it won't allow nonsensical operations such as double-installation. Therefore I am willing to use it on my own development box and I have never messed it up this way.

We move the original installed tool-chain like this because it is the only way to be absolutely sure that all calls to the compiler, linker etc. are intercepted; all other tricks with environment variables etc. can be subverted by the build process, whereas with our technique in order to avoid interception the build process would have to use a different compiler or actively search for where we hid the real one.

Introduction

Welcome!

Build Interceptor is a collection of scripts for recording the .i files generated during a build of C or C++ programs with the gcc tool-chain. No modification to the original build process is necessary.

Limitations

The method described here requires that you be root on your box so you can replace the system cc1 and cc1plus programs, among others; this is done so that the build process you are intercepting does not have to be changed at all. You can probably also get it to work by setting environment variables such as GCC_EXEC_PREFIX. I could not figure out how to change the compiler proper (cc1) from gcc spec files.

The previous version of this tool would not work with the gcc 3 series as the preprocessor and compiler had been integrated; however since then by looking through the source code I discovered the seemingly undocumented flag "--no-integrated-cpp" which solves this problem.

Compilers other than gcc are not supported. Gcc 3.3 and 3.4 work; gcc 3.2.3 seems to not work.

Background

When gcc/g++ compiles, it pre-processes .c or .cc files to .i or .ii files (respectively), compiles .i or .ii files to .s files, assembles .s files to .o files, and links .o files to executables. It traditionally does all these stages with separate programs (new versions of gcc complicate this by integrating preprocessing and compilation), in particular the compiler-proper program being called cc1 or cc1plus for C or C++ (respectively).

Basics of how the build interception works

The cc1_interceptor.pl script captures the .i and .ii files generated by the gcc compiler tool chain by replacing and imitating cc1. It

  1. copies the pre-processed input, the .i file, to a new file,
  2. runs the real cc1 passing in the copy,
  3. puts the fully-qualified filename of the copy into a string in the section ".note.cc1_interceptor" in the assembly output.

This name flows to the .o and then to the executable (the linker will concatenate multiple occurrences of this section) where it can later be retrieved using objdump; This is easier to do if you use Ben Liblit's extract-section script which he ships as part of "The Cooperative Bug Isolation Project" and which I include in this project; see below for details.

The build interceptor process works by first moving away the system executables (using the Intercept.mk makefile, as root) and replacing them with softlinks to the interception scripts provided.

Licensing

All files in this directory tree and its subtrees are distributed under the license in License.txt; please see that file for copyright and terms of use.

Design

Simplicity

There are other ways one might attempt build process interception. This particular design has been chosen to avoid some problems that are not at all obvious if you have not tried this before. The salient lesson of those other projects is that build-processes are very complex and interception is hard to do without breaking them; testing is very difficult because if something fails it is hard to know how what went wrong or even if something went wrong. The number one concern of the design is therefore to keep things as simple and non-intrusive as possible.

Our design builds on the experience of the MOPS project and Cooperative Bug Isolation Project (CBI), which I talk more about in the Acknowledgments section below.

Staged interception

We do not pipeline the build interception with any further analysis of the generated .i files. That is we just save the generated .i files, we don't run an analysis right then; the MOPS project (below) did attempt to analyze .i files as they were generated. When a build would fail, they assumed that their analysis had failed. When we later separated the interception from the analysis, we found that in fact the interception was often failing but this was going undetected.

Another reason to not separate them is that if your analysis does fail, you often want to re-run it multiple times as you gradually minimize the input, such as while using the Delta interesting file minimizer tool. This is only possible if you have already materialized the .i file somewhere separately.

Basically a complex process should be staged if at all possible to reduce complexity.

Metadata lives in data

We do not attempt to keep metadata on build-process-generated files anywhere outside the files themselves. Early versions of the MOPS projects attempted to put derived data from a .i file into another file and then somehow maintain an association between the two. This was found to be impossible due to build processes moving files around etc.

All metadata for a file is inserted into the file in one way or another, depending on the current language the file is in: at the compile stage, it is inserted into the generated assembly (a trick novel to build_interceptor) and at the link stage it is inserted into the .o file using objcopy (a trick from MOPS and also CBI as well I think).

Avoid long-range communication outside of data

We do not attempt complex out-of-band communication between the various sub-processes of gcc, which differs from both MOPS and CBI. MOPS for example attempts to capture the preprocessing stage, analyze it, and then insert the results in after the linking stage. Getting rid of this long-range dependency between stages greatly simplifies things.

We do by default insert the preprocessing output captured at the start of the compilation stage into the .o file at the end of the assembly stage. This is pretty simple as the out of band data is the preprocessing output which has been stored in a temporary file with a name computed to not collide with others and located in a canonical place; the name of this file is in-band, embedded in the file as it is passed along.

Avoid parsing complex command-lines

Similarly we manage to almost completely avoid parsing the command-line arguments of gcc, though a few situations forced us to do it a little. Again, the simplification of the process is huge; we only parse arguments of simple tools such as cc1 and collect2; their command-lines are much simpler as another tool uses them, not a human.

Something you might be tempted to do along these lines is to remove -O* flags from the compile stage to speed things up, since perhaps you are only interested in the .i files and not in actually using the resulting executables. Removing -O* from the compile stage alone will not work, as if it has been passed to the preprocessing stage the compile stage will fail to compile it due to various things having been inlined. I suppose it would work to remove it from all stages, probably using the gcc spec file mechanism, but I don't consider it worth the complexity and possibility of failure.

Goals and amount of interception

Only use what you need

What tools must be intercepted during the build process depends on what your goal is. You can turn off the interception of tools by removing them from intercept.progs after it is built.

File-by-file

For a file-by-file analysis of source code, you simply need the source files after pre-processing. It is sufficient to just intercept cc1/cc1plus and (after running reorg_build.pl) look at the resulting .i files.

Note that even if you do not intercept cpp/cpp0/tradcpp0/gcc -E, the gcc spec file will tell gcc to not pass -P which means there should always be line directives in the .i file. So if your analysis finds an error, it can always map it back to the original source line.

Whole-program

For a whole-program analysis of all the source in the package, you need to know for each executable which .i files went into it. Each such executable (and any other files produced by the linker) will result in a .ld file which lists all the .i files that went into it that were compiled during the build.

For a really whole-program analysis that also looks at libraries, or if you wanted to modify the .i files, recompile, and re-link, you need to know *all* the .o files that went into an executable. For this you will need to also intercept collect2, which is implemented; however the script reorg.pl would also have to be extended to extract the linker --trace output, but this is straightforward.

You would want to intercept 'as' to make a mapping between .s files output by cc1/cc1plus and .o files linked together by the linker as well as the command-line. It would probably be best to insert the metadata after assembly using objcopy, just as with collect2.

Source-to-source

If you wanted to do a source-to-source transformation on the original source you would need the preprocessing command line as well, and so would have to intercept cpp/cpp0/tradcpp0/gcc -E; probably you would insert the metadata into the file as the initializer of a global string variable with an unusual name.

"Replaying" a build process from the interception record is probably trickier than one might at first imagine: build processes sometimes do strange things such as move files around. You would have to intercept mv and perhaps rm etc. I have not done this but it is not hard given the infrastructure. One thing you will likely want is for the build process to be deterministic, so the make interceptor removes -j from the command line; try out the TestMake.mk makefile with and without it.

Miscellaneous difficulties with gcc layering

You might have to experiment to figure out exactly what which layer to intercept. I am using gcc 3.4.0 and it seems that neither cpp nor gcc -E call each other nor a program called cpp0, which seems to not exist anymore; however perhaps gcc 2.95.3 does. Similarly, ld does not call collect2, though the gcc source code suggests in a comment that they are interchangeable; why do the both exist? To assist in this experimentation, each interceptor script prints at the start its 1) name, 2) parent process id, 3) own process id and 4) arguments all to standard error (this may have been commented out, just uncomment).

Using the scripts

Setup

This is the one-time initial setup of build_interceptor. Note that as is traditional, commands executed as a normal user are preceded by a '$' and those executed as root are preceded by a '#'.

NOTE: Build interceptor is incompatible with ccache. If you have ccache installed, turn it off first by moving the ccache scripts away first.

  • Make a place to put the .i files in your $HOME directory.
        $ cd
        $ mkdir preproc-foo1
        $ ln -s preproc-foo1 preproc
    
  • Build the intercept.progs and other support files.
        $ make
    

    Now check that the files you want to intercept are generated in intercept.progs. You can change this file if you need to, but only do it while build interception is off! Otherwise you can get into an inconsistent state.

Interception

  • Move your system gcc to gcc_orig and link gcc to gcc_interceptor.pl.
        $ cd; cd build_interceptor
        $ su
        # make -f Intercept.mk on
    

    You could exit the root shell now, but I find it easier to instead just leave one shell open as root for turning interception on and off and do user things in another shell.

        # exit (leave the root shell)
    

    At any time you can check the interception state; this works as root or non-root, however other targets in Intercept.mk that mutate the system state will check if you are root before allowing them.

        $ make -f Intercept.mk
    

    If you are intercepting make as well and you want to avoid running the intercepted make, you can do this while interception is on.

        $ make_orig -f Intercept.mk
    
  • Build your project.

    If you mess up and need to start over again, just do this.

        $ rm -rf preproc/*
    

    If you want to build two different projects and capture both, just move the link.

        $ mkdir preproc-foo2
        $ ln -s preproc-foo2 preproc
    

    Before compiling anything else with gcc:

    1) Make the data read-only.

        $ cd
        $ chmod -R a-w preproc-foo1
    

    2) Point the preprocessor capture at another file.

        $ mkdir preproc-junk
        $ ln -s preproc-junk preproc
    
  • When you are done, put gcc back where it was.
        $ cd; cd build_interceptor
        $ su
        # make -f Intercept.mk off
        # exit (leave the root shell)
    

Extraction

After intercepting a build, one would like to access the intercepted .i files. Build-interceptor comes with a script for just this purpose: extract_build.pl. This script creates an 'abstraction' of the build process: a directory containing 1) the intercepted .i files and 2) a Makefile such that typing 'make' "replays" the build. That is, suppose we have intercepted the build of an executable 'a.out'.

  • We may then extract the entire build at once.
        $ extract_build.pl -infile a.out -outdir xdir
    
    The result will be a new directory xdir that contains a Makefile and some .i files in a src subdirectory. The generic_Makefile is the same for all projects and contains the build logic; it is included by the Makefile which has variables configured from interception of the build process.
        $ ls xdir
        Makefile
        generic_Makefile
        src
    
  • The xdir/Makefile is very simple: it just compiles each .i file and links them together; therefore the extracted build process is much more likely to be amenable to a static analysis or a source-to-source transformation than the original build process. Changing to that directory we may now rebuild a.out from those .i files.
        $ cd xdir
        $ make
        $ make check  # to run the resulting executable
    

I think it is possible however for extract_build.pl to fail to correctly set up the Makefile, depending on the complexity of the original build process. Therefore we give two more primitive ways of getting at the .i files directly. First, the .i files are embedded into the ELF files; you can get them out of the ELF as follows.

  • Print out the metadata we inserted into the ELF.
        $ extract_section.pl .note.cc1_interceptor a.out
        (
                . . .
                md5:a78dd86286867621359f8629a7bad88e
        )
    
  • Use this output to construct the name of the ELF section containing the .i file and print that out.
        $ extract_section.pl .file.a78dd86286867621359f8629a7bad88e a.out
        [... the .i file contents here...]
    

However, even this method may cause problems, because for some huge projects (Mozilla) the embedded .i files will cause the ELF file to exceed the file size limit on some systems (like mine which is 2 Gig). In case of this eventuality do as follows.

  • Turn off the "feature" that the .i file is embedded into the ELF by setting the environment variable BUILD_INTERCEPTOR_DONT_EMBED_PREPROC or commenting out this line in as_interceptor.pl
        system('objcopy', $outfile, '--add-section', ".file.$md5=$tmpfile")
    
  • The .i files may be found down in $HOME/preproc. Print out the name of the temporary file where the .i file was saved; it is still there unless you have intercepted another project in the mean time and also gotten very unlucky.
        $ extract_section.pl .note.cc1_interceptor a.out
        (
                . . .
                tmpfile:/home/dsw/preproc/./home/dsw/foo/hello.c-1153018736-18133
        )
    
    

Files

Build-interceptor needs a place to put the pre-processed output, the .i files. The name of the directory where it puts them is hard-coded into the scripts:

  • $HOME/preproc: where the scripts put the .i files.

However it is not recommended to use the tool by simply making a preproc directory since after interception is over, you want to move that directory so that other compilations on your system do not inadvertently put more .i files in there. Thus in the above instructions I use a layer of indirection as follows:

  • $HOME/preproc-foo1: An actual directory for holding the .i files.
  • $HOME/preproc: a softlink to preproc-foo1 that should be moved as soon as interception is done.

Weaknesses / Bugs

The primary assumption is that there is a binary file gcc-VERSION and that all other names such as "gcc" or "cc" are symbolic links (not hard-links) to gcc-VERSION. If this is not the case things will not work. In particular this assumption fails for Slackware.

Using this assumption, build-Interceptor gets the gcc version at run time from the binary name. If you have multiple gcc versions installed simultaneously, they must be named gcc-x.y (e.g. /usr/bin/gcc-3.4) for this version detecting to work.

Build-interceptor changes ongoingly to deal with various usage scenarios. There are some old scripts lying around that I don't to get rid of but that are unlikely to work out of the box. If I don't explicitly mention that you should use a script, then it is not guaranteed to work.

Acknowledgments

This work was supported by professors Alex Aiken and David Wagner and was done at UC Berkeley.

I used code and ideas for build-process interception from two different previous projects that dealt with this same problem.

The idea of inserting metadata into an unused section in ELF .o files was borrowed from Ben and Hao. I extended it back to the assembly stage.

Ben Liblit, Hao Chen, John Kodumal, and Simon Goldsmith contributed to the discussions leading to these scripts. Thanks especially to Simon Goldsmith for proof-reading this Readme [I of course take responsibility for any remaining mistakes].

Thanks to Andy Begel for his in-depth explanation of dynamic linking under various circumstances and operating systems.