The Artifact
An artifact is a set of files with the following properties:
- The creation process for those files is documented;
- The revisions of the source files and the precise version of other artifacts used in the build process are documented;
- The files are usable either by themselves or in conjunction with other artifacts. All the tools required to use the artifact must either be included or themselves be artifacts.
The main distinction between some random set of files and an artifact is the documentation requirement. It's a major bonus if that information is machine readable.
Note that being able to reproduce the build is not a strict requirement. Actually requiring reproducability is very very difficult to achieve in practice, so it is not reasonable to build a process that assumes reproducability. Instead, it is better to design a process that allows the exact same artifact to be used in different settings (testing, staging and production) and thereby avoiding rebuilding artifacts altogether.
Examples
Installers on a download page
The download page can fulfill the documentation requirement, if it has links or other means to identify the revisions and other artifacts used in the construction of the installer. For example, this download page contains a variety of links detailing exactly what went into building the three installers. Drill down on the "Changes since last good build" and see how not only the changes to the product itself, but also changes to various dependencies are included. You can further drill down all the way to the source code itself. Note that the information is also vailable in JSON format.
Source Packages
Since these contain just the source files and their build scripts themselves, they trivially fulfill the documentation requirement. They are usually not very useful artifacts for the end user, but can often serve as pre-requisite artifacts to be used for building artifacts that are usefull for the end-user.
Java JAR or WAR files
By themselves, they do not quite have enough information. Various systems (ivy, maven) have been built to attach the required documentation to make these into proper artifacts.
Notes
Why it's hard to reproduce builds
At first, it is not clear why it should be that hard. Part of it is that we generally are willing to accept "obvious" differences, as long as a reproduced build is "mostly" the same. After all, if builds were totally random, no software engineering could get done. The question is more one of risk analysis. What are the risks involved in rebuilding? Would you deploy a rebuilt artifact into production without testing?
- One obvious issue is that the timestamps will be different in a rebuild. This in itself already causes the built binaries to be different from the original ones. Many compilers and packaging systems embed timestamps into their output, and even the product itself might embed time sensitive information - for example in an "Application > About" menu dropdown, and may even sometimes cause buffer overruns if badly coded.
- Modern compilers and build systems are highly parallel, and therefore not deterministic. Due to random load and environmental issues, several equivalent traversals of a dependency tree or even a parse tree may be used, generating different results. This shouldn't cause bugs, but the binaries just might behave differently, ever so slightly.
- The original hardware or system configuration may no longer be available. This happens more often than one might think. For example, automated system updates, followed by hardware upgrades may render even a faithful and complete image of the original system inoperative.
- Most build systems cannot reliably perform incremental builds under all circumstances. An artifact resulting from an incremental build might not always be reproducible from scratch. Obviously, one solution is to not use incremental builds and accept the time hit, but in practice this can be a hard sell: 5 minutes vs 3-4 hours, really?
One of my major misgivings about the otherwise excellent maven build system is their release plugin, which makes the basic mistake of performing the tagging prior to the build, therefore requiring you to rebuild for release. The obvious workaround is to always build for release, but that would be much slower.
The Problem with Source Packages
Source packages are a common way to distribute open source software. Download the tarball, unpack, type "./configure; make; sudo make install" and you're done - most of the time. Assuming this works, there still is trouble:
- Build configuration and runtime configuration are invariably conflated. The result is by design not portable, which means you need to build and install on the production system, or design your own binary artifact mechanism.
- Development builds require system access, especially if you have a build dependency between two source package artifacts. You can work around this by defining the installation prefix appropriately, but that will then not catch permissions problem in the production environment, or missing user and group initialization steps or similar problems.
Source Packages are best used only as built time dependencies to produce your own binary artifact, which should be made to travel between test, staging and production without modifications.
