GLEP 47: Creating 'safe' environment variables

Author Diego Pettenò, Fabian Groffen
Type Standards Track
Status Rejected
Version 1
Created 2005-10-14
Last modified 2014-01-23
Posting history 2006-02-09
GLEP source glep-0047.rst

Credits

The text of this GLEP is a result of a discussion and input of the following persons, in no particular order: Mike Frysinger, Diego Pettenò, Fabian Groffen and Finn Thain.

Abstract

In order for ebuilds and eclasses to be able to make host specific decisions, it is necessary to have a number of environmental variables which allow for such decisions. This GLEP introduces some measures that need to be made to make these decisions 'safe', by making sure the variables the decisions are based on are 'safe'. A small overlap with GLEP 22 [1] is being handled in this GLEP where the use of 2-tuple keywords are being kept instead of 4-tuple keywords. Additionally, the ELIBC, KERNEL and ARCH get auto filled starting from CHOST and the 2-tuple keyword, instead of solely from they 4-tuple keyword as proposed in GLEP 22.

The destiny of the USERLAND variable is out of the scope of this GLEP. Depending on its presence in the tree, it may be decided to set this variable the same way we propose to set ELIBC, KERNEL and ARCH, or alternatively, e.g. via the profiles.

Motivation

The Gentoo/Alt project is in an emerging state to get ready to serve a plethora of 'alternative' configurations such as FreeBSD, NetBSD, DragonflyBSD, GNU/kFreeBSD, Mac OS X, (Open)Darwin, (Open)Solaris and so on. As such, the project is in need for a better grip on the actual host being built on. This information on the host environment is necessary to make proper (automated) decisions on settings that are highly dependant on the build environment, such as platform or C-library implementation.

Rationale

Gentoo's unique Portage system allows easy installation of applications from source packages. Compiling sources is prone to many environmental settings and availability of certain tools. Only recently the Gentoo for FreeBSD project has started, as second Gentoo project that operates on a foreign host operating system using foreign (non-GNU) C-libraries and userland utilities. Such projects suffer from the current implicit assumption made within Gentoo Portage's ebuilds that there is a single type of operating system, C-libraries and system utilities. In order to enable ebuilds -- and also eclasses -- to be aware of these environmental differences, information regarding it should be supplied. Since decisions based on this information can be vital, it is of high importance that this information can be trusted and the values can be considered 'safe' and correct.

Backwards Compatibility

The proposed keywording scheme in this GLEP is fully compatible with the current situation of the portage tree, this in contrast to GLEP 22. The variables provided by GLEP 22 can't be extracted from the new keyword, but since GLEP 22-style keywords aren't in the tree at the moment, that is not a problem. The same information can be extracted from the CHOST variable, if necessary. No modifications to ebuilds will have to be made.

Specification

Unlike GLEP 22 the currently used keyword scheme is not changed. Instead of proposing a 4-tuple [2] keyword, a 2-tuple keyword is chosen for archs that require them. Archs for which a 1-tuple keyword suffices, can keep that keyword. Since this doesn't change anything to the current situation in the tree, it is considered to be a big advantage over the 4-tuple keyword from GLEP 22. This GLEP is an official specification of the syntax of the keyword.

Keywords will consist out of two parts separated by a hyphen ('-'). The part up to the first hyphen from the left of the keyword 2-tuple is the architecture, such as ppc64, sparc and x86. Allowed characters for the architecture name are in a-z0-9. The remaining part on the right of the first hyphen from the left indicates the operating system or distribution, such as linux, macos, darwin, obsd, et-cetera. If the right hand part is omitted, it implies the operating system/distribution type is Gentoo GNU/Linux. In such case the hyphen is also omitted, and the keyword consists of solely the architecture. The operating system or distribution name can consist out of characters in a-zA-Z0-9_+:-. Please note that the hyphen is an allowed character, and therefore the separation of the two fields in the keyword is only determinable by scanning for the first hyphen character from the start of the keyword string. Examples of keywords following this specification are ppc-darwin and x86. This is fully compatible with the current use of keywords in the tree.

The variables ELIBC, KERNEL and ARCH are currently set in the profiles when other than their defaults for a GNU/Linux system. They can as such easily be overridden and defined by the user. To prevent this from happening, the variables should be auto filled by Portage itself, based on the CHOST variable. While the CHOST variable can be as easy as the others set by the user, it still is assumed to be 'safe'. This assumption is grounded in the fact that the variable itself is being used in various other places with the same intention, and that an invalid CHOST will cause major malfunctioning of the system. A user that changes the CHOST into something that is not valid for the system, is already warned that this might render the system unusable. Concluding, the 'safeness' of the CHOST variable is based on externally assumed 'safeness', which discussion falls outside this GLEP.

Current USE-expansion of the variables is being maintained, as this results in full backward compatibility. Since the variables themselves don't change in what they represent, but only how they are being assigned, there should be no problem in maintaining them. Using USE-expansion, conditional code can be written down in ebuilds, which is not different from any existing methods at all:

...
RDEPEND="elibc_FreeBSD? ( sys-libs/com_err )"
...
src_compile() {
    ...
    use elibc_FreeBSD && myconf="${myconf} -Dlibc=/usr/lib/libc.a"
    ...
}

Alternatively, the variables ELIBC, KERNEL and ARCH are available in the ebuild environment and they can be used instead of invoking xxx_Xxxx or in switch statements where they are actually necessary.

A map file can be used to have the various CHOST values being translated to the correct values for the four variables. This change is invisible for ebuilds and eclasses, but allows to rely on these variables as they are based on a 'safe' value -- the CHOST variable. Ebuilds should not be sensitive to the keyword value, but use the aforementioned four variables instead. They allow specific tests for properties. If this is undesirable, the full CHOST variable can be used to match a complete operating system.

Variable Assignment

The ELIBC, KERNEL, ARCH variables are filled from a profile file. The file can be overlaid, such that the following entries in the map file (on the left of the arrow) will result in the assigned variables on the right hand side of the arrow:

*-*-linux-*      -> KERNEL="linux"
*-*-*-gnu        -> ELIBC="glibc"
*-*-kfreebsd-gnu -> KERNEL="FreeBSD" ELIBC="glibc"
*-*-freebsd*     -> KERNEL="FreeBSD" ELIBC="FreeBSD"
*-*-darwin*      -> KERNEL="Darwin" ELIBC="Darwin"
*-*-netbsd*      -> KERNEL="NetBSD" ELIBC="NetBSD"
*-*-solaris*     -> KERNEL="Solaris" ELIBC="Solaris"

A way to achieve this is proposed by Mike Frysinger, which suggests to have an env-map file, for instance filled with:

% cat env-map
*-linux-* KERNEL=linux
*-gnu ELIBC=glibc
x86_64-* ARCH=amd64

then the following bash script can be used to set the four variables to their correct values:

% cat readmap
#!/bin/bash

CBUILD=${CBUILD:-${CHOST=${CHOST:-$1}}}
[[ -z ${CHOST} ]] && echo need chost

unset KERNEL ELIBC ARCH

while read LINE ; do
    set -- ${LINE}
    targ=$1
    shift
    [[ ${CBUILD} == ${targ} ]] && eval $@
done < env-map

echo ARCH=${ARCH} KERNEL=${KERNEL} ELIBC=${ELIBC}

Given the example env-map file, this script would result in:

% ./readmap x86_64-pc-linux-gnu
ARCH=amd64 KERNEL=linux ELIBC=glibc

The entries in the env-map file will be evaluated in a forward linear full scan. A side-effect of this exhaustive search is that the variables can be re-assigned if multiple entries match the given CHOST. Because of this, the order of the entries does matter. Because the env-map file size is assumed not to exceed the block size of the file system, the performance penalty of a full scan versus 'first-hit-stop technique' is assumed to be minimal.

It should be noted, however, that the above bash script is a proof of concept implementation. Since Portage is largerly written in Python, it will be more efficient to write an equivalent of this code in Python also. Coding wise, this is considered to be a non-issue, but the format of the env-map file, and especially its wildcard characters, might not be the best match with Python. For this purpose, the format specification of the env-map file is deferred to the Python implementation, and only the requirements are given here.

The env-map file should be capable of encoding a key, value pair, where key is a (regular) expression that matches a chost-string, and value contains at least one, distinct variable assignment for the variables ARCH, KERNEL and ELIBC. The interpreter of the env-map file must scan the file linearly and continue trying to match the keys and assign variables if appropriate until the end of file.

Since Portage will use the env-map file, the location of the file is beyond the scope of this GLEP and up to the Portage implementors.

References

[1]GLEP 22, New "keyword" system to incorporate various userlands/kernels/archs, Goodyear, (https://www.gentoo.org/glep/glep-0022.html)
[2]For the purpose of readability, we will refer to 1, 2 and 4-tuples, even though tuple in itself suggest a field consisting of two values. For clarity: a 1-tuple describes a single value field, while a 4-tuple describes a field consisting out of four values.