GLEP 75: Split distfile mirror directory structure

Author Michał Górny <mgorny@gentoo.org>, Robin H. Johnson <robbat2@gentoo.org>
Type Standards Track
Status Draft
Version 1
Created 2018-01-26
Last modified 2018-01-27
Posting history 2018-01-27
GLEP source glep-0075.rst

Abstract

This GLEP describes the procedure for splitting the distfiles on mirrors into multiple directories with the goal of reducing the number of files in a single directory.

Motivation

At the moment, both the package manager and Gentoo mirrors use flat directory structure to store files. While this solution usually works, it does not scale well. Directories with large number of files usually have significant performance penalty, unless using filesystems specifically designed for that purpose.

According to the Gentoo repository state at 2018-01-26 16:23, there was a total of 62652 unique distfiles in the repository. While the users realistically hit around 10% of that, distfile mirrors often hold even more files — more so if old distfiles are not wiped immediately.

While all filesystems used on Linux boxes should be able to cope with a number that large, they may suffer a performance penalty with even a few thousand files. Additionally, if mirrors enable directory indexes then generating the index imposes both a significant server overhead and a significant data transfer. At this moment, the index of distfiles.gentoo.org has around 17 MiB.

Splitting the distfiles into multiple directories makes it possible to avoid those problems by reducing the number of files in a single directory. For example, splitting the forementioned set of distfiles into 16 directories that are roughly balanced allows to reduce the number of files in a single directory to around 4000. Splitting them further into 256 directories (16x16) results in 200-300 files per directory which should avoid any performance problems long-term, even assuming 300% growth of number of distfiles.

Specification

Mirror layout file

A mirror adhering to this specification should include a layout.conf file in the top distfile directory. This file uses the format derived from the freedesktop Desktop Entry Specification file format [1].

Before using each Gentoo mirror, the package manager should attempt to fetch (update) its layout.conf file and process it to determine how to use the mirror. If the file is not present, the package manager should behave as if it were empty.

The package manager should recognize the sections and keys listed below. It should ignore any unrecognized sections or keys — the format is intended to account for future extensions.

This specification currently defines one section: [structure]. This section defines one or more repository structure definitions using non-negative sequential integer keys. The definition with the 0 key is the most preferred structure. The package manager should ignore any formats it does not recognize. If this section is not present, the package manager should behave as if only flat structure were specified.

The following structure definitions are supported:

  • flat to indicate the traditional flat structure where all distfiles are located in the top directory,
  • filename-hash <algorithm> <cutoffs> to indicate the filename hash structure explained below.

Filename hash structure

When using the filename hash structure, the distfiles are split into directories whose names are derived from the hash of distfile filename. This structure has two parameters: algorithm name and cutoffs list.

The algorithm name must correspond to a valid Manifest hash name. An informational list of hashes is included in GLEP 74 [2], and the policies for introducing new hashes are covered by GLEP 59 [3].

The cutoffs list specifies one or more integers separated by colons (:), indicating the number of bits (starting with the most significant bit) of the hash used to form subsequent subdirectory names. For example, the list of 2:4 would indicate that top-level directory names are formed using 2 most significant bits of the hash (resulting in 2² = 4 directories), and each of this directories would have subdirectories formed using the next 4 bits of the hash (resulting in 2⁴ = 8 subdirectories each).

The exact algorithm for determining the distfile location follows:

  1. Let the distfile filename be F.
  2. Compute the hash of F and store its binary value as H.
  3. For each integer C in cutoff list:
    1. Take C most significant bits of H and store them as V.
    2. Convert V into hexadecimal integer, left padded with zeros to C/4 digits (rounded up) and append it to the path, followed by the path separator.
    3. Shift H left C bits.
  4. Finally, append F to the obtained path.

In particular, note that when using nested directories the subdirectories do not repeat the hash bits used in parent directory.

Migrating mirrors to the hashed structure

Since all distfile mirrors sync to the master Gentoo mirror, it should be enough to perform all the needed changes on the master mirror and wait for other mirrors to sync. The following procedure is recommended:

  1. Include the initial layout.conf listing only flat layout.
  2. Create the new structure alongside the flat structure. Wait for mirrors to sync.
  3. Once all mirrors receive the new structure, update layout.conf to list the filename-hash structure.
  4. Once a version of Portage supporting the new structure is stable long enough, remove the fallback flat structure from layout.conf and duplicate distfiles.

This implies that during the migration period the distfiles will be stored duplicated on the mirrors and therefore will occupy twice as much space. Technically, this could be avoided either by using hard links or symbolic links.

The hard link solution allows us to save space on the master mirror. Additionally, if -H option is used by the mirrors it avoids transferring existing files again. However, this option is known to be expensive and could cause significant server load. Without it, all mirrors need to transfer a second copy of all the existing files.

The symbolic link solution could be more reliable if we could rely on mirrors using the --links rsync option. Without that, symbolic links are not transferred at all.

Using hashed structure for local distfiles

The hashed structure defined above could also be used for local distfile storage as used by the package manager. For this to work, the package manager authors need to ensure that:

  1. The ${DISTDIR} variable in the ebuild scope points to a temporary directory where distfiles specific to the package are linked in a flat structure.
  2. All tools are updated to support the nested structure.
  3. The package manager provides a tool for users to easily manipulate distfiles, in particular to add distfiles for fetch-restricted packages into an appropriate subdirectory.

For extended compatibility, the package manager may support finding distfiles in flat and nested structure simultaneously.

Rationale

Algorithm for splitting distfiles

The possible algorithms were considered with the following goals in mind:

  • the number of files in a single directory should not exceed 1000,
  • the total size of files in a single directory is not considered relevant,
  • the solution should preferably be future-proof,
  • moving distfiles should be avoided once it is deployed.

It should also be noted that at this moment the package having most distfiles in Gentoo at the time is dev-texlive/texlive-latexextra, with the number of 8556 distfiles. All of them start with a common prefix of texlive-module-. This specific prefix is used by a total of 23435 distfiles.

In the original debate that occurred in bug #534528 [4] and the mailing list review of the initial version of this GLEP [5], four fundamental ideas for splitting distfiles were listed:

  1. using initial portion of filename,
  2. using initial portion of file hash,
  3. using initial portion of filename hash,
  4. using package category (and package name).

The initial filename idea was to use the first character of filename, possibly followed by a longer part which was the idea historically used e.g. by PyPI Python package hosting. Its main advantage is simplicity. The users can easily determine the correct subdirectory by just looking at the distfile name. Sadly, this solution is not only very uneven but does not solve the problem. As mentioned above, the TeΧ Live packages share a long common prefix that make it impossible to split it properly with other packages on fixed-length prefixes.

This idea has been followed by an adaptive proposal by Andrew Barchuk [6]. In this proposal, the filenames are not strictly mapped to groups by a common prefix but instead each group contains all files between two prefixes being used (like in a dictionary). However, it has been pointed out that while this option can provide very even results initially, it is impossible to predict how it would be affected by future distfile changes and there will be a risk of needing to change the groups in the future. Furthermore, it is relatively complex and requires explicitly listing or obtaining used groups.

Another option was to use an initial portion of distfile hashes. Its main advantage is that cryptographic hash algorithms can provide a more balanced split with random data. Furthermore, since hashes are stored in Manifests using them has no cost for users. However, this solution has three disadvantages:

  1. Not all files in the distfile tree are covered by package Manifests. Additional files are injected into the mirrors, and those will not have a clearly-defined location.
  2. User-provided distfiles (e.g. for fetch-restricted packages) with hash mismatches would be placed in the wrong subdirectory, potentially causing confusing errors.
  3. The hash values are unknown for newly-downloaded distfiles, so repoman (or an equivalent tool) would have to use a temporary directory before locating the file in appropriate subdirectory.

Using filename hashes has proven to provide a similar balance to using file hashes. Furthermore, since filenames are known up front this solution does not suffer from the listed problems. While hashes need to be computed manually, hashing short string should not cause any performance problems.

Jason Zaman has suggested to use package categories (and package names) [7]. However, this solution has multiple problems:

  1. it does not solve the problem for large packages such as TeΧ Live,
  2. it introduces many unnecessarily small directories,
  3. it requires an explicit knowledge of which package distfiles belong to,
  4. it does not provide an explicit solution to the problem of distfiles shared by multiple packages,
  5. it does not provide a solution to the problem of injected distfiles.

All the options considered, the filename hash solution was selected as one that solves all the forementioned problems while introducing relatively low complexity and being reasonably future-proof.

glep-0075-extras/by-filename.png

Distribution of distfiles by first character of filenames (note: y axis is on log scale)

glep-0075-extras/by-csum.png

Distribution of distfiles by first hex-digit of checksum (x — content checksum, + — filename checksum)

glep-0075-extras/by-csum2.png

Distribution of distfiles by two first hex-digits of checksum (x — content checksum, + — filename checksum)

Layout file

The presence of control file has been suggested in the original discussion. Its main purpose is to let package managers cleanly handle the migration and detect how to correctly query the mirrors throughout it. Furthermore, it makes future changes easier.

The format lines specifically mean to hardcode as little about the actual algorithm as possible. Therefore, we can easily change the hash used or the exact split structure without having to update the package managers or even provide a compatibility layout.

The file is also open for future extensions to provide additional mirror metadata. However, no clear use for that has been determined so far.

Hash algorithm

The hash algorithm support is fully deferred to the existing code in the package managers that is required to handle Manifests. In particular, it is recommended to reuse one of the hashes that are used in Manifest entries at the time. This avoids code duplication and reuses an existing mechanism to handle hash upgrades.

During the discussion, it has been pointed that this particular use case does not require a cryptographically strong hash and a faster algorithm could be used instead. However, given the short length of hashed strings performance is not a problem, and speed does not justify the resulting code duplication.

It has also been pointed out that e.g. the BLAKE2 hash family provides the ability of creating arbitrary length hashes instead of truncating the standard-length hash. However, not all implementations of BLAKE2 support that and relying on it could reduce portability for no apparent gain.

Backwards Compatibility

Mirror compatibility

The mirrored files are propagated to other mirrors as opaque directory structure. Therefore, there are no backwards compatibility concerns on the mirroring side.

Backwards compatibility with existing clients is detailed in migrating mirrors to the hashed structure section. Backwards compatibility with the old clients will be provided by preserving the flat structure during the transitional period.

The new clients will fetch the layout.conf file to avoid backwards compatibility concerns in the future. In case of hitting an old mirror, the package manager will default to the flat structure.

Package manager storage compatibility

The exact means of preserving backwards compatibility in package manager storage are left to the package manager authors. However, it is recommended that package managers continue to support the flat layout even if it is no longer the default. The package manager may either continue to read files from this location or automatically move them to an appropriate subdirectory.

References

[1]Desktop Entry Specification: Basic format of the file (https://standards.freedesktop.org/desktop-entry-spec/latest/ar01s03.html)
[2]GLEP 74: Full-tree verification using Manifest files: Checksum algorithms (informational) (https://www.gentoo.org/glep/glep-0074.html#checksum-algorithms-informational)
[3]GLEP 59: Manifest2 hash policies and security implications (https://www.gentoo.org/glep/glep-0059.html)
[4]Bug 534528 - distfiles should be sorted into subdirectories of DISTDIR (https://bugs.gentoo.org/534528)
[5][gentoo-dev] [pre-GLEP] Split distfile mirror directory structure (https://archives.gentoo.org/gentoo-dev/message/cfc4f8595df2edf9a25ba9ecae2463ba)
[6]Andrew Barchuk's reply on 'using character ranges for each directory computed in a way to have the files distributed evenly' (https://archives.gentoo.org/gentoo-dev/message/611bdaa76be049c1d650e8995748e7b8)
[7]Jason Zamal's reply including 'using the same dir layout as the packages themselves) (https://archives.gentoo.org/gentoo-dev/message/f26ed870c3a6d4ecf69a821723642975)