Globbing

Herein globbing refers to using filesystem information (like which files have .cxx extension) to configure targets and other project properties, as opposed to explicitly listing each target’s sources in CMakeLists.txt.

The case for globbing

Discussion of globbing in a CMake project must begin with discussion of the admonition against doing so in CMake’s own documentation (skip to usage). The main reasons cited for avoiding globs are:

  • Not all generators support glob-dependent reconfiguration.

  • There may be files which match a glob unintentionally (for example temporary files generated by a tool) which pessimize or invalidate build configuration.

  • If there are globs configuration depends on then each build must check that those globs have not changed, which introduces overhead.

If it is necessary for a project to support all generators or to enable usage of tools which introduce spurious glob matches, then globbing is not an option. There is no decision on which workflows to support which is correct for all projects, so I think a blanket prohibition against a technique is less beneficial than a description of its relative merits.

In Maud’s case the C++20 modular structure is central and every generator which supports C++20 modules also supports glob-dependent reconfiguration, so avoiding globs would not expand Maud’s generator support.

As for tools which touch the source tree: even in projects where globbing is not used I frequently have multiple worktrees associated with the repository to isolate those tools from (for example) a build which I don’t want to invalidate. Perhaps some would find this unacceptably inelegant.

Globbing Performance

One of the project tests is a benchmark of globbing overhead. On my machine, the output looks like:

$ ./test_.project --gtest_filter=*bench* | grep -E "^BENCHMARK" -A 10 -B 0
BENCHMARK
--      Writing:            ( mean=3602.278     min=3381.849    ) ms
--      New checking:       ( mean=848.554      min=826.814     ) ms
--      Globbing:           ( mean=929.903      min=909.522     ) ms
--      Globbing(fd):       ( mean=299.474      min=292.794     ) ms
--      Globbing(git):      ( mean=311.838      min=305.374     ) ms
--      Filtering:          ( mean=87.323       min=85.166      ) ms
--      Loading the cache:  ( mean=24.618       min=22.662      ) ms
--
    8 iterations with 160000 files

(Parameters chosen to approximate the llvm-project repository at the time of writing in number of files and directory depth (median=4).) Writing serves as a baseline of the filesystem’s speed: a simulated project with 160,000 empty files is generated, which takes a few seconds. New checking is another useful baseline: accessing the mtime of every file takes a little less than a second.

The benchmark’s Globbing result shows that using file(GLOB_RECURSE) to list all files and directories in the simulated project also takes a little less than a second. (Unless we delegate to a dedicated globbing utility as in Globbing(*), which can reduce that time significantly for large projects.) Maud’s globbing aggressively caches results, filtering from those cached results on each new glob. This means the overhead of actual filesystem access is only paid once per rebuild; each new glob incurs less than a tenth of that overhead.

Loading the cache is also once-per-build overhead. Maud stores glob results in ${CMAKE_BINARY_DIR}/CMakeCache.txt, which must be loaded in the CMake scripts which verify globs have not changed.

In testing on multiple machines and simulated project sizes, Globbing overhead remains comparable to New checking. The latter is an unavoidable once-per-build overhead even if globbing is not used, since each source file’s mtime must be checked to determine if it must be recompiled. To me, adding this overhead again seems acceptable. There may be projects where that added overhead is unacceptable; in that case, I’m glad this benchmark was useful to decide that quantitatively… but I’d be more glad of a PR to increase Maud’s globbing performance.

glob

glob(
  name
  [CONFIGURE_DEPENDS]
  [EXCLUDE_RENDERED]
  < inclusion_regex | ! exclusion_regex >...
)

Declare a glob. A list will be stored in a CACHE variable with the provided name containing the absolute path of matching files and directories. All files in ${CMAKE_SOURCE_DIR} as well as generated files in ${MAUD_DIR}/rendered are examined for inclusion in the glob. Files and directories whose name begins with . are excluded from all globs.

Glob results are updated as part of the main build system check target, so during reconfiguration calls to glob() are a no-op (because the CACHE variable is already up-to-date). Scripts which load the cache can access the variable normally.

CONFIGURE_DEPENDS

If this flag is specified then in addition to updating the glob’s results the check target will trigger regeneration if the results change.

EXCLUDE_RENDERED

Generated files will be ignored if this flag is specified.

< inclusion_regex | ! exclusion_regex >...

Each pattern is a REGEX which is applied to each candidate file’s path. Patterns are applied to relative paths; either the component relative to ${CMAKE_SOURCE_DIR} or relative to ${MAUD_DIR}/rendered if generated.

Patterns are evaluated in series, starting with an empty result set. Inclusion patterns are applied to all files and any matches are added to the result set. Exclusion patterns are applied to the result set and any matches are removed. So for example [.](cxx|hxx)   !(^|/)_   !thirdparty would include hello.cxx, hello.hxx but would exclude _disabled.cxx and any files in world_thirdparty/.

Built-in globs

Maud uses several globs internally:

MAUD_IN2_TEMPLATES

By default, this includes all files with extension .in2.

These files will be rendered and the results included in subsequent globs.

MAUD_INCLUDE_DIRS

By default, this includes all directories named include.

These directories will be added to the project-wide include path.

MAUD_CXX_MODULE_SOURCES

By default, this includes all files with any extension in .cxx .cxxm .ixx .mxx .cpp .cppm .cc .ccm .c++ .c++m.

These files will be scanned for C++ Modules, and the results used to define targets and linkage.

MAUD_CXX_FORMATTED_SOURCES

By default, this includes all files with any extension listed for MAUD_CXX_MODULE_SOURCES or in .hxx .hpp .hh .h++ .h.

These files will be tested for consistent formatting using clang-format.

MAUD_DOCUMENTATION_SOURCES

By default, this includes all files with any extension in .rst .myst .md, excluding those whose STEM is spelled in SHOUTY_SNAKE_CASE (to avoid building documentation from README.md when not explicitly included).

These will be passed to Sphinx and used to build Documentation.

To override any of these, call glob() to set the CACHE variable before it is required by Maud. For example to scan only src/**.ixx files, write

glob(
  MAUD_CXX_MODULE_SOURCES
  CONFIGURE_DEPENDS
  "^src/.*[.]ixx$"
)

(The CMake inclusion globs cannot be overridden this way.)

Maud relies on build files being excluded from globs of source files, which is ensured by default: the default build directory name is .build/ and all globs exclude directories and files whose names start with . If a non-default build directory name is used or the globs are adjusted from their defaults, then the user must ensure build files are still excluded from globs. I recommend upholding the convention by naming build directories .$name and excluding .$name from globs with !(/|^)[.].