Skip to content

Commit

Permalink
Add per-site cache mechanism to speed up site resolve
Browse files Browse the repository at this point in the history
The slowest part of using the hab cli is the globing of config/distro
paths(especially for network paths on windows). Individually parsing
hundreds of json files also is slower than parsing a single json file
containing the same data, which caching helps out with.
  • Loading branch information
MHendricks committed Jan 19, 2024
1 parent d1e97eb commit 641614a
Show file tree
Hide file tree
Showing 8 changed files with 436 additions and 92 deletions.
108 changes: 76 additions & 32 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -417,6 +417,41 @@ Note the order of left/middle/right in the test_paths variable. Also, for
site file with it defined is used. The other path maps are picked up from the
site file they are defined in.

#### Platform Path Maps

The site setting `platform_path_maps` is a dictionary, the key is a unique name
for each mapping, and value is a dictionary of leading directory paths for each platform.
[PurePath.relative_to](https://docs.python.org/3/library/pathlib.html#pathlib.PurePath.relative_to)
is used to match, so full directory names need to be used. The unique name allows
for multiple site json files to override the setting as well as converting
resolved file paths to str.format style paths (`{server-main}/folder/file.txt`).
If multiple site json files specify the same key, the right-most site json file
specifying that key is used. It is safe to use forward slashes for windows paths.

```json
{
"append": {
"platform_path_maps": {
"server-main": {
"linux": "/mnt/main",
"windows": "//example//main"
},
"server-dev": {
"linux": "/mnt/dev",
"windows": "//example//dev"
}
}
},
"set": {
"platforms": ["linux", "windows"]
}
}
```

With these settings, if a path on a linux host, starts with `/mnt/main` when
generating the corresponding windows file path it will translate it to
`\\example\main`. Note the use of `platforms` to disable osx platform support.

#### Hab Entry Points

The site file can be used to replace some hab functionality with custom plugins.
Expand Down Expand Up @@ -509,6 +544,37 @@ Alternatively, you could create a second host site file named `c:\hab\host_no_gu
put the gui disabling config in that file and on the host's you want to disable
the gui prepend to `HAB_PATHS=c:\hab\host_no_gui.json;c:\hab\host.json;\\server\share\studio.json`.

#### Habcache

By default hab has to find and process all available configs and distros every
time it's launched. This has to glob the file paths in `config_paths` and
`distro_paths`, and parse each file it finds. As you add more distro versions
and configs this can slow down the launching of hab. This is especially true
when storing them on the network and when using windows.

To address this you can add per-site habcache files. This is a cross-platform
collection of all of the found files for a specific site file's `config_paths`
and `distro_paths` glob strings.

To enable caching run `hab cache /path/to/site_file.json`. This will create a
habcache file next to the `site_file.json`.

It will be named matching `site["site_cache_file_template"][0]`, which defaults
to `{stem}.habcache` where stem is the site filename without extension. For the
example command it would create the file `/path/to/site_file.habcache`. To ensure
cross platform support, make sure your `HAB_PATHS` configuration contains all of
the required [`platform_path_maps`](#platform-path-maps) site mappings.

While the `site_file.habcache` exists and `HAB_PATHS` includes `site_file.json`
hab will use the cached value unless the `--no-cache` flag is used. After adding,
updating or removing a config or distro, you will need to run the `hab cache`
command to update the cache with your changes. If using a distribution ci you
should add this command call there.

The habcache is cross platform as long as the hab site configuration loaded when
calling `hab cache` has all of the required `platform_path_maps` defined. The
cache will replace the start of file paths matching one of the current platform's
mapping values with the mappings key.

### Python version

Expand All @@ -528,9 +594,14 @@ the scripts:
* `colorize`: If `hab dump` should colorize its output for ease of reading.
* `config_paths`: Configures where URI configs are discovered. See below.
* `distro_paths`: Configures where distros discovered. See below.
* `platform_path_maps`: Configures mappings used to convert paths from one
operating system to another. This is used by the freeze system to ensure that if
unfrozen on another platform it will still work.
* `ignored_distros`: Don't use distros that have this version number. This makes
it possible for a ci to deploy some non-versioned copies of distros next to the
distros so non-hab workflows can access known file paths. For example this could
be used to put a latest folder next to each of the releases of a distro and not
have to remove the .hab.json file in that folder.
* [`platform_path_maps`](#platform-path-maps): Configures mappings used to convert
paths from one operating system to another. This is used by the freeze system to
ensure that if unfrozen on another platform it will still work.
* `platforms`: A list of platforms that are supported by these hab configurations.
When using freeze, all of these platforms will be stored. Defaults to linux, osx, windows.
* `prereleases`: If pre-release distros should be allowed. Works the same as
Expand All @@ -545,6 +616,8 @@ to override the default(as long as its not disabled.) `hab --prefs dump ...`.
than this duration, force the user to re-save the URI returned for `-` when using
the `--save-prefs` flag. To enable a timeout set this to a dictionary of kwargs
to initialize a `datetime.timedelta` object.
* `site_cache_file_template`: The str.format template defining the name of
[habcache](#habcache) files.

`config_paths` and `distro_paths` take a list of glob paths. For a given glob
string in these variables you can not have duplicate values. For configs a
Expand All @@ -560,35 +633,6 @@ global shared configs/distros they are not working on.
See [specifying distro version](#specifying-distro-version) for details on specifying a
distro version in a git repo.

`platform_path_maps` is a dictionary, the key is a unique name for each mapping,
and value is a dictionary of leading paths for each platform. The unique name
allows for multiple site json files to override the setting. If multiple site
json files specify the same key, the right-most site json file specifying that
key is used.

```json
{
"append": {
"platform_path_maps": {
"server-main": {
"linux": "/mnt/main",
"windows": "\\\\example\\main"
},
"server-dev": {
"linux": "/mnt/dev",
"windows": "\\\\example\\dev"
}
}
},
"set": {
"platforms": ["linux", "windows"]
}
}
```

With these settings, if a path on a linux host, starts with `/mnt/main` when
generating the corresponding windows file path it will translate it to
`\\example\main`. Note the use of `platforms` to disable osx platform support.

### Distro

Expand Down
186 changes: 186 additions & 0 deletions hab/cache.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,186 @@
import glob
import json
import logging
from pathlib import Path

from packaging.version import InvalidVersion

from . import utils
from .errors import InvalidVersionError, _IgnoredVersionError

logger = logging.getLogger(__name__)


class Cache:
"""Used to save/restore cached data to speed up initialization of hab.
The caches are stored per-site file as file next to the site file using the
same stem name. (Ie by default studio.json would have a cache file called
studio.cache).
If this cache file exists it is used unless enabled is set to False. Cache
files are useful when you have some sort of CI setup to ensure the cache is
re-generated using `save_cache` any time you make changes to configs or
distros that site file references.
Properties:
cache_template (dict): The str.format template used to find the cache files.
This template requires the kwarg `stem`.
enabled (bool): Used to disable using of the cached data forcing a full
glob and parse of files described by all site files.
"""

def __init__(self, site):
self.site = site
self._cache = None
self.enabled = True

# Get the template filename used to find the cache files on disk
self.cache_template = self.site.get("site_cache_file", "{stem}.cache")

@property
def cached_keys(self):
"""A dict of cache keys and how they should be processed.
{Name of key to cache: ("relative file glob", class used to process)}
"""
try:
return self._cached_keys
except AttributeError:
pass

from .parsers import Config, DistroVersion

self._cached_keys = {
"config_paths": ("*.json", Config),
"distro_paths": ("*/.hab.json", DistroVersion),
}
return self._cached_keys

def cache(self, force=False):
if not self.enabled:
# If caching is disabled, never attempt to load the cache
return {}

if self._cache is not None and not force:
return self._cache

self._cache = {}

# Process caches from right to left. This makes it so the left most
# cache_file is respected if any paths are duplicated.
for path in reversed(self.site.paths):
cache_file = self.site_cache_path(path)
if cache_file.is_file():
logger.debug(f"Site cache loading: {cache_file!s}")
self.load_cache(cache_file)

# Create a flattened cache removing the glob paths.
flat_cache = {key: {} for key in self.cached_keys}
for key in self._cache:
for values in self._cache.get(key, {}).values():
flat_cache[key].update(values)

self._cache["flat"] = flat_cache

return self._cache

def config_paths(self, flat=False):
if flat:
return self.cache().get("flat", {}).get("config_paths", {})
return self.cache().get("config_paths", {})

def distro_paths(self, flat=False):
if flat:
return self.cache().get("flat", {}).get("distro_paths", {})
return self.cache().get("distro_paths", {})

def generate_cache(self, resolver, site_file, version=1):
"""Generate a cache file of the current state defined by this site file.
This contains the raw values of each URI config and distro file including
version. If this cache exists it is used instead of searching the file
system for each path defined in config_paths or distro_paths defined in
the provided site file. Use this method any time changes are made that
hab needs to be aware of. Caching is enabled by the existence of this file.
"""
from .site import Site

output = {"version": version}

# read the site file to get paths to process
temp_site = Site([site_file])

for key, stats in self.cached_keys.items():
glob_str, cls = stats
# Process each glob dir defined for this site
for dirname in temp_site.get(key, []):
cfg_paths = output.setdefault(key, {}).setdefault(
dirname.as_posix(), {}
)

# Add each found hab config to the cache
for path in sorted(glob.glob(str(dirname / glob_str))):
path = Path(path)
try:
data = cls(forest={}, resolver=resolver)._load(
path, cached=False
)
except (
InvalidVersion,
InvalidVersionError,
_IgnoredVersionError,
) as error:
logger.debug(str(error))
else:
cfg_paths[path.as_posix()] = data

return output

@classmethod
def iter_cache_paths(cls, name, paths, cache, glob_str=None, include_path=True):
"""Yields path information stored in the cache falling back to glob if
not cached.
Yields:
dirname: Each path stored in paths.
path
"""
for dirname in paths:
dn_posix = dirname.as_posix()
cached = dn_posix in cache
if cached:
logger.debug(f"Using cache for {name} dir: {dn_posix}")
paths = cache[dn_posix]
else:
logger.debug(f"Using glob for {name} dir: {dirname}")
# Fallback to globing the file system
if glob_str:
paths = sorted(glob.glob(str(dirname / glob_str)))
else:
paths = []
if not include_path:
yield dirname, None, cached
else:
for path in paths:
yield dirname, path, cached

def load_cache(self, filename):
"""For each glob dir add or replace the contents. If a previous cache
has the same glob dir, it's cache is ignored. This expects that
load_cache is called from right to left for each path in `self.site.path`.
"""
contents = utils.load_json_file(filename)
for key in self.cached_keys:
if key in contents:
self._cache.setdefault(key, {}).update(contents[key])

def save_cache(self, resolver, site_file, version=1):
cache_file = self.site_cache_path(site_file)
cache = self.generate_cache(resolver, site_file, version=version)

with cache_file.open("w") as fle:
json.dump(cache, fle, indent=4, cls=utils.HabJsonEncoder)
return cache_file

def site_cache_path(self, path):
"""Returns the name of the cache file for the given site file."""
return path.parent / self.cache_template.format(stem=path.stem)
Loading

0 comments on commit 641614a

Please sign in to comment.