Continuous Integration

GitHub: https://github.com/JCSDA-internal/ci

Documentation: via multiple READMEs inside the GitHub repository

Table of Contents

About

CI System Information

Quick reference

Presubmit tests can be controlled by single-line annotations in the pull
request description. These annotations will be re-examined for each run.
Here is an example of their use:

# Build tests with other unsubmitted packages.
build-group=https://github.com/JCSDA-internal/oops/pull/2284
build-group=https://github.com/JCSDA-internal/saber/pull/651

# Disable the build-cache for tests.
jedi-ci-build-cache=skip

Each configuration setting must be on a single line, but order and
position does not matter.

# Enable tests for your draft PR (disabled by default).
run-ci-on-draft=true

# Select the compiler used by CI (defaults to random choice).
jedi-ci-test-select=gcc

# Select the jedi-bundle branch used for building. Using this option
# disables the build cache.
jedi-ci-bundle-branch=feature/my-bundle-change

Specifying a Build Group

In the default configuration the CI system will build candidate code against
the latest submitted version each package of the jedi-bundle. A pull request
can be built against unsubmitted versions of specific packages by specifying
the version using a tag in the pull request description. Multiple tags may
be added as long as each tag is on its own line of the pull request
description.

build-group=https://github.com/JCSDA-internal/oops/pull/2284

Selecting a Compiler

To save on cloud compute resources the CI test environment selects one of
our three environments randomly. If you want tests with a specific compiler
you can set the annotation jedi-ci-test-select to either gcc, intel,
or gcc11. Please do not use the special value all unless you have an
especially dangerous change known to affect all compilers or the CI
environment.

gcc11: uses the GNU Compiler Collection (GCC) v11.4.0 and OpenMPI v5.0.5.
gcc: uses the GNU Compiler Collection (GCC) v13.3.0 and OpenMPI v5.0.5.
intel: Uses the Intel OneAPI v2024.2.1 with icx/icpx/ifort and OneAPI MPI v2021.13.

Build Cache

The CI system relies on a build cache to speed the the build process. Some
changes are capable of causing build failures arising from the use of the
cache. The CI system has two controls to modify cache behavior.

The build cache can be disabled by adding the annotation
jedi-ci-build-cache=skip to the PR description.

If it is necessary to rebuild the entire cache to remove a bug in the cached
binaries, add the annotation jedi-ci-build-cache=rebuild to the PR
description.

CI Development and Debug Options:

USE THESE OPTIONS WITH CAUTION

jedi-ci-bundle-branch=branch-name: Unless otherwise specified tests
will be run from the default branch of the jedi-bundle repository.
This annotation overrides the branch and the value of this tag sets a
valid branch name currently fetchable from the jedi-bundle repository
that will be used for testing. If this annotation is explicitly set
(even if set to the default branch), cache reading and writing is
disabled and any cache annotations will be ignored.
jedi-ci-manifest-branch=branch-name: This tag overrides the default
branch name used for fetching the CI manifest. This is used when CI
config changes are needed to run the test. WARNING: a bad value here
can cause the test to silently fail to configure.
jedi-ci-next=true: This annotation will use the "next" tagged CI
images. This tag will primarily be used by the infra team for testing
spack-stack releases or breaking changes. At most times, the "next" tag
will be assigned to the current live build images.
jedi-ci-debug=true: This annotation can be used to induce a post-test
delay of 60 minutes during which the build environment will be saved
for inspection and debug activities.

FAQ

Q: Why is this test running?

A: This test was run by the JEDI CI system whose code is hosted at
github.com/JCSDA-internal/CI.

Q: My draft pull request's tests are not running.

A: You must enable tests for draft PRs by adding the annotation
run-ci-on-draft=true in the pull request description.

Q: How can a test "pass with failures"?

A: Because the integration test is much larger than typical unit tests, a
small amount of flake test failure is allowed. Over time we will track
the repeatedly flaky tests and fix them. Please examine any failures
carefully to ensure that they were not caused by your change.

Q: Why can't I access the build log?

A: The AWS hosted build logs require a login to the jcsda-usaf AWS
account. We also provide a public build log available to anyone with the
link but this log file is not available until all tests are complete for
an environment.

Administrative Tasks (For JEDI Infra team)

Updating CI instance disk space

Use the following procedure to update the disk space for the ci instances if they are running out of space.

Add an additional EBS volume and mount it on the instance
Move the spack-stack build and source caches there and link to current locations
Turn swapfile off on root filesystem and enable on new volume
Created 500Gb EBS volume “Ubuntu 22.04 CI Intel”
Mounted on EC2 instance following https://docs.aws.amazon.com/ebs/latest/userguide/ebs-attaching-volume.html
Partitioned on EC2 instance, created ext4 filesystem and mounted via /etc/fstab entry following Linux standard practices - mounted as /mnt/addon ; see https://docs.aws.amazon.com/ebs/latest/userguide/ebs-using-volumes.html for one of many tutorials
Moved spack source and build caches to /mnt/addon/spack-stack/{build,source}-cache
Created 128GB swapfile /mnt/addon/swapfile and removed 64GB swapfile /swapfile (incl. fstab entries); this again is Linux boilerplate, see e.g. https://phoenixnap.com/kb/linux-swap-file

Troubleshooting / FAQ

CDash Troubleshooting

CDash is hosted on an AWS EC2 instance in our USAF account in us-east-2 region. Members of the Infrastructure team can access this instance with SSH.

Instance details link: i-0c57da8a0fd6f7cc1
SSH: ssh -i your-key ubuntu@cdash.jcsda.org

HTTPS / Signing Authority: The CDash server uses SSL connection with a LetsEncrypt SSL signature which is renewed on the 10th of each month by a cron job on the instance. If the job fails or our CDash integration breaks.

Containerized Service Deployment: The CDash server is deployed on the instance via a docker compose deployment with three containers. During certificate updates the "cdash" container is temporarily brought down so that certbot can communicate with the signing authority. It is safe to stop and start the cdash container although it will cause a temporary service outage when it is stopped. Stopping the MySQL container (without preserving the volume) will clear all data from our CDash server, including repository configurations which will need to be added manually to re-enable test uploading.

cdash: kitware/cdash - The http endpoint server that responds to web requests.
cdash_worker: kitware/cdash-worker - A background and RPC worker for the service (not web accessible).
cdash_mysql: mysql/mysql-server. - The MySQL database used by the service. Warning, do not stop this container.

Detailed debugging notes for the containerized deployment can be found in the CDash config code repository README file

Running GitHub Workflow Locally

To save on costs and time there is a way to run GitHub Workflow CIs locally, but one that is heavily used that mimics GitHub's locally without minimal setup is act (https://github.com/nektos/act). This will take existing workflow yamls and run them locally using Docker.

MacOS Setup

MacOS has various differences compared to Linux when it comes to using act. Below are a few items that you need to make sure are installed on your machine:

Docker with Docker Desktop (it is preferred to install through homebrew).
GitHub personal token (since the majority of the repos are private, you'll have to generate a token for any cloning that is done in the workflow)
The platforms used by act are not one to one replacements for the GitHub images being used, so any errors running will have to have adjustments that should not be committed
If a workflow errors out due to a docker container already existing with that name, you will have to manually delete the docker container, or update the script to do so (snippet provided below)

To set up your environment to make sure the right Docker sockets are used you should run the following command(s):

docker context use default

This makes sure the docker socket(s) are set up in a way that they are properly linked.

Install act with: brew install act

`act` command-line options

The most common command-line options you should remember are:

--container-architecture linux/amd64 - This is due to the Apple M* processors. Most of the images that act uses are based on linux/amd64 .
-W - The workflow directory/file. You can provide the appropriate yaml file here to run