summaryrefslogtreecommitdiffstats
path: root/chromium/docs/website/site/developers/testing/isolated-testing/infrastructure/index.md
diff options
context:
space:
mode:
Diffstat (limited to 'chromium/docs/website/site/developers/testing/isolated-testing/infrastructure/index.md')
-rw-r--r--chromium/docs/website/site/developers/testing/isolated-testing/infrastructure/index.md204
1 files changed, 0 insertions, 204 deletions
diff --git a/chromium/docs/website/site/developers/testing/isolated-testing/infrastructure/index.md b/chromium/docs/website/site/developers/testing/isolated-testing/infrastructure/index.md
deleted file mode 100644
index b47c7d9dd47..00000000000
--- a/chromium/docs/website/site/developers/testing/isolated-testing/infrastructure/index.md
+++ /dev/null
@@ -1,204 +0,0 @@
----
-breadcrumbs:
-- - /developers
- - For Developers
-- - /developers/testing
- - Testing and infrastructure
-- - /developers/testing/isolated-testing
- - Isolated Testing
-page_name: infrastructure
-title: Isolated Testing Infrastructure
----
-
-[TOC]
-
-## This page is obsolete, need to be migrated to markdown (https://crbug.com/1246778)
-
-## Objective
-
-"*Build and test at scale*". The goal is to drastically reduce whole test cycle
-time by scaling building and test sharding across multiple bots seamlessly. It
-does so by integrating Swarming within the Try Server and the Continuous
-Integration system.
-
-This page is about the Chromium specific Swarming infrastructure. For the
-general Swarming Design, see [its
-documentation](https://chromium.googlesource.com/infra/luci/luci-py/+/HEAD/appengine/swarming/doc/Design.md).
-
-## Background
-
-* The Chromium waterfall used to use completely manual test sharding.
- A "builder" compiles and creates a .zip of the build output. Then
- "testers" download the zip, checkout the sources, unpack the zip
- inside the source checkout and run a few tests.
-* While we can continue throwing more faster hardware at the problem,
- the fundamental issue remains; as tests gets larger and slower, the
- end-to-end test latency will continue to increase, slowing down
- developer productivity.
-* This is a natural extension of the Chromium Try Server (initiated
- and written by maruel@ in 2008) that scaled up through the years and
- the Commit Queue (initiated and written by maruel@ in 2011).
-* Before the [Try Server](/system/errors/NodeNotFound), team members
- were not testing on other platforms than the one they were
- developing on, causing constant breakage. This helped getting at 50
- commits/day.
-* Before the [Commit Queue](/developers/testing/commit-queue/design),
- the overhead of manually triggered proper tests on all the important
- configuration was becoming increasingly cumbersome. This could be
- automated and was done. This helped sustain 200 commits/day.
-
-But these are not sufficient to scale the team velocity at hundreds of commits
-per day. Big design flaws remain in the way the team is working. In particular,
-to scale the Chromium team productivity, significant changes in the
-infrastructure need to happen. In particular, the latency of the testing across
-platforms need to be drastically reduced. That requires getting the test result
-in O(1) time, independent of:
-
-1. Number of platforms to test on.
-2. Number of test executables.
-3. Number of test cases.
-4. Duration of each test cases, especially in the worse case.
-5. Size of the data required to run the test.
-6. Size of the checkout.
-
-To achieve this, sharding a test must be a constant cost. This is what the
-Swarming integration is about.
-
-## Overview
-
-Using Swarming works around [Buildbot](http://buildbot.net/)'s limitations and
-permits sharding automatically and in an unlimited way. For example, it permits
-sharding the test cases on a large smoke test across multiple bots to reduce the
-latency of running it. Buildbot on the other hand requires manual configuration
-to shard the tests and is not very efficient at large scale.
-
-By reusing the Isolated testing effort, we're going to be able to shard
-efficiently the Swarming bots. By integrating Swarming infrastructure inside
-[Buildbot](http://buildbot.net/), we worked around the manual sharding that
-buildbot requires.
-
-To recapitulate the Isolated design doc, `isolateserver.py` is used to archive
-all the run time dependencies of a unit tests on the "builder" to Isolate
-Server. Since the content store is content-addressed by the SHA-1 of the
-content, only new contents are archived. Then only the SHA-1 of the manifest
-describing the whole dependency is sent to the Swarming bots, with an index of
-the shards that it needs to run. That is, **40 bytes for the hash plus 2
-integers** is all that is required to know what OS is needed and what files are
-needed to run a shard of test cases along `run_isolated.py`.
-
-## How the infrastructure works
-
-For each buildbot worker using Swarming:
-
-1. Checks out sources.
-2. Compile
-3. Runs 'isolate tests'. This archives the builds on
- <https://isolateserver.appspot.com>.
-4. Triggers Swarming tasks.
-5. Runs anything that needs to run locally.
-6. Collects Swarming tasks results.
-
-The Commit Queue uses Swarming indirectly via the Try Server.
-
-So there is really 2 layers of control involved. The first being **the LUCI
-scheduler** which controls the overall "build", which includes syncing the
-sources, compiling, requesting the test to be run on Swarming and asking it to
-report success or failure. The second layer is the **Swarming server** itself
-which "micro-distribute" test shards. Each test shard is actually a subset of
-the test cases for a single unit test executable. All the unit tests are run
-concurrently. So for example for a Try Job that requests `base_unittests`,
-`net_unittests`, `unit_tests` and `browser_tests` to be run, they are all run
-simultaneously on different Swarming bot, and slow tests, like browser_tests,
-are further sharded across multiple bots, all simultaneously.
-
-## Diagrams
-
-How the Try Server is using Swarming:
-
-#### Chromium Try Server Swarming Infrastructure
-
-How using Swarming directly looks like
-
-#### Using Swarming directly
-
-## Project information
-
-* This project is an integral part of the Chromium Continuous
- Integration infrastructure and the Chromium Try Server.
-* While this project will greatly improve the Chromium Commit Queue
- performance, it has no direct relationship and the performance
- improvement, while we're aiming for it, is totally a side-effect of
- the reduced Try Server testing latency.
-* Code: <https://chromium.googlesource.com/infra/luci/luci-py.git>.
-
-### Appengine Servers
-
-* [chromium-swarm.appspot.com](https://chromium-swarm.appspot.com):
- manages task execution
-* [isolateserver.appspot.com](https://isolateserver.appspot.com/):
- build output cache
-
-### Canary Setup
-
-* [chromium-swarm-dev.appspot.com](https://chromium-swarm-dev.appspot.com)
-* [isolateserver-dev.appspot.com](https://isolateserver-dev.appspot.com)
-
-## Latency
-
-This project is primarily aimed at reducing the overall latency from "ask for
-green light signal for a CL" to getting the signal. The CL can be "not committed
-yet" or "just committed", the former being the Try Server, the later the
-Continuous Integration servers. The latency is reduced by enabling a higher of
-parallel shard execution and removing the constant costs of syncing the sources
-and zipping the test executables, both which are extremely slow, in the orders
-of minutes.
-Other latencies includes;
-
-1. Time to archive the dependencies to the Isolate Server.
-2. Time to trigger a Swarming run.
-3. Time for the workers to react to a Swarming run request.
-4. Time for the workers to fetch the dependencies, map them in a
- temporary directory.
-5. Time for the workers to cleanup the temporary directory and report
- back stdout/stderr to the Swarming master.
-6. Time for the Swaming master to react and return the information to
- the Swarming client running on the buildbot worker.
-
-## Scalability
-
-All servers run on AppEngine. It scales just fine.
-
-## Redundancy and Reliability
-
-There are multiple single points of failures
-
-1. The Isolate Server which is hosted on AppEngine.
-2. The Swarming master, which is also hosted on AppEngine.
-
-The swarming bots are intrinsically redundant. The Isolate Server data store
-isn't redundant or reliable, it can be rebuilt from sources if needed. If it
-fails, it will block the infrastructure.
-
-## Security Consideration
-
-Since the whole infrastructure is visible from the internet, like this design
-doc, proper DACL need to be used. Both the Swarming and the Isolate servers
-require valid Google accounts. The credential verification is completely managed
-by
-[auth_service](https://github.com/luci/luci-py/tree/master/appengine/auth_service).
-
-## Testing Plan
-
-All the code (Swarming master, Isolate Server and swarming_client code) are
-tested in canary before being rolled out to prod. See the Canary Setup above.
-
-## FAQ
-
-### Why not a faulty file system like FUSE?
-
-Faulty file systems are inherently slow: every time a file is missing, the whole
-process hangs, the FUSE adapter downloads the file synchronously, then the
-process resume. Multiply 8000x; that's what browser_tests lists. With a
-pre-loaded content-addressed file-system, all the files can be cached safely
-locally and be downloaded simultaneously. The saving and speed improvement is
-enormous. \ No newline at end of file