diff options
authorShawn O. Pearce <>2009-02-06 12:32:57 -0800
committerShawn O. Pearce <>2009-02-08 17:00:20 -0800
commitc4bcc0986df3427e36e8128a040f70e9a99f4d75 (patch)
parentce32ec517684e7cbe76a8b9b36495b1a41587442 (diff)
Add a document describing Gerrit's high level design
Signed-off-by: Shawn O. Pearce <>
2 files changed, 631 insertions, 0 deletions
diff --git a/Documentation/dev-design.txt b/Documentation/dev-design.txt
new file mode 100644
index 0000000000..dc386e9dbc
--- /dev/null
+++ b/Documentation/dev-design.txt
@@ -0,0 +1,630 @@
+Gerrit2 - System Design
+Gerrit is a web based code review system, facilitating online code
+reviews for projects using the Git version control system.
+Gerrit makes reviews easier by showing changes in a side-by-side
+display, and allowing inline comments to be added by any reviewer.
+Gerrit simplifies Git based project maintainership by permitting
+any authorized user to submit changes to the master Git repository,
+rather than requiring all approved changes to be merged in by
+hand by the project maintainer. This functionality enables a more
+centralized usage of Git.
+Google developed Mondrian, a Perforce based code review tool to
+facilitate peer-review of changes prior to submission to the central
+code repository. Mondrian is not open source, as it is tied to the
+use of Perforce and to many Google-only services, such as Bigtable.
+Google employees have often described how useful Mondrian and its
+peer-review process is to their day-to-day work.
+Guido van Rossum open sourced portions of Mondrian within Rietveld,
+a similar code review tool running on Google App Engine, but for
+use with Subversion rather than Perforce. Rietveld is in common
+use by many open source projects, facilitating their peer reviews
+much as Mondrian does for Google employees. Unlike Mondrian and
+the Google Perforce triggers, Rietveld is strictly advisory and
+does not enforce peer-review prior to submission.
+Git is a distributed version control system, wherein each repository
+is assumed to be owned/maintained by a single user. There are no
+inherit security controls built into Git, so the ability to read
+from or write to a repository is controlled entirely by the host's
+filesystem access controls. When multiple maintainers collaborate
+on a single shared repository a high degree of trust is required,
+as any collaborator with write access can alter the repository.
+Gitosis provides tools to secure centralized Git repositories,
+permitting multiple maintainers to manage the same project at once,
+by restricting the access to only over a secure network protocol,
+much like Perforce secures a repository by only permitting access
+over its network port.
+The Android Open Source Project (AOSP) was founded by Google by the
+open source releasing of the Android operating system. AOSP has
+selected Git as its primary version control tool. As many of the
+engineers have a background of working with Mondrian at Google,
+there is a strong desire to have the same (or better) feature set
+available for Git and AOSP.
+* link:[Mondrian Code Review On The Web]
+* link:[Rietveld - Code Review for Subversion]
+* link:;a=blob;f=README.rst;hb=HEAD[Gitosis README]
+* link:[Android Open Source Project]
+Developers create one or more changes on their local desktop system,
+then upload them for review to Gerrit using the standard `git push`
+command line program, or any GUI which can invoke `git push` on
+behalf of the user. Authentication and data transfer are handled
+through SSH. Users are authenticated by username and public/private
+key pair, and all data transfer is protected by the SSH connection
+and Git's own data integrity checks.
+Each Git commit created on the client desktop system is converted
+into a unique change record which can be reviewed independently.
+Change records are stored in PostgreSQL, where they can be queried to
+present customized user dashboards, enumerating any pending changes.
+A summary of each newly uploaded change is automatically emailed
+to reviewers, so they receive a direct hyperlink to review the
+change on the web. Reviewer email addresses can be specified on the
+`git push` command line, but typically reviewers are automatically
+selected by Gerrit by identifying users who have change approval
+permissions in the project.
+Reviewers use the web interface to read the side-by-side or unified
+diff of a change, and insert draft inline comments where appropriate.
+A draft comment is visible only to the reviewer, until they publish
+those comments. Published comments are automatically emailed to
+the change author by Gerrit, and are CC'd to all other reviewers
+who have already commented on the change.
+When publishing comments reviewers are also given the opportunity
+to score the change, indicating whether they feel the change is
+ready for inclusion in the project, needs more work, or should be
+rejected outright. These scores provide direct feedback to Gerrit's
+change submit function.
+After a change has been scored positively by reviewers, Gerrit
+enables a submit button on the web interface. Authorized users
+can push the submit button to have the change enter the project
+repository. The equivilant in Subversion or Perforce would be
+that Gerrit is invoking `svn commit` or `p4 submit` on behalf of
+the web user pressing the button. Due to the way Git audit trails
+are maintained, the user pressing the submit button does not need
+to be the author of the change.
+End-user web browsers make HTTP requests directly to Gerrit's
+HTTP server. As nearly all of the user interface is implemented
+through Google Web Toolkit (GWT), the majority of these requests
+are transmitting compressed JSON payloads, with all HTML being
+generated within the browser. Most responses are under 1 KB.
+Gerrit's HTTP server side component is implemented as a standard
+Java servlet, and thus runs within any J2EE servlet container.
+Popular choices for deployments would be Tomcat or Jetty, as these
+are high-quality open-source servlet containers that are readily
+available for download.
+End-user uploads are performed over SSH, so Gerrit's servlets also
+start up a background thread to receive SSH connections through
+an independent SSH port. SSH clients communicate directly with
+this port, bypassing the HTTP server used by browsers.
+Server side data storage for Gerrit is broken down into two different
+* Git repository data
+* Gerrit metadata
+The Git repository data is the Git object database used to store
+already submitted revisions, as well as all uploaded (proposed)
+changes. Gerrit uses the standard Git repository format, and
+therefore requires direct filesystem access to the repositories.
+All repository data is stored in the filesystem and accessed through
+the JGit library. Repository data can be stored on remote servers
+accessible through NFS or SMB, but the remote directory must
+be mounted on the Gerrit server as part of the local filesystem
+namespace. Remote filesystems are likely to perform worse than
+local ones, due to Git disk IO behavior not being optimized for
+remote access.
+The Gerrit metadata contains a summary of the available changes,
+all comments (published and drafts), and individual user account
+information. The metadata is housed in a PostgreSQL database,
+which can be located either on the same server as Gerrit, or on
+a different (but nearby) server. Most installations would opt to
+install both Gerrit and PostgreSQL on the same server, to reduce
+administration overheads.
+User authentication is handled by OpenID, and therefore Gerrit
+requires that the OpenID provider selected by a user must be
+online and operating in order to authenticate that user.
+* link:[Google Web Toolkit (GWT)]
+* link:[Git Repository Format]
+* link:[About PostgreSQL]
+* link:[OpenID Specifications]
+Project Information
+Gerrit is developed as a self-hosting open source project:
+* link:[Project Homepage]
+* link:[Release Versions]
+* link:[Source]
+* link:[Issue Tracking]
+* link:[Change Review]
+Internationalization and Localization
+As a source code review system for open source projects, where the
+commonly preferred language for communication is typically English,
+Gerrit does not make internationalization or localization a priority.
+The majority of Gerrit's users will be writing change descriptions
+and comments in English, and therefore an English user interface
+is usable by the target user base.
+Gerrit uses GWT's i18n support to externalize all constant strings
+and messages shown to the user, so that in the future someone who
+really needed a translated version of the UI could contribute new
+string files for their locale(s).
+Right-to-left (RTL) support is only barely considered within the
+Gerrit code base. Some portions of the code have tried to take
+RTL into consideration, while others probably need to be modified
+before translating the UI to an RTL language.
+* link:i18n-readme.html[Gerrit's i18n Support]
+Accessibility Considerations
+Whenever possible Gerrit displays raw text rather than image icons,
+so screen readers should still be able to provide useful information
+to blind persons accessing Gerrit sites.
+Standard HTML hyperlinks are used rather than HTML div or span tags
+with click listeners. This provides two benefits to the end-user.
+The first benefit is that screen readers are optimized to locating
+standard hyperlink anchors and presenting them to the end-user as
+a navigation action. The second benefit is that users can use
+the 'open in new tab/window' feature of their browser whenever
+they choose.
+When possible, Gerrit uses the ARIA properties on DOM widgets to
+provide hints to screen readers.
+Browser Compatibility
+Supporting non-JavaScript enabled browsers is a non-goal for Gerrit.
+As Gerrit is a pure-GWT application with no server side rendering
+fallbacks, the browser must support modern JavaScript semantics in
+order to access the Gerrit web application. Dumb clients such as
+`lynx`, `wget`, `curl`, or even many search engine spiders are not
+able to access Gerrit content.
+As Google Web Toolkit (GWT) is used to generate the browser
+specific versions of the client-side JavaScript code, Gerrit works
+on any JavaScript enabled browser which GWT can produce code for.
+This covers the majority of the popular browsers.
+The Gerrit project wants to offer offline support via the HTML 5
+standard and/or Google Gears plugin, both of which would require
+the UI to be rendered in JavaScript on the client side.
+The Gerrit project does not have the development resources necessary
+to support two parallel UI implementations (GWT based JavaScript
+and server-side rendering). Consequently only one is implemented.
+There are number of web browsers available with full JavaScript
+support, and nearly every operating system (including any PDA-like
+mobile phone) comes with one standard. Users who are committed
+to developing changes for a Gerrit managed project can be expected
+to be able to run a JavaScript enabled browser, as they also would
+need to be running Git in order to contribute.
+There are a number of open source browsers available, including
+Firefox and Chromium. Users have some degree of choice in their
+browser selection, including being able to build and audit their
+browser from source.
+The majority of the content stored within Gerrit is also available
+through other means, such as gitweb or the `git://` protocol.
+Any existing search engine spider can crawl the server-side HTML
+produced by gitweb, and thus can index the majority of the changes
+which might appear in Gerrit. Some engines may even choose to
+crawl the native version control database, such as does.
+Therefore the lack of support for most search engine spiders is a
+non-issue for most Gerrit deployments.
+Product Integration
+Gerrit integrates with an existing gitweb installation by optionally
+creating hyperlinks to reference changes on the gitweb server.
+Gerrit integrates with an existing git-daemon installation by
+optionally displaying `git://` URLs for users to download a
+change through the native Git protocol.
+Gerrit integrates with any OpenID provider for user authentication,
+making it easier for users to join a Gerrit site and manage their
+authentication credentials to it. To make use of Google Accounts
+as an OpenID provider easier, Gerrit has a shorthand "Sign in with
+a Google Account" link on its sign-in screen. Gerrit also supports
+a shorthand sign in link for Yahoo!. Other providers may also be
+supported more directly in the future.
+Gerrit integrates with some types of corporate single-sign-on (SSO)
+solutions, typically by having the SSO authentication be performed
+in a reverse proxy web server and then blindly trusting that all
+incoming connections have been authenticated by that reverse proxy.
+When configured to use this form of authentication, Gerrit does
+not integrate with OpenID providers.
+When installing Gerrit, administrators may optionally include an
+HTML header or footer snippet which may include user tracking code,
+such as that used by Google Analytics. This is a per-instance
+configuration that must be done by hand, and is not supported
+out of the box. Other site trackers instead of Google Analytics
+can be used, as the administrator can supply any HTML/JavaScript
+they choose.
+Gerrit does not integrate with any Google service, or any other
+services other than those listed above.
+Standards / Developer APIs
+Gerrit uses an XSRF protected variant of JSON-RPC 1.1 to communicate
+between the browser client and the server.
+As the protocol is not the GWT-RPC protocol, but is instead a
+self-describing standard JSON format it is easily implemented by
+any 3rd party client application, provided the client has a JSON
+parser and HTTP client library available.
+As the entire command set necessary for the standard web browser
+based UI is exposed through JSON-RPC over HTTP, there are no other
+data feeds or command interfaces to the server.
+Commands requiring user authentication may require the user agent to
+complete a sign-in cycle through the user's OpenID provider in order
+to establish the HTTP cookie Gerrit uses to track user identity.
+Automating this sign-in process for non-web browser agents is
+outside of the scope of Gerrit, as each OpenID provider uses its own
+sign-in sequence. Use of OpenID providers which have difficult to
+automate interfaces may make it impossible for non-browser agents
+to be used with the JSON-RPC interface.
+* link:[JSON-RPC 1.1]
+* link:;a=blob;f=README;hb=HEAD[XSRF JSON-RPC]
+Privacy Considerations
+Gerrit stores the following information per user account:
+* Full Name
+* Preferred Email Address
+* Mailing Address '(Optional)'
+* Country '(Optional)'
+* Phone Number '(Optional)'
+* Fax Number '(Optional)'
+The full name and preferred email address fields are shown to any
+site visitor viewing a page containing a change uploaded by the
+account owner, or containing a published comment written by the
+account owner.
+Showing the full name and preferred email is approximately the same
+risk as the `From` header of an email posted to a public mailing
+list that maintains archives, and Gerrit treats these fields in
+much the same way that a mailing list archive might handle them.
+Users who don't want to expose this information should either not
+participate in a Gerrit based online community, or open a new email
+address dedicated for this use.
+As the Gerrit UI data is only available through XSRF protected
+JSON-RPC calls, "screen-scraping" for email addresses is difficult,
+but not impossible. It is unlikely a spammer will go through the
+effort required to code a custom scraping application necessary
+to cull email addresses from published Gerrit comments. In most
+cases these same addresses would be more easily obtained from the
+project's mailing list archives.
+The snail-mail mailing address, country, and phone and fax numbers
+are gathered to help project leads contact the user should there
+be a legal question regarding any change they have uploaded.
+This data is only visible to the account owner and to the Gerrit
+site administrator. It is expected that the information would only
+be revealed with a valid court subpoena, but this is really left
+to the discretion of the Gerrit site administrator as to when it
+is reasonable to reveal this information to a 3rd party.
+All user account information is stored unencrypted in the Gerrit
+metadata store, typically a PostgreSQL database.
+Spam and Abuse Considerations
+Gerrit makes no attempt to detect spam changes or comments. The
+somewhat high barrier to entry makes it unlikely that a spammer
+will target Gerrit.
+To upload a change, the client must speak the native Git protocol
+embedded in SSH, with some custom Gerrit semantics added on top.
+The client must have their public key already stored in the Gerrit
+database, which can only be done through the XSRF protected
+JSON-RPC interface. The level of effort required to construct
+the necessary tools to upload a well-formatted change that isn't
+rejected outright by the Git and Gerrit checksum validations is
+too high to for a spammer to get any meaningful return.
+To post and publish a comment a client must sign in with an OpenID
+provider and then use the XSRF protected JSON-RPC interface to
+publish the draft on an existing change record. Again, the level of
+effort required to implement the Gerrit specific XSRF protections
+and the JSON-RPC payload format necessary to post a draft and then
+publish that draft is simply too high for a spammer to bother with.
+Both of these assumptions are also based upon the idea that Gerrit
+will be a lot less popular than blog software, and thus will be
+running on a lot less websites. Spammers therefore have very little
+returned benefit for getting over the protocol hurdles.
+These assumptions may need to be revisited in the future if any
+public Gerrit site actually notices spam.
+Gerrit targets for sub-250 ms per page request, mostly by using
+very compact JSON payloads bewteen client and server. However, as
+most of the serving stack (network, hardware, PostgreSQL metadata
+database) is out of control of the Gerrit developers, no real
+guarantees can be made about latency.
+Gerrit is designed for an open source project. Roughly this
+amounts to parameters such as the following:
+.Design Parameters
+Parameter Estimated Maximum
+Projects 500
+Contributors 2,000
+Changes/Day 400
+Revisions/Change 2.0
+Files/Change 4
+Comments/File 2
+Reviewers/Change 1.0
+CPU Usage
+Very few, if any open source projects have more than a handful of
+Git repositories associated with them. Since Gerrit treats one
+Git repository as a project, an assumed limit of 500 projects
+is reasonable. Only an operating system distribution project
+would really need to be tracking more than a handful of discrete
+Git repositories.
+Almost no open source project has 2,000 contributors over all time,
+let alone on a daily basis. This figure of 2,000 was WAG'd by
+looking at PR statements published by cell phone companies picking
+up the Android operating system. If all of the stated employees in
+those PR statements were working on *only* the open source Android
+repositories, we might reach the 2,000 estimate listed here. Knowing
+these companies as being very closed-source minded in the past, it
+is very unlikely all of their Android engineers will be working on
+the open source repository, and thus 2,000 is a very high estimate.
+The estimate of 400 changes per day was WAG'd off some estimates
+originally obtained from Android's development history. Writing a
+good change that will be accepted through a peer-review process
+takes time. The average engineer may need 4-6 hours per change just
+to write the code and unit tests. Proper design consideration and
+additional but equally important tasks such as meetings, interviews,
+training, and eating lunch will often pad the engineer's day out
+such that suitable changes are only posted once a day, or once
+every other day. For reference, the entire Linux kernel has an
+average of only 79 changes/day.
+The estimate of 2 revisions/change means that on average any
+given change will need to be modified once to address peer review
+comments before the final revision can be accepted by the project.
+Executing these revisions also eats into the contributor's time,
+and is another factor limiting the number of changes/day accepted
+by the Gerrit instance.
+The estimate of 1 reviewer/change means that on average only one
+person will comment on a change. Usually this would be the project
+lead, or someone who is familiar with the code being modified.
+The time required to comment further reduces the time available
+for writing one's own changes.
+Gerrit's web UI would require on average `4+F+F*C` HTTP requests to
+review a change and post comments. Here `F` is the number of files
+modified by the change, and `C` is the number of inline comments left
+by the reviewer per file. The constant 4 accounts for the request
+to load the reviewer's dashboard, to load the change detail page,
+to publish the review comments, and to reload the change detail
+page after comments are published.
+This WAG'd estimate boils down to <12,800 HTTP requests per day
+(QPD). Assuming these are evenly distributed over an 8 hour work day
+in a single time zone, we are looking at approximately 26 queries
+per second (QPS).
+ QPD = Changes_Day * Revisions_Change * Reviewers_Change * (4 + F + F * C)
+ = 400 * 2.0 * 1.0 * (4 + 4 + 4 * 2)
+ = 12,800
+ QPS = QPD / 8_Hours / 60_Seconds
+ = 26
+Gerrit serves most requests in under 60 ms when using the loopback
+interface and a single processor. On a single CPU system there is
+sufficient capacity for 16 QPS. A dual processor system should be
+sufficient for a site with the estimated load described above.
+Given a more realistic estimate of 79 changes per day (from the
+Linux kernel) suggests only 2,528 queries per day, and a much lower
+5.2 QPS when spread out over an 8 hour work day.
+Disk Usage
+The average size of a revision in the Linux kernel once compressed
+by Git is 2,327 bytes, or roughly 2 KB. Over the course of a year
+a Gerrit server running with the parameters above might see an
+introduction of 570 MB over the total set of 500 projects hosted in
+that server. This figure assumes the majorty of the content is human
+written source code, and not large binary blobs such as disk images.
+Redundancy & Reliability
+Gerrit largely assumes that the local filesystem where Git repository
+data is stored is always available. Important data written to disk
+is also forced to the platter with an `fsync()` once it has been
+fully written. If the local filesystem fails to respond to reads
+or becomes corrupt, Gerrit has no provisions to fallback or retry
+and errors will be returned to clients.
+Gerrit largely assumes that the metadata PostgreSQL database is
+online and answering both read and write queries. Query failures
+immediately result in the operation aborting and errors being
+returned to the client, with no retry or fallback provisions.
+Due to the relatively small scale described above, it is very likely
+that the Git filesystem and PostgreSQL based metadata database
+are all housed on the same server that is running Gerrit. If any
+failure arises in one of these components, it is likely to manifest
+in the others too. It is also likely that the administrator cannot
+be bothered to deploy a cluster of load-balanced server hardware,
+as the scale and expected load does not justify the hardware or
+management costs.
+Most deployments caring about reliability will setup a warm-spare
+standby system and use a manual fail-over process to switch from the
+failed system to the warm-spare.
+As Git is a distributed version control system, and open source
+projects tend to have contributors from all over the world, most
+contributors will be able to tolerate a Gerrit down time of several
+hours while the administrator is notified, signs on, and brings the
+warm-spare up. Pending changes are likely to need at least 24 hours
+of time on the Gerrit site anyway in order to ensure any interested
+parties around the world have had a chance to comment. This expected
+lag largely allows for some downtime in a disaster scenario.
+PostgreSQL can be configured to save its write-ahead-log (WAL)
+and ship these logs to other systems, where they are applied to
+a warm-standby backup in real time. Gerrit instances which care
+about reduduncy will setup this feature of PostgreSQL to ensure
+the warm-standby is reasonably current should the master go offline.
+Gerrit can be configured to replicate changes made to the local
+Git repositories over any standard Git transports. This can be
+configured in `'site_path'/replication.conf` to send copies of
+all changes over SSH to other servers, or to the Amazon S3 blob
+storage service.
+Logging Plan
+Gerrit does not maintain logs on its own.
+Published comments contain a publication date, so users can judge
+when the comment was posted and decide if it was "recent" or not.
+Only the timestamp is stored in the database, the IP address of
+the comment author is not stored.
+Changes uploaded over the SSH daemon from `git push` have the
+standard Git reflog updated with the date and time that the upload
+occurred, and the Gerrit account identity of who did the upload.
+Changes submitted and merged into a branch also update the
+Git reflog. These logs are available only to the Gerrit site
+administrator, and they are not replicated through the automatic
+replication noted earlier. These logs are primarly recorded for an
+"oh s**t" moment where the administrator has to rewind data. In most
+installations they are a waste of disk space. Future versions of
+JGit may allow disabling these logs, and Gerrit may take advantage
+of that feature to stop writing these logs.
+A web server positioned in front of Gerrit (such as a reverse proxy)
+or the hosting servlet container may record access logs, and these
+logs may be mined for usage information. This is outside of the
+scope of Gerrit.
+Testing Plan
+Gerrit is currently manually tested through its web UI.
+JGit has a fairly extensive automated unit test suite. Most new
+changes to JGit are rejected unless corresponding automated unit
+tests are included.
+Reitveld can't be used as it does not provide the "submit over the
+web" feature that Gerrit provides for Git.
+Gitosis can't be used as it does not provide any code review
+features, but it does provide basic access controls.
+Email based code review does not scale to a project as large and
+complex as Android. Most contributors at least need some sort of
+dashboard to keep track of any pending reviews, and some way to
+correlate updated revisions back to the comments written on prior
+revisions of the same logical change.
diff --git a/Documentation/index.txt b/Documentation/index.txt
index 319a2ba53d..e241ee2a17 100644
--- a/Documentation/index.txt
+++ b/Documentation/index.txt
@@ -33,6 +33,7 @@ Developer Documentation
* link:dev-readme.html[Developer Setup]
* link:dev-eclipse.html[Eclipse Setup]
+* link:dev-design.html[System Design]
* link:i18n-readme.html[i18n Support]
Gerrit resources: