diff options
author | Shawn O. Pearce <sop@google.com> | 2009-02-06 12:32:57 -0800 |
---|---|---|
committer | Shawn O. Pearce <sop@google.com> | 2009-02-08 17:00:20 -0800 |
commit | c4bcc0986df3427e36e8128a040f70e9a99f4d75 (patch) | |
tree | 36d34ffc3701068d4437a6962378c05d68e8ccb5 | |
parent | ce32ec517684e7cbe76a8b9b36495b1a41587442 (diff) |
Add a document describing Gerrit's high level design
Signed-off-by: Shawn O. Pearce <sop@google.com>
-rw-r--r-- | Documentation/dev-design.txt | 630 | ||||
-rw-r--r-- | Documentation/index.txt | 1 |
2 files changed, 631 insertions, 0 deletions
diff --git a/Documentation/dev-design.txt b/Documentation/dev-design.txt new file mode 100644 index 0000000000..dc386e9dbc --- /dev/null +++ b/Documentation/dev-design.txt @@ -0,0 +1,630 @@ +Gerrit2 - System Design +======================= + +Objective +--------- + +Gerrit is a web based code review system, facilitating online code +reviews for projects using the Git version control system. + +Gerrit makes reviews easier by showing changes in a side-by-side +display, and allowing inline comments to be added by any reviewer. + +Gerrit simplifies Git based project maintainership by permitting +any authorized user to submit changes to the master Git repository, +rather than requiring all approved changes to be merged in by +hand by the project maintainer. This functionality enables a more +centralized usage of Git. + + +Background +---------- + +Google developed Mondrian, a Perforce based code review tool to +facilitate peer-review of changes prior to submission to the central +code repository. Mondrian is not open source, as it is tied to the +use of Perforce and to many Google-only services, such as Bigtable. +Google employees have often described how useful Mondrian and its +peer-review process is to their day-to-day work. + +Guido van Rossum open sourced portions of Mondrian within Rietveld, +a similar code review tool running on Google App Engine, but for +use with Subversion rather than Perforce. Rietveld is in common +use by many open source projects, facilitating their peer reviews +much as Mondrian does for Google employees. Unlike Mondrian and +the Google Perforce triggers, Rietveld is strictly advisory and +does not enforce peer-review prior to submission. + +Git is a distributed version control system, wherein each repository +is assumed to be owned/maintained by a single user. There are no +inherit security controls built into Git, so the ability to read +from or write to a repository is controlled entirely by the host's +filesystem access controls. When multiple maintainers collaborate +on a single shared repository a high degree of trust is required, +as any collaborator with write access can alter the repository. + +Gitosis provides tools to secure centralized Git repositories, +permitting multiple maintainers to manage the same project at once, +by restricting the access to only over a secure network protocol, +much like Perforce secures a repository by only permitting access +over its network port. + +The Android Open Source Project (AOSP) was founded by Google by the +open source releasing of the Android operating system. AOSP has +selected Git as its primary version control tool. As many of the +engineers have a background of working with Mondrian at Google, +there is a strong desire to have the same (or better) feature set +available for Git and AOSP. + +* link:http://video.google.com/videoplay?docid=-8502904076440714866[Mondrian Code Review On The Web] +* link:http://code.google.com/p/rietveld/[Rietveld - Code Review for Subversion] +* link:http://eagain.net/gitweb/?p=gitosis.git;a=blob;f=README.rst;hb=HEAD[Gitosis README] +* link:http://source.android.com/[Android Open Source Project] + + +Overview +-------- + +Developers create one or more changes on their local desktop system, +then upload them for review to Gerrit using the standard `git push` +command line program, or any GUI which can invoke `git push` on +behalf of the user. Authentication and data transfer are handled +through SSH. Users are authenticated by username and public/private +key pair, and all data transfer is protected by the SSH connection +and Git's own data integrity checks. + +Each Git commit created on the client desktop system is converted +into a unique change record which can be reviewed independently. +Change records are stored in PostgreSQL, where they can be queried to +present customized user dashboards, enumerating any pending changes. + +A summary of each newly uploaded change is automatically emailed +to reviewers, so they receive a direct hyperlink to review the +change on the web. Reviewer email addresses can be specified on the +`git push` command line, but typically reviewers are automatically +selected by Gerrit by identifying users who have change approval +permissions in the project. + +Reviewers use the web interface to read the side-by-side or unified +diff of a change, and insert draft inline comments where appropriate. +A draft comment is visible only to the reviewer, until they publish +those comments. Published comments are automatically emailed to +the change author by Gerrit, and are CC'd to all other reviewers +who have already commented on the change. + +When publishing comments reviewers are also given the opportunity +to score the change, indicating whether they feel the change is +ready for inclusion in the project, needs more work, or should be +rejected outright. These scores provide direct feedback to Gerrit's +change submit function. + +After a change has been scored positively by reviewers, Gerrit +enables a submit button on the web interface. Authorized users +can push the submit button to have the change enter the project +repository. The equivilant in Subversion or Perforce would be +that Gerrit is invoking `svn commit` or `p4 submit` on behalf of +the web user pressing the button. Due to the way Git audit trails +are maintained, the user pressing the submit button does not need +to be the author of the change. + + +Infrastructure +-------------- + +End-user web browsers make HTTP requests directly to Gerrit's +HTTP server. As nearly all of the user interface is implemented +through Google Web Toolkit (GWT), the majority of these requests +are transmitting compressed JSON payloads, with all HTML being +generated within the browser. Most responses are under 1 KB. + +Gerrit's HTTP server side component is implemented as a standard +Java servlet, and thus runs within any J2EE servlet container. +Popular choices for deployments would be Tomcat or Jetty, as these +are high-quality open-source servlet containers that are readily +available for download. + +End-user uploads are performed over SSH, so Gerrit's servlets also +start up a background thread to receive SSH connections through +an independent SSH port. SSH clients communicate directly with +this port, bypassing the HTTP server used by browsers. + +Server side data storage for Gerrit is broken down into two different +categories: + +* Git repository data +* Gerrit metadata + +The Git repository data is the Git object database used to store +already submitted revisions, as well as all uploaded (proposed) +changes. Gerrit uses the standard Git repository format, and +therefore requires direct filesystem access to the repositories. +All repository data is stored in the filesystem and accessed through +the JGit library. Repository data can be stored on remote servers +accessible through NFS or SMB, but the remote directory must +be mounted on the Gerrit server as part of the local filesystem +namespace. Remote filesystems are likely to perform worse than +local ones, due to Git disk IO behavior not being optimized for +remote access. + +The Gerrit metadata contains a summary of the available changes, +all comments (published and drafts), and individual user account +information. The metadata is housed in a PostgreSQL database, +which can be located either on the same server as Gerrit, or on +a different (but nearby) server. Most installations would opt to +install both Gerrit and PostgreSQL on the same server, to reduce +administration overheads. + +User authentication is handled by OpenID, and therefore Gerrit +requires that the OpenID provider selected by a user must be +online and operating in order to authenticate that user. + +* link:http://code.google.com/webtoolkit/[Google Web Toolkit (GWT)] +* link:http://www.kernel.org/pub/software/scm/git/docs/gitrepository-layout.html[Git Repository Format] +* link:http://www.postgresql.org/about/[About PostgreSQL] +* link:http://openid.net/developers/specs/[OpenID Specifications] + + +Project Information +------------------- + +Gerrit is developed as a self-hosting open source project: + +* link:http://code.google.com/p/gerrit/[Project Homepage] +* link:http://code.google.com/p/gerrit/downloads/list[Release Versions] +* link:http://code.google.com/p/gerrit/wiki/Source?tm=4[Source] +* link:http://code.google.com/p/gerrit/wiki/Issues?tm=3[Issue Tracking] +* link:http://review.source.android.com/[Change Review] + + +Internationalization and Localization +------------------------------------- + +As a source code review system for open source projects, where the +commonly preferred language for communication is typically English, +Gerrit does not make internationalization or localization a priority. + +The majority of Gerrit's users will be writing change descriptions +and comments in English, and therefore an English user interface +is usable by the target user base. + +Gerrit uses GWT's i18n support to externalize all constant strings +and messages shown to the user, so that in the future someone who +really needed a translated version of the UI could contribute new +string files for their locale(s). + +Right-to-left (RTL) support is only barely considered within the +Gerrit code base. Some portions of the code have tried to take +RTL into consideration, while others probably need to be modified +before translating the UI to an RTL language. + +* link:i18n-readme.html[Gerrit's i18n Support] + + +Accessibility Considerations +---------------------------- + +Whenever possible Gerrit displays raw text rather than image icons, +so screen readers should still be able to provide useful information +to blind persons accessing Gerrit sites. + +Standard HTML hyperlinks are used rather than HTML div or span tags +with click listeners. This provides two benefits to the end-user. +The first benefit is that screen readers are optimized to locating +standard hyperlink anchors and presenting them to the end-user as +a navigation action. The second benefit is that users can use +the 'open in new tab/window' feature of their browser whenever +they choose. + +When possible, Gerrit uses the ARIA properties on DOM widgets to +provide hints to screen readers. + + +Browser Compatibility +--------------------- + +Supporting non-JavaScript enabled browsers is a non-goal for Gerrit. + +As Gerrit is a pure-GWT application with no server side rendering +fallbacks, the browser must support modern JavaScript semantics in +order to access the Gerrit web application. Dumb clients such as +`lynx`, `wget`, `curl`, or even many search engine spiders are not +able to access Gerrit content. + +As Google Web Toolkit (GWT) is used to generate the browser +specific versions of the client-side JavaScript code, Gerrit works +on any JavaScript enabled browser which GWT can produce code for. +This covers the majority of the popular browsers. + +The Gerrit project wants to offer offline support via the HTML 5 +standard and/or Google Gears plugin, both of which would require +the UI to be rendered in JavaScript on the client side. + +The Gerrit project does not have the development resources necessary +to support two parallel UI implementations (GWT based JavaScript +and server-side rendering). Consequently only one is implemented. + +There are number of web browsers available with full JavaScript +support, and nearly every operating system (including any PDA-like +mobile phone) comes with one standard. Users who are committed +to developing changes for a Gerrit managed project can be expected +to be able to run a JavaScript enabled browser, as they also would +need to be running Git in order to contribute. + +There are a number of open source browsers available, including +Firefox and Chromium. Users have some degree of choice in their +browser selection, including being able to build and audit their +browser from source. + +The majority of the content stored within Gerrit is also available +through other means, such as gitweb or the `git://` protocol. +Any existing search engine spider can crawl the server-side HTML +produced by gitweb, and thus can index the majority of the changes +which might appear in Gerrit. Some engines may even choose to +crawl the native version control database, such as ohloh.net does. +Therefore the lack of support for most search engine spiders is a +non-issue for most Gerrit deployments. + + +Product Integration +------------------- + +Gerrit integrates with an existing gitweb installation by optionally +creating hyperlinks to reference changes on the gitweb server. + +Gerrit integrates with an existing git-daemon installation by +optionally displaying `git://` URLs for users to download a +change through the native Git protocol. + +Gerrit integrates with any OpenID provider for user authentication, +making it easier for users to join a Gerrit site and manage their +authentication credentials to it. To make use of Google Accounts +as an OpenID provider easier, Gerrit has a shorthand "Sign in with +a Google Account" link on its sign-in screen. Gerrit also supports +a shorthand sign in link for Yahoo!. Other providers may also be +supported more directly in the future. + +Gerrit integrates with some types of corporate single-sign-on (SSO) +solutions, typically by having the SSO authentication be performed +in a reverse proxy web server and then blindly trusting that all +incoming connections have been authenticated by that reverse proxy. +When configured to use this form of authentication, Gerrit does +not integrate with OpenID providers. + +When installing Gerrit, administrators may optionally include an +HTML header or footer snippet which may include user tracking code, +such as that used by Google Analytics. This is a per-instance +configuration that must be done by hand, and is not supported +out of the box. Other site trackers instead of Google Analytics +can be used, as the administrator can supply any HTML/JavaScript +they choose. + +Gerrit does not integrate with any Google service, or any other +services other than those listed above. + + +Standards / Developer APIs +-------------------------- + +Gerrit uses an XSRF protected variant of JSON-RPC 1.1 to communicate +between the browser client and the server. + +As the protocol is not the GWT-RPC protocol, but is instead a +self-describing standard JSON format it is easily implemented by +any 3rd party client application, provided the client has a JSON +parser and HTTP client library available. + +As the entire command set necessary for the standard web browser +based UI is exposed through JSON-RPC over HTTP, there are no other +data feeds or command interfaces to the server. + +Commands requiring user authentication may require the user agent to +complete a sign-in cycle through the user's OpenID provider in order +to establish the HTTP cookie Gerrit uses to track user identity. +Automating this sign-in process for non-web browser agents is +outside of the scope of Gerrit, as each OpenID provider uses its own +sign-in sequence. Use of OpenID providers which have difficult to +automate interfaces may make it impossible for non-browser agents +to be used with the JSON-RPC interface. + +* link:http://json-rpc.org/wd/JSON-RPC-1-1-WD-20060807.html[JSON-RPC 1.1] +* link:http://android.git.kernel.org/?p=tools/gwtjsonrpc.git;a=blob;f=README;hb=HEAD[XSRF JSON-RPC] + + +Privacy Considerations +---------------------- + +Gerrit stores the following information per user account: + +* Full Name +* Preferred Email Address +* Mailing Address '(Optional)' +* Country '(Optional)' +* Phone Number '(Optional)' +* Fax Number '(Optional)' + +The full name and preferred email address fields are shown to any +site visitor viewing a page containing a change uploaded by the +account owner, or containing a published comment written by the +account owner. + +Showing the full name and preferred email is approximately the same +risk as the `From` header of an email posted to a public mailing +list that maintains archives, and Gerrit treats these fields in +much the same way that a mailing list archive might handle them. +Users who don't want to expose this information should either not +participate in a Gerrit based online community, or open a new email +address dedicated for this use. + +As the Gerrit UI data is only available through XSRF protected +JSON-RPC calls, "screen-scraping" for email addresses is difficult, +but not impossible. It is unlikely a spammer will go through the +effort required to code a custom scraping application necessary +to cull email addresses from published Gerrit comments. In most +cases these same addresses would be more easily obtained from the +project's mailing list archives. + +The snail-mail mailing address, country, and phone and fax numbers +are gathered to help project leads contact the user should there +be a legal question regarding any change they have uploaded. +This data is only visible to the account owner and to the Gerrit +site administrator. It is expected that the information would only +be revealed with a valid court subpoena, but this is really left +to the discretion of the Gerrit site administrator as to when it +is reasonable to reveal this information to a 3rd party. + +All user account information is stored unencrypted in the Gerrit +metadata store, typically a PostgreSQL database. + + +Spam and Abuse Considerations +----------------------------- + +Gerrit makes no attempt to detect spam changes or comments. The +somewhat high barrier to entry makes it unlikely that a spammer +will target Gerrit. + +To upload a change, the client must speak the native Git protocol +embedded in SSH, with some custom Gerrit semantics added on top. +The client must have their public key already stored in the Gerrit +database, which can only be done through the XSRF protected +JSON-RPC interface. The level of effort required to construct +the necessary tools to upload a well-formatted change that isn't +rejected outright by the Git and Gerrit checksum validations is +too high to for a spammer to get any meaningful return. + +To post and publish a comment a client must sign in with an OpenID +provider and then use the XSRF protected JSON-RPC interface to +publish the draft on an existing change record. Again, the level of +effort required to implement the Gerrit specific XSRF protections +and the JSON-RPC payload format necessary to post a draft and then +publish that draft is simply too high for a spammer to bother with. + +Both of these assumptions are also based upon the idea that Gerrit +will be a lot less popular than blog software, and thus will be +running on a lot less websites. Spammers therefore have very little +returned benefit for getting over the protocol hurdles. + +These assumptions may need to be revisited in the future if any +public Gerrit site actually notices spam. + + +Latency +------- + +Gerrit targets for sub-250 ms per page request, mostly by using +very compact JSON payloads bewteen client and server. However, as +most of the serving stack (network, hardware, PostgreSQL metadata +database) is out of control of the Gerrit developers, no real +guarantees can be made about latency. + + +Scalability +----------- + +Gerrit is designed for an open source project. Roughly this +amounts to parameters such as the following: + +.Design Parameters +[grid="all"] +`-----------------'---------------- +Parameter Estimated Maximum +----------------------------------- +Projects 500 +Contributors 2,000 +Changes/Day 400 +Revisions/Change 2.0 +Files/Change 4 +Comments/File 2 +Reviewers/Change 1.0 +----------------------------------- + +CPU Usage +~~~~~~~~~ + +Very few, if any open source projects have more than a handful of +Git repositories associated with them. Since Gerrit treats one +Git repository as a project, an assumed limit of 500 projects +is reasonable. Only an operating system distribution project +would really need to be tracking more than a handful of discrete +Git repositories. + +Almost no open source project has 2,000 contributors over all time, +let alone on a daily basis. This figure of 2,000 was WAG'd by +looking at PR statements published by cell phone companies picking +up the Android operating system. If all of the stated employees in +those PR statements were working on *only* the open source Android +repositories, we might reach the 2,000 estimate listed here. Knowing +these companies as being very closed-source minded in the past, it +is very unlikely all of their Android engineers will be working on +the open source repository, and thus 2,000 is a very high estimate. + +The estimate of 400 changes per day was WAG'd off some estimates +originally obtained from Android's development history. Writing a +good change that will be accepted through a peer-review process +takes time. The average engineer may need 4-6 hours per change just +to write the code and unit tests. Proper design consideration and +additional but equally important tasks such as meetings, interviews, +training, and eating lunch will often pad the engineer's day out +such that suitable changes are only posted once a day, or once +every other day. For reference, the entire Linux kernel has an +average of only 79 changes/day. + +The estimate of 2 revisions/change means that on average any +given change will need to be modified once to address peer review +comments before the final revision can be accepted by the project. +Executing these revisions also eats into the contributor's time, +and is another factor limiting the number of changes/day accepted +by the Gerrit instance. + +The estimate of 1 reviewer/change means that on average only one +person will comment on a change. Usually this would be the project +lead, or someone who is familiar with the code being modified. +The time required to comment further reduces the time available +for writing one's own changes. + +Gerrit's web UI would require on average `4+F+F*C` HTTP requests to +review a change and post comments. Here `F` is the number of files +modified by the change, and `C` is the number of inline comments left +by the reviewer per file. The constant 4 accounts for the request +to load the reviewer's dashboard, to load the change detail page, +to publish the review comments, and to reload the change detail +page after comments are published. + +This WAG'd estimate boils down to <12,800 HTTP requests per day +(QPD). Assuming these are evenly distributed over an 8 hour work day +in a single time zone, we are looking at approximately 26 queries +per second (QPS). + +---- + QPD = Changes_Day * Revisions_Change * Reviewers_Change * (4 + F + F * C) + = 400 * 2.0 * 1.0 * (4 + 4 + 4 * 2) + = 12,800 + QPS = QPD / 8_Hours / 60_Seconds + = 26 +---- + +Gerrit serves most requests in under 60 ms when using the loopback +interface and a single processor. On a single CPU system there is +sufficient capacity for 16 QPS. A dual processor system should be +sufficient for a site with the estimated load described above. + +Given a more realistic estimate of 79 changes per day (from the +Linux kernel) suggests only 2,528 queries per day, and a much lower +5.2 QPS when spread out over an 8 hour work day. + +Disk Usage +~~~~~~~~~~ + +The average size of a revision in the Linux kernel once compressed +by Git is 2,327 bytes, or roughly 2 KB. Over the course of a year +a Gerrit server running with the parameters above might see an +introduction of 570 MB over the total set of 500 projects hosted in +that server. This figure assumes the majorty of the content is human +written source code, and not large binary blobs such as disk images. + + +Redundancy & Reliability +------------------------ + +Gerrit largely assumes that the local filesystem where Git repository +data is stored is always available. Important data written to disk +is also forced to the platter with an `fsync()` once it has been +fully written. If the local filesystem fails to respond to reads +or becomes corrupt, Gerrit has no provisions to fallback or retry +and errors will be returned to clients. + +Gerrit largely assumes that the metadata PostgreSQL database is +online and answering both read and write queries. Query failures +immediately result in the operation aborting and errors being +returned to the client, with no retry or fallback provisions. + +Due to the relatively small scale described above, it is very likely +that the Git filesystem and PostgreSQL based metadata database +are all housed on the same server that is running Gerrit. If any +failure arises in one of these components, it is likely to manifest +in the others too. It is also likely that the administrator cannot +be bothered to deploy a cluster of load-balanced server hardware, +as the scale and expected load does not justify the hardware or +management costs. + +Most deployments caring about reliability will setup a warm-spare +standby system and use a manual fail-over process to switch from the +failed system to the warm-spare. + +As Git is a distributed version control system, and open source +projects tend to have contributors from all over the world, most +contributors will be able to tolerate a Gerrit down time of several +hours while the administrator is notified, signs on, and brings the +warm-spare up. Pending changes are likely to need at least 24 hours +of time on the Gerrit site anyway in order to ensure any interested +parties around the world have had a chance to comment. This expected +lag largely allows for some downtime in a disaster scenario. + +Backups +~~~~~~~ + +PostgreSQL can be configured to save its write-ahead-log (WAL) +and ship these logs to other systems, where they are applied to +a warm-standby backup in real time. Gerrit instances which care +about reduduncy will setup this feature of PostgreSQL to ensure +the warm-standby is reasonably current should the master go offline. + +Gerrit can be configured to replicate changes made to the local +Git repositories over any standard Git transports. This can be +configured in `'site_path'/replication.conf` to send copies of +all changes over SSH to other servers, or to the Amazon S3 blob +storage service. + + +Logging Plan +------------ + +Gerrit does not maintain logs on its own. + +Published comments contain a publication date, so users can judge +when the comment was posted and decide if it was "recent" or not. +Only the timestamp is stored in the database, the IP address of +the comment author is not stored. + +Changes uploaded over the SSH daemon from `git push` have the +standard Git reflog updated with the date and time that the upload +occurred, and the Gerrit account identity of who did the upload. +Changes submitted and merged into a branch also update the +Git reflog. These logs are available only to the Gerrit site +administrator, and they are not replicated through the automatic +replication noted earlier. These logs are primarly recorded for an +"oh s**t" moment where the administrator has to rewind data. In most +installations they are a waste of disk space. Future versions of +JGit may allow disabling these logs, and Gerrit may take advantage +of that feature to stop writing these logs. + +A web server positioned in front of Gerrit (such as a reverse proxy) +or the hosting servlet container may record access logs, and these +logs may be mined for usage information. This is outside of the +scope of Gerrit. + + +Testing Plan +------------ + +Gerrit is currently manually tested through its web UI. + +JGit has a fairly extensive automated unit test suite. Most new +changes to JGit are rejected unless corresponding automated unit +tests are included. + + +Caveats +------- + +Reitveld can't be used as it does not provide the "submit over the +web" feature that Gerrit provides for Git. + +Gitosis can't be used as it does not provide any code review +features, but it does provide basic access controls. + +Email based code review does not scale to a project as large and +complex as Android. Most contributors at least need some sort of +dashboard to keep track of any pending reviews, and some way to +correlate updated revisions back to the comments written on prior +revisions of the same logical change. diff --git a/Documentation/index.txt b/Documentation/index.txt index 319a2ba53d..e241ee2a17 100644 --- a/Documentation/index.txt +++ b/Documentation/index.txt @@ -33,6 +33,7 @@ Developer Documentation * link:dev-readme.html[Developer Setup] * link:dev-eclipse.html[Eclipse Setup] +* link:dev-design.html[System Design] * link:i18n-readme.html[i18n Support] Gerrit resources: |