BM2 Documentation

Benchmark Time Series Page

Overview

Main context

Note: Some of the terms and concepts in the below table are described in greater detail elsewhere. The additional documentation can often be accessed through a link.

Database	The database from which results for the time series were extracted.
Report date	The date at which the web page was generated.
Host	The physical computer on which the benchmark producing this time series was executed. Note:It is assumed that the HW/SW specifications of the host does not change significantly during the time span of the time series. (The principle here being that significant changes in the time series should be caused by changes in the product (i.e. Qt) only!)
Platform	The general environment used for building and executing the product being measured (typically an OS/compiler combination).
Branch	Essentially the version of the product being measured. The branch is normally made up of two components: The git repository and the git branch.
Target snapshots	The requested subsequence of snapshots for which results for this host/platform/branch/benchmark/metric combination potentially exist in the database.
Difference tolerance	A real value ≥ 1 that decides whether a change between two median observations in the time series is considered significant or not.
Minimum durability tolerance	The minimum length a contiguous sequence of significantly equal median observations must have for it to achieve a durability score greater than zero. Once the sequence is at least this long, the durability score grows linearly to 1 at a rate that depends on the maximum durability tolerance.
Maximum durability tolerance	The length of a contiguous sequence of significantly equal median observations that is sufficient to achieve the maximum durability score of 1. The durability score for shorter sequences falls linearly to 0 at a rate that depends on the minimum durability tolerance.

Time series statistics

Note: Some of the terms and concepts in the below table are described in greater detail elsewhere. The additional documentation can often be accessed through a link.

MS	Missing snapshots, i.e. the number of target snapshots for which no results exist. A high value might indicate unstable execution.
LSD	Last Snapshot Distance, i.e. distance between the last target snapshot and the last snapshot in the time series. If the last target snapshot is the last one available in the database, a high value might indicate that the benchmark currently fails to produce results.
NI	Total number of observations explicitly flagged as invalid. An invalid observation is typically caused by a failed QVERIFY() etc.
NZ	Total number of non-positive observations. Normally an observation must be positive to be valid. Note: A non-positive observation is not necessarily flagged as invalid (see NI).
NC	Number of significant changes. A high value might indicate unstable or fluctuating results.
MDRSE	Median of the valid relative standard errors of all snapshots. A high value might indicate unstable or fluctuating results.
RSEMD	Relative standard error (see above) of the valid median observations of all snapshots. A high value might indicate either 1) unstable or fluctuating results or 2) stable changes of a high magnitude.
LC	Last significant change. The higher the value is above 1, the more strongly it represents an improvement. The lower the value is below 1, the more strongly it represents a regression.
LCDA	Days ago (relative to the report date) since the first observation for the last significant change snapshot was uploaded to the database. The distance (in terms of number of target snapshots) between the last significant change snapshot and the last target snapshot is shown in parentheses.
LCMS	Magnitude score of the last significant change. This score indicates the strength of the last signicifant change as a value ranging from 0 (weak) to 1 (strong):
LCGSS	Global separation score for the last significant change. This score indicates how well the median observation at the last significant change snapshot are separated from the median observations at all preceding snapshots in the time series. The median observation at the last base snapshot is used as the maximum separation reference. The score ranges from 0 (weak separation) to 1 (strong separation). This score roughly measures how close the median observation at the last significant change is to represent an "all time high(low)" up to this point in the history.
LCLSS	Local separation score for the last significant change. This score indicates how well the median observations on each side of the last significant change snapshot are separated from each other. Snapshots before the last base snapshot are not considered. The median observation at the base snapshot is used as the maximum separation reference. The score ranges from 0 (weak separation) to 1 (strong separation).
LCDS1	Durability score 1 for the last significant change. This score indicates the distance (in terms of number of snapshots) from the last significant change to its base snapshot. The score ranges from 0 (weak durability) to 1 (strong durability) and is scaled against the min/max durability tolerances. This score measures for how long the median observation stayed near the base value until the last significant change occurred.
LCDS2	Durability score 2 for the last significant change. This score indicates the distance (in terms of number of snapshots) from the last significant change to the end of the time series. The score ranges from 0 (weak durability) to 1 (strong durability) and is scaled against the min/max durability tolerances. This score measures for how long the median observation at the last significant change has stayed essentially the same.
LCSS	Stability score for the last significant change: LCMS * LCGSS * LCLSS * LCDS1 * LCDS2 The higher this score, the higher the likelihood that the last significant change is or will become permanent.
LCSS1	Stability score for the last significant change that does not consider the history after the latter: LCMS * LCGSS * LCLSS * LCDS1 The higher this score, the higher the likelihood that the last signicifant change is or will become permanent, but since LCDS2 is omitted from the product, a high LCSS1 is more likely to be caused by an outlier than a high LCSS!

Benchmark and metric

The benchmark name consists of three subnames and is formatted like this:

<subname 1>:<subname 2>(<subname 3>)

The subnames can essentially be anything not containing the characters ':', '(', and ')'. Only subname 3 may contain whitespace. For benchmark results generated by QTestLib, the subnames always correspond to test case, test function, and data tag respectively.

The metric name is one of a set of predefined metric names, each of which is classified as either "lower is better" (like walltime) or "higher is better" (like fps).

Time Series Plot

Snapshots and main graph

Significant changes

Sample size

Non-positive observations

Invalid observations

Statistical dispersion

Statistical dispersion in a time series is measured in terms of relative standard error (RSE).

Missing data

Snapshot details

When clicking on a snapshot in the plot, the two tables below the plot are filled with various details about the selected snapshot.