summaryrefslogtreecommitdiffstats
path: root/chromium/docs/website/site/Home/chromium-privacy/privacy-sandbox/floc/index.md
blob: a847360c550b88467ebe4fec239badfc0baae0fe (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
---
breadcrumbs:
- - /Home
  - Chromium
- - /Home/chromium-privacy
  - Chromium Privacy
- - /Home/chromium-privacy/privacy-sandbox
  - The Privacy Sandbox
page_name: floc
title: FLoC Origin Trial & Clustering
---

**This page refers to the origin trial for the initial version of FLoC, which
ran from Chrome 89 to 91.**

---

See [web.dev/floc](https://web.dev/floc) for an explanation of the idea behind
this experimental new advertising-related browser API, a component of Chrome's
Privacy Sandbox effort to support web advertising without user tracking. To
participate in the development process, see the [FLoC GitHub
repository](https://github.com/WICG/floc).

Even for developers experienced with [origin
trials](https://web.dev/origin-trials/) and [third-party origin
trials](https://web.dev/third-party-origin-trials/), the FLoC origin trial is a
bit different. That's because FLoC is two different things: a JavaScript API
that offers a signal which we hope will prove useful for interest based ads
targeting, and also an on-device clustering algorithm that generates the signal.

Figuring out the right way to perform that clustering is still very much an open
question. During the course of the Origin Trial we expect to introduce multiple
possible clustering algorithms, and we solicit feedback concerning both the
privacy and the utility of the clusters produced. We hope that during the Origin
Trial, the ad tech community will collectively figure out which tasks are well
served by the FLoC approach. As we inevitably find areas where FLoC could do
better, we look forward to public discussion about what modifications to
clustering might help serve those uses.

You might wonder: once there are multiple clustering algorithms performing FLoC
assignment, how do you know which one you're getting? Per the [draft spec for
the API](https://wicg.github.io/floc/), the object returned by cohort = await
document.interestCohort(); has two keys: an id indicating which cluster the
browser is in, and a version, a label that identifies the algorithm used to
compute that id. (The API is not permitted in an insecure context, or where
blocked by a Permissions-Policy, or on a site where you've used Chrome settings
to block cookies.)

We realize this strange situation, of a single API that might be wrapped around
multiple different possible algorithms, means the Origin Trial of FLoC is not
for the faint of heart. If you're still interested in joining us during this
early experimental stage of our development, check out [this
page](https://developer.chrome.com/blog/floc/) for the details of how to take
part.

# FLoC Algorithm Versions

## Version "chrome.2.1"

This algorithm was introduced in Chrome 89. It is similar to the approach called
SortingLSH that was
[described](https://github.com/google/ads-privacy/blob/master/proposals/FLoC/FLOC-Whitepaper-Google.pdf)
by our colleagues in Google Research and Ads in October 2020, which their
experiments indicated performs rather well for [some types of ad
targeting](https://blog.google/products/ads-commerce/2021-01-privacy-sandbox/#jump-content:~:text=in%2Dmarket%20and%20affinity%20Google%20Audiences):
"Affinity Audiences" (like "Cooking Enthusiasts") and "In-Market Audiences"
(like "people actively researching Consumer Electronics").

In this clustering technique, people are more likely to end up in the same
cohort if they browse the same web sites. Only the domain of the site is used —
not the URL or the contents of the pages, for example.

The browser instance's cohort calculation is based on the following inputs:

    A subset of the registrable domain names (eTLD+1's) in the browser's Chrome
    history for the seven-day period leading up to the cohort calculation.

    A domain name is included if some page on that domain either:

        uses the document.interestCohort() API, or

        is detected as loading ads-related resources (see [Ad Tagging in
        Chromium](https://chromium.googlesource.com/chromium/src/+/HEAD/docs/ad_tagging.md)).

    The API is disabled, and the domain name is ignored, on any page which is
    served with the HTTP response header Permissions-Policy: interest-cohort=().

    Domain names of non-publicly routable IP addresses are never included.

The inputs are turned into a cohort ID using a technique we're calling
PrefixLSH. It is similar to a SimHash variant called SortingLSH that was
[described](https://github.com/google/ads-privacy/blob/master/proposals/FLoC/FLOC-Whitepaper-Google.pdf)
by our colleagues in Google Research and Google Ads last October.

    The browser uses each domain name included in the inputs to
    deterministically produce one 50-dimensional floating-point vector whose
    coordinates are pseudorandom draws from a Gaussian distribution, with the
    pseudorandom number generator seeded from a hash of the domain name. (Note:
    ultimately in all the 50-dimensional vectors described here, only the first
    20 coordinates are ever used; the length of 50 is vestigial.)

    The browser then uses the full set of domain name inputs to
    deterministically produce a 50-bit Locality-Sensitive Hash bitvector, where
    the i'th bit indicates the sign (positive or negative) of the sum of the
    i'th coordinates of all the floating-point vectors derived from the domain
    names.

    A Chrome-operated server-side pipeline counts how many times each 50-bit
    hash occurs among [qualifying
    users](https://github.com/WICG/floc#qualifying-users-for-whom-a-cohort-will-be-logged-with-their-sync-data)
    — those for whom we log cohort calculations along with their sync data.

    The 50-bit hashes start in two big cohorts: all hashes whose first bit is 0,
    versus all hashes whose first bit is 1. Then each cohort is repeatedly
    divided into two smaller cohorts by looking at successive bits of the hash
    value, as long as such a division yields two cohorts each with at least 2000
    qualifying users. (Each cohort will comprise thousands of people total, when
    including those Chrome users for whom we don't sync cohort data.)

    The result is a list of cohorts represented as Locality-Sensitive Hash
    bitvector prefixes, which we number in lexicographic order and distribute to
    all Chrome browsers. Any browser can calculate its own 50-bit hash, find the
    unique prefix of that vector which appears in the list of cohorts, and read
    off the corresponding cohort ID.

    Note that this is an unsupervised clustering technique; no Federated
    Learning is used (despite the "FL" in the name). The only parameters of the
    clustering model are the details of pseudorandom number generation and the
    minimum cluster size threshold.

After creation of the list of cohorts based on Locality-Sensitive Hash bitvector
prefixes, we impose additional filtering criteria. Any time a browser instance's
cohort is filtered, the promise returned by document.interestCohort() rejects,
without further indication of the reason for rejection.

    Some filtering is calculated by the server-side pipeline, and the result is
    included with the list of cohort prefixes distributed to all Chrome
    instances:

        A cohort is filtered if it has too few qualifying users. (This is not
        possible at the outset, since the server-side clustering pipeline would
        not produce an under-sized cohort, but it could happen over time as
        people's browsing behavior changes. We do not handle changing cohort
        sizes by re-calculating the list of LSH prefixes, since that would
        change the meaning of existing cohorts ids.)

        A cohort is filtered if the browsing behavior of its qualifying users
        has a higher-than-typical rate of visits to web pages on sensitive
        topics. See [this
        paper](https://docs.google.com/a/chromium.org/viewer?a=v&pid=sites&srcid=Y2hyb21pdW0ub3JnfGRldnxneDo1Mzg4MjYzOWI2MzU2NDgw)
        for an explanation of the t-closeness calculation.

    Other filtering happens in an individual browser instance:

        An individual browser instance's cohort is filtered if the inputs to the
        cohort id calculation has fewer than seven domain names.

        An individual browser instance's cohort is filtered any time its user
        clears any browsing history data or other site data; a new cohort id is
        eventually re-computed without the cleared history.

        An individual browser instance's cohort is filtered in incognito
        (private browsing) mode

All details are specific to this particular version of FLoC clustering, and
subject to change in future clustering algorithms.

Observed statistics of the cohorts created by this clustering algorithm, based
on data from qualifying Chrome users:

    Number of cohorts, before any filtering: 33,872

    Number of LSH bits used to define a cohort: between 13 and 20

    Minimum number of qualifying Chrome users in a cohort: 2000

    Minimum number of different qualifying Chrome user browsing histories (sets
    of visited domains) in a cohort: 735

    Number of cohorts filtered due to sensitive browsing t-closeness test
    (t=0.1): 792 (approx. 2.3%)