summaryrefslogtreecommitdiffstats
path: root/chromium/third_party/icu/README.chromium
blob: 4d1713b0a669240c9504346281bf04b4af34855e (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
Name: icu
URL: http://site.icu-project.org/
Version: 62.1
License: MIT
Security Critical: yes

Description:
This directory contains the source code of ICU 62.1 for C/C++.

A. How to update ICU

1. Run "scripts/update.sh <version>" (e.g. 60-1).
   This will download ICU from the upstream svn repository.
   It does preserve Chrome-specific build files (*local.mk) and
   converter files. (see section C)

   BUILD.gn and icu.gyp* files are automatically updated, too.

2. Review and apply patches/changes in "D. Local Modifications" if
   necessary/applicable. Update patch files in patches/.

3. Follow the instructions in section B on building ICU data files


B. How to build ICU data files


Pre-built data files are generated and checked in with the following steps

1. icu data files for Chrome OS, Linux, Mac and Windows

  a. Make a icu data build directory outside the Chromium source tree
     and cd to that directory (say, $ICUBUILDIR).

  b. Run

    ${CHROME_ICU_TREE_TOP}/source/runConfigureICU Linux --disable-layout --disable-tests


  c. Run make
     'make' will fail  when pkgdata looks for root_subset.res. This
     is expected. See https://unicode-org.atlassian.net/browse/ICU-10570

  d. Run
       ${CHROME_ICU_TREE_TOP}/scripts/make_data_all.sh

     This script takes the following steps:

     i) scripts/trim_data.sh
       The full locale data for Chrome's UI languages and their select variants
       and the bare minimum locale data for other locales will be kept.

     ii) scripts/make_data.sh
       This makes icudt${version}l.dat.

     iii)  scripts/copy_data.sh common
       This copies the ICU data files for non-Android platforms
       (both Little and Big Endian) to the following locations:

       common/icudtl.dat
       common/icudtb.dat

     iv) cast/patch_locale.sh
       On top of trim_data.sh (step d), further cuts the data entries for
       Cast.

     v) Repeat ii) and iii) for cast to get cast/icudtl.dat

     vi) android/patch_locale.sh
       On top of trim_data.sh (step d), further cuts the data entries for
       Android.

     vii) Repeat ii) and iii) for Android to get android/icudtl.dat

     viii) ios/patch_locale.sh
       Further cuts the data size for iOS.

     ix) Repeat ii) and iii) for iOS to get ios/icudtl.dat

     x) Repeat ii) and iii) for Flutter to get flutter/icudtl.dat

     xi) scripts/clean_up_data_source.sh

     This reverts the result of trim_data.sh and patch_locale.sh and
     make the tree ready for committing updated ICU data files for
     non-Android and Android platforms.

  e. Whenever data is updated (e.g timezone update), take step d as long
  as the ICU build directory used in a ~ c is kept.

2. Note on the locale data customization

  - scripts/trim_data.sh
      a. Trim the locale data for Chrome's UI langauges :
         locales, lang, region, currency, zone
      b. Trim the locale data for non-UI languages to the bare minimum :
        ExemplarCharacters, LocaleScript, layout, and the name of the
        language for a locale in its native language.
      c. Remove the legacy Chinese character set-based collation
         (big5han/gb2312han) that don't make any sense and nobdoy uses.

  - android/patch_locale.sh
      a. Make changes to source/data/{region,lang} to exclude these data
         except the language and script names of zh_Hans and zh_Hant.
      b. Remove exemplar cities in timezone data (data/zone).
      c. Keep only the minimal calendar data in data/locales.
      d. Include currency display names for a smaller subset of currencies.
      e. Minimize the locale data for 9 locales to which Chrome on Android
         is not localized.
      f. Also apply android/brkitr.patch

  - android/brkitr.patch
      Do not use the C+J dictionary for Chinese/Japanese segmentation
      to reduce the data size. Adjust word.txt and a few other files.

C. Chromium-specific data build files and converters

They're preserved in step A.1 above. In general, there's no need to touch
them when updating ICU.

1. source/data/mappings
  - convrtrs.txt : Lists encodings and aliases required by the WHATWG
    Encoding spec plus a few extra (see the file as to why).

  - ucmlocal.txt : to list only converters we need.

  - *html.ucm: Mapping files per WHATWG encoding standards for EUC-JP,
    Shift_JIS, Big5 (Big5+Big5HKSCS), EUC-KR and all the single byte encodings.
    They're generated with scripts/{eucjp,sjis,big5,euckr,single_byte}_gen.sh.

  - gb18030.ucm and windows-936.ucm
    gb_table.patch was applied for the following changes. No need
    to apply it again. The patch is kept for the record.
    a. Map \xA3\xA0 to U+3000 instead of U+E5E5 in gb18030 and windows-936 per
    the encoding spec (one-way mapping in toUnicode direction).
    b. Map \xA8\xBF to U+01F9 instead of U+E7C8. Add one-way map
    from U+1E3F to \xA8\xBC (windows-936/GBK).
       See https://www.w3.org/Bugs/Public/show_bug.cgi?id=28740#c3

2. source/data/*/*local.mk
  - List locales of interest to Chromium
   a. Chrome's UI languages
   b. Variants of UI languages
   c. Other locales in Accept-Language list : will only have bare minimum
   locale data

  - brklocal.mk drops some line*brk files to save space for now.

3. source/data/brkitr
  - dictionaries/khmerdict.txt: Abridged Khmer dictionary. See
    https://unicode-org.atlassian.net/browse/ICU-9451
  - rules/word_ja.txt (used only on Android)
    Added for Japanese-specific word-breaking without the C+J dictionary.
  - rules/{fi,root,zh,zh_Hant}.txt
    a. Drop *_loose.txt for fi/root and use the corresponding line_normal.txt
    b. Use line_normal by default.
    c. Drop local patches we used to have for the following issues. They'll
       be dealt with in the upstream (Unicode/CLDR).
       http://unicode.org/cldr/trac/ticket/6557
       http://unicode.org/cldr/trac/ticket/4200 (http://crbug.com/39779)

4. source/data/trnslit/root_subset.txt
   Subset of transliteration data to keep for:

5. Add {an,ku,tg,wa}.txt to source/data/{locale,lang}
   with the minimal locale data necessary for spellchecker and
   and language menus.

D. Local Modifications

1. Applied locale data patches from Google obtained by diff'ing
   the upstream copy and Google's internal copy for source/data

  - patches/locale_google.patch:
    * Google's internal ICU locale changes
    * Simpler region names for Hong Kong and Macau in all locales
    * Currency signs in ru and uk locales (do not include 'tr' locale changes)
    * AM/PM, midnight, noon formatting for a few Indian locales
    * Timezone name changes in Korean and Chinese locales
    * Default digit for Arabic locale is European digits.

  - patches/locale1.patch: Minor fixes for Korean


2. Breakiterator patches
  - patches/wordbrk.patch for word.txt
    a. Move full stops (U+002E, U+FF0E) from MidNumLet to MidNum so that
       FQDN labels can be split at '.'
    b. Move fullwidth digits (U+FF10 - U+FF19) from Ideographic to Numeric.
       See http://unicode.org/cldr/trac/ticket/6555

  - patches/khmer-dictbe.patch
    Adjust parameters to use a smaller Khmer dictionary (khmerdict.txt).
    https://unicode-org.atlassian.net/browse/ICU-9451

  - Add several common Chinese words that were dropped previously to
    source/data/cjdict/brkitr/cjdict.txt
    patch: patches/cjdict.patch
    upstream bug: https://unicode-org.atlassian.net/browse/ICU-10888

3. Timezone data update
  Run scripts/update_tz.sh to grab the latest version of the
  following timezone data files and put them in source/data/misc

     metaZones.txt
     timezoneTypes.txt
     windowsZones.txt
     zoneinfo64.txt

  As of Oct 27, 2018, the latest version is 2018g and the above files
  are available at the ICU github repos.

4. Build-related changes

  - patches/wpo.patch (only needed when icudata dll is used).
    upstream bugs : https://unicode-org.atlassian.net/browse/ICU-8043
                    https://unicode-org.atlassian.net/browse/ICU-5701
  - patches/vscomp.patch for building with Visual Studio on Windows:
    do not use WINDOWS_LOCALE_API in locmap.c

  - patches/data.build.patch :
      Remove unnecessary resources : unames, collator rule source
  - patches/data.build.win.patch :
      Windows-only data build patch.
  - patches/data_symb.patch :
      Put ICU_DATA_ENTRY_POINT(icudtXX_dat) in common when we use
      the icu data file or icudt.dll

5. Fix -Wsign-compare warning in EnumSet::isValidEnum()

  - patches/isvalidenum.patch
    upstream bug: https://unicode-org.atlassian.net/browse/ICU-13509

7. Update IANA language tag/subtag mapping and add missing canonicalization for
    deprecated regions

  - patches/locid_map.patch
  - upstream bugs:
    https://unicode-org.atlassian.net/browse/ICU-13726
    https://unicode-org.atlassian.net/browse/ICU-13723
    https://unicode-org.atlassian.net/browse/ICU-13721
    https://unicode-org.atlassian.net/browse/ICU-13720
    https://unicode-org.atlassian.net/browse/ICU-13719

8. Double conversion library build failure

  - patches/double_conversion.patch
  - upstream bugs:
    https://unicode-org.atlassian.net/browse/ICU-13750
    https://github.com/google/double-conversion/issues/66

9. Cherry-pick Greek lowercase fix from the upstream

  - patches/greek_lowercase.patch
  - upstream bug (fixed in 62.2-to-be)
    https://unicode-org.atlassian.net/browse/ICU-13851

10. Max significant digit is always 6

  - patches/nf_maxsig.patch
  - upstream bug:
    https://unicode-org.atlassian.net/browse/ICU-13852

11. Align memory buffer used in Decimal Format

  - patches/decimalformat_align.patch
  - upstream bug:
    https://unicode-org.atlassian.net/browse/ICU-20039

12. Cherry-pick an upstream CL for quarter support in RelativeDate format

  - patches/reldate_quarter.patch
  - upstream bug:
    https://unicode-org.atlassian.net/browse/ICU-20022
  - fix: https://github.com/unicode-org/icu/pull/77

13. Cherry-pick an upstream CL for BCP47 language tag validation

  - patches/langtag_bcp47.patch
  - upstream bug:
    https://unicode-org.atlassian.net/browse/ICU-20098
  - fix: https://github.com/unicode-org/icu/pull/102

14. Another cherry-pick of upstream CL for duplicate U-extension
    handling.

  - patches/langtag_uext.patch
  - upstream bug:
    https://unicode-org.atlassian.net/browse/ICU-20140
  - fix: https://github.com/unicode-org/icu/pull/136

15. Cherry pick an upstream patch for Windows timezone detection

  - patches/wintz_detection.patch
  - upstream bugs:
    https://unicode-org.atlassian.net/browse/ICU-13842
    https://unicode-org.atlassian.net/browse/ICU-13827
  - fixes:
      https://github.com/unicode-org/icu/pull/55
      https://github.com/unicode-org/icu/pull/129

16. Align memory buffer used in UnicodeSet on ARM.

  - patches/arm_align.patch
  - upstream bug:
    https://unicode-org.atlassian.net/browse/ICU-20001

17. Windows 7 timezone detection fix

  - patches/win7_tz.patch
  - upstream bug:
    https://unicode-org.atlassian.net/browse/ICU-20302
  - Fix:
    https://github.com/unicode-org/icu/pull/315
    https://github.com/unicode-org/icu/pull/318