1 | Unicode support in busybox
|
---|
2 |
|
---|
3 | There are several scenarios where we need to handle unicode
|
---|
4 | correctly.
|
---|
5 |
|
---|
6 | Shell input
|
---|
7 |
|
---|
8 | We want to correctly handle input of unicode characters.
|
---|
9 | There are several problems with it. Just handling input
|
---|
10 | as sequence of bytes would break any editing. This was fixed
|
---|
11 | and now lineedit operates on the array of wchar_t's.
|
---|
12 | But we also need to handle the following problematic moments:
|
---|
13 |
|
---|
14 | * It is unreasonable to expect that output device supports
|
---|
15 | _any_ unicode chars. Perhaps we need to avoid printing
|
---|
16 | those chars which are not supported by output device.
|
---|
17 | Examples: chars which are not present in the font,
|
---|
18 | chars which are not assigned in unicode,
|
---|
19 | combining chars (especially trying to combine bad pairs:
|
---|
20 | a_chinese_symbol + "combining grave accent" = ??!)
|
---|
21 |
|
---|
22 | * We need to account for the fact that unicode chars have
|
---|
23 | different widths: 0 for combining chars, 1 for usual,
|
---|
24 | 2 for ideograms (are there 3+ wide chars?).
|
---|
25 |
|
---|
26 | * Bidirectional handling. If user wants to echo a phrase
|
---|
27 | in Hebrew, he types: echo "srettel werbeH"
|
---|
28 |
|
---|
29 | Editors (vi, ed)
|
---|
30 |
|
---|
31 | This case is a bit similar to "shell input", but unlike shell,
|
---|
32 | editors may encounder many more unexpected unicode sequences
|
---|
33 | (try to load a random binary file...), and they need to preserve
|
---|
34 | them, unlike shell which can afford to drop bogus input.
|
---|
35 |
|
---|
36 | more, less
|
---|
37 |
|
---|
38 | Need to correctly display any input file. Ideally, with
|
---|
39 | ASCII/unicode/filtered_unicode option or keyboard switch.
|
---|
40 | Note: need to handle tabs and backspaces specially
|
---|
41 | (bksp is for manpage compat).
|
---|
42 |
|
---|
43 | cut, fold, watch
|
---|
44 |
|
---|
45 | May need ability to cut unicode string to specified number of wchars
|
---|
46 | and/or to specified screen width. Need to handle tabs specially.
|
---|
47 |
|
---|
48 | sed, awk, grep
|
---|
49 |
|
---|
50 | Handle unicode-aware regexp match
|
---|
51 |
|
---|
52 | ls (multi-column display)
|
---|
53 |
|
---|
54 | ls will fail to line up columnar output if it will not account
|
---|
55 | for character widths (and maybe filter out some of them, see
|
---|
56 | above). OTOH, non-columnar views (ls -1, ls -l, ls | car)
|
---|
57 | should NOT filter out bad unicode (but need to filter out
|
---|
58 | control chars (coreutils does that). Note that unlike more/less,
|
---|
59 | tabs and backspaces need not special handling.
|
---|
60 |
|
---|
61 | top, ps
|
---|
62 |
|
---|
63 | Need to perform filtering similar to ls.
|
---|
64 |
|
---|
65 | Filename display (in error messages and elsewhere)
|
---|
66 |
|
---|
67 | Need to perform filtering similar to ls.
|
---|
68 |
|
---|
69 |
|
---|
70 | TODO: write an email to Asmus Freytag (asmus@unicode.org),
|
---|
71 | author of http://unicode.org/reports/tr11/
|
---|