Commit graph

262 commits

Author SHA1 Message Date
Daniel Eklöf
a02c0c8d4d
vt: utf8: insert a REPLACEMENT CHARACTER when an invalid UTF-8 sequence is detected 2025-03-18 18:28:09 +01:00
Daniel Eklöf
878e07da59
vt: utf8: don't discard current byte when an invalid UTF-8 sequence is detected
Example:

  printf "pok\xe9mon\n"

would result in 'pokon' - the 'm' has been discarded along with E9.

While correct, in some sense, it's perhaps not intuitive.

This patch changes the VT parser to instead discard everything up to
the invalid byte, but then try the invalid byte from the ground
state. This way, invalid UTF-8 sequences followed by both plain ASCII,
or longer (and valid) UTF-8 sequences are printed as expected instead
of being discarded.
2025-03-18 14:37:28 +01:00
Daniel Eklöf
d3f692990e
term+vt: refactor: move "utf8" char processing to term_process_and_print_non_ascii()
This function "prints" any non-ascii character (i.e. any character
that ends up in the action_utf8_print() function in vt.c) to the
grid. This includes grapheme cluster processing etc.

action_utf8_print() now simply calls this function.

This allows us to re-use the same functionality from other
places (like the text-sizing protocol).
2025-02-06 07:45:20 +01:00
Daniel Eklöf
e248e73753
composed: refactor: break out lookup with collision detection 2025-02-06 07:42:37 +01:00
Daniel Eklöf
1181f74d19
composed: re-factor: break out key calculation from vt.c 2025-02-06 07:42:37 +01:00
Daniel Eklöf
88dcde3ed8
term: insert-mode: handle combining characters correctly
When the client application emits combining characters, for example
multi-codepoint emojis, in insert-mode, we ended up pushing partial
graphemes to the right, for each codepoint, resulting in too many
cells (and with the wrong content) being inserted.

The fix is fairly simple; don't "insert" when appending characters to
an existing grapheme cluster.

This isn't something we can detect easily in print_insert() (it would
require us to do grapheme clustering again). Fortunately, we do have
the required information in action_utf8_print(). So, pass this
information as a boolean to term_print().

Closes #1947
2025-02-06 07:37:55 +01:00
Daniel Eklöf
22e1b1610f
vt: combining chars: ensure 'key' is within range
When there's a key collision, we increment the key and check
again. When doing this, we need to ensure the key is withing range,
and wrap around to 0 if the key value is too large.
2025-01-18 10:22:24 +01:00
Daniel Eklöf
b43f19cb50
vt: don't call fcft_precompose() if font is NULL
This fixes a crash when doing a partial PGO build (where we don't have
any fonts available).
2024-11-02 20:11:14 +01:00
Daniel Eklöf
a9e462d952
Remove a number of unused includes 2024-08-02 08:28:13 +02:00
Daniel Eklöf
48cf57818d
term: performance: use a bitfield to track which ascii printer to use
The things affecting which ASCII printer we use have grown...

Instead of checking everything inside term_update_ascii_printer(), use
a bitfield.

Anything affecting the printer used, must now set a bit in this
bitfield. This makes term_update_ascii_printer() much faster, since
all it needs to do is check if the bitfield is zero or not.
2024-06-26 18:39:24 +02:00
Daniel Eklöf
7378ecf9a7
vt: unittest: verify emoji_vs list is sorted 2024-06-25 08:23:40 +02:00
Daniel Eklöf
9665661445
vt: only apply VS-15/16 to valid sequences
At compile time, build a lookup table from the Unicode data file
'emoji-variation-sequences.txt'.

At run-time, when we detect a VS-15/16 sequence, do a lookup in this
table, and enforce the variation selector iff the sequence is valid.

Closes #1742
2024-06-25 08:20:21 +02:00
Daniel Eklöf
94583703e1
vt: don't ignore VS-15 (text presentation)
When we encounter either VS-15 or VS-16, set the grapheme width to 1
or 2 explicitly.
2024-06-25 08:20:20 +02:00
Daniel Eklöf
60c5d889ec
vt: DECALN: erase sixels, reset margins, home the cursor
https://vt100.net/docs/vt510-rm/DECALN.html:

  Notes on DECALN

  DECALN sets the margins to the extremes of the page, and moves the
  cursor to the home position.
2024-03-07 16:24:34 +01:00
Daniel Eklöf
74a1fa9e00
vt: update DECALN to use term_fill() 2024-03-07 16:24:33 +01:00
Daniel Eklöf
0c94bf43f2
vt: ignore VS16 (U+FE0F) when grapheme clustering is disabled
This fixes:

a) a compilation error with -Dgrapheme-clustering=disabled

b) ensures U+FE0F does *not* allocate a two cells when grapheme
   clustering has been disabled (either compile time, in config, or
   run-time).
2024-02-16 07:13:32 +01:00
Daniel Eklöf
aca9af0202
vt: VS16 - variation selector 16 (emoji representation) should only affect emojis 2024-02-15 16:56:30 +01:00
Daniel Eklöf
7999975016
Don't use fancy Unicode quotes, stick to ASCII 2024-02-06 12:36:45 +01:00
Daniel Eklöf
4eef001d58
csi: implement DECSET/DECRST/DECRQM 2027 - grapheme cluster processing
This implements private mode 2027 - grapheme cluster processing, as
defined in the "Terminal Unicode Core"[1] specification.

Internally, we just flip the already existing option "grapheme
shaping". Since it's now runtime changeable, we need a copy of it in
the terminal struct, rather than referencing the conf object.

[1]: 13fc5a8993/spec/terminal-unicode-core.tex (L50-L53)
2023-09-25 16:50:44 +02:00
Daniel Eklöf
12e0edd6e1
vt: fix ASAN UB warning
../vt.c:648:13: runtime error: signed integer overflow: 3924432811 * 2654435761 cannot be represented in type 'long'
  SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior ../vt.c:648:13 in

Closes #1456
2023-08-05 07:19:51 +02:00
Daniel Eklöf
b59fd7c388
vt: detect and ignore invalid UTF-8 sequences
This patch detects invalid codepoints in the UTF-8 EDxxxx range, and
the F4xxxxxx range.

Note that we still allow the E0xxxx and F0xxxxxx ranges. These
contains overlong encodings. We allow them, because they still decode
into correct UTF-32.

Closes #1423
2023-07-22 11:21:41 +02:00
CismonX
c3b119ea81
vt: improve handling of HTS
Do not insert existing positions into the tab stop list.

This prevents a performance issue when iterating through
an extremely long tab stop list.

Also corrects the behaviour of CBT.
2023-07-20 08:47:40 +08:00
Daniel Eklöf
d88bea5e22
vt: split up action_param() to three separate functions
We’re already switching on the next VT input byte in the state
machine; no need to if...else if in action_param() too.

That is, split up action_param() into three:

* action_param_new()
* action_param_new_subparam()
* action_param()

This makes the code cleaner, and hopefully slightly faster.

Next, to improve performance further, only check for (sub)parameter
overflow in action_param_new() and action_param_subparam().

Add pointers to the VT struct that points to the currently active
parameter and sub-parameter.

When the number of parameters (or sub-parameters) overflow, warn, and
then point the parameter pointer to a "dummy" value in the VT struct.

This way, we don’t have to check anything in action_param().
2023-06-16 16:26:13 +02:00
Daniel Eklöf
9dc4f48e7a
vt: tag collision-count check with ‘unlikely’ 2022-06-15 19:25:00 +02:00
Daniel Eklöf
fbcb30bf98
vt: improve key calculation for compose sequences
* Don’t assume 32 bits when rotating the old key. Use the number of
  actual bits available, as determined by CELL_COMB_CHARS_{HI,LO}
* Multiply with magic hash constant

This greatly reduces the number of collisions seen. For example, the
Emoji test file (from the Unicode specification), now has zero
collisions.
2022-06-15 19:25:00 +02:00
Daniel Eklöf
edd68732ad
vt: prevent potential endless loop when finding a slot for a composed character
Composed characters are stored in a tree structure, using a key as
identifier. The key is calculated from the individual characters that
make up the composed character sequence.

Since the address space for keys is limited, collisions may occur. In
this case, we simply increment the key and try again.

It is theoretically possible to saturate the key space, in which case
we’ll get stuck in an endless loop.

Even if the key space isn’t fully saturated, we fairly easy reach a
point where there are so many collisions for each insertion, that
performance drops significantly.

Since key space is limited (it’s not like a hash table that we can
grow), our only option is to limit the number of collisions. If we
can’t find a slot within a hard code amount of collisions, the
character is simply dropped.
2022-06-15 19:25:00 +02:00
Daniel Eklöf
0b9b726bdf
vt: free OSC buffer after dispatch, if larger than 4K 2022-03-21 20:40:10 +01:00
Daniel Eklöf
e0227266ca
fcft: adapt to API changes in fcft-3.x
Fcft no longer uses wchar_t, but plain uint32_t to represent
codepoints.

Since we do a fair amount of string operations in foot, it still makes
sense to use something that actually _is_ a string (or character),
rather than an array of uint32_t.

For this reason, we switch out all wchar_t usage in foot to
char32_t. We also verify, at compile-time, that char32_t used
UTF-32 (which is what fcft expects).

Unfortunately, there are no string functions for char32_t. To avoid
having to re-implement all wcs*() functions, we add a small wrapper
layer of c32*() functions.

These wrapper functions take char32_t arguments, but then simply call
the corresponding wcs*() function.

For this to work, wcs*() must _also_ be UTF-32 compatible. We can
check for the presence of the  __STDC_ISO_10646__ macro. If set,
wchar_t is at least 4 bytes and its internal representation is UTF-32.

FreeBSD does *not* define this macro, because its internal wchar_t
representation depends on the current locale. It _does_ use UTF-32
_if_ the current locale is UTF-8.

Since foot enforces UTF-8, we simply need to check if __FreeBSD__ is
defined.

Other fcft API changes:

* fcft_glyph_rasterize() -> fcft_codepoint_rasterize()
* font.space_advance has been removed
* ‘tags’ have been removed from fcft_grapheme_rasterize()
* ‘fcft_log_init()’ removed
* ‘fcft_init()’ and ‘fcft_fini()’ must be explicitly called
2022-02-05 17:00:54 +01:00
Daniel Eklöf
c1c0f11821
config: add tweak.grapheme-width-method=max
‘max’ is a new value for ‘tweak.grapheme-width-method’. When enabled,
the width of a grapheme cluster is that of the cluster’s widest
codepoint.
2021-11-23 19:50:05 +01:00
Craig Barnes
52dcf72d0b osc: use BEL terminator in OSC replies to BEL-terminated OSC queries
This matches the documented (and observed) behavior in xterm:

> XTerm accepts either BEL or ST for terminating OSC sequences, and
> when returning information, uses the same terminator used in a query

-- https://invisible-island.net/xterm/ctlseqs/ctlseqs.html#h3-Operating-System-Commands
2021-10-20 12:48:37 +01:00
Craig Barnes
b18d3aef17 vt: add some unit tests for action_collect() 2021-07-02 08:46:28 +01:00
Daniel Eklöf
5138f02214
config: rename at-most-2 (value for grapheme-width-method) to double-width 2021-07-01 08:00:23 +02:00
Daniel Eklöf
9817e44c32
config: add tweak.grapheme-width-method=wcswidth|at-most-2 2021-07-01 07:58:06 +02:00
Daniel Eklöf
031e8f5987
vt: limit grapheme width to 2 cells
All emoji graphemes are double-width. Foot doesn’t support non-latin
scripts. Ergo, this should result in the Right Thing, even though
we’re not doing it the Right Way.

Note that we’re now breaking cursor synchronization with nearly all
applications.

But the way I see it, the applications need to be
updated.
2021-07-01 07:57:56 +02:00
Daniel Eklöf
0ff8f72a9d
vt: don’t reset utf8proc grapheme state when we’re not at a grapheme break 2021-06-25 20:42:23 +02:00
Daniel Eklöf
3bad062f8a
vt: utf8: rotate instead of just shifting when updating compose key
This reduces the number of collisions in even more workloads.
2021-06-24 19:36:39 +02:00
Daniel Eklöf
88ce0e4375
vt: improved key hash algorithm -> reduces number of key collisions 2021-06-24 19:18:06 +02:00
Daniel Eklöf
f20956ff1b
composed: insert: require key to be unique 2021-06-24 19:12:25 +02:00
Daniel Eklöf
415ecfc6fa
vt: codespell: bumb -> bump 2021-06-24 17:30:50 +02:00
Daniel Eklöf
fe8ca23cfe
composed: store compose chains in a binary search tree
The previous implementation stored compose chains in a dynamically
allocated array. Adding a chain was easy: resize the array and append
the new chain at the end. Looking up a compose chain given a compose
chain key/index was also easy: just index into the array.

However, searching for a pre-existing chain given a codepoint sequence
was very slow. Since the array wasn’t sorted, we typically had to scan
through the entire array, just to realize that there is no
pre-existing chain, and that we need to add a new one.

Since this happens for *each* codepoint in a grapheme cluster, things
quickly became really slow.

Things were ok:ish as long as the compose chain struct was small, as
that made it possible to hold all the chains in the cache. Once the
number of chains reached a certain point, or when we were forced to
bump maximum number of allowed codepoints in a chain, we started
thrashing the cache and things got much much worse.

So what can we do?

We can’t sort the array, because

a) that would invalidate all existing chain keys in the grid (and
iterating the entire scrollback and updating compose keys is *not* an
option).

b) inserting a chain becomes slow as we need to first find _where_ to
insert it, and then memmove() the rest of the array.

This patch uses a binary search tree to store the chains instead of a
simple array.

The tree is sorted on a “key”, which is the XOR of all codepoints,
truncated to the CELL_COMB_CHARS_HI-CELL_COMB_CHARS_LO range.

The grid now stores CELL_COMB_CHARS_LO+key, instead of
CELL_COMB_CHARS_LO+index.

Since the key is truncated, collisions may occur. This is handled by
incrementing the key by 1.

Lookup is of course slower than before, O(log n) instead of
O(1).

Insertion is slightly slower as well: technically it’s O(log n)
instead of O(1). However, we also need to take into account the
re-allocating the array will occasionally force a full copy of the
array when it cannot simply be growed.

But finding a pre-existing chain is now *much* faster: O(log n)
instead of O(n). In most cases, the first lookup will either
succeed (return a true match), or fail (return NULL). However, since
key collisions are possible, it may also return false matches. This
means we need to verify the contents of the chain before deciding to
use it instead of inserting a new chain. But remember that this
comparison was being done for each and every chain in the previous
implementation.

With lookups being much faster, and in particular, no longer requiring
us to check the chain contents for every singlec chain, we can now use
a dynamically allocated ‘chars’ array in the chain. This was
previously a hardcoded array of 10 chars.

Using a dynamic allocated array means looking in the array is slower,
since we now need two loads: one to load the pointer, and a second to
load _from_ the pointer.

As a result, the base size of a compose chain (i.e. an “empty” chain)
has now been reduced from 48 bytes to 32. A chain with two codepoints
is 40 bytes. This means we have up to 4 codepoints while still using
less, or the same amount, of memory as before.

Furthermore, the Unicode random test (i.e. write random “unicode”
chars) is now **faster** than current master (i.e. before text-shaping
support was added), **with** test-shaping enabled. With text-shaping
disabled, we’re _even_ faster.
2021-06-24 17:30:49 +02:00
Daniel Eklöf
81131e3a87
vt: utf8: don’t scan *all* previous chains
When checking if we already have a compose chain for the current
sequence of characters, don’t search the list from the beginning,
unless we have to.

Taking the following things into consideration:

* New compose chains are always appended at the end of the list
* If the current sequence is 3 or more characters, it *must* consist
  of an existing compose chain, plus the new character.

Thus, when searching, start at index 0 if we only have two characters,
since then the base cell originally contained a regular base
character, and not a compose chain. I.e. the new chain may be
_anywhere_ in the chain list.

If however we have a sequence of three or more characters, start at
the index the *base* chain was at. If the chain we’re searching for
exists, it *must* have been added *after* the base chain, and thus
it *must* be located *after* the base chain in the chain list.
2021-06-24 17:30:48 +02:00
Daniel Eklöf
e81d1845bf
vt: utf8: de-duplicate; jump to end of function to print to grid 2021-06-24 17:30:48 +02:00
Daniel Eklöf
dc5019a535
vt: utf8-print: don’t build a compose chain on a zero-width base character 2021-06-24 17:30:47 +02:00
Daniel Eklöf
f865612667
vt: utf8-print: check base character before count when looking for existing compose chain
Count is more likely to be the same for many chains. Thus we’re likely
to fail sooner by checking the base character first.
2021-06-24 17:30:47 +02:00
Daniel Eklöf
57e636dd8e
vt: don’t call wcwidth() on all combining characters every time we add
We already have all the widths needed to calculate the new one; it’s
the base characters width (base_width), or the previous combining
chain’s width (composed->width) plus the new characters’s
width (width).
2021-06-24 17:30:46 +02:00
Daniel Eklöf
09431dd15c
vt: presentation selectors may be anywhere in the cluster 2021-06-24 17:30:46 +02:00
Daniel Eklöf
6c70cd9366
vt: don’t force cols=2 when we see an emoji variant selector
Fish appears to be the only shell expecting this. The rest probably
just does wcswidth(), like usual.
2021-06-24 17:30:45 +02:00
Daniel Eklöf
0a9531ac6c
vt: cache grapheme cluster width in composed struct
* Use regular wcswidth() to calculate the width
* Explicitly set to ‘2’ if we see a emoji variant selector
* Cache the result in the composed struct
2021-06-24 17:30:45 +02:00
Daniel Eklöf
b9ef703eb1
wip: grapheme shaping 2021-06-24 17:30:45 +02:00
Craig Barnes
2a75da4143 Merge branch 'charset-shift-fixes' 2021-06-09 10:18:52 +01:00