mirrors/foot - Forgejo: Beyond coding. We Forge.

mirror of https://codeberg.org/dnkl/foot.git synced 2026-02-05 04:06:08 -05:00

Author	SHA1	Message	Date
Daniel Eklöf	a02c0c8d4d	vt: utf8: insert a REPLACEMENT CHARACTER when an invalid UTF-8 sequence is detected	2025-03-18 18:28:09 +01:00
Daniel Eklöf	878e07da59	vt: utf8: don't discard current byte when an invalid UTF-8 sequence is detected Example: printf "pok\xe9mon\n" would result in 'pokon' - the 'm' has been discarded along with E9. While correct, in some sense, it's perhaps not intuitive. This patch changes the VT parser to instead discard everything up to the invalid byte, but then try the invalid byte from the ground state. This way, invalid UTF-8 sequences followed by both plain ASCII, or longer (and valid) UTF-8 sequences are printed as expected instead of being discarded.	2025-03-18 14:37:28 +01:00
Daniel Eklöf	d3f692990e	term+vt: refactor: move "utf8" char processing to term_process_and_print_non_ascii() This function "prints" any non-ascii character (i.e. any character that ends up in the action_utf8_print() function in vt.c) to the grid. This includes grapheme cluster processing etc. action_utf8_print() now simply calls this function. This allows us to re-use the same functionality from other places (like the text-sizing protocol).	2025-02-06 07:45:20 +01:00
Daniel Eklöf	e248e73753	composed: refactor: break out lookup with collision detection	2025-02-06 07:42:37 +01:00
Daniel Eklöf	1181f74d19	composed: re-factor: break out key calculation from vt.c	2025-02-06 07:42:37 +01:00
Daniel Eklöf	88dcde3ed8	term: insert-mode: handle combining characters correctly When the client application emits combining characters, for example multi-codepoint emojis, in insert-mode, we ended up pushing partial graphemes to the right, for each codepoint, resulting in too many cells (and with the wrong content) being inserted. The fix is fairly simple; don't "insert" when appending characters to an existing grapheme cluster. This isn't something we can detect easily in print_insert() (it would require us to do grapheme clustering again). Fortunately, we do have the required information in action_utf8_print(). So, pass this information as a boolean to term_print(). Closes #1947	2025-02-06 07:37:55 +01:00
Daniel Eklöf	22e1b1610f	vt: combining chars: ensure 'key' is within range When there's a key collision, we increment the key and check again. When doing this, we need to ensure the key is withing range, and wrap around to 0 if the key value is too large.	2025-01-18 10:22:24 +01:00
Daniel Eklöf	b43f19cb50	vt: don't call fcft_precompose() if font is NULL This fixes a crash when doing a partial PGO build (where we don't have any fonts available).	2024-11-02 20:11:14 +01:00
Daniel Eklöf	a9e462d952	Remove a number of unused includes	2024-08-02 08:28:13 +02:00
Daniel Eklöf	48cf57818d	term: performance: use a bitfield to track which ascii printer to use The things affecting which ASCII printer we use have grown... Instead of checking everything inside term_update_ascii_printer(), use a bitfield. Anything affecting the printer used, must now set a bit in this bitfield. This makes term_update_ascii_printer() much faster, since all it needs to do is check if the bitfield is zero or not.	2024-06-26 18:39:24 +02:00
Daniel Eklöf	7378ecf9a7	vt: unittest: verify emoji_vs list is sorted	2024-06-25 08:23:40 +02:00
Daniel Eklöf	9665661445	vt: only apply VS-15/16 to valid sequences At compile time, build a lookup table from the Unicode data file 'emoji-variation-sequences.txt'. At run-time, when we detect a VS-15/16 sequence, do a lookup in this table, and enforce the variation selector iff the sequence is valid. Closes #1742	2024-06-25 08:20:21 +02:00
Daniel Eklöf	94583703e1	vt: don't ignore VS-15 (text presentation) When we encounter either VS-15 or VS-16, set the grapheme width to 1 or 2 explicitly.	2024-06-25 08:20:20 +02:00
Daniel Eklöf	60c5d889ec	vt: DECALN: erase sixels, reset margins, home the cursor https://vt100.net/docs/vt510-rm/DECALN.html: Notes on DECALN DECALN sets the margins to the extremes of the page, and moves the cursor to the home position.	2024-03-07 16:24:34 +01:00
Daniel Eklöf	74a1fa9e00	vt: update DECALN to use term_fill()	2024-03-07 16:24:33 +01:00
Daniel Eklöf	0c94bf43f2	vt: ignore VS16 (U+FE0F) when grapheme clustering is disabled This fixes: a) a compilation error with -Dgrapheme-clustering=disabled b) ensures U+FE0F does not allocate a two cells when grapheme clustering has been disabled (either compile time, in config, or run-time).	2024-02-16 07:13:32 +01:00
Daniel Eklöf	aca9af0202	vt: VS16 - variation selector 16 (emoji representation) should only affect emojis	2024-02-15 16:56:30 +01:00
Daniel Eklöf	7999975016	Don't use fancy Unicode quotes, stick to ASCII	2024-02-06 12:36:45 +01:00
Daniel Eklöf	4eef001d58	csi: implement DECSET/DECRST/DECRQM 2027 - grapheme cluster processing This implements private mode 2027 - grapheme cluster processing, as defined in the "Terminal Unicode Core"[1] specification. Internally, we just flip the already existing option "grapheme shaping". Since it's now runtime changeable, we need a copy of it in the terminal struct, rather than referencing the conf object. [1]: `13fc5a8993/spec/terminal-unicode-core.tex (L50-L53)`	2023-09-25 16:50:44 +02:00
Daniel Eklöf	12e0edd6e1	vt: fix ASAN UB warning ../vt.c:648:13: runtime error: signed integer overflow: 3924432811 * 2654435761 cannot be represented in type 'long' SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior ../vt.c:648:13 in Closes #1456	2023-08-05 07:19:51 +02:00
Daniel Eklöf	b59fd7c388	vt: detect and ignore invalid UTF-8 sequences This patch detects invalid codepoints in the UTF-8 EDxxxx range, and the F4xxxxxx range. Note that we still allow the E0xxxx and F0xxxxxx ranges. These contains overlong encodings. We allow them, because they still decode into correct UTF-32. Closes #1423	2023-07-22 11:21:41 +02:00
CismonX	c3b119ea81	vt: improve handling of HTS Do not insert existing positions into the tab stop list. This prevents a performance issue when iterating through an extremely long tab stop list. Also corrects the behaviour of CBT.	2023-07-20 08:47:40 +08:00
Daniel Eklöf	d88bea5e22	vt: split up action_param() to three separate functions We’re already switching on the next VT input byte in the state machine; no need to if...else if in action_param() too. That is, split up action_param() into three: * action_param_new() * action_param_new_subparam() * action_param() This makes the code cleaner, and hopefully slightly faster. Next, to improve performance further, only check for (sub)parameter overflow in action_param_new() and action_param_subparam(). Add pointers to the VT struct that points to the currently active parameter and sub-parameter. When the number of parameters (or sub-parameters) overflow, warn, and then point the parameter pointer to a "dummy" value in the VT struct. This way, we don’t have to check anything in action_param().	2023-06-16 16:26:13 +02:00
Daniel Eklöf	9dc4f48e7a	vt: tag collision-count check with ‘unlikely’	2022-06-15 19:25:00 +02:00
Daniel Eklöf	fbcb30bf98	vt: improve key calculation for compose sequences * Don’t assume 32 bits when rotating the old key. Use the number of actual bits available, as determined by CELL_COMB_CHARS_{HI,LO} * Multiply with magic hash constant This greatly reduces the number of collisions seen. For example, the Emoji test file (from the Unicode specification), now has zero collisions.	2022-06-15 19:25:00 +02:00
Daniel Eklöf	edd68732ad	vt: prevent potential endless loop when finding a slot for a composed character Composed characters are stored in a tree structure, using a key as identifier. The key is calculated from the individual characters that make up the composed character sequence. Since the address space for keys is limited, collisions may occur. In this case, we simply increment the key and try again. It is theoretically possible to saturate the key space, in which case we’ll get stuck in an endless loop. Even if the key space isn’t fully saturated, we fairly easy reach a point where there are so many collisions for each insertion, that performance drops significantly. Since key space is limited (it’s not like a hash table that we can grow), our only option is to limit the number of collisions. If we can’t find a slot within a hard code amount of collisions, the character is simply dropped.	2022-06-15 19:25:00 +02:00
Daniel Eklöf	0b9b726bdf	vt: free OSC buffer after dispatch, if larger than 4K	2022-03-21 20:40:10 +01:00
Daniel Eklöf	e0227266ca	fcft: adapt to API changes in fcft-3.x Fcft no longer uses wchar_t, but plain uint32_t to represent codepoints. Since we do a fair amount of string operations in foot, it still makes sense to use something that actually _is_ a string (or character), rather than an array of uint32_t. For this reason, we switch out all wchar_t usage in foot to char32_t. We also verify, at compile-time, that char32_t used UTF-32 (which is what fcft expects). Unfortunately, there are no string functions for char32_t. To avoid having to re-implement all wcs() functions, we add a small wrapper layer of c32() functions. These wrapper functions take char32_t arguments, but then simply call the corresponding wcs() function. For this to work, wcs() must _also_ be UTF-32 compatible. We can check for the presence of the __STDC_ISO_10646__ macro. If set, wchar_t is at least 4 bytes and its internal representation is UTF-32. FreeBSD does not define this macro, because its internal wchar_t representation depends on the current locale. It _does_ use UTF-32 _if_ the current locale is UTF-8. Since foot enforces UTF-8, we simply need to check if __FreeBSD__ is defined. Other fcft API changes: * fcft_glyph_rasterize() -> fcft_codepoint_rasterize() * font.space_advance has been removed * ‘tags’ have been removed from fcft_grapheme_rasterize() * ‘fcft_log_init()’ removed * ‘fcft_init()’ and ‘fcft_fini()’ must be explicitly called	2022-02-05 17:00:54 +01:00
Daniel Eklöf	c1c0f11821	config: add tweak.grapheme-width-method=max ‘max’ is a new value for ‘tweak.grapheme-width-method’. When enabled, the width of a grapheme cluster is that of the cluster’s widest codepoint.	2021-11-23 19:50:05 +01:00
Craig Barnes	52dcf72d0b	osc: use BEL terminator in OSC replies to BEL-terminated OSC queries This matches the documented (and observed) behavior in xterm: > XTerm accepts either BEL or ST for terminating OSC sequences, and > when returning information, uses the same terminator used in a query -- https://invisible-island.net/xterm/ctlseqs/ctlseqs.html#h3-Operating-System-Commands	2021-10-20 12:48:37 +01:00
Craig Barnes	b18d3aef17	vt: add some unit tests for action_collect()	2021-07-02 08:46:28 +01:00
Daniel Eklöf	5138f02214	config: rename at-most-2 (value for grapheme-width-method) to double-width	2021-07-01 08:00:23 +02:00
Daniel Eklöf	9817e44c32	config: add tweak.grapheme-width-method=wcswidth\|at-most-2	2021-07-01 07:58:06 +02:00
Daniel Eklöf	031e8f5987	vt: limit grapheme width to 2 cells All emoji graphemes are double-width. Foot doesn’t support non-latin scripts. Ergo, this should result in the Right Thing, even though we’re not doing it the Right Way. Note that we’re now breaking cursor synchronization with nearly all applications. But the way I see it, the applications need to be updated.	2021-07-01 07:57:56 +02:00
Daniel Eklöf	0ff8f72a9d	vt: don’t reset utf8proc grapheme state when we’re not at a grapheme break	2021-06-25 20:42:23 +02:00
Daniel Eklöf	3bad062f8a	vt: utf8: rotate instead of just shifting when updating compose key This reduces the number of collisions in even more workloads.	2021-06-24 19:36:39 +02:00
Daniel Eklöf	88ce0e4375	vt: improved key hash algorithm -> reduces number of key collisions	2021-06-24 19:18:06 +02:00
Daniel Eklöf	f20956ff1b	composed: insert: require key to be unique	2021-06-24 19:12:25 +02:00
Daniel Eklöf	415ecfc6fa	vt: codespell: bumb -> bump	2021-06-24 17:30:50 +02:00
Daniel Eklöf	fe8ca23cfe	composed: store compose chains in a binary search tree The previous implementation stored compose chains in a dynamically allocated array. Adding a chain was easy: resize the array and append the new chain at the end. Looking up a compose chain given a compose chain key/index was also easy: just index into the array. However, searching for a pre-existing chain given a codepoint sequence was very slow. Since the array wasn’t sorted, we typically had to scan through the entire array, just to realize that there is no pre-existing chain, and that we need to add a new one. Since this happens for each codepoint in a grapheme cluster, things quickly became really slow. Things were ok:ish as long as the compose chain struct was small, as that made it possible to hold all the chains in the cache. Once the number of chains reached a certain point, or when we were forced to bump maximum number of allowed codepoints in a chain, we started thrashing the cache and things got much much worse. So what can we do? We can’t sort the array, because a) that would invalidate all existing chain keys in the grid (and iterating the entire scrollback and updating compose keys is not an option). b) inserting a chain becomes slow as we need to first find _where_ to insert it, and then memmove() the rest of the array. This patch uses a binary search tree to store the chains instead of a simple array. The tree is sorted on a “key”, which is the XOR of all codepoints, truncated to the CELL_COMB_CHARS_HI-CELL_COMB_CHARS_LO range. The grid now stores CELL_COMB_CHARS_LO+key, instead of CELL_COMB_CHARS_LO+index. Since the key is truncated, collisions may occur. This is handled by incrementing the key by 1. Lookup is of course slower than before, O(log n) instead of O(1). Insertion is slightly slower as well: technically it’s O(log n) instead of O(1). However, we also need to take into account the re-allocating the array will occasionally force a full copy of the array when it cannot simply be growed. But finding a pre-existing chain is now much faster: O(log n) instead of O(n). In most cases, the first lookup will either succeed (return a true match), or fail (return NULL). However, since key collisions are possible, it may also return false matches. This means we need to verify the contents of the chain before deciding to use it instead of inserting a new chain. But remember that this comparison was being done for each and every chain in the previous implementation. With lookups being much faster, and in particular, no longer requiring us to check the chain contents for every singlec chain, we can now use a dynamically allocated ‘chars’ array in the chain. This was previously a hardcoded array of 10 chars. Using a dynamic allocated array means looking in the array is slower, since we now need two loads: one to load the pointer, and a second to load _from_ the pointer. As a result, the base size of a compose chain (i.e. an “empty” chain) has now been reduced from 48 bytes to 32. A chain with two codepoints is 40 bytes. This means we have up to 4 codepoints while still using less, or the same amount, of memory as before. Furthermore, the Unicode random test (i.e. write random “unicode” chars) is now faster than current master (i.e. before text-shaping support was added), with test-shaping enabled. With text-shaping disabled, we’re _even_ faster.	2021-06-24 17:30:49 +02:00
Daniel Eklöf	81131e3a87	vt: utf8: don’t scan all previous chains When checking if we already have a compose chain for the current sequence of characters, don’t search the list from the beginning, unless we have to. Taking the following things into consideration: * New compose chains are always appended at the end of the list * If the current sequence is 3 or more characters, it must consist of an existing compose chain, plus the new character. Thus, when searching, start at index 0 if we only have two characters, since then the base cell originally contained a regular base character, and not a compose chain. I.e. the new chain may be _anywhere_ in the chain list. If however we have a sequence of three or more characters, start at the index the base chain was at. If the chain we’re searching for exists, it must have been added after the base chain, and thus it must be located after the base chain in the chain list.	2021-06-24 17:30:48 +02:00
Daniel Eklöf	e81d1845bf	vt: utf8: de-duplicate; jump to end of function to print to grid	2021-06-24 17:30:48 +02:00
Daniel Eklöf	dc5019a535	vt: utf8-print: don’t build a compose chain on a zero-width base character	2021-06-24 17:30:47 +02:00
Daniel Eklöf	f865612667	vt: utf8-print: check base character before count when looking for existing compose chain Count is more likely to be the same for many chains. Thus we’re likely to fail sooner by checking the base character first.	2021-06-24 17:30:47 +02:00
Daniel Eklöf	57e636dd8e	vt: don’t call wcwidth() on all combining characters every time we add We already have all the widths needed to calculate the new one; it’s the base characters width (base_width), or the previous combining chain’s width (composed->width) plus the new characters’s width (width).	2021-06-24 17:30:46 +02:00
Daniel Eklöf	09431dd15c	vt: presentation selectors may be anywhere in the cluster	2021-06-24 17:30:46 +02:00
Daniel Eklöf	6c70cd9366	vt: don’t force cols=2 when we see an emoji variant selector Fish appears to be the only shell expecting this. The rest probably just does wcswidth(), like usual.	2021-06-24 17:30:45 +02:00
Daniel Eklöf	0a9531ac6c	vt: cache grapheme cluster width in composed struct * Use regular wcswidth() to calculate the width * Explicitly set to ‘2’ if we see a emoji variant selector * Cache the result in the composed struct	2021-06-24 17:30:45 +02:00
Daniel Eklöf	b9ef703eb1	wip: grapheme shaping	2021-06-24 17:30:45 +02:00
Craig Barnes	2a75da4143	Merge branch 'charset-shift-fixes'	2021-06-09 10:18:52 +01:00

1 2 3 4 5 ...

262 commits