We only used utf8proc to try to pre-compose a glyph from a base and
combining character.
We can do this ourselves by using a pre-compiled table of valid
pre-compositions. This table isn't _that_ big, and binary searching it
is fast.
That is, for a very small amount of code, and not too much extra RO
data, we can get rid of the utf8proc dependency.
If the client sent the sequence SAB, where SA does NOT have a composed
representation, but SB does, the old code would compose SB and throw
away A.
This patch fixes this by only allowing a compose if there aren't
any pre-existing combining characters.
When we detect a combining character, we first try to compose it with
the base character (like before).
When this fails, we instead add the combining character to the base
cell's combining characters array.
The reason for using a composed character when possible is twofold:
one, the rendered glyph will look better since it will be a single
glyph instead of two separate glyphs (possibly from different
fonts(!)). And two, for performance. A composed glyph is a single
glyph to render, while a decomposed glyph sequence means the renderer
has to render multiple glyphs for a single cell.
action_clear() is in the super hot code path. Avoid resetting utf8
state there, as utf8 input is relatively uncommon.
Instead, reset it when we explicitly enter any of the utf8 collecting
states, as this is exactly the point where we need it.
This feature lets foot combine e.g. "a\u0301" to "á".
We first check if the current character (that we're about to print) is
a combining character, by checking if it's in one of the following
ranges:
* Combining Diacritical Marks (0300–036F), since version 1.0, with
modifications in subsequent versions down to 4.1
* Combining Diacritical Marks Extended (1AB0–1AFF), version 7.0
* Combining Diacritical Marks Supplement (1DC0–1DFF), versions 4.1 to 5.2
* Combining Diacritical Marks for Symbols (20D0–20FF), since version
1.0, with modifications in subsequent versions down to 5.1
* Combining Half Marks (FE20–FE2F), versions 1.0, with modifications
in subsequent versions down to 8.0
If it is, we check if the last cell appears to contain a valid symbol,
and if so, we attempt to compose (combine) the last cell with the
current character, using utf8proc.
If the result is a combined character, we replace the content in the
previous cell with the new, combined character.
Thus, if you select and copy the printed character, you would get
e.g. "\u00e1" instead of "a\u0301".
This feature can be disabled. By default, it is enabled if the
utf8proc library is found, but can be explicitly disabled, or enabled,
with 'meson -Dunicode-combining=disabled|enabled'.
This fixes an issue where we failed to restore the cursor correctly
when exiting from the alternate screen, if the client had sent escapes
to save the cursor position while inside the alternate screen.
This was because we used the *same* storage for saving the cursor
position through escapes, as for saving it when entering the alternate
screen.
Fix by using a custom variable dedicated to normal <--> alt screen
switching.
To handle text reflow correctly when a line has a printable character
in the last column, but was still line breaked, we need to track the
fact that the slave inserted a line break here.
Otherwise, when the window width is increased, we'll end up pulling up
the next line, when we really should have inserted a line break.
We only support 16 parameters, and for each parameter, 16
sub-parameters. If we ever hit that limit (or rather, if the client
writes 17 (sub) parameters), log this and stop incrementing the
parameter index variable.
For performance reason, we implement the following behavior:
* We never increment the parameter index past the supported
number. This ensures all code *accessing* the parameter list can do
so without verifying the validity of the index.
* The *first* time we see too many parameters, and the first time we
see too many sub parameters, log this. Then *never* log again. Even
if we see too many parameters in a completely different escape. This
is so that we don't have to keep a "have warned" boolean in the
terminal struct, but can use a simple function local static
variable.
0x3a/0x3b are ':' and ';'. These should not only switch to the 'csi
param' state, but also be parsed as a parameter.
This fixes an issue where a multi-parameter escape with the first
parameter omitted was parsed incorrectly - as if the first parameter
wasn't there.
I.e. "\e[;123r" was parsed as "\e[123r"
mbrtowc() returns an unsigned. Need to cast to signed before checking
if less than zero.
This fixes an issue where invalid utf-8 sequences where treated as valid.
Add data structure to term->vt. This structure tracks the free-form
data that is passed-through, and the handler to call at the end.
Intermediates and parameters are collected by the normal VT
parser. Then, when we enter the passthrough state, we call dcs_hook().
This function checks the intermediate(s) and parameters, and selects
the appropriate unhook handler (and optionally does some execution
already).
In passthrough mode, we simply append strings to an internal
buffer. This might have to be changed in the future, if we need to
support a DCS that needs to execute as we go.
In unhook (i.e. when the DCS is terminated), we execute the unhook
handler.
As a proof-of-concept, handlers for BSU/ESU (Begin/End Synchronized
Update) has been added (but are left unimplemented).
XTerm seems to ignore these when in UTF-8 mode. Since we _only_
support UTF-8, we don't need to recognize these control characters at
all.
However, it may be good to have them here for reference. So add them,
but commented out, along with their corresponding 7-bit
versions (which we _do_ recognize and implement).
When we insert an auto-newline, we must make sure we don't try to move
outside the terminal window.
This can for example happen when a scrolling region have been
configured, and the cursor is **outside** the scrolling
region (i.e. it's in the bottom margin).
Having them as error messages was nice when we where still missing
lots of sequences.
Now we don't anymore, and these just spam stdout as well as syslog
when e.g. cat:ing binary data.
In most states, most 8-bit values are no-ops. This is already handled;
action() recognizes ACTION_NONE as a no-op. Thus, all we need to do is
remove the assertion.