ColumnLimit check for trailing comments alignment acts wrong for multi-byte UTF-8 #47624

satinowl · 2020-11-24T08:58:40Z


Bugzilla Link	48280
Version	11.0
OS	Linux

Extended Description

.clang-format:

AlignTrailingComments: true
ColumnLimit: 80

What we have upon clang-format:

int a_short;         // some 1-byte UTF8 comment with some good and neat info
int a_veryverylong;  // some one-byte UTF8 comment breaks correctly at 80 char
                     // boundary

int a_short_rus;  // А теперь комментарии, например, на русском, 2-байта
int a_veryverylong_rus;  // Верхний коммент еще не превысил границу в 80, однако
                         // уже отодвинут. Перенос, при этом, отрабатывает верно

What we should have:

int a_short;         // some 1-byte UTF8 comment with some good info
int a_veryverylong;  // some one-byte UTF8 comment breaks correctly at 80 char
                     // boundary

int a_short_rus;         // А теперь комментарии, например, на русском, 2-байта
int a_veryverylong_rus;  // Верхний коммент еще не превысил границу в 80, однако
                         // уже отодвинут. Перенос, при этом, отрабатывает верно

Russian trailing comments go as UTF-8 2-byte characters, and, obviously, clang-format counts their length as raw byte count when checking if line is exceeded. As a result, comments fall back closer to code, while still having enough space for being aligned.
This is relevant only for trailing comment alignment tirgger check. Line break upon exceeding 80 character limit works correctly for multi-byte characters.

The text was updated successfully, but these errors were encountered:

StailGot · 2022-04-22T14:27:11Z

https://reviews.llvm.org/D124260

Fixes llvm#47333. Fixes llvm#47624. Fixes llvm#75929. Fixes llvm#87885. Fixes llvm#89916.

mgood7123 · 2024-04-29T01:43:32Z

Bugzilla Link 48280
Version 11.0
OS Linux

Extended Description

.clang-format:
AlignTrailingComments: true
ColumnLimit: 80
What we have upon clang-format:
int a_short;         // some 1-byte UTF8 comment with some good and neat info
int a_veryverylong;  // some one-byte UTF8 comment breaks correctly at 80 char
                     // boundary

int a_short_rus;  // А теперь комментарии, например, на русском, 2-байта
int a_veryverylong_rus;  // Верхний коммент еще не превысил границу в 80, однако
                         // уже отодвинут. Перенос, при этом, отрабатывает верно
What we should have:
int a_short;         // some 1-byte UTF8 comment with some good info
int a_veryverylong;  // some one-byte UTF8 comment breaks correctly at 80 char
                     // boundary

int a_short_rus;         // А теперь комментарии, например, на русском, 2-байта
int a_veryverylong_rus;  // Верхний коммент еще не превысил границу в 80, однако
                         // уже отодвинут. Перенос, при этом, отрабатывает верно
Russian trailing comments go as UTF-8 2-byte characters, and, obviously, clang-format counts their length as raw byte count when checking if line is exceeded. As a result, comments fall back closer to code, while still having enough space for being aligned. This is relevant only for trailing comment alignment tirgger check. Line break upon exceeding 80 character limit works correctly for multi-byte characters.

i wonder how this happens, as if we count 2 bytes instead of 1 then shouldnt clang-format be breaking at around 40'th column since it was counting 2 bytes as per raw byte count ?

more specifically i would probably expect it to maybe be like the following comment

mgood7123 · 2024-04-29T02:26:28Z

i have done some character math kung-fu and come up with my own line breaking which i think should have been done had clang actually formatted on a per-byte-count basis

// we assume 'i' consumes col 1 and 'r' consumes column 79 and 'к' consumes column 80
// this assumes each russain character takes up 1 column for every 2 raw bytes

int a_short;         // some 1-byte UTF8 comment with some good info
int a_veryverylong;  // some one-byte UTF8 comment breaks correctly at 80 char
                     // boundary

int a_short_rus;         // А теперь комментарии, например, на русском, 2-байта
int a_veryverylong_rus;  // Верхний коммент еще не превысил границу в 80, однак
                         // оуже отодвинут. Перенос, при этом, отрабатывает вер
                         // но

// we assume 'i' consumes col 1 and 'r' consumes column 79 and 'к' consumes column 80
// this assumes each russain character takes up 1 column for every 1 raw bytes

int a_short;         // some 1-byte UTF8 comment with some good info
int a_veryverylong;  // some one-byte UTF8 comment breaks correctly at 80 char
                     // boundary

// we calculate where the line break should occur based on raw 2 byte output
// transform 1 byte column to 2 byte by duping each russian char not including ascii ',' and ascii ' ' since those are 1 byte
// we assume '2-' are ascii '2' '-' note 'б' and '6' are different yet similar
//
// we can confirm this by highlighting the letter and seeing what else gets highlighted in the editor
// if we highlight ascii 'a' then no russain 'а' should be highlighted, and vice versa
// unfortunately ' ' itself does not get highlighted unless a non white space is also highlighted
//
// doing so we get the following transformation (shift+right, ctrl+c, ctrl+v, ctrl+v, repeat)
//
// int a_short_rus;         // А теперь комментарии, например, на русском, 2-байта
// ->
// int a_short_rus;         // АА ттееппееррьь ккооммммееннттааррииии, ннааппррииммеерр, ннаа ррууссссккоомм, 2-бааййттаа
//

// fully padded we get the following, with original and padded below

int a_short_rus;         // А теперь комментарии, например, на русском, 2-байта
int a_short_rus;         // АА ттееппееррьь ккооммммееннттааррииии, ннааппррииммеерр, ннаа ррууссссккоомм, 2-ббааййттаа
int a_veryverylong_rus;  // Верхний коммент еще не превысил границу в 80, однак
int a_veryverylong_rus;  // ВВееррххнниийй ккооммммееннтт еещщее ннее ппррееввыыссиилл ггррааннииццуу вв 80, ооддннаакк
                         // оуже отодвинут. Перенос, при этом, отрабатывает вер
                         // ооуужжее ооттооддввииннуутт. ППеерреенноосс, ппррии ээттоомм, ооттррааббааттыыввааеетт ввеерр
                         // но
                         // нноо

// if we remove the original we get the following

int a_short_rus;         // АА ттееппееррьь ккооммммееннттааррииии, ннааппррииммеерр, ннаа ррууссссккоомм, 2-ббааййттаа
int a_veryverylong_rus;  // ВВееррххнниийй ккооммммееннтт еещщее ннее ппррееввыыссиилл ггррааннииццуу вв 80, ооддннаакк
                         // ооуужжее ооттооддввииннуутт. ППеерреенноосс, ппррии ээттоомм, ооттррааббааттыыввааеетт ввеерр
                         // нноо

// if we break by character we get the following

int a_short_rus;         // А теперь комментарии, например, на русском, 2-байта//ORIGINAL
int a_short_rus;         // АА ттееппееррьь ккооммммееннттааррииии, ннааппррии
                         // ммеерр, ннаа ррууссссккоомм, 2-ббааййттаа
int a_veryverylong_rus;  // Верхний коммент еще не превысил границу в 80, однак//ORIGINAL
                         // оуже отодвинут. Перенос, при этом, отрабатывает вер//ORIGINAL
                         // но//ORIGINAL
int a_veryverylong_rus;  // ВВееррххнниийй ккооммммееннтт еещщее ннее ппрреевв
                         // ыыссиилл ггррааннииццуу вв 80, ооддннаакк ооуужжее
                         // ооттооддввииннуутт. ППеерреенноосс, ппррии ээттоо
                         // мм, ооттррааббааттыыввааеетт ввеерр нноо

// if we break by whitespace we get the following

int a_short_rus;         // А теперь комментарии, например, на русском, 2-байта//ORIGINAL
int a_short_rus;         // АА ттееппееррьь ккооммммееннттааррииии, 
                         // ннааппррииммеерр, ннаа ррууссссккоомм, 2-ббааййттаа
int a_veryverylong_rus;  // Верхний коммент еще не превысил границу в 80, однак//ORIGINAL
                         // оуже отодвинут. Перенос, при этом, отрабатывает вер//ORIGINAL
                         // но//ORIGINAL
int a_veryverylong_rus;  // ВВееррххнниийй ккооммммееннтт еещщее ннее 
                         // ппррееввыыссиилл ггррааннииццуу вв 80, ооддннаакк
                         // ооуужжее ооттооддввииннуутт. ППеерреенноосс, ппррии
                         // ээттоомм, ооттррааббааттыыввааеетт ввеерр нноо

// if we strip the duplicates (broken by char) we get the following

int a_short_rus;         // А теперь комментарии, например, на русском, 2-байта//ORIGINAL
int a_short_rus;         // А теперь комментарии, напри
                         // мер, на русском, 2-байта
int a_veryverylong_rus;  // Верхний коммент еще не превысил границу в 80, однак//ORIGINAL
                         // оуже отодвинут. Перенос, при этом, отрабатывает вер//ORIGINAL
                         // но//ORIGINAL
int a_veryverylong_rus;  // Верхний коммент еще не прев
                         // ысил границу в 80, однак оуже
                         // отодвинут. Перенос, при это
                         // м, отрабатывает вер но

// if we strip the duplicates (break by whitespace) we get the following

int a_short_rus;         // А теперь комментарии, например, на русском, 2-байта//ORIGINAL
int a_short_rus;         // А теперь комментарии, 
                         // например, на русском, 2-байта
int a_veryverylong_rus;  // Верхний коммент еще не превысил границу в 80, однак//ORIGINAL
                         // оуже отодвинут. Перенос, при этом, отрабатывает вер//ORIGINAL
                         // но//ORIGINAL
int a_veryverylong_rus;  // Верхний коммент еще не 
                         // превысил границу в 80, однак
                         // оуже отодвинут. Перенос, при
                         // этом, отрабатывает вер но

// if we strip the duplicates (broken by char) we get the following (without the original)

int a_short_rus;         // А теперь комментарии, напри
                         // мер, на русском, 2-байта
int a_veryverylong_rus;  // Верхний коммент еще не прев
                         // ысил границу в 80, однак оуже
                         // отодвинут. Перенос, при это
                         // м, отрабатывает вер но

// if we strip the duplicates (break by whitespace) we get the following (without the original)

int a_short_rus;         // А теперь комментарии, 
                         // например, на русском, 2-байта
int a_veryverylong_rus;  // Верхний коммент еще не 
                         // превысил границу в 80, однак
                         // оуже отодвинут. Перенос, при
                         // этом, отрабатывает вер но

// and compared with the original (assuming 2 bytes equals 1 col as per correct)
//
// we see that the (incorrect) above yields expectedly shorter line break columns for 2 byte per 2 columns
// instead of the (correct) below intended 2 byte per 1 col for correct russain 2-byte chars

// we assume 'i' consumes col 1 and 'r' consumes column 79 and 'к' consumes column 80
// this assumes each russain character takes up 1 column for every 2 raw bytes

int a_short;         // some 1-byte UTF8 comment with some good info
int a_veryverylong;  // some one-byte UTF8 comment breaks correctly at 80 char
                     // boundary

int a_short_rus;         // А теперь комментарии, например, на русском, 2-байта
int a_veryverylong_rus;  // Верхний коммент еще не превысил границу в 80, однак
                         // оуже отодвинут. Перенос, при этом, отрабатывает вер
                         // но

mgood7123 · 2024-04-29T02:35:00Z

do note however, that as per

// fully padded we get the following, with original and padded below

int a_short_rus;         // А теперь комментарии, например, на русском, 2-байта
int a_short_rus;         // АА ттееппееррьь ккооммммееннттааррииии, ннааппррииммеерр, ннаа ррууссссккоомм, 2-ббааййттаа
int a_veryverylong_rus;  // Верхний коммент еще не превысил границу в 80, однак
int a_veryverylong_rus;  // ВВееррххнниийй ккооммммееннтт еещщее ннее ппррееввыыссиилл ггррааннииццуу вв 80, ооддннаакк
                         // оуже отодвинут. Перенос, при этом, отрабатывает вер
                         // ооуужжее ооттооддввииннуутт. ППеерреенноосс, ппррии ээттоомм, ооттррааббааттыыввааеетт ввеерр
                         // но
                         // нноо

the following letters consume part of the boundary

мм -> ннааппррииммеерр -> например
ыы -> ппррееввыыссиилл -> превысил

thus following broken by char semantics the 2-byte utf8 character would have been split with ? denoting the split utf8 assumably now malformed single-byte UTF8

мм -> ннааппрриим меерр -> напри? ?ер
ыы -> ппррееввы ыссиилл -> прев? ?сил

mgood7123 · 2024-04-29T02:41:42Z

also note that since clang-format DID NOT output such format, it either did not truly work at 1-byte granularity as you suggested, or it did something else

Fixes llvm#37705. Fixes llvm#47333. Fixes llvm#47624. Fixes llvm#58850. Fixes llvm#75929. Fixes llvm#87885. Fixes llvm#89916.

llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021

HazardyKnusperkeks added the awaiting-review Has pending Phabricator review label Apr 22, 2022

owenca removed the awaiting-review Has pending Phabricator review label Sep 14, 2023

owenca self-assigned this Apr 27, 2024

owenca added a commit to owenca/llvm-project that referenced this issue Apr 28, 2024

[clang-format] Set Change.TokenLength to ColumnWidth

d671f32

Fixes llvm#47333. Fixes llvm#47624. Fixes llvm#75929. Fixes llvm#87885. Fixes llvm#89916.

owenca mentioned this issue Apr 28, 2024

[clang-format] Set Change.TokenLength to ColumnWidth #90378

Merged

owenca closed this as completed in #90378 Apr 28, 2024

owenca closed this as completed in aa596fa Apr 28, 2024

sabauma pushed a commit to sabauma/llvm-project that referenced this issue May 1, 2024

[clang-format] Set Change.TokenLength to ColumnWidth (llvm#90378)

e04f1eb

Fixes llvm#37705. Fixes llvm#47333. Fixes llvm#47624. Fixes llvm#58850. Fixes llvm#75929. Fixes llvm#87885. Fixes llvm#89916.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ColumnLimit check for trailing comments alignment acts wrong for multi-byte UTF-8 #47624

ColumnLimit check for trailing comments alignment acts wrong for multi-byte UTF-8 #47624

satinowl commented Nov 24, 2020 •

edited by mkurdej

StailGot commented Apr 22, 2022

mgood7123 commented Apr 29, 2024 •

edited

Extended Description

mgood7123 commented Apr 29, 2024 •

edited

mgood7123 commented Apr 29, 2024 •

edited

mgood7123 commented Apr 29, 2024

ColumnLimit check for trailing comments alignment acts wrong for multi-byte UTF-8 #47624

ColumnLimit check for trailing comments alignment acts wrong for multi-byte UTF-8 #47624

Comments

satinowl commented Nov 24, 2020 • edited by mkurdej

Extended Description

StailGot commented Apr 22, 2022

mgood7123 commented Apr 29, 2024 • edited

Extended Description

mgood7123 commented Apr 29, 2024 • edited

mgood7123 commented Apr 29, 2024 • edited

mgood7123 commented Apr 29, 2024

satinowl commented Nov 24, 2020 •

edited by mkurdej

mgood7123 commented Apr 29, 2024 •

edited

mgood7123 commented Apr 29, 2024 •

edited

mgood7123 commented Apr 29, 2024 •

edited