Groups keyboard shortcuts have been updated
Dismiss
See shortcuts

GitHub Issue #1520

9 views
Skip to first unread message

CCExtractor.org CI Platform

unread,
Mar 27, 2023, 5:52:20 PM3/27/23
to ccextra...@googlegroups.com
[BUG] WebVTT style/characters get out of sync when non-ASCII characters are used - dhouck
Link to Issue: https://www.github.com/CCExtractor/ccextractor/issues/1520
dhouck

CCExtractor version: Compiled myself from fa85a527 (I double-checked several times); I donʼt know why it thinks itʼs on d379d726 ``` CCExtractor 0.94, Carlos Fernandez Sanz, Volker Quetschke.

Teletext portions taken from Petr Kutalek's telxcc

CCExtractor detailed version info Version: 0.94 Git commit: d379d72685959859db797621f270aeeb01a50021 Compilation date: 2023-03-26 CEA-708 decoder: Rust File SHA256: f7edb9796bf45c48bf3fe80db340293854e394f4ed0960f0f730d2ab5eec9028 Libraries used by CCExtractor libGPAC Version: 1.0.1 zlib: 1.2.11 utf8proc Version: 2.4.0 protobuf-c Version: 1.3.1 libpng Version: 1.6.37 FreeType libhash nuklear libzvbi ```

Necessary information

  • Is this a regression (i.e. did it work before)? New behavior, but it was even worse before
  • What platform did you use? {Window/Linux/Mac} Linux
  • What were the used arguments? {replace with the arguments}

Video links

[Same test input #1516; no need to re-upload] Current output after #1518: test.vtt.gz

Expected output: there should be space between the and the </i>; see this line of the g608: ♪ ^@99999999999999000999999999999999RRRRRRRRRRRRRRIIIRRRRRRRRRRRRRRR

The SRT line is has the space before the </i>; the WebVTT-full one has it after. This isnʼt a big deal in this sample but the same thing would happen in a visible way in most circumstances. For example, if the line were supposed to be <i> ♪♪ [epic music] ♪♪ </i>, then it would instead be <i> ♪♪ [epic music</i>] ♪♪.

Additional information

This is a follow-up for #1516. Its fix, #1518, does prevent splitting characters, but the styling will still always get out of sync with the text if there are any multibyte characters. This is because it still uses j as both a bytes index and a screen index; I think a more comprehensive fix would be to use j as only a screen index, like the SRT decoder does, and decode each symbol separately.

I have a fix Iʼm planning to upload shortly, although probably a better fix is possible and I wonʼt be disappointed if someone comes along and refactors the entire loop or function away.

Reply all
Reply to author
Forward
0 new messages