C++ formatting library proposal

893 views
Skip to first unread message

Victor Zverovich

unread,
Jun 6, 2017, 9:41:26 AM6/6/17
to std-pr...@isocpp.org
I'm working on a proposal for a new formatting functionality for the C++ standard library based on the fmt library (https://github.com/fmtlib/fmt).

The first draft of the proposal is available at http://fmtlib.net/Text%20Formatting.html and I would appreciate any comments.

- Victor

Nicol Bolas

unread,
Jun 6, 2017, 10:04:37 AM6/6/17
to ISO C++ Standard - Future Proposals

I'm concerned about this:

The formatting library uses a null-terminated string view basic_cstring_view instead of basic_string_view. This results in somewhat smaller and faster code because the string size, which is not used, doesn't have to be computed and passed. Also having a termination character makes parsing easier.

This really hurts users of `basic_string_view`, since their views are not NUL-terminated. Such users don't necessarily have to calculate the string size either; the `sv` user-defined literal will generate the size based on the input literal.

Your insistence on `basic_cstring_view` only aids scenarios where the user has an unsized, NUL-terminated string. So they would have to not be using `basic_string` or similar types (which are sized).

Victor Zverovich

unread,
Jun 6, 2017, 10:41:04 AM6/6/17
to std-pr...@isocpp.org
> So they would have to not be using `basic_string` or similar types (which are sized).

Not really, because `basic_cstring_view` can be easily constructed from `basic_string`, but I agree there is a tradeoff between code size and usability here. I was considering replacing `basic_cstring_view` with `basic_string_view` to avoid an extra type but decided to hear the feedback first. Note that `basic_cstring_view` is only used for the format string itself which is very often a string literal.

Cheers,
Victor

--
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposal...@isocpp.org.
To post to this group, send email to std-pr...@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp.org/d/msgid/std-proposals/e6e42f40-be11-46a4-824f-37199e9fa8da%40isocpp.org.

Bengt Gustafsson

unread,
Jun 6, 2017, 11:37:14 AM6/6/17
to ISO C++ Standard - Future Proposals
You may want to overload format() for both const Char* and basic_string_view to avoid having to count the characters when not necessary, while avoiding a deep copy just to add a \0 if you have a std::string to start from. Internally you can stop parsing when you reach a \0 or the end iterator. Allowing a \0 in a std::string to propagate to the output will most likely do more harm than good. Introducing a basic_cstring_view seems like overdoing it.

Nicol Bolas

unread,
Jun 6, 2017, 12:06:02 PM6/6/17
to ISO C++ Standard - Future Proposals
On Tuesday, June 6, 2017 at 10:41:04 AM UTC-4, Victor Zverovich wrote:
> So they would have to not be using `basic_string` or similar types (which are sized).

Not really, because `basic_cstring_view` can be easily constructed from `basic_string`, but I agree there is a tradeoff between code size and usability here. I was considering replacing `basic_cstring_view` with `basic_string_view` to avoid an extra type but decided to hear the feedback first. Note that `basic_cstring_view` is only used for the format string itself which is very often a string literal.

"very often" is true for some cases, but not for many others that your system would be suitable for.

The primary reason for positional arguments is to make translations easier/possible. Generally speaking, translations aren't stored in executables as string literals; they're stored in other files. If those are plain text files, perhaps with some markup to designate the end of one format, I see no reason that I should have to put `\0` characters in all of them.

Bengt's suggestion seems the most reasonable here: just provide a separate overload for NUL-terminated strings than for `string_view`s. You say that this is for "code size" reasons, but I fail to see how the code size will be impacted significantly enough to be worth the hassle.

Thiago Macieira

unread,
Jun 6, 2017, 1:38:36 PM6/6/17
to std-pr...@isocpp.org
On Tuesday, 6 June 2017 09:06:02 PDT Nicol Bolas wrote:
> "very often" is true for some cases, but not for many others that your
> system would be suitable for.
>
> The primary reason for positional arguments is to make translations
> easier/possible. Generally speaking, translations aren't stored in
> executables as string literals; they're stored in other files. If those are
> plain text files, perhaps with some markup to designate the end of one
> format, I see no reason that I should have to put `\0` characters in all of
> them.

If it's a binary file (like .qm), then most likely the size is stored somewhere
due to the construction of the file (indexing, etc.). The presence of a null
terminator is likely in file formats meant to be used with C API (like
libintl's .mo), but native C++ APIs may want to skip that.

> Bengt's suggestion seems the most reasonable here: just provide a separate
> overload for NUL-terminated strings than for `string_view`s. You say that
> this is for "code size" reasons, but I fail to see how the code size will
> be impacted significantly enough to be worth the hassle.

Good compilers optimise strlen of a string literal to a constexpr constant
anyway, when compiled in release mode.

--
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
Software Architect - Intel Open Source Technology Center

Bengt Gustafsson

unread,
Jun 6, 2017, 4:44:20 PM6/6/17
to ISO C++ Standard - Future Proposals

Good compilers optimise strlen of a string literal to a constexpr constant
anyway, when compiled in release mode.

Of course, you're right, basic_string_view seems the right choice then, after all.

Bengt Gustafsson

unread,
Jun 6, 2017, 6:12:27 PM6/6/17
to ISO C++ Standard - Future Proposals
I like this functionality a lot, it has been sorely lacking in C++ and is becoming more and more a standard in many other languages.

Some comments:

- I would like the part before the : inside {} to allow any text. This allows translators to understand better the meaning of the inserted value and thus produce better translations. Not writing anything before the : would still
be equivalent to writing the next consecutive number. I think you removed this capability from the python implementation as C++ does not have named arguments but it still has value in conveying information between programmer and the person doing translation.

- Some kind of standard interface to internationalization string replacement could be incorporated in format(), so that we don't have to write _() or something around the literal.

- How to specify using . for decimal char in a country where , is standard and vice versa. n only seems to relate to thousand separator handling, but the usual problem is to get the right decimal character in float number output. Many file formats (JSON for instance) require a dot even in countries where comma is standard. This is a recurring problem as "someone else" may set your locale on a  global level. I have a similar library where I use comma or dot (in the dot position in the format string) to denote themselves as decimal char while semicolon denotes "the locale decimal char". While creating your own buffer type which overrides the locale() method is possible it is much more work than defining the decimal char in the format string.

- I don't particularly like the partial reuse of the C formatting conventions with a specified order of the parts of the format string, as it is hard to learn. Sure, it is rather easy whern you have learned printf, but do we want future generations of programmers have to go through that detour? Better start from scratch with something logical, where the order of characters is not crucial.

- It may be better to send the formatting string snippet to format_value() as a basic_string_view rather than relying on all the function overloads to remember to bump the ptr correctly. At least provide a method in ctx to forward the ptr and return the basic_string_view containing the format part in case you want to preserve the possibility to nest {} inside the formatting string. This simplifies for the fairly large share of format_value functions which are implemented but don't care about any formatting details.

- A maximum inserted length is often useful to limit the output. One case is when formatting large doubles in the f format. This can get ridiculous in printf. Another is potentially long file names, where you may want to limit the string length.

- It is hard to understand the "arg store" idea and how this goes together with calling format_value on each parameter. It may be that the fmt::args and basic_args are actually the same. There is also arg_store and basic_arg and the visit() function which seem more like internal details of the implementation. Even as an implementation it seems overly complex, and I for one can't understand how visit can call separate user defined format_value functions if arg_store is not templated on the argument types.

- It would be nice if it was easy to create a formatting object from a format specification string. This object should have methods to format the standard types that fomat() works for "out of the box". In this way you can easily override format_value for instance for std::vector<T>, passing each element to a formatting object you have created. Of course you can also call format_value() for each T provided it exists, but this incurs quite an overhead as the same format specification string is parsed over and over again for each array element. Taking this thought to the logical conclusion means that the customization point for a user defined type should be a function create_formatter<T>(string_view format_specification_string) rather than format_value. The formatter returned from create_formatter then has a method format(const T& value) which does the formatting job. This allows the vector optimization to be taken to vectors of vectors etc. Continuing on this tangent I think it may make sense to allow pre-creating a formatting object for an entire format string, so that the parsing of that string occurs only once. Lets call this class formatter<Ts...>. Example:

template<typename... Ts> class formatter {
public:
    formatter(string_view format_string);  // This parses the format_string and stores the objects returned from create_formatter<T> for each of the Ts.

    void run(buffer& buf, const Ts&.. args);    // Perform the formatting to a buffer
    string run(buffer& buf, const Ts&... args); // Perform the formatting and return a string
};

// The format function is then defined as:
template<typename... Ts> string format(string_view format_string, const Ts& values)
{
     formatter<Ts...> fmt(format_string);
     return fmt.run(values...);
}
// A good compiler hopefully generates the same code as in the current implementation.
// A loop for printing many lines would be more effective if written:

formatter<int, string, double> fmt("#{IX}: {NAME}, {PRICE:.2");   // Preparse format
for (item : inventory)
    cout << fmt.run(item.ix, item.name, item.price);

Victor Zverovich

unread,
Jun 7, 2017, 10:42:17 AM6/7/17
to std-pr...@isocpp.org
I was hoping to avoid having extra overloads so maybe will just go with basic_string_view. I don't remember the numbers but there was a noticeable regression on the code bloat benchmark (https://github.com/fmtlib/fmt#compile-time-and-code-bloat) when doing this due to passing an extra size argument and more formatting arguments spilling on stack, but the benchmark is somewhat artificial and it may not be important in practice.

- Victor

--
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposal...@isocpp.org.
To post to this group, send email to std-pr...@isocpp.org.

Victor Zverovich

unread,
Jun 8, 2017, 9:45:25 AM6/8/17
to std-pr...@isocpp.org
Thanks for such a detailed feedback.

> I would like the part before the : inside {} to allow any text.

The fmt library, this proposal is based on, supports named arguments:

  format("The answer is {answer:}", arg(answer, 42));

or, with user defined literals,

  format("The answer is {answer:}", "answer"_a=42);

but I decided to omit this feature from the first draft of the proposal to keep it manageable. It should be possible to add named arguments as an extension.

> Some kind of standard interface to internationalization string replacement could be incorporated in format(), so that we don't have to write _() or something around the literal.

I think some i18n features can be built on top of this formatting library. In particular, a user can easily create a formatting function with the same API as `format` that looks up the format string in the database of translations. This doesn't address marking string for translation though. The latter is more of a tooling issue.

> How to specify using . for decimal char in a country where , is standard and vice versa.

As in Python, the default format is locale-independent for exactly the same reason that you mention - many output formats require this. So the default output uses '.' as a decimal separator. Locale-specific number formatting is done with the 'n' format specifier in which case either the current locale is used or the user can provide a locale via a buffer. It might be worth adding a simpler API for providing a custom locale, e.g. via a separate format overload.

> I don't particularly like the partial reuse of the C formatting conventions with a specified order of the parts of the format string, as it is hard to learn.

I guess the grammar can be extended to allow reordering some of the specifiers, but will create some questions such as how to handle duplicate specifiers and why some specifiers can be reordered and some not (for example, you can't move fill without introducing ambiguity). The current syntax has the advantage that it has already been proven to work in Python and Rust as well as a few C++ libraries.

> It may be better to send the formatting string snippet to format_value() as a basic_string_view rather than relying on all the function overloads to remember to bump the ptr correctly. At least provide a method in ctx to forward the ptr and return the basic_string_view containing the format part ...

Providing such method sounds like a good idea. I'll add it to my TODO list.

A maximum inserted length is often useful to limit the output.

Do you mean something like snprintf?

> It is hard to understand the "arg store" idea and how this goes together with calling format_value on each parameter.

arg_store is basically an std::array of variants representing arguments. For a user-defined type the variant stores a pointer to the object (as void*) and a pointer to a little wrapper function that casts the pointer to the correct type and invokes the format_value function.

> It may be that the fmt::args and basic_args are actually the same.

They are the same for the standard context type, but users can create custom formatters and contexts. For example, one can implement a printf formatter and a printf_context re-using much of the argument passing machinery. In fact this is what the fmt library does - it implements both Python-like and printf formatters.

> There is also arg_store and basic_arg and the visit() function which seem more like internal details of the implementation.

arg_store needs to be exposed via the public API because this is the type that stores the arguments, but it's opaque and won't be used directly most of the time. Users who implement formatting functions can just use make_args to construct it, e.g.:

template <class ...Args>
string format_error(cstring_view format_str, const Args&... args) {
  return "error: " + vformat(format_str, make_args(args));
}

> Even as an implementation it seems overly complex, and I for one can't understand how visit can call separate user defined format_value functions if arg_store is not templated on the argument types.

As I mentioned above the argument types are erased (cast to void*) and pointers to little helper functions that know how to recover the types are passed around (https://github.com/fmtlib/fmt/blob/std/fmt/format.h#L1267). The implementation is somewhat elaborate to achieve good performance and extensibility but I tried to simplify is as much as I could.

> It would be nice if it was easy to create a formatting object from a format specification string.

That's an interesting idea, let me think about it and get back to you.

Cheers,
Victor

--
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposal...@isocpp.org.
To post to this group, send email to std-pr...@isocpp.org.

Klaim - Joël Lamotte

unread,
Jun 8, 2017, 12:04:51 PM6/8/17
to std-pr...@isocpp.org

On 8 June 2017 at 15:45, Victor Zverovich <victor.z...@gmail.com> wrote:
Thanks for such a detailed feedback.

> I would like the part before the : inside {} to allow any text.

The fmt library, this proposal is based on, supports named arguments:

  format("The answer is {answer:}", arg(answer, 42));

or, with user defined literals,

  format("The answer is {answer:}", "answer"_a=42);

but I decided to omit this feature from the first draft of the proposal to keep it manageable. It should be possible to add named arguments as an extension.

Consider mentioning this in the paper in a section about potential extensions.

Joël Lamotte

Victor Zverovich

unread,
Jun 24, 2017, 2:52:50 PM6/24/17
to std-pr...@isocpp.org
It would be nice if it was easy to create a formatting object from a format specification string.

I think this is a great idea and I finally got a chance to experiment with it. 

The main drawback of having a separate formatting object is that it makes writing a formatter for a user-defined type more complicated, because you need to define 3 things: a formatter class, a parsing function and a formatting function. In the current proposal everything is combined in format_value which, on one hand, leads to less boilerplate but, on the other hand, is more restrictive.

template <>
class formatter<MyClass> {
 public:
  explicit formatter(context &ctx) {
    // Parse the format string.
  }

  void format(buffer &buf, MyClass &value) {
    // Format value.
  }

 private:
  // Formatting state.
};

One way to reduce boilerplate is by returning the formatter object as a lambda from the parsing function, e.g.

auto parse_format<MyClass>(context &ctx) {
  // Parse the format string.
  return [/* Formatting state */](buffer &buf, MyClass &value) {
    // Format value.
  };
}

Unfortunately this doesn't work if parse_format has to take additional template arguments such as a char type:

// Doesn't compile
template <typename Char>
auto parse_format<MyClass>(basic_context<Char> &ctx) {
  // Parse the format string.
  return [/* Formatting state */](basic_buffer<Char> &buf, MyClass &value) {
    // Format value.
  };
}

A possible workaround is to use enable_if:

template <typename T, typename U>
using enable_format = typename std::enable_if<std::is_same<T, U>::value>::type;

template <typename T, typename Char, typename = enable_format<T, MyClass>>
auto parse_format(basic_context<Char> &ctx) {
  // Parse the format string.
  return [/* Formatting state */](basic_buffer<Char> &buf, MyClass &value) {
    // Format value.
  };
}

This workaround does the job (implemented in experimental branch: https://github.com/fmtlib/fmt/tree/ext) but I am not super happy about it. Any ideas on how to improve this are appreciated.

Best regards,
Victor

On Tue, Jun 6, 2017 at 3:12 PM Bengt Gustafsson <bengt.gu...@beamways.com> wrote:
--
Reply all
Reply to author
Forward
0 new messages