[llvm-dev] RFC: General purpose type-safe formatting library

246 views
Skip to first unread message

Zachary Turner via llvm-dev

unread,
Oct 11, 2016, 9:22:22 PM10/11/16
to llvm-dev
A while back llvm::format() was introduced that made it possible to combine printf-style formatting with llvm streams.  However, this still comes with all the risks and pitfalls of printf.  Everyone is no-doubt familiar with these problems, but here are just a few anyway:

1. Not type-safe.  Not all compilers warn when you mess up the format specifier.  And when you're writing your own Printf-like functions, you need to tag them with __attribute__(format, printf) which again not all compilers have.  If you change a const char * to a StringRef, it can silently succeed while passing your StringRef object to printf.  It should fail to compile!

2. Not security safe.  Functions like sprintf() will happily smash your stack for you if you're not careful.  

3. Not portable (well kinda).  Quick, how do you print a size_t?  You probably said %z.  Well MSVC didn't even support %z until 2015, which we aren't even officially requiring yet.  So you've gotta write (uint64_t)x and then use PRIx64.  Ugh.

4. Redundant.  If you're giving it an integer, why do you need to specify %d?  It's an integer!  We should be able to use the type system to our advantage.

5. Not flexible.  How do you print a std::chrono::time_point with llvm::format()?  You can't.  You have to resort to providing an overloaded streaming operator or formatting it some other way.

So I've been working on a library that will solve all of these problems and more.


The high level design of my library is borrowed heavily from C#.  But if you're not familiar with C#, I believe boost has something similar in spirit.  The best way to show it off is with some examples:

1. os << format_string("Test");   // writes "test"
2. os << format_string("{0}", 7);  // writes "7"

Immediately we can see one big difference between this and llvm::format() / printf.  You don't have to specify the type.  If you pass in an int, it formats it as an int.

3. os << format_string("{0} {0}", 7); // writes "7 7"

#3 is an example of something that cannot be done elegantly with printf.  Sure, you can pass it in twice, but if it's expensive to compute, this means you have to save it into a temporary.

4. os << format_string("{0:X}", 255);  // writes "0xFF"
5. os << format_string("{0:X7}", 255);  // writes "0x000FF"

6. os << format_string("{0}", foo_object); // fails to compile!

Here is another example of an improvement over traditional formatting mechanisms.  If you pass an object for which it cannot find a formatter, it fails to compile.

However, you can always define custom formatters for your own types.  If you write:

namespace llvm {
  template<>
  struct format_provider<Foo> {
    static void format(raw_ostream &S, const Foo &F, int Align, StringRef Options) {
    }
  };
}

Then #6 will magically compile, and invoke the function above to do the formatting.  There are other ways to customize the formatting behavior, but I'll keep going with some more examples:

7. os << format_string("{0:N}", -1234567);  // Writes "-1,234,567".  Note the commas.
8. os << format_string("{0:P}", 0.76);  // Writes "76.00%"

You can also left justify and right justify.  For example:

9. os << format_string("{0,8:P}", 0.76);  // Writes "  76.00%"
10. os << format_string("{0,-8,P}", 0.76);  // Writes "76.00%  "

And you can also format complicated types.  For example:

11. os << format_string("{0:DD/MM/YYYY hh:mm:ss}", std::chrono::system_clock::now());  // writes "10/11/2016 18:19:11"


I already have a working proof of concept that supports most of the fundamental data types and formatting options such as percents, exponents, comma grouping, fixed point, hex, etc.

To summarize, the advantages of this approach are:

1) Safe.  If it can't format your type, it won't even compile.  
2) Concise.  You can re-use parameters multiple times without re-specifying them.
3) Simple.  You don't have to remember whether to use %llu or PRIx64 or %z, because format specifiers don't exist!
4) Flexible.  You can format types in a multitude of different ways while still having the nice format-string style syntax.
5) Extensible.  If you don't like the behavior of a built-in formatter, you can override it with your own.  If you have your own type which you'd like to be able to format, you can add formatting support for it in multiple different ways.

I am hoping to have something ready for submitting later this week.  If this interests you, please help me out by reviewing my patch!  And if you think this would not be helpful for LLVM and I should not worry about this, let me know as well!

Thanks,
Zach

Mehdi Amini via llvm-dev

unread,
Oct 11, 2016, 11:59:17 PM10/11/16
to Zachary Turner, llvm-dev
Hi,

I On Oct 11, 2016, at 6:22 PM, Zachary Turner via llvm-dev <llvm...@lists.llvm.org> wrote:

A while back llvm::format() was introduced that made it possible to combine printf-style formatting with llvm streams.  However, this still comes with all the risks and pitfalls of printf.  Everyone is no-doubt familiar with these problems, but here are just a few anyway:

1. Not type-safe.  Not all compilers warn when you mess up the format specifier.  And when you're writing your own Printf-like functions, you need to tag them with __attribute__(format, printf) which again not all compilers have.

I’m not very sensitive to the “not all compilers have” argument, however it is worth mentioning that the format may not be a string literal, which defeat the “sanitizer”.

  If you change a const char * to a StringRef, it can silently succeed while passing your StringRef object to printf.  It should fail to compile!

llvm::format now fails to compile as well :)

However this does not address other issues, like: `format(“%d”, float_var)` 


2. Not security safe.  Functions like sprintf() will happily smash your stack for you if you're not careful.  

3. Not portable (well kinda).  Quick, how do you print a size_t?  You probably said %z.  Well MSVC didn't even support %z until 2015, which we aren't even officially requiring yet.  So you've gotta write (uint64_t)x and then use PRIx64.  Ugh.

4. Redundant.  If you're giving it an integer, why do you need to specify %d?  It's an integer!  We should be able to use the type system to our advantage.

5. Not flexible.  How do you print a std::chrono::time_point with llvm::format()?  You can't.  You have to resort to providing an overloaded streaming operator or formatting it some other way.

It seems to me that there is no silver bullet for that: being for llvm::format() or your new proposal, there is some sort of glue/helpers that need to be provided for each and every non-standard type.


So I've been working on a library that will solve all of these problems and more.

Great! I appreciate the effort, and talking about that with Duncan last week he was mentioning that we should do it :)



The high level design of my library is borrowed heavily from C#.  But if you're not familiar with C#, I believe boost has something similar in spirit.  The best way to show it off is with some examples:

1. os << format_string("Test");   // writes "test"
2. os << format_string("{0}", 7);  // writes "7"

Immediately we can see one big difference between this and llvm::format() / printf.  You don't have to specify the type.  If you pass in an int, it formats it as an int.

3. os << format_string("{0} {0}", 7); // writes "7 7"

#3 is an example of something that cannot be done elegantly with printf.  Sure, you can pass it in twice, but if it's expensive to compute, this means you have to save it into a temporary.

What about: printf(“%0$ %0$”, 7); 


4. os << format_string("{0:X}", 255);  // writes "0xFF"
5. os << format_string("{0:X7}", 255);  // writes "0x000FF"
6. os << format_string("{0}", foo_object); // fails to compile!

Here is another example of an improvement over traditional formatting mechanisms.  If you pass an object for which it cannot find a formatter, it fails to compile.

However, you can always define custom formatters for your own types.  If you write:

namespace llvm {
  template<>
  struct format_provider<Foo> {
    static void format(raw_ostream &S, const Foo &F, int Align, StringRef Options) {
    }
  };
}

Then #6 will magically compile, and invoke the function above to do the formatting.  There are other ways to customize the formatting behavior, but I'll keep going with some more examples:

7. os << format_string("{0:N}", -1234567);  // Writes "-1,234,567".  Note the commas.

Why add commas? Because of the “:N”?
This seems like localization-dependent: how do you handle that?

What happens with the following?

os << format_string("{0:N}", -123.455);


8. os << format_string("{0:P}", 0.76);  // Writes "76.00%"

You can also left justify and right justify.  For example:

9. os << format_string("{0,8:P}", 0.76);  // Writes "  76.00%"
10. os << format_string("{0,-8,P}", 0.76);  // Writes "76.00%  "

And you can also format complicated types.  For example:

11. os << format_string("{0:DD/MM/YYYY hh:mm:ss}", std::chrono::system_clock::now());  // writes "10/11/2016 18:19:11”

11 looks pretty cool in terms of flexibility :)


I already have a working proof of concept that supports most of the fundamental data types and formatting options such as percents, exponents, comma grouping, fixed point, hex, etc.

To summarize, the advantages of this approach are:

1) Safe.  If it can't format your type, it won't even compile.  
2) Concise.  You can re-use parameters multiple times without re-specifying them.
3) Simple.  You don't have to remember whether to use %llu or PRIx64 or %z, because format specifiers don't exist!
4) Flexible.  You can format types in a multitude of different ways while still having the nice format-string style syntax.
5) Extensible.  If you don't like the behavior of a built-in formatter, you can override it with your own.  If you have your own type which you'd like to be able to format, you can add formatting support for it in multiple different ways.

I am hoping to have something ready for submitting later this week.  If this interests you, please help me out by reviewing my patch!  And if you think this would not be helpful for LLVM and I should not worry about this, let me know as well!

Feel free to add me as a reviewer!

— 
Mehdi

Zachary Turner via llvm-dev

unread,
Oct 12, 2016, 12:19:03 AM10/12/16
to Mehdi Amini, llvm-dev
On Tue, Oct 11, 2016 at 8:59 PM Mehdi Amini <mehdi...@apple.com> wrote:
Hi,

I On Oct 11, 2016, at 6:22 PM, Zachary Turner via llvm-dev <llvm...@lists.llvm.org> wrote:

A while back llvm::format() was introduced that made it possible to combine printf-style formatting with llvm streams.  However, this still comes with all the risks and pitfalls of printf.  Everyone is no-doubt familiar with these problems, but here are just a few anyway:

1. Not type-safe.  Not all compilers warn when you mess up the format specifier.  And when you're writing your own Printf-like functions, you need to tag them with __attribute__(format, printf) which again not all compilers have.

I’m not very sensitive to the “not all compilers have” argument, however it is worth mentioning that the format may not be a string literal, which defeat the “sanitizer”.

  If you change a const char * to a StringRef, it can silently succeed while passing your StringRef object to printf.  It should fail to compile!

llvm::format now fails to compile as well :)

However this does not address other issues, like: `format(“%d”, float_var)` 


2. Not security safe.  Functions like sprintf() will happily smash your stack for you if you're not careful.  

3. Not portable (well kinda).  Quick, how do you print a size_t?  You probably said %z.  Well MSVC didn't even support %z until 2015, which we aren't even officially requiring yet.  So you've gotta write (uint64_t)x and then use PRIx64.  Ugh.

4. Redundant.  If you're giving it an integer, why do you need to specify %d?  It's an integer!  We should be able to use the type system to our advantage.

5. Not flexible.  How do you print a std::chrono::time_point with llvm::format()?  You can't.  You have to resort to providing an overloaded streaming operator or formatting it some other way.

It seems to me that there is no silver bullet for that: being for llvm::format() or your new proposal, there is some sort of glue/helpers that need to be provided for each and every non-standard type.


So I've been working on a library that will solve all of these problems and more.

Great! I appreciate the effort, and talking about that with Duncan last week he was mentioning that we should do it :)



The high level design of my library is borrowed heavily from C#.  But if you're not familiar with C#, I believe boost has something similar in spirit.  The best way to show it off is with some examples:

1. os << format_string("Test");   // writes "test"
2. os << format_string("{0}", 7);  // writes "7"

Immediately we can see one big difference between this and llvm::format() / printf.  You don't have to specify the type.  If you pass in an int, it formats it as an int.

3. os << format_string("{0} {0}", 7); // writes "7 7"

#3 is an example of something that cannot be done elegantly with printf.  Sure, you can pass it in twice, but if it's expensive to compute, this means you have to save it into a temporary.

What about: printf(“%0$ %0$”, 7); 
Well, umm..  I didn't even know about that.  And I wonder how many others also don't.  How does it choose the type?  It seems there is no d in there. 


4. os << format_string("{0:X}", 255);  // writes "0xFF"
5. os << format_string("{0:X7}", 255);  // writes "0x000FF"
6. os << format_string("{0}", foo_object); // fails to compile!

Here is another example of an improvement over traditional formatting mechanisms.  If you pass an object for which it cannot find a formatter, it fails to compile.

However, you can always define custom formatters for your own types.  If you write:

namespace llvm {
  template<>
  struct format_provider<Foo> {
    static void format(raw_ostream &S, const Foo &F, int Align, StringRef Options) {
    }
  };
}

Then #6 will magically compile, and invoke the function above to do the formatting.  There are other ways to customize the formatting behavior, but I'll keep going with some more examples:

7. os << format_string("{0:N}", -1234567);  // Writes "-1,234,567".  Note the commas.

Why add commas? Because of the “:N”?
This seems like localization-dependent: how do you handle that?
Yes, it is localization dependent.  That being said, llvm has 0 existing support for localization.  We already print floating point numbers with decimals, messages in English, etc.  

The purpose of this example was to illustrate that each formatter can have its own custom set of options.  For the case of integral arithemtic types, those would be:

X : Uppercase hex
X- : Uppercase hex without the 0x prefix.
x : Lowercase hex
x- : Lowercase hex without the 0x prefix
N : comma grouped digits
E : scientific notation with uppercase E
e : scientific notation with lowercase e
P : percent
F : fixed point

But for floating point types, a different set of format specifiers would be valid (for example, it doesn't make sense to print a floating point number as hex)

If you wrote your own formatter (as described earlier in #6, the field following the : would be passed in as the `Options` parameter, and the implementation is free to use it however it wants.  The std::chrono formatter takes strings similar to those described in #11, for example.
 

What happens with the following?

os << format_string("{0:N}", -123.455);

You would get "-123.46" (default precision of floating point types is 2 decimal places).  If you had -1234.566 it would print "-1,234.57" (you could change the precision by specifying an integer after the N.  So {0:N3} would print "-1,234.566").  For integral types the "precision" is the number of digits, so if it's greater than the length of the number it would pad left with 0s.  For floating point types it's the number of decimal places, so it would pad right with 0s.

Of course, all these details are open for debate, that's just my initial plan.

Zachary Turner via llvm-dev

unread,
Oct 12, 2016, 12:26:55 AM10/12/16
to Mehdi Amini, llvm-dev
On Tue, Oct 11, 2016 at 8:59 PM Mehdi Amini <mehdi...@apple.com> wrote:

5. Not flexible.  How do you print a std::chrono::time_point with llvm::format()?  You can't.  You have to resort to providing an overloaded streaming operator or formatting it some other way.

It seems to me that there is no silver bullet for that: being for llvm::format() or your new proposal, there is some sort of glue/helpers that need to be provided for each and every non-standard type.

I only half agree with this.  for llvm::format() there is no glue or helpers that can fit into the existing model.  It's a wrapper around snprintf, so you get what snprintf gives you.   You can go *around* llvm::format() and overload an operator to print your std::chrono::time_point, but there's no way to integrate it into llvm::format.  So with my proposed library you could write:

os << format_string("Start: {0}, End: {1}, Elapsed: {2:ms}", start, end, start-end);

Or you could write:

os << "Start: " << format_time_point(start) << ", End: " << format_time_point(end) << ", Elapsed: " << std::chrono::duration_cast<std::chrono::millis>(start-end).count();

Mehdi Amini via llvm-dev

unread,
Oct 12, 2016, 12:30:26 AM10/12/16
to Zachary Turner, llvm-dev
Sorry, I meant printf(“%0$d %0$d”, 7); 
Not sure if it is the best example: hexadecimal is the default format for printing float literal in the IR I believe. But OK I see how it works!

Zachary Turner via llvm-dev

unread,
Oct 12, 2016, 12:48:12 AM10/12/16
to Mehdi Amini, llvm-dev
Ok, well another example would be if you pass a pointer.  The only valid options are various flavors of hex.  You wouldn't want to print a pointer in scientific notation, for example.  

Mehdi Amini via llvm-dev

unread,
Oct 12, 2016, 12:49:14 AM10/12/16
to Zachary Turner, llvm-dev
On Oct 11, 2016, at 9:47 PM, Zachary Turner <ztu...@google.com> wrote:

Ok, well another example would be if you pass a pointer.  The only valid options are various flavors of hex.  You wouldn't want to print a pointer in scientific notation, for example.  

Sure, I got the point, this is great! (I should have made it more clear earlier).

— 
Mehdi

Sean Silva via llvm-dev

unread,
Oct 12, 2016, 2:15:57 AM10/12/16
to Zachary Turner, llvm-dev
This is awesome. +1

Copying a time-tested design like C#'s (and which also Python uses) seems like a really sound approach.

Do you have any particular plans w.r.t. converting existing uses of the other formatting constructs? At the very least we can hopefully get rid of format_hex/format_hex_no_prefix since I don't think there are too many uses of those functions.

Also, Since the format string already can embed the surrounding literal strings, do you anticipate the use case where you would want to use `OS << format_string(...) << ...something else...`?
Would `print(OS, "....", ....)` make more sense?

-- Sean Silva

_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev


Chandler Carruth via llvm-dev

unread,
Oct 12, 2016, 2:29:48 AM10/12/16
to Zachary Turner, llvm-dev
I'm generally favorable on the core idea of having a type-safe and friendly format-string-like formatting utility. Somewhat minor comments below:

On Tue, Oct 11, 2016 at 6:22 PM Zachary Turner via llvm-dev <llvm...@lists.llvm.org> wrote:
The high level design of my library is borrowed heavily from C#.

My only big hesitation here is that the substitution specifier seems heavily influenced by C#. I'd prefer to model this after a format string syntax folks are fairly familiar with. IMO, Python's is probably the best bet here and has had a lot of hammering on it over the years. So I'd suggest that the pattern syntax be mapped to be as similar to Python's as possible or at least built on top of it.

1. os << format_string("Test");   // writes "test"
2. os << format_string("{0}", 7);  // writes "7"

The "<< format_string(..." is ... really verbose for me. It also makes me strongly feel like this produces a string rather than a streamable entity.

I'm not a huge fan of streaming, but if we want to go this route, I'd very much like to keep the syntax short and sweet. "format" is pretty great for that. If this is going to fully subsume its use cases, can we eventually get that to be the name?

(While I don't like streaming, I'm not trying to fight that battle here...)

Also, you should probably look at what is quickly becoming a popular C++ library in this space: https://github.com/fmtlib/fmt

David Chisnall via llvm-dev

unread,
Oct 12, 2016, 4:24:02 AM10/12/16
to Chandler Carruth, llvm-dev
On 12 Oct 2016, at 07:29, Chandler Carruth via llvm-dev <llvm...@lists.llvm.org> wrote:
>
> I'm generally favorable on the core idea of having a type-safe and friendly format-string-like formatting utility

I’m also generally in favour, but I wonder what the key motivations for designing our own, rather than importing something like FastFormat, fmtlib, or one of the other tried-and-tested C++ typesafe I/O libraries is. Has someone done an analysis of why these designs are a bad fit for LLVM, or are we just reinventing the wheel because we feel like it?

David

Chandler Carruth via llvm-dev

unread,
Oct 12, 2016, 4:34:30 AM10/12/16
to David Chisnall, llvm-dev
On Wed, Oct 12, 2016 at 1:23 AM David Chisnall <David.C...@cl.cam.ac.uk> wrote:
On 12 Oct 2016, at 07:29, Chandler Carruth via llvm-dev <llvm...@lists.llvm.org> wrote:

>

> I'm generally favorable on the core idea of having a type-safe and friendly format-string-like formatting utility



I’m also generally in favour, but I wonder what the key motivations for designing our own, rather than importing something like FastFormat, fmtlib, or one of the other tried-and-tested C++ typesafe I/O libraries is.  Has someone done an analysis of why these designs are a bad fit for LLVM, or are we just reinventing the wheel because we feel like it?

(this keeps coming up in various contexts, so a somewhat longer/in-depth post than I originally intended. If folks want to discuss this further should probably fork a new thread)

Given the tendency of utilities like this to become used pervasively in the project, it would seem a fairly heavy weight dependency to grow.

I understand that LLVM's refusal to depend on and re-use existing open source code is frustrating, I'm actually rather frustrated as well at times by the NIH-like pattern. But I think there are good reasons for LLVM to eschew third party libraries in its core utilities, not the least of which are the inherent licensing complications.

LLVM faces a somewhat unique challenge when it comes to licensing compared to most other open source software: parts of LLVM are embedded into the binaries we build. This makes finding a "compatibly licensed" existing project .... unlikely. ;]

I don't want to spin off on a debate here about which license LLVM should use or not use or what all it needs to say. We have a separate thread about that. But one hope I have of *any* resolution ta that thread is that perhaps more open source projects will use exactly the same license. If they do, we might finally be able to have more reuse of existing open source code.


Either way, rolling our own has some advantages: LLVM may be able to make simplifying tradeoffs other libraries cannot realistically make due to narrower use cases and needs.

Provided we're only talking about very low level utilities like this, the cost doesn't seem terribly high to rolling our own, so I'm generally comfortable doing it.

Doesn't mean we shouldn't look at all the existing ones and learn everything we can from them.

-Chandler

David Chisnall via llvm-dev

unread,
Oct 12, 2016, 4:53:16 AM10/12/16
to Chandler Carruth, llvm-dev

> On 12 Oct 2016, at 09:34, Chandler Carruth <chan...@gmail.com> wrote:
>
> On Wed, Oct 12, 2016 at 1:23 AM David Chisnall <David.C...@cl.cam.ac.uk> wrote:
> On 12 Oct 2016, at 07:29, Chandler Carruth via llvm-dev <llvm...@lists.llvm.org> wrote:
>
> >
>
> > I'm generally favorable on the core idea of having a type-safe and friendly format-string-like formatting utility
>
>
>
> I’m also generally in favour, but I wonder what the key motivations for designing our own, rather than importing something like FastFormat, fmtlib, or one of the other tried-and-tested C++ typesafe I/O libraries is. Has someone done an analysis of why these designs are a bad fit for LLVM, or are we just reinventing the wheel because we feel like it?
>
> (this keeps coming up in various contexts, so a somewhat longer/in-depth post than I originally intended. If folks want to discuss this further should probably fork a new thread)
>
> Given the tendency of utilities like this to become used pervasively in the project, it would seem a fairly heavy weight dependency to grow.

A reimplementation is likely to be no less complex than any of the originals. Both fmtlib and FastFormat are under BSD / MIT-style licenses and are both small enough that it would be possible to embed copies of either in the LLVM tree if eliminating a dependency were desired.

Even if the implementation is not useable, adopting similar interfaces to an existing C++ solution is likely to be more friendly to C++ developers than designing something based on C# or Python.

> Either way, rolling our own has some advantages: LLVM may be able to make simplifying tradeoffs other libraries cannot realistically make due to narrower use cases and needs.

If that is the case, I would be totally in favour of rolling our own, but it seems that rolling our own was a decision made before investigating the alternatives.

> Provided we're only talking about very low level utilities like this, the cost doesn't seem terribly high to rolling our own, so I'm generally comfortable doing it.
>
> Doesn't mean we shouldn't look at all the existing ones and learn everything we can from them.

Completely agreed.

Chandler Carruth via llvm-dev

unread,
Oct 12, 2016, 5:15:11 AM10/12/16
to David Chisnall, llvm-dev
On Wed, Oct 12, 2016 at 1:53 AM David Chisnall <David.C...@cl.cam.ac.uk> wrote:
> On 12 Oct 2016, at 09:34, Chandler Carruth <chan...@gmail.com> wrote:
> Given the tendency of utilities like this to become used pervasively in the project, it would seem a fairly heavy weight dependency to grow.

A reimplementation is likely to be no less complex than any of the originals.  Both fmtlib and FastFormat are under BSD / MIT-style licenses and are both small enough that it would be possible to embed copies of either in the LLVM tree if eliminating a dependency were desired.

Sorry, by heavyweight I meant more that everything in LLVM would end up using it, and so any potential license incompatibility would be a serious issue.

And "BSD / MIT-style licenses" specifically don't address a number of the issues raised in the licensing thread. I don't want to try to rehash it here, but if we as a community think those issues are worth addressing, that precludes depending on existing code carrying these licenses.

As a specific issue: if this code ends up transitively used in runtime libraries, we would have binary attribution problems. So adding a dependency on code under some other license is, IMO, problematic from a very basic pragmatic perspective. It would move us back into having a weird partition through the LLVM project of some code that could go into runtimes but other code that could not go into runtimes. I don't want to go back to that point.

David Chisnall via llvm-dev

unread,
Oct 12, 2016, 5:30:39 AM10/12/16
to Chandler Carruth, llvm-dev
On 12 Oct 2016, at 10:14, Chandler Carruth <chan...@gmail.com> wrote:
>
> And "BSD / MIT-style licenses" specifically don't address a number of the issues raised in the licensing thread. I don't want to try to rehash it here, but if we as a community think those issues are worth addressing, that precludes depending on existing code carrying these licenses.
>
> As a specific issue: if this code ends up transitively used in runtime libraries, we would have binary attribution problems. So adding a dependency on code under some other license is, IMO, problematic from a very basic pragmatic perspective. It would move us back into having a weird partition through the LLVM project of some code that could go into runtimes but other code that could not go into runtimes. I don't want to go back to that point.

2-clause BSD and MIT licenses (the relevant ones here) do address this. They are as permissive as the most permissive license used in LLVM (and far more permissive than the proposed new license) and carry no binary attribution clauses.

Pavel Labath via llvm-dev

unread,
Oct 12, 2016, 5:34:21 AM10/12/16
to Zachary Turner, llvm-dev
On 12 October 2016 at 05:26, Zachary Turner via llvm-dev

<llvm...@lists.llvm.org> wrote:
> os << format_string("Start: {0}, End: {1}, Elapsed: {2:ms}", start, end,
> start-end);

What would happen if I accidentally type "ps" instead of "ms" (I am
assuming we will not support picoseconds here)?

Will this abort at runtime?

I would prefer if *all* arguments to the format were checkable at compile time:
I.e. something like:
os << "blah blah" << format<std::milli>(end-start) << "blah blah";

I understand this may clash a bit with the desire for a compact
representation, but maybe with some clever design we could achieve
both?

pl

Chandler Carruth via llvm-dev

unread,
Oct 12, 2016, 5:40:59 AM10/12/16
to David Chisnall, llvm-dev
On Wed, Oct 12, 2016 at 2:30 AM David Chisnall <David.C...@cl.cam.ac.uk> wrote:
2-clause BSD and MIT licenses (the relevant ones here) do address this.

The second clause here:

States:
"""
2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
"""

IANAL and all that, but I do not think this addresses the binary distribution issues as effectively as what is being proposed for the LLVM license.

Even if it does, it would become that much harder to understand and convince everyone that it sufficiently addresses it.

However, if everything goes under the single LLVM license being proposed, we get to deal with that exactly once rather than having to evaluate N different licenses.

Anyways, we're pretty far afield here. My main point was that reusing existing libraries in LLVM at this low level has a surprising additional cost beyond any technical cost of tracking dependencies due to the surprising nature of runtime libraries reusing parts of the LLVM project. It is still a cost that should be traded off carefully against the cost of re-implementing something. And none of it should cause us to not examine the alternatives and learn from and match their API ideas where reasonable.

They are as permissive as the most permissive license used in LLVM (and far more permissive than the proposed new license) and carry no binary attribution clauses.

While I don't quite agree with every aspect of your claim here, I don't want to debate on a mailing list which license is more or less permissive (ones currently in use, ones proposed, etc.). Not sure anything good comes of that.

Zachary Turner via llvm-dev

unread,
Oct 12, 2016, 9:11:55 AM10/12/16
to Pavel Labath, llvm-dev
In my current implementation, it's up to the format provider. If you have an illegal format spec (eg {0;0}) it ignores it and prints the format spec as a literal. We could also add an assert here in theory.

If/when we move to c++14, a constexpr StringRef implementation would allow us to parse and validate the entire format string at compile time.

Since conciseness is one of the main goals of a library such as this, I would hate to actively hamper this for more compile time checking. If you write an invalid format spec presumably your test will fail since it will ignore it

Zachary Turner via llvm-dev

unread,
Oct 12, 2016, 9:38:02 AM10/12/16
to Pavel Labath, llvm-dev
Actually I should elaborate because this is a tad misleading.

If the *syntax* is illegal it will ignore. Like if you write {0;} or {0{. In this case the entire thing is pasted into the output and no replacement happens.

If it can successfully parse into X, Y, and Z where X is the index, Y is the alignment, and Z is the option string, then what happens depends on which of X, Y, and Z are illegal.

If X is empty the sequence is replaced with an empty string. Otherwise, If X is not a positive integer the sequence is pasted into the output and everything else ignored.

If Y is illegal, Y is ignored as if it wasn't specified.

Whether Z is illegal is up to the format provider, so each one decides how to react to an invalid string

Note that any point here we can assert. This would allow us to catch these in debug builds while silently doing the best we can in non debug builds

Aaron Ballman via llvm-dev

unread,
Oct 12, 2016, 9:43:50 AM10/12/16
to Chandler Carruth, llvm-dev
>> 1. os << format_string("Test"); // writes "test"
>> 2. os << format_string("{0}", 7); // writes "7"
>
>
> The "<< format_string(..." is ... really verbose for me. It also makes me
> strongly feel like this produces a string rather than a streamable entity.

I wonder if we could use UDLs instead?

os << "Test" << "{0}"_fs << 7;

~Aaron

>
> I'm not a huge fan of streaming, but if we want to go this route, I'd very
> much like to keep the syntax short and sweet. "format" is pretty great for
> that. If this is going to fully subsume its use cases, can we eventually get
> that to be the name?
>
> (While I don't like streaming, I'm not trying to fight that battle here...)
>
> Also, you should probably look at what is quickly becoming a popular C++
> library in this space: https://github.com/fmtlib/fmt
>

Zachary Turner via llvm-dev

unread,
Oct 12, 2016, 10:03:38 AM10/12/16
to Aaron Ballman, Chandler Carruth, llvm-dev
I'm not sure that would work well. The implementation relies on being able to index into the parameter pack. How would you do that if each parameter is streamed in?

"{0} {1}"_fs(1, 2)

Could perhaps work, but it looks a little strange to me.

Fwiw i agree format_string is long. Ideally it would be called format, but that's taken.

Another option is os.format("{0}", 7), and have format_string("{0}", 7) return a std::string.

Zachary Turner via llvm-dev

unread,
Oct 12, 2016, 10:12:27 AM10/12/16
to Aaron Ballman, Chandler Carruth, llvm-dev
Ahh, UDLs also wouldn't permit non literal format strings, which is a deal breaker imo

Zachary Turner via llvm-dev

unread,
Oct 12, 2016, 11:54:03 AM10/12/16
to Sean Silva, llvm-dev
On Tue, Oct 11, 2016 at 11:15 PM Sean Silva <chiso...@gmail.com> wrote:
This is awesome. +1

Copying a time-tested design like C#'s (and which also Python uses) seems like a really sound approach.

Do you have any particular plans w.r.t. converting existing uses of the other formatting constructs? At the very least we can hopefully get rid of format_hex/format_hex_no_prefix since I don't think there are too many uses of those functions.
I can certainly try, although when I did a quick grep I found 1,637 uses of llvm::format().  It's something we can work towards slowly, but I don't imagine I have the capacity to convert all of these by myself.  Getting rid of format_hex() could be a worthy first step though.
 

Also, Since the format string already can embed the surrounding literal strings, do you anticipate the use case where you would want to use `OS << format_string(...) << ...something else...`?
Would `print(OS, "....", ....)` make more sense?
Perhaps.  I would argue that the whole reason we use << in the first place is *because* we don't have a real formatting function.  And when we do have one -- assuming it's designed correctly -- streaming operators become unnecessary / a thing of the past.  I can imagine a couple of different syntaxes.  

os.format(format_str, args...);   // format() is an instance method of raw_ostream.
T format_string<T>(format_str, args...);  // returns a T  (e.g. a std::string, or SmallString<N>)

T &formatf(T &t, format_str, args...);  // formats to the location specified by T, which could be a stream, std::string, SmallString, etc.  In practice this could be implemented by having the raw_ostream overload call os.format(format_str, args); and having the other versions create a raw_string_ostream or raw_svec_ostream and delegating to the stream version.

Zachary Turner via llvm-dev

unread,
Oct 12, 2016, 12:04:24 PM10/12/16
to Chandler Carruth, llvm-dev
On Tue, Oct 11, 2016 at 11:29 PM Chandler Carruth <chan...@google.com> wrote:
I'm generally favorable on the core idea of having a type-safe and friendly format-string-like formatting utility. Somewhat minor comments below:

On Tue, Oct 11, 2016 at 6:22 PM Zachary Turner via llvm-dev <llvm...@lists.llvm.org> wrote:
The high level design of my library is borrowed heavily from C#.

My only big hesitation here is that the substitution specifier seems heavily influenced by C#. I'd prefer to model this after a format string syntax folks are fairly familiar with. IMO, Python's is probably the best bet here and has had a lot of hammering on it over the years. So I'd suggest that the pattern syntax be mapped to be as similar to Python's as possible or at least built on top of it.
A lot of Python's substitution rules only make sense in the context of a language with reflection.  For example, you can write "{0.x}".format(obj) in python which means to print obj.x.  If you take all of that out of the equation, Python and C#'s formatting syntax is honestly very similar.  They both use curly brace delimeters, they both index by number, they both use a : separator.

The biggest difference is that Python smashes ALL of the formatting info into a single field (i.e. everything after the colon), whereas C# separates this into two fields as follows:

{index[,align][:options]}

I prefer this approach because it draws a firm line between the type-specific formatting (e.g. the options field) and universal formatting (e.g. the alignment field).  I did find some potentially useful tidbits in Python's specification that seem useful and which C# does not support though.  For example, the ability to center a field, and the ability to specify the padding character rather than always using spaces.  We could possibly integrate some of those ideas.
 

1. os << format_string("Test");   // writes "test"
2. os << format_string("{0}", 7);  // writes "7"

The "<< format_string(..." is ... really verbose for me. It also makes me strongly feel like this produces a string rather than a streamable entity.

I'm not a huge fan of streaming, but if we want to go this route, I'd very much like to keep the syntax short and sweet. "format" is pretty great for that. If this is going to fully subsume its use cases, can we eventually get that to be the name?

(While I don't like streaming, I'm not trying to fight that battle here...)
Just for the record, I'm not a fan either.  See my response to Sean Silva for some alternatives.
 

James Y Knight via llvm-dev

unread,
Oct 12, 2016, 1:13:52 PM10/12/16
to Zachary Turner, llvm-dev
On Tue, Oct 11, 2016 at 9:22 PM, Zachary Turner via llvm-dev <llvm...@lists.llvm.org> wrote:
A while back llvm::format() was introduced that made it possible to combine printf-style formatting with llvm streams.  However, this still comes with all the risks and pitfalls of printf.  Everyone is no-doubt familiar with these problems, but here are just a few anyway:

1. Not type-safe.  Not all compilers warn when you mess up the format specifier.  And when you're writing your own Printf-like functions, you need to tag them with __attribute__(format, printf) which again not all compilers have.  If you change a const char * to a StringRef, it can silently succeed while passing your StringRef object to printf.  It should fail to compile!

2. Not security safe.  Functions like sprintf() will happily smash your stack for you if you're not careful.  

3. Not portable (well kinda).  Quick, how do you print a size_t?  You probably said %z.  Well MSVC didn't even support %z until 2015, which we aren't even officially requiring yet.  So you've gotta write (uint64_t)x and then use PRIx64.  Ugh.

4. Redundant.  If you're giving it an integer, why do you need to specify %d?  It's an integer!  We should be able to use the type system to our advantage.

5. Not flexible.  How do you print a std::chrono::time_point with llvm::format()?  You can't.  You have to resort to providing an overloaded streaming operator or formatting it some other way.

So I've been working on a library that will solve all of these problems and more.


I wonder what use cases you envision for this? Why does LLVM need a super extensible flexible formatting library? I mean -- if you were developing this as a standalone project, that seems like maybe a nice feature. But I see no rationale as to why LLVM should include it.

That is to say: wouldn't a much-simpler printf replacement, implemented with variadic templates instead of C varargs (and which therefore doesn't require size/signedness prefixes on %d) be sufficient for LLVM?

You can do that as a drop-in improvement for llvm::format, replacing the call to snprintf inside the implementation with a new implementation that actually uses the type information.

Zachary Turner via llvm-dev

unread,
Oct 12, 2016, 1:28:57 PM10/12/16
to James Y Knight, llvm-dev
On Wed, Oct 12, 2016 at 10:13 AM James Y Knight <jykn...@google.com> wrote:


I wonder what use cases you envision for this? Why does LLVM need a super extensible flexible formatting library? I mean -- if you were developing this as a standalone project, that seems like maybe a nice feature. But I see no rationale as to why LLVM should include it.
We were discussing this on IRC chat the other night, but I believe many people underestimate the need for string formatting.  Here are some data points:

1. There are currently 1,637 calls to llvm::format() across the codebase, and this doesn't include calls to format_hex(), format_decimal(), and the other variants.
2. LLVM consists of a large number (20+ at a minimum) of focused tools (llc, lli, llvm-dwarfdump, llvm-objdump, etc) whose sole purpose is to output formatted text.  Consider the use case of printing a verbose disassembly listing which is fed into FileCheck.
3. Even the "flagship" tools such as clang have need for string formatting when writing diagnostic messages.
4. LLDB in particular has this kind of thing *everywhere*.  I'm talking about anywhere from 3-50+ times per function (and that's not an exaggeration) for logging purposes.

That said, LLVM already includes a formatting library.  llvm::format().  So what would be the rationale *against* a better, safer, and easier version of the same thing?
 

That is to say: wouldn't a much-simpler printf replacement, implemented with variadic templates instead of C varargs (and which therefore doesn't require size/signedness prefixes on %d) be sufficient for LLVM?

You can do that as a drop-in improvement for llvm::format, replacing the call to snprintf inside the implementation with a new implementation that actually uses the type information.
How would you format user-defined types using this?  I gave an example earlier:  Consider you have a start time and an end time in std::chrono types, and you want to print the start, end, and duration.  The code to do this using llvm::format() or stream operators is horrible.

Mehdi Amini via llvm-dev

unread,
Oct 12, 2016, 2:25:02 PM10/12/16
to Zachary Turner, llvm-dev
On Oct 12, 2016, at 7:12 AM, Zachary Turner via llvm-dev <llvm...@lists.llvm.org> wrote:

Ahh, UDLs also wouldn't permit non literal format strings, which is a deal breaker imo

Why?
Somehow the goal pursued by Pavel (which you didn’t object per-se) is to provide *compile* time checking.
This imply that you cannot decouple the construction of the format and the argument list.

— 
Mehdi

James Y Knight via llvm-dev

unread,
Oct 12, 2016, 2:38:57 PM10/12/16
to Zachary Turner, llvm-dev
On Wed, Oct 12, 2016 at 1:28 PM, Zachary Turner <ztu...@google.com> wrote:

On Wed, Oct 12, 2016 at 10:13 AM James Y Knight <jykn...@google.com> wrote:


I wonder what use cases you envision for this? Why does LLVM need a super extensible flexible formatting library? I mean -- if you were developing this as a standalone project, that seems like maybe a nice feature. But I see no rationale as to why LLVM should include it.
We were discussing this on IRC chat the other night, but I believe many people underestimate the need for string formatting.  Here are some data points:

1. There are currently 1,637 calls to llvm::format() across the codebase, and this doesn't include calls to format_hex(), format_decimal(), and the other variants.
2. LLVM consists of a large number (20+ at a minimum) of focused tools (llc, lli, llvm-dwarfdump, llvm-objdump, etc) whose sole purpose is to output formatted text.  Consider the use case of printing a verbose disassembly listing which is fed into FileCheck.
3. Even the "flagship" tools such as clang have need for string formatting when writing diagnostic messages.
4. LLDB in particular has this kind of thing *everywhere*.  I'm talking about anywhere from 3-50+ times per function (and that's not an exaggeration) for logging purposes.

That said, LLVM already includes a formatting library.  llvm::format().  So what would be the rationale *against* a better, safer, and easier version of the same thing?

The arguments against for me are roughly:
1. It introduces a new formatting language that people need to learn.
2. People will still continue using printf-style formattings strings, too, because everyone **always** does, whenever anyone's ever introduced another formatting language anywhere.
3. The extensible formatting support is a) not obviously necessary, and b) will be more difficult to understand for readers, versus calling a function with normal function arguments.
 

That is to say: wouldn't a much-simpler printf replacement, implemented with variadic templates instead of C varargs (and which therefore doesn't require size/signedness prefixes on %d) be sufficient for LLVM?

You can do that as a drop-in improvement for llvm::format, replacing the call to snprintf inside the implementation with a new implementation that actually uses the type information.
How would you format user-defined types using this?  I gave an example earlier:  Consider you have a start time and an end time in std::chrono types, and you want to print the start, end, and duration.  The code to do this using llvm::format() or stream operators is horrible.

I'd call a function that returns a string, and print the string.
E.g.:
format("Started at %s, ended at %s", 
  format_date("%d/%m/%Y %T", start_time),
  format_date("%d/%m/%Y %T", end_time));

Zachary Turner via llvm-dev

unread,
Oct 12, 2016, 2:39:40 PM10/12/16
to Mehdi Amini, llvm-dev
I don't object to compile time checking *as long as it doesn't severely detract from brevity*.  At the same time, I do object to *preventing* runtime format strings.

When we have C++14, we can make every member of StringRef constexpr, and at that point we will get compile time checking mostly "for free" without preventing runtime format strings.  For example, given a constexpr-aware implementation of StringRef, if you were to write: os.format("literal format", a, b, c) you would get all the compile time checking, such as ensuring that the number of arguments matches the highest index in the format string, and ensuring there are enough arguments for every placeholder.  But if you wrote os.format(s, a, b, c) you would still get runtime checking of the format strings.

As long as the runtime implementation doesn't exhibit UB when things don't match up, and it kindly asserts to warn you of the problem in the test suite, support runtime format strings can be very helpful.  For example, it could allow you to wrap a call to format in some other function, like:

template<typename... Ts>
void wrap_format(const char *Format, Ts &&... Args) {
   dbgs().format(Format, ConvertArg(Args)...);
}

Zachary Turner via llvm-dev

unread,
Oct 12, 2016, 2:48:30 PM10/12/16
to James Y Knight, llvm-dev
On Wed, Oct 12, 2016 at 11:38 AM James Y Knight <jykn...@google.com> wrote:
On Wed, Oct 12, 2016 at 1:28 PM, Zachary Turner <ztu...@google.com> wrote:

On Wed, Oct 12, 2016 at 10:13 AM James Y Knight <jykn...@google.com> wrote:


I wonder what use cases you envision for this? Why does LLVM need a super extensible flexible formatting library? I mean -- if you were developing this as a standalone project, that seems like maybe a nice feature. But I see no rationale as to why LLVM should include it.
We were discussing this on IRC chat the other night, but I believe many people underestimate the need for string formatting.  Here are some data points:

1. There are currently 1,637 calls to llvm::format() across the codebase, and this doesn't include calls to format_hex(), format_decimal(), and the other variants.
2. LLVM consists of a large number (20+ at a minimum) of focused tools (llc, lli, llvm-dwarfdump, llvm-objdump, etc) whose sole purpose is to output formatted text.  Consider the use case of printing a verbose disassembly listing which is fed into FileCheck.
3. Even the "flagship" tools such as clang have need for string formatting when writing diagnostic messages.
4. LLDB in particular has this kind of thing *everywhere*.  I'm talking about anywhere from 3-50+ times per function (and that's not an exaggeration) for logging purposes.

That said, LLVM already includes a formatting library.  llvm::format().  So what would be the rationale *against* a better, safer, and easier version of the same thing?

The arguments against for me are roughly:
1. It introduces a new formatting language that people need to learn.
We learn new things every day.  Among the new things that people would need to learn, I would rank this among the least difficult we can think of.  The syntax is familiar to anyone who has ever used Python or C# (which is probably most people here).
 
2. People will still continue using printf-style formattings strings, too, because everyone **always** does, whenever anyone's ever introduced another formatting language anywhere.
Not if the end-state is that we remove llvm::format()
 
3. The extensible formatting support is a) not obviously necessary, and b) will be more difficult to understand for readers, versus calling a function with normal function arguments.
I disagree.  I would be surprised if anyone thinks

os.format("Start: {0}, End: {1}, Duration: {2:ms} milliseconds", start, end, end-start); 

is harder to understand than pretty much anything else you could possibly write.
 
 

That is to say: wouldn't a much-simpler printf replacement, implemented with variadic templates instead of C varargs (and which therefore doesn't require size/signedness prefixes on %d) be sufficient for LLVM?

You can do that as a drop-in improvement for llvm::format, replacing the call to snprintf inside the implementation with a new implementation that actually uses the type information.
How would you format user-defined types using this?  I gave an example earlier:  Consider you have a start time and an end time in std::chrono types, and you want to print the start, end, and duration.  The code to do this using llvm::format() or stream operators is horrible.

I'd call a function that returns a string, and print the string.
E.g.:
format("Started at %s, ended at %s", 
  format_date("%d/%m/%Y %T", start_time),
  format_date("%d/%m/%Y %T", end_time));
We take care to make our stream based formatting as efficient as possible since it is used so pervasively throughout LLVM.  There are quite a few unnecessary copies in here, and more room for programmer error in doing the formatting.

Duncan P. N. Exon Smith via llvm-dev

unread,
Oct 12, 2016, 2:50:31 PM10/12/16
to Zachary Turner, llvm-dev

> On 2016-Oct-11, at 18:22, Zachary Turner via llvm-dev <llvm...@lists.llvm.org> wrote:
>
> A while back llvm::format() was introduced that made it possible to combine printf-style formatting with llvm streams. However, this still comes with all the risks and pitfalls of printf. Everyone is no-doubt familiar with these problems, but here are just a few anyway:
>
> 1. Not type-safe. Not all compilers warn when you mess up the format specifier. And when you're writing your own Printf-like functions, you need to tag them with __attribute__(format, printf) which again not all compilers have. If you change a const char * to a StringRef, it can silently succeed while passing your StringRef object to printf. It should fail to compile!
>
> 2. Not security safe. Functions like sprintf() will happily smash your stack for you if you're not careful.
>
> 3. Not portable (well kinda). Quick, how do you print a size_t? You probably said %z. Well MSVC didn't even support %z until 2015, which we aren't even officially requiring yet. So you've gotta write (uint64_t)x and then use PRIx64. Ugh.
>
> 4. Redundant. If you're giving it an integer, why do you need to specify %d? It's an integer! We should be able to use the type system to our advantage.
>
> 5. Not flexible. How do you print a std::chrono::time_point with llvm::format()? You can't. You have to resort to providing an overloaded streaming operator or formatting it some other way.
>
> So I've been working on a library that will solve all of these problems and more.
>
>
> The high level design of my library is borrowed heavily from C#. But if you're not familiar with C#, I believe boost has something similar in spirit.

Boost.Format:
http://www.boost.org/doc/libs/1_62_0/libs/format/doc/format.html

I used it extensively in a past gig. IIRC, it's type safe, more convenient than usual operator<<, and faster than printf. I would love for something like this to be in tree... I don't really care which one as long as it's convenient enough that it's "obviously better".

(IOW, +1.)

> The best way to show it off is with some examples:
>

> 1. os << format_string("Test"); // writes "test"
> 2. os << format_string("{0}", 7); // writes "7"
>

Mehdi Amini via llvm-dev

unread,
Oct 12, 2016, 2:59:06 PM10/12/16
to Zachary Turner, llvm-dev
On Oct 12, 2016, at 11:38 AM, Zachary Turner <ztu...@google.com> wrote:

I don't object to compile time checking *as long as it doesn't severely detract from brevity*. 


At the same time, I do object to *preventing* runtime format strings.

You haven’t answered: why?

— 
Mehdi

Zachary Turner via llvm-dev

unread,
Oct 12, 2016, 3:08:37 PM10/12/16
to Mehdi Amini, llvm-dev
I thought I did.  :)  Passing format strings between functions is very useful.  For example, imagine wanting to write a function like printRange(const char *Fmt, std::vector<int> Items);

This isn't possible if your format string MUST be a string literal and is very useful.

Equally importantly, I don't see a good reason to disallow runtime format strings.

Mehdi Amini via llvm-dev

unread,
Oct 12, 2016, 3:23:56 PM10/12/16
to Zachary Turner, llvm-dev
On Oct 12, 2016, at 12:08 PM, Zachary Turner <ztu...@google.com> wrote:

I thought I did.  :)  Passing format strings between functions is very useful.  For example, imagine wanting to write a function like printRange(const char *Fmt, std::vector<int> Items);

I’m not sure I understand your example? 
Do you mean you want the range to be in the format? If so Why? I would rather write something like:

printRange(“{per_elts_fmt}”, /* separator */ “, ", begin, end);

This isn't possible if your format string MUST be a string literal

I haven’t seen a convincing example yet to support this. I may miss the obvious, but you haven’t shown it either.
One could find a way to *compose* format in a compile-time-safe more efficiently.

Equally importantly, I don't see a good reason to disallow runtime format strings.

No compile time checking, bug hiding, not robust.
(i.e. you may not “crash”, but you may still don’t print what you want / expect in every case).

Zachary Turner via llvm-dev

unread,
Oct 12, 2016, 3:35:23 PM10/12/16
to Mehdi Amini, llvm-dev
You get compile time checking automatically when we can use c++14 though. If you use it with a string literal, you'll get compile time checking, otherwise you won't.

Here's a different example though. Suppose you're writing a tool which prints formatted output, and the field width is specified by the user. Now you NEED to build the format string at runtime, there's no other way. Off the top of my head, lldb does this already when printing disassembly and stack frames. The column widths are user settings

Mehdi Amini via llvm-dev

unread,
Oct 12, 2016, 3:40:51 PM10/12/16
to Zachary Turner, llvm-dev
On Oct 12, 2016, at 12:35 PM, Zachary Turner <ztu...@google.com> wrote:

You get compile time checking automatically when we can use c++14 though. If you use it with a string literal, you'll get compile time checking, otherwise you won’t.

I understand that, but that doesn’t really address my concerns.


Here's a different example though. Suppose you're writing a tool which prints formatted output, and the field width is specified by the user.


Now you NEED to build the format string at runtime, there's no other way

Maybe the problem is using a string to format this in the first place.

For example, you could wrap the object you want to print with an adaptor in charge of padding to the right till you reach the column width.

format(“{0}”, rPad(col_width, my_object));

Zachary Turner via llvm-dev

unread,
Oct 12, 2016, 3:47:14 PM10/12/16
to Mehdi Amini, llvm-dev
That's less efficient, more verbose, involves extra copies, and doesn't allow you to take full advantage of the library's mechanism for formatting user-defined types using different presentation styles.  

Just to be clear, no other format libraries that exist today mandate string literal format strings.  And it would be an understatement to say that I would be strongly opposed to such a requirement.

I would be fine providing UDLs for the case where you have a string literal format string and encouraging people to use it wherever possible, but I don't consider providing *only* UDL-based formatting (or any mechanism that mandates string literals) a viable option.

Tim Shen via llvm-dev

unread,
Oct 12, 2016, 4:01:51 PM10/12/16
to Zachary Turner, Mehdi Amini, llvm-dev
On Wed, Oct 12, 2016 at 12:35 PM Zachary Turner via llvm-dev <llvm...@lists.llvm.org> wrote:
You get compile time checking automatically when we can use c++14 though. If you use it with a string literal, you'll get compile time checking, otherwise you won't.

Even with C++14, os.format("literal format", a, b, c) cannot do the compile-time checking (I maybe wrong with understanding C++14 constexpr). You probably need to add a overloaded version like `os.format(static_format("literal format"), a, b, c)`, or `os.format("literal format"_fmt, a, b, c)` to hold the compile-time checked version.

But anyway, the current interface os.format(const char*, ...) is forward-compatible.

Zachary Turner via llvm-dev

unread,
Oct 12, 2016, 4:07:39 PM10/12/16
to Tim Shen, Mehdi Amini, llvm-dev
Couldn't you define a class FormatString like this:

class FormatString {
  template<int N>
  constexpr FormatString(const char (&S)[N]) {
    tokenize();
  }

  FormatString(const char *s) {}
};

Then define the format function as format(const FormatString &S, Ts &&...Args)

The implicit conversion from string literal would go to the constexpr constructor which could tokenize the string at compile time, while implicit conversion from non-literal would be tokenized at runtime.

If that doesn't work, then like you said, you could use a UDL to force the checking at compile time.

Aaron Ballman via llvm-dev

unread,
Oct 12, 2016, 4:10:58 PM10/12/16
to Mehdi Amini, llvm-dev
On Wed, Oct 12, 2016 at 3:23 PM, Mehdi Amini <mehdi...@apple.com> wrote:
>
> On Oct 12, 2016, at 12:08 PM, Zachary Turner <ztu...@google.com> wrote:
>
> I thought I did. :) Passing format strings between functions is very
> useful. For example, imagine wanting to write a function like
> printRange(const char *Fmt, std::vector<int> Items);
>
>
> I’m not sure I understand your example?
> Do you mean you want the range to be in the format? If so Why? I would
> rather write something like:
>
> printRange(“{per_elts_fmt}”, /* separator */ “, ", begin, end);
>
> This isn't possible if your format string MUST be a string literal
>
>
> I haven’t seen a convincing example yet to support this. I may miss the
> obvious, but you haven’t shown it either.

Internationalization is often one common reason for a format string to
not be a string literal. I could see us wanting to translate our
diagnostic messages, for instance.

~Aaron

Tim Shen via llvm-dev

unread,
Oct 12, 2016, 4:22:58 PM10/12/16
to Zachary Turner, Mehdi Amini, llvm-dev
On Wed, Oct 12, 2016 at 1:07 PM Zachary Turner <ztu...@google.com> wrote:
Couldn't you define a class FormatString like this:

class FormatString {
  template<int N>
  constexpr FormatString(const char (&S)[N]) {
    tokenize();
  }

  FormatString(const char *s) {}
};

Then define the format function as format(const FormatString &S, Ts &&...Args)

The implicit conversion from string literal would go to the constexpr constructor which could tokenize the string at compile time, while implicit conversion from non-literal would be tokenized at runtime.

Actually it might be doable even in C++11. For example:

  template<typename Ret>
  Ret error_empty_string() {}

  constexpr char get_first_char(const char* a) {
    return *a == '\0' ? error_empty_string<char>() : a[0];
  }

  int main() {
    // { constexpr char c = get_first_char(""); }  // compile-time error
    { constexpr char c = get_first_char("a"); }  // good
    { char c = get_first_char(""); }  // runtime error handling
    { char c = get_first_char("a"); }  // good
  }


I was thinking about using static_assert, which seems hard to integrate into only the compile-time version but not the runtime version.

Chris Bieneman via llvm-dev

unread,
Oct 12, 2016, 5:44:11 PM10/12/16
to Zachary Turner, llvm-dev
+1

I really like this proposal. Throughout LLVM sub-projects there is a lot of string formatting that we do and it would be great to have a more modern, flexible, portable, and safe string formatting API.

-Chris

On Oct 11, 2016, at 6:22 PM, Zachary Turner via llvm-dev <llvm...@lists.llvm.org> wrote:

A while back llvm::format() was introduced that made it possible to combine printf-style formatting with llvm streams.  However, this still comes with all the risks and pitfalls of printf.  Everyone is no-doubt familiar with these problems, but here are just a few anyway:

1. Not type-safe.  Not all compilers warn when you mess up the format specifier.  And when you're writing your own Printf-like functions, you need to tag them with __attribute__(format, printf) which again not all compilers have.  If you change a const char * to a StringRef, it can silently succeed while passing your StringRef object to printf.  It should fail to compile!

2. Not security safe.  Functions like sprintf() will happily smash your stack for you if you're not careful.  

3. Not portable (well kinda).  Quick, how do you print a size_t?  You probably said %z.  Well MSVC didn't even support %z until 2015, which we aren't even officially requiring yet.  So you've gotta write (uint64_t)x and then use PRIx64.  Ugh.

4. Redundant.  If you're giving it an integer, why do you need to specify %d?  It's an integer!  We should be able to use the type system to our advantage.

5. Not flexible.  How do you print a std::chrono::time_point with llvm::format()?  You can't.  You have to resort to providing an overloaded streaming operator or formatting it some other way.

So I've been working on a library that will solve all of these problems and more.


The high level design of my library is borrowed heavily from C#.  But if you're not familiar with C#, I believe boost has something similar in spirit.  The best way to show it off is with some examples:

1. os << format_string("Test");   // writes "test"
2. os << format_string("{0}", 7);  // writes "7"

Joerg Sonnenberger via llvm-dev

unread,
Oct 12, 2016, 5:50:26 PM10/12/16
to llvm...@lists.llvm.org
On Wed, Oct 12, 2016 at 11:50:25AM -0700, Duncan P. N. Exon Smith via llvm-dev wrote:
> Boost.Format:
> http://www.boost.org/doc/libs/1_62_0/libs/format/doc/format.html

It's quite heavy, e.g.:

https://github.com/fmtlib/fmt#compile-time-and-code-bloat

I've been using that library for a couple of projects in an older
version, I think the newer version would primarily be quite a bit less
verbose. It has a modern BSD license.

Joerg

Zachary Turner via llvm-dev

unread,
Oct 12, 2016, 11:07:46 PM10/12/16
to Mehdi Amini, llvm-dev
On Wed, Oct 12, 2016 at 12:40 PM Mehdi Amini <mehdi...@apple.com> wrote:
On Oct 12, 2016, at 12:35 PM, Zachary Turner <ztu...@google.com> wrote:

You get compile time checking automatically when we can use c++14 though. If you use it with a string literal, you'll get compile time checking, otherwise you won’t.

I understand that, but that doesn’t really address my concerns.


Here's a different example though. Suppose you're writing a tool which prints formatted output, and the field width is specified by the user.


Now you NEED to build the format string at runtime, there's no other way

Maybe the problem is using a string to format this in the first place.

For example, you could wrap the object you want to print with an adaptor in charge of padding to the right till you reach the column width.

format(“{0}”, rPad(col_width, my_object));

FWIW I do think that literal format strings will handle 90% or more of uses.  I just don't see the benefit of needlessly banning the other cases.  Because all that's going to happen is someone is going to resort to using snprintf etc, which is exactly the problem I'm trying to solve.  It's literally no extra effort to support runtime format strings, and it makes the library more flexible as a result.

I'm willing to start with UDLs only because I think it will get us quite far, but as soon as I need to pass a format string through an intermediate function or something like that, I will probably check in the 3 extra lines of code to add a const char* overload format function.

FWIW, there's no easy way to add compile time checking of format strings until C++14, regardless of whether use UDLs or not.  So that doesn't change either way.
 

Mehdi Amini via llvm-dev

unread,
Oct 12, 2016, 11:20:04 PM10/12/16
to Zachary Turner, llvm-dev
On Oct 12, 2016, at 8:07 PM, Zachary Turner <ztu...@google.com> wrote:

On Wed, Oct 12, 2016 at 12:40 PM Mehdi Amini <mehdi...@apple.com> wrote:
On Oct 12, 2016, at 12:35 PM, Zachary Turner <ztu...@google.com> wrote:

You get compile time checking automatically when we can use c++14 though. If you use it with a string literal, you'll get compile time checking, otherwise you won’t.

I understand that, but that doesn’t really address my concerns.


Here's a different example though. Suppose you're writing a tool which prints formatted output, and the field width is specified by the user.


Now you NEED to build the format string at runtime, there's no other way

Maybe the problem is using a string to format this in the first place.

For example, you could wrap the object you want to print with an adaptor in charge of padding to the right till you reach the column width.

format(“{0}”, rPad(col_width, my_object));

FWIW I do think that literal format strings will handle 90% or more of uses.  I just don't see the benefit of needlessly banning the other cases.  Because all that's going to happen is someone is going to resort to using snprintf etc, which is exactly the problem I'm trying to solve.

Sorry but you’re totally missing the point. If there is a need for dynamism, this should be supported, that’s not the question. My point is that generating a string that will be parsed by a format function can’t be the only solution.

  It's literally no extra effort to support runtime format strings, and it makes the library more flexible as a result.

No: it does *not* make it more flexible than a non-string based solution that have the same functionality.

— 
Mehdi

Zachary Turner via llvm-dev

unread,
Oct 12, 2016, 11:34:02 PM10/12/16
to Mehdi Amini, llvm-dev
AFAICT this appears to be the first time you've clarified that you're talking about a situation where the compile-time checking happens using something other than format strings.  In Pavel's original email, he suggested compile time checking and you mentioned that I didn't object to it.  But if you go back and read my response, I said we can do the compile time checking *of the format strings* using C++14.  So no I didn't object to it in principle, but I never strayed from the desire to use format strings.

To respond to your other point, no it doesn't make it more flexible than a non-string based solution.  But does anyone want a non string-based solution?  We already have one, it's called raw_ostream.  And STL has another one in iostreams.  sprintf and llvm::format are not more flexible than streaming operators either, and yet people still flock to them because it yields the nicest looking code.  James Knight pointed out earlier that "any time someone invents a new formatting library, everyone always ends up using printf anyway".  There's a reason for that, and it's because printf is string-based.  That's what people want.

So if we're talking about string-based versus non string-based, then yes, I'm married to the idea of a string based solution.  

That doesn't mean we can't *also* expose the underlying format functionality via an additional set of non format based functions.  But string-based formatting is necessary if there is to be any adoption at all.

Mehdi Amini via llvm-dev

unread,
Oct 13, 2016, 12:16:10 AM10/13/16
to Zachary Turner, llvm-dev
On Oct 12, 2016, at 8:33 PM, Zachary Turner <ztu...@google.com> wrote:

AFAICT this appears to be the first time you've clarified that you're talking about a situation where the compile-time checking happens using something other than format strings.

I though I was clear in the thread (in the history below) when I wrote "Maybe the problem is using a string to format this in the first place” followed by this example:  format(“{0}”, rPad(col_width, my_object));  where the padding is *not* in the format string.

Another example earlier in the thread was the range (let say printing elts from 10 to 20 in a vector<int>). And instead of a syntax like:

/* Format string is {eltid, separator, <range>} */
print(“{0:,<10-20>}”,  /* std::vector<int> */. v);

And having to actually generate the format string in the first place

std::string format = format_string(“\{0:,<{0}-{1}>\}”, /* begin */ 10, /* end */ 20);

I rather have something like 

print(“{0}”, Range(“, ", v.begin()+10, v.begin()+20));


  In Pavel's original email, he suggested compile time checking and you mentioned that I didn't object to it.  But if you go back and read my response, I said we can do the compile time checking *of the format strings* using C++14.  So no I didn't object to it in principle, but I never strayed from the desire to use format strings.

To respond to your other point, no it doesn't make it more flexible than a non-string based solution.  But does anyone want a non string-based solution?  We already have one, it's called raw_ostream.  And STL has another one in iostreams.  sprintf and llvm::format are not more flexible than streaming operators either, and yet people still flock to them because it yields the nicest looking code.  James Knight pointed out earlier that "any time someone invents a new formatting library, everyone always ends up using printf anyway".  There's a reason for that, and it's because printf is string-based.  That's what people want.

So if we're talking about string-based versus non string-based, then yes, I'm married to the idea of a string based solution.  

That doesn't mean we can't *also* expose the underlying format functionality via an additional set of non format based functions.  But string-based formatting is necessary if there is to be any adoption at all.

I am not convinced that just because previous attempts didn’t success, we’re stuck forever with printf. 

At this point I’m not it is worth continuing the discussion, we can just agree to disagree on the principle.

— 
Mehdi

Zachary Turner via llvm-dev

unread,
Oct 13, 2016, 12:32:35 AM10/13/16
to Mehdi Amini, llvm-dev
I think my range example was misunderstood because that isn't really what I had in mind. Apologies if that's what led to us getting off track

However, the syntax you proposed should work just fine. Since it is extensible, you need only give the Range class in your example a format method. The only thing I would change is that I would put the separator in the string instead of the object

print("{0:,}", Range(s.begin(), s.begin()+20))

This way you have the freedom to display it multiple times with different presentation. Eg

print("{0:,} {0: }")

Zachary Turner via llvm-dev

unread,
Oct 13, 2016, 12:46:15 AM10/13/16
to Mehdi Amini, llvm-dev
Perhaps I should elaborate on something I said in the original post.  

"There are other ways to customize the formatting behavior, but I'll keep going with some more "

What I meant here is that there I currently have a two-stage lookup for a format provider.

1. If the type you passed in is a class, and it contains a format method with the appropriate signature, that method is invoked.
2. Otherwise, it looks for a specialization of llvm::format_provider<T> with a format method of the appropriate signature.

So in your range example, it would be perfectly reasonable to write:

template<typename Iter>
class Range {
  Iter Begin;
  Iter End;
  void format(raw_ostream &S, int Align, StringRef Style) {
  }
};

template<typename T>
Range<std::iterator_traits<T>::const_iterator> format_range(T &&t) {
  return Range<std::iterator_traits<T>::const_iterator>(std::begin(t), std::end(t));
}

and then write:

std::vector<int> X = {1, 2, 3};
os.format("{0: }", format_range(X));

And just to be clear, I think that syntax has clear advantages.  What I don't like is this:

os << "blah blah" << format<std::milli>(end-start) << "blah blah" 

I would much prefer to write that

os.format("blah blah {0:ms} blah blah", end-start);

Nicolai Hähnle via llvm-dev

unread,
Oct 14, 2016, 3:35:48 AM10/14/16
to Mehdi Amini, Zachary Turner, llvm-dev
On 12.10.2016 05:59, Mehdi Amini via llvm-dev wrote:
>> If you change a const char * to a StringRef, it can silently succeed
>> while passing your StringRef object to printf. It should fail to compile!
>
> llvm::format now fails to compile as well :)
>
> However this does not address other issues, like: `format(“%d”, float_var)`

This may be a good time to point at https://reviews.llvm.org/D25018

But if someone ends up doing a full overhaul of the formatting that
makes that patch unnecessary, I'm happy too.

Cheers,
Nicolai

Zachary Turner via llvm-dev

unread,
Oct 31, 2016, 6:46:01 PM10/31/16
to Nicolai Hähnle, Mehdi Amini, llvm-dev
Hi all,

Tentatively final version is up here: https://reviews.llvm.org/D25587

It has a verbal LGTM, but I plan to wait a bit longer just in case anyone has some additional thoughts.  It's a large patch, but if you're interested, one way you can help without doing a full-blown review is to look at the large comment blocks in FormatVariadic.h and FormatProviders.h.  Here I provide a formal description of the grammar of the replacement sequences and format syntax.  So you can look at this without looking at the code behind it and see if you have comments just on the format language.

Here's a summary of (most) everything contained in this patch:

1) UDL Syntax for outputting to a stream or converting to a string.
    outs() << "{0}"_fmt.stream(1)
    std::string S = "{0}"_fmt.string(1);

2) Built-in format providers for various common types:
     outs() << "{0}"_fmt.stream("test");             // "test"
     outs() << "{0}"_fmt.stream(StringRef("test"));  // "test"
     outs() << "{0}"_fmt.stream(true);               // "true"
     outs() << "{0}"_fmt.stream((void*)nullptr);     // "0x00000000"

3) Customizable formatting of ranges with optionally customizable separator
    std::vector<int> X = {1, 2, 3, 4}
    outs() << "{0}"_fmt.stream(make_range(X.begin(), X.end()));             "1, 2, 3, 4"
    outs() << "{0:@[  -  ]}"_fmt.stream(make_range(X.begin(), X.end()));    "1  -  2  -  3  -  4";

4) Left, center, and right-alignment:
    outs() << "{0:-3}"_fmt.stream(0);    "3  "
    outs() << "{0:=3}"_fmt.stream(0);    " 3 "
    outs() << "{0:3}"_fmt.stream(0);     "  3"

5) Type-specific style options:
    outs() << "{0,N}"_fmt.stream(123456);      "123,456"
    outs() << "{0,P}"_fmt.stream(0.25);        "25.00%"
    outs() << "{0,X}"_fmt.stream(0xDEADBEEF);  "0xDEADBEEF"
    outs() << "{0,X-}"_fmt.stream(0xDEADBEEF); "DEADBEEF"  
    outs() << "{0,x-}"_fmt.stream(0xDEADBEEF); "deadbeef"  
And many others

6) Adapters for specifying alignment, padding, and repetition with runtime values so you don't have to dynamically manipulate a format string.
    outs() << "{0}"_fmt.stream(fmt_pad(7, 3, 5));    "   7     ";
    outs() << "{0}{1}{2}"_fmt.stream(fmt_repeat("/\\", 3), 7, fmt_repeat("\\/", 3));   "/\/\/\7\/\/\/"

7) Compilation failures if the type cannot be formatted.
     struct Foo {};
    outs() << "{0}"_fmt.stream(Foo{});   // compilation failure.

8) Extensible format provider mechanism to allow formatting of your own types.
    struct AddressRange { uint64_t Begin; uint64_t End; }
    template<> class format_provider<AddressRange> {
    public:
        static void format(const AddressRange &R, raw_ostream &S, StringRef Style) {
            S << "[{0:X} - {1:X}]"_fmt.stream(R.begin(), R.end());
        }
    };
    AddressRange AR{0, 0xDEADBEEF};
    outs() << "{0}"_fmt.stream(AR);    // "[0x0 - 0xDEADBEEF]"

I'm planning to submit this towards the end of the week unless someone has further suggestions or complaints.

Chandler Carruth via llvm-dev

unread,
Oct 31, 2016, 8:21:33 PM10/31/16
to Zachary Turner, Nicolai Hähnle, Mehdi Amini, llvm-dev
On Mon, Oct 31, 2016 at 3:46 PM Zachary Turner via llvm-dev <llvm...@lists.llvm.org> wrote:
Hi all,

Tentatively final version is up here: https://reviews.llvm.org/D25587

It has a verbal LGTM, but I plan to wait a bit longer just in case anyone has some additional thoughts.  It's a large patch, but if you're interested, one way you can help without doing a full-blown review is to look at the large comment blocks in FormatVariadic.h and FormatProviders.h.  Here I provide a formal description of the grammar of the replacement sequences and format syntax.  So you can look at this without looking at the code behind it and see if you have comments just on the format language.

Here's a summary of (most) everything contained in this patch:

1) UDL Syntax for outputting to a stream or converting to a string.
    outs() << "{0}"_fmt.stream(1)
    std::string S = "{0}"_fmt.string(1);

I continue to have a strong objection to using UDLs for this (or anything else in LLVM).

I think this feature is poorly known by many programmers. I think it will produce error messages that are confusing and hard to debug. I think it will have a significant negative impact on compile time. I also think that it will exercise substantially less well tested parts of every host compiler for LLVM and subject us to an increased rate of mysterious host compiler bugs.

I also think it forces programmers to be aware of a "magical" construct that doesn't really fit with the rest of the language.

It isn't that any of these issues in isolation cannot be overcome, it is that I think the value provided by the UDL specifically is substantially smaller than the cost.

I would *very strongly* prefer that this is accomplished with "normal" C++ syntax, and that compile time checking is done with constexpr when available. I think that will give the overwhelming majority of the benefit with dramatically lower cost.

Zachary Turner via llvm-dev

unread,
Oct 31, 2016, 8:45:29 PM10/31/16
to Chandler Carruth, Nicolai Hähnle, Mehdi Amini, llvm-dev
Ahh, I must have missed where you voiced that objection earlier. Do you prefer that UDL syntax be explicitly disallowed, or do you only prefer that normal c++ syntax be possible? It is currently possible, I just didn't demonstrate it in the previous message since almost all the feedback i had seen so far seemed to prefer UDL syntax due to the brevity and similarity to Python.

I recall you mentioned the verbosity of llvm format as something you would like to see this improve, so i had assumed you would be happy with UDL syntax.

compile time checking may not be possible without UDLs unless we wrap the format string in a macro, which may hurt readability even more. With a UDL we can get it via the gnu literal operator template though, and the check can be #ifdef'ed out on any compiler that doesn't support that extension

In any case, both syntaxes are currently supported. Is that acceptable?

Zachary Turner via llvm-dev

unread,
Oct 31, 2016, 9:01:23 PM10/31/16
to Chandler Carruth, Nicolai Hähnle, Mehdi Amini, llvm-dev
Another advantage of the UDL syntax is that gnu literal operator templates are supported by clang today, so we can get compile time checking immediately rather than having to wait for c++14 (and then suffer an inferior syntax to boot)

Sean Silva via llvm-dev

unread,
Nov 1, 2016, 2:41:18 AM11/1/16
to Chandler Carruth, llvm-dev
+1, the UDL seems a bit too automagical.
`format_string("{0}", 1)` is not that much longer than `"{0}"_fmt.string(1)`, but significantly less magical.

Simple example: what should I type into a search engine to find the LLVM doxygen for the UDL? I know to search "llvm format_string" for the format string, but just from looking at a use of the UDL syntax it might not be clear that format_string is even called.

-- Sean Silva

Zachary Turner via llvm-dev

unread,
Nov 1, 2016, 8:39:21 AM11/1/16
to Sean Silva, Chandler Carruth, llvm-dev
The big problem i see is that to get compile time checking without the UDL we're going to have to do something like FORMAT_STRING("{0}") where this is a macro. It just seems really gross. It is true that it is harder to find the documentation, but that could be alleviated by putting all of this in its own namespace like llvm::formatv, then one could search the namespace

Zachary Turner via llvm-dev

unread,
Nov 2, 2016, 6:55:06 PM11/2/16
to Sean Silva, Chandler Carruth, llvm-dev
* UDL Syntax is removed in the latest version of the patch.  
* Name changed to `formatv` since `format_string` is too much to type.
* Added conversion operators for `std::string` and `llvm::SmallString`.

I had some feedback offline (not on this thread, unfortunately) that it might be worth using a printf style syntax instead of this Python-esque syntax.  FTR, I actually somewhat object to this, for a couple of reasons:

1) It makes back-reference syntax ugly.   "{0} {1} {0}" is much clearer to me than "%0$ %1$ %0$".  The latter syntax is also not a very well known feature of printf and so unlikely to be used by people with a printf-style implementation, whereas it's un-missable with the python-style syntax.

2) I don't see why we should need to specify the type of the argument with %d if the compiler knows it's an integer.  Even if the we can add compile-time checking to make it error, it seems unnecessary to even encounter this situation in the first place.  I believe the compiler should simply format what you give it.

3) One of the most useful aspects of the current approach is the ability to plug in custom formatters for application specific data types.  This is not straightforward with a printf-style syntax.  

You might be able to hook up a template-specialization like mechanic to the processing of %s (similar to my current approach), but it's not obvious how you proceed from there to get custom format strings for individual types.  For example, a formatter which can print a TimeSpan in different units depending on style options you pass in.  This is especially useful when trying to print ranges where you often want to be able to specify a different separator, or control the formatting of the underlying type.  (e.g. it's not clear how you would elegantly format a range of integers in hex using this style of approach).

I'm open to feedback here, so if you have an opinion one way or the other, please LMK.

Zachary Turner via llvm-dev

unread,
Nov 7, 2016, 12:05:18 PM11/7/16
to Sean Silva, Chandler Carruth, llvm-dev
Wanted to ping again on this now that the LLVM Developer Conference is over and people are (presumably) back to normal.

chandlerc@: Are your concerns sufficiently addressed?
Anyone: Does anyone feel strongly that the proposed syntax is or is not the way to go, based on the arguments outlined previously?

Chandler Carruth via llvm-dev

unread,
Nov 7, 2016, 12:38:55 PM11/7/16
to Zachary Turner, Sean Silva, Chandler Carruth, llvm-dev
My high-level design concerns are addressed. I've made some nit-picky code comments on the patch.

James Y Knight via llvm-dev

unread,
Nov 7, 2016, 4:57:54 PM11/7/16
to Zachary Turner, llvm-dev
On Wed, Nov 2, 2016 at 3:54 PM, Zachary Turner via llvm-dev <llvm...@lists.llvm.org> wrote:
* UDL Syntax is removed in the latest version of the patch.  
* Name changed to `formatv` since `format_string` is too much to type.
* Added conversion operators for `std::string` and `llvm::SmallString`.

I had some feedback offline (not on this thread, unfortunately) that it might be worth using a printf style syntax instead of this Python-esque syntax.  FTR, I actually somewhat object to this, for a couple of reasons:

1) It makes back-reference syntax ugly.   "{0} {1} {0}" is much clearer to me than "%0$ %1$ %0$".  The latter syntax is also not a very well known feature of printf and so unlikely to be used by people with a printf-style implementation, whereas it's un-missable with the python-style syntax

Don't forget the trailing "s" (or whatever) -- "%0$s %1$s %0$s". (The possibility to forget the trailing letter is a downside of the printf syntax, certainly.)

2) I don't see why we should need to specify the type of the argument with %d if the compiler knows it's an integer.  Even if the we can add compile-time checking to make it error, it seems unnecessary to even encounter this situation in the first place.  I believe the compiler should simply format what you give it.

Well, it would seem fine to me if "%s" converted integers/floats/etc to a string for you as well. What %d really gives you is ability to specify the numeric formatting options instead of the string formatting options.
 
3) One of the most useful aspects of the current approach is the ability to plug in custom formatters for application specific data types.  This is not straightforward with a printf-style syntax.  

It is true, printf style formatting doesn't allow something like this:
  formatv("{0:DD/MM/YYYY hh:mm:ss}", std::chrono::system_clock::now());

On the other hand, what printf does have is a single description of what formatting strings are valid. With your proposed system, you need to know what type the argument is, in order to know what the D in "{0:D}" is going to mean. IMO, that the printf syntax is both well-known and NOT extensible seems an advantage to me, not a disadvantage.

You might be able to hook up a template-specialization like mechanic to the processing of %s (similar to my current approach), but it's not obvious how you proceed from there to get custom format strings for individual types.  For example, a formatter which can print a TimeSpan in different units depending on style options you pass in.  This is especially useful when trying to print ranges where you often want to be able to specify a different separator, or control the formatting of the underlying type.  (e.g. it's not clear how you would elegantly format a range of integers in hex using this style of approach).

It is entirely unclear to me that putting everything in a format string:
  formatv("Here's a range: {0:$[ + ]@[x]}", range);
is better than composing functions with usual function-call syntax, e.g. something like this:
  format("Here's a range: %s", Join(range, " + ", Formatter("%x")));

I think that's the meat of the disagreement: I'd prefer to just see a safe printf-replacement; something that's able to basically drop-in replace C printf, nothing super fancy. I don't see the justification for being able to specify everything you'd ever want to be able to do directly in a complex format string language.

On the other hand, you see value in being able to specify the entirety of the output in the format string, and aren't concerned about the syntax being new and complicated.

That said -- it doesn't appear that my point of view is widely held, and that's fine -- this is a matter of opinion, not right or wrong. So, continue on. :)

Justin Bogner via llvm-dev

unread,
Nov 7, 2016, 5:54:21 PM11/7/16
to James Y Knight via llvm-dev

FWIW, I'm also not entirely sold that we need a complex formatting
language here. The printf modifiers are easy to remember and are good
enough 90% of the time, whereas with something like this I feel like I'd
need to look up the syntax every time I used it.

Like James though, I'm fine with conceding to the majority on this one.

Zachary Turner via llvm-dev

unread,
Nov 7, 2016, 5:58:55 PM11/7/16
to Justin Bogner, James Y Knight via llvm-dev
FWIW, if you're only ever formatting numbers and strings (which I agree is likely a majority of use cases), the syntax should be very easy to remember.  Most of the time you don't need to specify anything other than the placeholder index.  In that respect I expect it to catch on very quickly as there's really nothing to remember.  

Only if you want to customize the behavior will you maybe have to look up the syntax, and in that case you would have to do something equally funky with printf (such as not using it and writing 4 lines of streaming stuff to an ostream instead).

Chris Lattner via llvm-dev

unread,
Nov 8, 2016, 1:45:39 AM11/8/16
to Zachary Turner, James Y Knight via llvm-dev
On Nov 7, 2016, at 2:58 PM, Zachary Turner via llvm-dev <llvm...@lists.llvm.org> wrote:

FWIW, if you're only ever formatting numbers and strings (which I agree is likely a majority of use cases), the syntax should be very easy to remember.  Most of the time you don't need to specify anything other than the placeholder index.  In that respect I expect it to catch on very quickly as there's really nothing to remember.  

Only if you want to customize the behavior will you maybe have to look up the syntax, and in that case you would have to do something equally funky with printf (such as not using it and writing 4 lines of streaming stuff to an ostream instead).

I haven’t looked at your most recent patch to see if you have already done this, but it would be great to add a section about this new API to "Important and useful LLVM APIs” in http://llvm.org/docs/ProgrammersManual.html.

-Chris

_______________________________________________

Zachary Turner via llvm-dev

unread,
Nov 8, 2016, 2:27:40 PM11/8/16
to Chris Lattner, James Y Knight via llvm-dev
This is now done in the latest diff.  Thanks for the suggestion.

Zachary Turner via llvm-dev

unread,
Nov 8, 2016, 6:29:40 PM11/8/16
to Chris Lattner, James Y Knight via llvm-dev
Will leave this up until this Friday before committing unless there are further blockers.

Zachary Turner via llvm-dev

unread,
Nov 10, 2016, 2:07:22 PM11/10/16
to Chris Lattner, James Y Knight via llvm-dev
Just a friendly ping that I'm looking to get this in tomorrow if there are no further blockers.
Reply all
Reply to author
Forward
0 new messages