new_obj_num_array_.begin(), new_obj_num_array_.end()));Andy PhanDoes `new_obj_num_array_` need to be put into a set? Does it contain duplicate values?
No, my goal was for performance when finding values in large collections, but since the input is guaranteed sorted, binary search works. Replaced with more spans.
const CPDF_Object* obj_to_write = obj.Get();
auto it = font_obj_overrides.find(objnum);
if (it != font_obj_overrides.end()) {
obj_to_write = it->second.Get();
}Andy PhanUse a ternary operator here:
```
auto it = font_obj_overrides.find(objnum);
const CPDF_Object* obj_to_write = it != font_obj_overrides.end()
? it->second.Get(): obj.Get();
```
Done
// Mapping from font file streams to subset candidates.Andy PhanIs the key a PDF object number?
Yes. This was still WIP, but clarified in comment.
void AddUsedText(SubsetCandidate& candidate,Andy PhanMay be slightly more intuitive to order the parameters to follow the ordering in the comment.
Done
#include "hb-subset.h"Andy PhanIs this the best way to include HarfBuzz includes?
Not sure why, but that's how they're included in Chromium and Skia.
new_obj_nums_ = new_obj_nums;Andy PhanAs-is: Use a non-const/non-ref parameter and std::move() here. Or pass in the vector (span?) from the caller and uniquify internally. Either way will avoid a copy.
Chose span.
if (font->IsCIDFont()) {Andy PhanIf this is true, then `font` can be casted to CPDF_CIDFont, right? In which case, does that mean `descendants` will never be null, since CPDF_CIDFont::Load() presumably succeeded?
More generally, can CPDF_Font and subclasses help simplify the code here?
Yes, although I don't see any methods that particularly help here. Able to convert these to CHECKs, though?
CPDF_Document* CPDFDocumentFromFPDFDocument(FPDF_DOCUMENT document) {Ehh... Found two other code spots that do this.
std::vector<uint32_t> kTestObjNums{5, 6, 7, 8, 9, 10, 11, 12, 13};Is there a better way to do these new object numbers?
}Do we want to test rendering? If so, should we test the rendering here or in fpdf_save_embeddertest.cpp?
In future CLs, these tests will check the root font, CID font, and font descriptor objects, so I'm not sure if that would make sense to be moved fpdf_save_embeddertest.cpp.
it = overrides.find(16);Hardcoded to find a specific object number. Not sure how to avoid. This will get worse when the root font, CID font, and font descriptor objects are added to the overrides list.
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
Do we want to test rendering? If so, should we test the rendering here or in fpdf_save_embeddertest.cpp?
In future CLs, these tests will check the root font, CID font, and font descriptor objects, so I'm not sure if that would make sense to be moved fpdf_save_embeddertest.cpp.
We already test a little bit of rendering in fpdf_save_embeddertest.cpp, but not to this extent.
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
#include "hb-subset.h"Andy PhanIs this the best way to include HarfBuzz includes?
Not sure why, but that's how they're included in Chromium and Skia.
Looks like the latest patchset changed it.
CPDF_Document* CPDFDocumentFromFPDFDocument(FPDF_DOCUMENT document) {Ehh... Found two other code spots that do this.
PDFium has all 3 ways of testing:
1) Only fpdfsdk/ embedder tests, but they cheat and cast to CPDF objects.
2) Separate fpdfsdk/ embedder tests and core/ embedder tests.
3) Separate fpdfsdk/ embedder tests and core/ unit tests.
If it's easier to group everything together, maybe just do (1)?
std::vector<uint32_t> kTestObjNums{5, 6, 7, 8, 9, 10, 11, 12, 13};Is there a better way to do these new object numbers?
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
pdfium::span<uint32_t> new_obj_nums);`const uint32_t` if CPDF_FontSubsetter is only reading the span.
if (!std::binary_search(new_obj_nums.begin(), new_obj_nums.end(),std::ranges::binary_search()
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
`const uint32_t` if CPDF_FontSubsetter is only reading the span.
Done
if (!std::binary_search(new_obj_nums.begin(), new_obj_nums.end(),Andy Phanstd::ranges::binary_search()
Done
if (font->IsCIDFont()) {Andy PhanIf this is true, then `font` can be casted to CPDF_CIDFont, right? In which case, does that mean `descendants` will never be null, since CPDF_CIDFont::Load() presumably succeeded?
More generally, can CPDF_Font and subclasses help simplify the code here?
Yes, although I don't see any methods that particularly help here. Able to convert these to CHECKs, though?
Done
std::vector<uint32_t> kTestObjNums{5, 6, 7, 8, 9, 10, 11, 12, 13};Lei ZhangIs there a better way to do these new object numbers?
std::iota()?
Done
Andy PhanDo we want to test rendering? If so, should we test the rendering here or in fpdf_save_embeddertest.cpp?
In future CLs, these tests will check the root font, CID font, and font descriptor objects, so I'm not sure if that would make sense to be moved fpdf_save_embeddertest.cpp.
We already test a little bit of rendering in fpdf_save_embeddertest.cpp, but not to this extent.
Decided to keep the tests as-is and have the rendering/text extraction tests occur in fpdf_save_embeddertest.cpp. CPDFFontSubsetterTest can test dictionary values and stream lengths.
Hardcoded to find a specific object number. Not sure how to avoid. This will get worse when the root font, CID font, and font descriptor objects are added to the overrides list.
For input: using a large range of new obj nums.
For output: using gMock matchers.
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
ScopedFPDFFont font(FPDFText_LoadFont(document(), font_data.data(),BTW, I think this still has to be an embedder test for FPDFText_LoadFont.
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
std::map<uint32_t, CPDF_FontSubsetter::SubsetCandidate> candidates_;Can be shortened, like line 68.
CPDF_Font* font,Isn't `font` always `text->GetFont()`? Omit?
std::map<uint32_t, RetainPtr<const CPDF_Object>> GenerateObjectOverrides(Can the caller call this method multiple times without issues? Should `candidates_` be cleared at the start of the method, or should it be created on the stack and passed around?
std::vector<uint8_t> GenerateFontSubset(CPDF_Document* doc,Return DataVector<uint8_t> instead. Then use the CPDF_Stream ctor that takes a DataVector<uint8_t> (+ dictionary with /Length1 set) to avoid a copy.
unsigned int out_len = hb_blob_get_length(subset_blob.get());Instead of separately getting the length this way, use the second parameter of hb_blob_get_data().
auto old_stream_acc =Maybe "old" and "new" can be "original" and "subsetted", respectively.
if (!page_obj->IsText()) {I think this can call AsText() like line 154 here, and then skip the entry if `text` is null.
if (!font) {Based on other GetFont() callers, this shouldn't ever happen.
if (!root_font) {Same. Never happens except during FPDF_DestroyLibrary().
void InsertNewTextObject(const FPDF_PAGE& page,Some of the test refactoring can happen separately.
float x,Use a CFX_PointF for x and y coordinates.
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
std::map<uint32_t, CPDF_FontSubsetter::SubsetCandidate> candidates_;Can be shortened, like line 68.
Done
Isn't `font` always `text->GetFont()`? Omit?
Done
std::map<uint32_t, RetainPtr<const CPDF_Object>> GenerateObjectOverrides(Can the caller call this method multiple times without issues? Should `candidates_` be cleared at the start of the method, or should it be created on the stack and passed around?
At the moment, it won't cause any issues, but it will in future CLs. So clearing at start of method.
std::vector<uint8_t> GenerateFontSubset(CPDF_Document* doc,Return DataVector<uint8_t> instead. Then use the CPDF_Stream ctor that takes a DataVector<uint8_t> (+ dictionary with /Length1 set) to avoid a copy.
Done
unsigned int out_len = hb_blob_get_length(subset_blob.get());Instead of separately getting the length this way, use the second parameter of hb_blob_get_data().
Done
Maybe "old" and "new" can be "original" and "subsetted", respectively.
Done
I think this can call AsText() like line 154 here, and then skip the entry if `text` is null.
Done
Based on other GetFont() callers, this shouldn't ever happen.
Done
Same. Never happens except during FPDF_DestroyLibrary().
Done
Some of the test refactoring can happen separately.
Use a CFX_PointF for x and y coordinates.
Done at https://pdfium-review.googlesource.com/c/pdfium/+/143410 and here.
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
| Code-Review | +1 |
auto subsetted_stream = pdfium::MakeRetain<CPDF_Stream>(Local variable no longer needed. Just assign directly.
std::vector<uint32_t> new_obj_nums;Probably don't need this variable, just pass `{}` and GetTestNewObjNums() in directly.
size_t original_size = font_data.size();Make this const. More below.
ASSERT_GT(original_size, 0u);Can probably put in a larger number here. Also line 167. An exact number comparion is probably ok, since the test fonts rarely changes. If they do change, it might affect the tests anyway and someone may want to adjust the numbers. More below.
ScopedFPDFFont font(FPDFText_LoadFont(document(), font_data.data(),BTW, I think this still has to be an embedder test for FPDFText_LoadFont.
OK. Copying the relevant part of FPDFText_LoadFont() itself actually isn't that bad, but it would be unreasonable to copy LoadCompositeFont().
EXPECT_THAT(overrides, UnorderedElementsAre(StreamSizeIsWithinRange(But there's only 1 element in the vector. More below.
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
Local variable no longer needed. Just assign directly.
Done
Probably don't need this variable, just pass `{}` and GetTestNewObjNums() in directly.
Done
"Not a text object." ?
Done
Make this const. More below.
Done
Can probably put in a larger number here. Also line 167. An exact number comparion is probably ok, since the test fonts rarely changes. If they do change, it might affect the tests anyway and someone may want to adjust the numbers. More below.
Makes sense, done.
EXPECT_THAT(overrides, UnorderedElementsAre(StreamSizeIsWithinRange(But there's only 1 element in the vector. More below.
Will use Contain() for now, but not for long when we override more objects.
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
24 is the latest approved patch-set.
The change was submitted with unreviewed changes in the following files:
```
The name of the file: core/fpdfapi/edit/cpdf_fontsubsetter_embeddertest.cpp
Insertions: 20, Deletions: 24.
@@ -22,6 +22,7 @@
#include "testing/utils/file_util.h"
#include "testing/utils/path_service.h"
+using ::testing::Contains;
using ::testing::IsEmpty;
using ::testing::Matcher;
using ::testing::UnorderedElementsAre;
@@ -113,16 +114,16 @@
CPDF_FontSubsetter subsetter(CPDFDocumentFromFPDFDocument(document()));
- std::vector<uint32_t> new_obj_nums;
- EXPECT_THAT(subsetter.GenerateObjectOverrides(new_obj_nums), IsEmpty());
+ EXPECT_THAT(subsetter.GenerateObjectOverrides({}), IsEmpty());
- new_obj_nums = GetTestNewObjNums();
- EXPECT_THAT(subsetter.GenerateObjectOverrides(new_obj_nums), IsEmpty());
+ EXPECT_THAT(subsetter.GenerateObjectOverrides(GetTestNewObjNums()),
+ IsEmpty());
- // No-text object.
+ // Not a text object.
FPDF_PAGEOBJECT rect = FPDFPageObj_CreateNewRect(20, 100, 50, 50);
FPDFPage_InsertObject(page.get(), rect);
- EXPECT_THAT(subsetter.GenerateObjectOverrides(new_obj_nums), IsEmpty());
+ EXPECT_THAT(subsetter.GenerateObjectOverrides(GetTestNewObjNums()),
+ IsEmpty());
}
TEST_F(CPDFFontSubsetterTest, OpenType) {
@@ -133,10 +134,9 @@
GetTestFontFilePath("NotoSansCJKjp-Regular.otf");
ASSERT_FALSE(font_path.empty());
- // The file size for `font_path` is ~16.4 MB.
std::vector<uint8_t> font_data = GetFileContents(font_path.c_str());
- size_t original_size = font_data.size();
- ASSERT_GT(original_size, 0u);
+ const size_t original_size = font_data.size();
+ ASSERT_EQ(16427228u, original_size);
ScopedFPDFFont font(FPDFText_LoadFont(document(), font_data.data(),
font_data.size(), FPDF_FONT_TRUETYPE,
@@ -150,7 +150,7 @@
ASSERT_EQ(1u, overrides.size());
// Subset size is ~2.5% of the original font file, i.e. ~450 KB.
- EXPECT_THAT(overrides, UnorderedElementsAre(StreamSizeIsWithinRange(
+ EXPECT_THAT(overrides, Contains(StreamSizeIsWithinRange(
original_size * 0.02, original_size * 0.03)));
}
@@ -161,10 +161,9 @@
const std::string font_path = GetTestFontFilePath("Arimo-Regular.ttf");
ASSERT_FALSE(font_path.empty());
- // The file size for `font_path` is ~436 KB.
std::vector<uint8_t> font_data = GetFileContents(font_path.c_str());
- size_t original_size = font_data.size();
- ASSERT_GT(original_size, 0u);
+ const size_t original_size = font_data.size();
+ ASSERT_EQ(436180u, original_size);
ScopedFPDFFont font(FPDFText_LoadFont(document(), font_data.data(),
font_data.size(), FPDF_FONT_TRUETYPE,
@@ -179,7 +178,7 @@
ASSERT_EQ(1u, overrides.size());
// Subset size is ~3% of the original font file, i.e. ~13 KB.
- EXPECT_THAT(overrides, UnorderedElementsAre(StreamSizeIsWithinRange(
+ EXPECT_THAT(overrides, Contains(StreamSizeIsWithinRange(
original_size * 0.025, original_size * 0.035)));
}
@@ -190,10 +189,9 @@
const std::string font_path = GetTestFontFilePath("Arimo-Regular.ttf");
ASSERT_FALSE(font_path.empty());
- // The file size for `font_path` is ~436 KB.
std::vector<uint8_t> font_data = GetFileContents(font_path.c_str());
- size_t original_size = font_data.size();
- ASSERT_GT(original_size, 0u);
+ const size_t original_size = font_data.size();
+ ASSERT_EQ(436180u, original_size);
ScopedFPDFFont font(FPDFText_LoadFont(document(), font_data.data(),
font_data.size(), FPDF_FONT_TRUETYPE,
@@ -210,7 +208,7 @@
ASSERT_EQ(1u, overrides.size());
// Subset size is ~3.5% of the original font file, i.e. ~15 KB.
- EXPECT_THAT(overrides, UnorderedElementsAre(StreamSizeIsWithinRange(
+ EXPECT_THAT(overrides, Contains(StreamSizeIsWithinRange(
original_size * 0.03, original_size * 0.04)));
}
@@ -223,14 +221,12 @@
const std::string font_path2 = GetTestFontFilePath("Arimo-Regular.ttf");
ASSERT_FALSE(font_path2.empty());
- // The file size for `font_path1` is ~48 KB.
std::vector<uint8_t> font_data1 = GetFileContents(font_path1.c_str());
- size_t original_size1 = font_data1.size();
- ASSERT_GT(original_size1, 0u);
- // The file size for `font_path2` is ~436 KB.
+ const size_t original_size1 = font_data1.size();
+ ASSERT_EQ(48908u, original_size1);
std::vector<uint8_t> font_data2 = GetFileContents(font_path2.c_str());
- size_t original_size2 = font_data2.size();
- ASSERT_GT(original_size2, 0u);
+ const size_t original_size2 = font_data2.size();
+ ASSERT_EQ(436180u, original_size2);
ScopedFPDFFont font1(FPDFText_LoadFont(document(), font_data1.data(),
font_data1.size(), FPDF_FONT_TRUETYPE,
```
```
The name of the file: core/fpdfapi/edit/cpdf_fontsubsetter.cpp
Insertions: 1, Deletions: 2.
@@ -122,9 +122,8 @@
// TrueType fonts requires a Length1 entry.
subsetted_font_dict->SetNewFor<CPDF_Number>(
"Length1", static_cast<int>(subsetted_font_data.size()));
- auto subsetted_stream = pdfium::MakeRetain<CPDF_Stream>(
+ overrides[obj_num] = pdfium::MakeRetain<CPDF_Stream>(
std::move(subsetted_font_data), std::move(subsetted_font_dict));
- overrides[obj_num] = subsetted_stream;
}
return overrides;
}
```
Introduce CPDF_FontSubsetter to subset new fonts on PDF save
Add CPDF_FontSubsetter, which, when given a list of new PDF object
numbers, will output object overrides required to subset any new fonts
while saving the PDF. The objects do not affect the original copy and
are purely used for writing the new PDF, avoiding any in-memory PDF
mutation. CPDF_FontSubsetter uses HarfBuzz internally to subset fonts.
third_party/test_fonts is already included in PDFium, so use it to test
font subsetting. Some of the fonts are multiple megabytes large, so some
tests may take ~2-3 seconds.
There are still some remaining TODOs to make the subsetted font
compliant with the PDF spec, but some PDF viewers will be able to
properly render the subsetted font with this CL alone.
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |