Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Extract TIFF Image from PDF Stream

1,907 views
Skip to first unread message

Figment

unread,
Feb 7, 2002, 12:19:36 PM2/7/02
to
Hello,
I'm trying to extract a TIFF image from a pdf file using Java. I am
able to read the file, and get to the stream data and read all the
bytes. However when I try to interpret the data as a tiff (either by
writing the bytes to separate tiff file, or using Java Advanced
Imaging to construct a TIFF in memory) it is not valid. I even went
so far as to look at the beginning of the data, and it did not meet
the Tiff standard (i.e the first 8 bytes don't signify a TIFF image
header. I have posted the partial contents of the pdf object I'm
trying to extract below (in case it helps).
What am I missing to make this work? If the 'filter' in the PDF
says it is 'CCITTFaxDecode' shouldn't the data be a valid tiff stream?
Any and all help appreciated.
Thanks,
Ryan

19 0 obj
<<
/Type /XObject
/Subtype /Image
/Name /Image1
/Filter /CCITTFaxDecode
/Width 2537
/Height 3380
/BitsPerComponent 1
/ColorSpace /DeviceGray
/Length 80544
/DecodeParms <<
/K -1 /Columns 2537
>> /Decode [0 1]
>>
stream
˙ňÝU 4B"\Ąu Á¬ŤA.....
endstream
endobj

John Doherty

unread,
Feb 7, 2002, 1:08:59 PM2/7/02
to
In article <8b839693.0202...@posting.google.com>,
rsta...@yahoo.com (Figment) wrote:

> I'm trying to extract a TIFF image from a pdf file using Java. I am
> able to read the file, and get to the stream data and read all the
> bytes. However when I try to interpret the data as a tiff (either by
> writing the bytes to separate tiff file, or using Java Advanced
> Imaging to construct a TIFF in memory) it is not valid.

The image data in the PDF is not a TIFF. It's basically just a 2D
array of pixels.

> What am I missing to make this work? If the 'filter' in the PDF
> says it is 'CCITTFaxDecode' shouldn't the data be a valid tiff
> stream?

No. You should refer to the PDF spec, specifically pp. 263-275.

--

max

unread,
Feb 7, 2002, 3:03:29 PM2/7/02
to
Ryan,
the CCITTFaxDecode filter does not mean that a tiff file is included inside
the pdf. It means that CCITTFAX data is included. It HAPPENS to be that the
most popular version of TIFF file is one that is based on the same standard.
To be able to save CCITTFax data as a TIFF file you need the appropriate
TIFF header, for which you should refer to TIFF specs. To sum up: you are
missing the tiff header!!!
cheers,
max.


Figment

unread,
Feb 7, 2002, 5:55:57 PM2/7/02
to
> > What am I missing to make this work? If the 'filter' in the PDF
> > says it is 'CCITTFaxDecode' shouldn't the data be a valid tiff
> > stream?
>
> No. You should refer to the PDF spec, specifically pp. 263-275.
>
> --
Thanks for replying!
I have the pdf Spec 1.4 and I'm assuming you mean the part under
graphics-->Images right? I read it once, and after reading it again,
I think I understand a little more, but I was hoping you might be able
to help me out again.

A pdf Image stream (as you said) is esentially a 2D array of pixels.
However if it is encoded in 'CCITTFaxDecode' then I would first need
to run this stream through a 'decoder' of some sort, and then
reference those 2D array of pixels... is that correct? If it is, I'm
assuming that the 'CCITTFaxDecode' algorithm would be the same
encoding/decoding that is described in the specification for the TIFF
File format right? But how do I find out if the image is Group 4 or
Group 3? I pretty sure I'm still missing something...
Thanks for the help.
-Ryan

John Doherty

unread,
Feb 7, 2002, 6:41:25 PM2/7/02
to

> > > What am I missing to make this work? If the 'filter' in the PDF
> > > says it is 'CCITTFaxDecode' shouldn't the data be a valid tiff
> > > stream?
> >
> > No. You should refer to the PDF spec, specifically pp. 263-275.
> >
> > --
> Thanks for replying!
> I have the pdf Spec 1.4 and I'm assuming you mean the part under
> graphics-->Images right? I read it once, and after reading it again,
> I think I understand a little more, but I was hoping you might be able
> to help me out again.
>
> A pdf Image stream (as you said) is esentially a 2D array of pixels.
> However if it is encoded in 'CCITTFaxDecode' then I would first need
> to run this stream through a 'decoder' of some sort, and then
> reference those 2D array of pixels... is that correct?

That would be the general idea. You can probably find java code out
there somewhere to do the decoding.

> If it is, I'm assuming that the 'CCITTFaxDecode' algorithm would be the
> same encoding/decoding that is described in the specification for the TIFF
> File format right? But how do I find out if the image is Group 4 or
> Group 3?

> 19 0 obj


> <<
> /Type /XObject
> /Subtype /Image
> /Name /Image1
> /Filter /CCITTFaxDecode
> /Width 2537
> /Height 3380
> /BitsPerComponent 1
> /ColorSpace /DeviceGray
> /Length 80544
> /DecodeParms <<
> /K -1 /Columns 2537

"/K -1" indicates that it's Group 4.

--

sergey.a...@gmail.com

unread,
Jan 1, 2016, 5:46:38 AM1/1/16
to
Python realization:

import PyPDF2
import struct

"""
Links:
PDF format: http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdf
CCITT Group 4: https://www.itu.int/rec/dologin_pub.asp?lang=e&id=T-REC-T.6-198811-I!!PDF-E&type=items
Extract images from pdf: http://stackoverflow.com/questions/2693820/extract-images-from-pdf-without-resampling-in-python
Extract images coded with CCITTFaxDecode in .net: http://stackoverflow.com/questions/2641770/extracting-image-from-pdf-with-ccittfaxdecode-filter
TIFF format and tags: http://www.awaresystems.be/imaging/tiff/faq.html
"""


def tiff_header_for_CCITT(width, height, img_size, CCITT_group=4):
tiff_header_struct = '<' + '2s' + 'h' + 'l' + 'h' + 'hhll' * 8 + 'h'
return struct.pack(tiff_header_struct,
b'II', # Byte order indication: Little indian
42, # Version number (always 42)
8, # Offset to first IFD
8, # Number of tags in IFD
256, 4, 1, width, # ImageWidth, LONG, 1, width
257, 4, 1, height, # ImageLength, LONG, 1, lenght
258, 3, 1, 1, # BitsPerSample, SHORT, 1, 1
259, 3, 1, CCITT_group, # Compression, SHORT, 1, 4 = CCITT Group 4 fax encoding
262, 3, 1, 0, # Threshholding, SHORT, 1, 0 = WhiteIsZero
273, 4, 1, struct.calcsize(tiff_header_struct), # StripOffsets, LONG, 1, len of header
278, 4, 1, height, # RowsPerStrip, LONG, 1, lenght
279, 4, 1, img_size, # StripByteCounts, LONG, 1, size of image
0 # last IFD
)

pdf_filename = 'scan.pdf'
pdf_file = open(pdf_filename, 'rb')
cond_scan_reader = PyPDF2.PdfFileReader(pdf_file)
for i in range(0, cond_scan_reader.getNumPages()):
page = cond_scan_reader.getPage(i)
xObject = page['/Resources']['/XObject'].getObject()
for obj in xObject:
if xObject[obj]['/Subtype'] == '/Image':
"""
The CCITTFaxDecode filter decodes image data that has been encoded using
either Group 3 or Group 4 CCITT facsimile (fax) encoding. CCITT encoding is
designed to achieve efficient compression of monochrome (1 bit per pixel) image
data at relatively low resolutions, and so is useful only for bitmap image data, not
for color images, grayscale images, or general data.

K < 0 --- Pure two-dimensional encoding (Group 4)
K = 0 --- Pure one-dimensional encoding (Group 3, 1-D)
K > 0 --- Mixed one- and two-dimensional encoding (Group 3, 2-D)
"""
if xObject[obj]['/Filter'] == '/CCITTFaxDecode':
if xObject[obj]['/DecodeParms']['/K'] == -1:
CCITT_group = 4
else:
CCITT_group = 3
width = xObject[obj]['/Width']
height = xObject[obj]['/Height']
data = xObject[obj]._data # sorry, getData() does not work for CCITTFaxDecode
img_size = len(data)
tiff_header = tiff_header_for_CCITT(width, height, img_size, CCITT_group)
img_name = obj[1:] + '.tiff'
with open(img_name, 'wb') as img_file:
img_file.write(tiff_header + data)
#
# import io
# from PIL import Image
# im = Image.open(io.BytesIO(tiff_header + data))
pdf_file.close()

bellamkon...@gmail.com

unread,
Jun 7, 2018, 6:17:25 AM6/7/18
to
this is not working but how people are getting the data from the pdf i am not get could you let me know and my output is
'/Contents': {'/Length': '33'},
'/CropBox': ['0', '0', '614.4', '792'],
'/MediaBox': ['0', '0', '614.4', '792'],
'/Parent': {'/Count': '2',
'/Kids': [{...},
{'/Contents': {'/Length': '33'},
'/CropBox': ['0', '0', '614.4', '822'],
'/MediaBox': ['0', '0', '614.4', '822'],
'/Parent': {...},
'/Resources': {'/ProcSet': ['/PDF', '/Text', '/ImageC'],
'/XObject': {'/Im1': {'/BitsPerComponent': '1',
'/ColorSpace': '/DeviceGray',
'/DecodeParms': [{'/BlackIs1': 'false',
'/Columns': '2560',
'/K': '-1',
'/Rows': '3425'}],
'/Filter': ['/CCITTFaxDecode'],
'/Height': '3425',
'/Length': '30572',
'/Name': '/Im1',
'/Subtype': '/Image',
'/Type': '/XObject',
'/Width': '2560'}}},
'/Thumb': {'/BitsPerComponent': '1',
'/ColorSpace': '/DeviceGray',
'/DecodeParms': [{'/BlackIs1': 'false',
'/Columns': '79',
'/K': '-1',
'/Rows': '106'}],
'/Filter': ['/CCITTFaxDecode'],
'/Height': '106',
'/Length': '463',
'/Width': '79'},
'/Type': '/Page'}],
'/Type': '/Pages'},
'/Resources': {'/ProcSet': ['/PDF', '/Text', '/ImageC'],
'/XObject': {'/Im0': {'/BitsPerComponent': '1',
'/ColorSpace': '/DeviceGray',
'/DecodeParms': [{'/BlackIs1': 'false',
'/Columns': '2560',
'/K': '-1',
'/Rows': '3300'}],
'/Filter': ['/CCITTFaxDecode'],
'/Height': '3300',
'/Length': '45897',
'/Name': '/Im0',
'/Subtype': '/Image',
'/Type': '/XObject',
'/Width': '2560'}}},
'/Thumb': {'/BitsPerComponent': '1',
'/ColorSpace': '/DeviceGray',
'/DecodeParms': [{'/BlackIs1': 'false',
'/Columns': '82',
'/K': '-1',
'/Rows': '106'}],
'/Filter': ['/CCITTFaxDecode'],
'/Height': '106',
'/Length': '726',
'/Width': '82'},
'/Type': '/Page'}
how to get data from this

darbydayst...@gmail.com

unread,
Dec 6, 2019, 9:48:49 AM12/6/19
to
Thank you so much! Works well. For PDFs that have /BlackIs 1, just change the struct setting for thresh-hold, from 0 to 1.

0 new messages