Extract embedded file from specific sheet

262 views
Skip to first unread message

Josh Grou

unread,
Jul 5, 2023, 9:53:20 AM7/5/23
to openpyxl-users
Hi All,

In an excel file that some people are using they embed a file that is output from a measurement instrument in a sheet that contains some metadata about the measurement. I have seen ways to extract all of the embedded files by unzipping the excel file but I need to link the metadata to the embedded file in the case that there are multiple of these sheets in a workbook. Does anyone know of a way to tie these embedded files to a specific worksheet?

Charlie Clark

unread,
Jul 5, 2023, 11:39:36 AM7/5/23
to openpyxl-users
I don't think openpyxl exposes files directly but you should be able to use some of the tools to get what you need. For example, if a file is embedded within a worksheet then there should be a relationship element to help you link the two.

Charlie

--
Charlie Clark
Managing Director
Clark Consulting & Research
German Office
Sengelsweg 34
Düsseldorf
D- 40489
Tel: +49-203-3925-0390
Mobile: +49-178-782-6226

Josh Grou

unread,
Jul 5, 2023, 4:08:54 PM7/5/23
to openpyxl-users
Thanks for the pointer Charlie,

I didn't end up using the openpyxl handling for relationships but that got me going in the right direction. Instead, I directly accessed the xml files and looked for relationships with targets in the embedding folder which seems to do the job. The only drawback I am having is that I am not entirely sure where the actual sheet name is stored in the excel zip archive, so for now I just have to go by the order of the sheets which does seem to match up with the workbook.worksheets method. My implementation if anyone has a similar problem:

import zipfile
import os
import io
from xml.etree import cElementTree as ElementTree
import pandas as pd

from copy import copy

def xmltodict(r,root=True):
    if root:
        return {r.tag : xmltodict(r, False)}
    d=copy(r.attrib)
    if r.text:
        d["_text"]=r.text
    for x in r.findall("./*"):
        if x.tag not in d:
            d[x.tag]=[]
        d[x.tag].append(xmltodict(x,False))
    return d

filename = "embedd_test.xlsx"
rel_path = "xl/worksheets/_rels"
embedd_path = "../embeddings"


with zipfile.ZipFile(filename,"r") as fd:
    for fn in fd.namelist():
        if fn.startswith(rel_path):
            tree = ElementTree.parse(io.BytesIO(fd.read(fn)))
            root = tree.getroot()
            xmldict = xmltodict(root)
            for item in xmldict['{http://schemas.openxmlformats.org/package/2006/relationships}Relationships']['{http://schemas.openxmlformats.org/package/2006/relationships}Relationship']:
                if item["Target"].startswith(embedd_path):
                    location = item["Target"]
                    print("Attachment Found for sheet: " + str(fn))
                    df = pd.read_excel(io.BytesIO(fd.read(location.replace('..',"xl"))))
                    print(df)

Charlie Clark

unread,
Jul 6, 2023, 5:29:56 AM7/6/23
to openpyxl-users

On 5 Jul 2023, at 22:08, Josh Grou wrote:

Thanks for the pointer Charlie,

I didn't end up using the openpyxl handling for relationships but that got
me going in the right direction. Instead, I directly accessed the xml files
and looked for relationships with targets in the embedding folder which
seems to do the job. The only drawback I am having is that I am not
entirely sure where the actual sheet name is stored in the excel zip
archive, so for now I just have to go by the order of the sheets which does
seem to match up with the workbook.worksheets method. My implementation if
anyone has a similar problem:

You're making a rod for your own back because openpyxl provides all the primitives for handling the XML source reliably.

  • The workbook parser will provide you the worksheet names and file names
  • Use those to find the worksheets with relationships
  • Parse the relationships using RelationshipList and filter (the find method might be useful) out the files you're looking for
  • Pass the list of files to your processor (it's always good to decouple this from the finding loop)
Message has been deleted

Josh Grou

unread,
Jul 7, 2023, 10:35:47 AM7/7/23
to openpyxl-users
Thanks for the feedback and advice Charlie,

I was being a bit lazy and not understanding the use of the openpyxl RelationshipList class, I spent some time to refactor using the openpyxl handling which turned out to be simpler than I thought it would be.
There may be a better method, but this seems to solve the task that I had for this project in a fairly robust way. In the future I may add a file extension filter to the find_workbook_attachments function as I am not sure what all can be included in the openxml package content type.

Again, I'll post my implementation in case anyone is facing a similar issue:

import openpyxl as op
import zipfile
import io

def find_workbook_attachments(wb_filename, content_type):
    """Searches a workbook for embedded files based on content type and returns the archived filepath to each attachments grouped by the worksheet.

    Args:
        wb_filename (string): File path to identify attachments
        content_type (string): openxml content type to filter by

    Returns:
        targets (dictionary): Dictionary where keys are worksheet names and values are archive file paths of attachments
    """    
   
    wb = op.load_workbook(wb_filename)
    targets = {}
    for ws_name in wb.sheetnames:
        ws = wb[ws_name]
        rel_list = ws._rels
        for rel in rel_list.find(content_type):
            target_file = rel.Target
            if ws_name in targets:
                targets[ws_name].append(target_file)
            else:
                targets[ws_name] = target_file
    return targets

def extract_workbook_attachments_by_sheet(wb_filename,targets):
    """Collects the embedded files based on a list of targets grouped by worksheet.

    Args:
        wb_filename (string): File path to extract attachments from
        targets (dictionary): Dictionary where keys are worksheet names and values are archive file paths of attachments

    Returns:
        extracted_contents (dictionary): Dictionary where keys are worksheet names and values are the BytesIO object of the attachments
    """    
    extracted_contents = {}
    with zipfile.ZipFile(wb_filename,"r") as fd:
        for ws, target in targets.items():
            data_obj = io.BytesIO(fd.read(target))
            if ws in extracted_contents:
                extracted_contents[ws].append(data_obj)
            else:
                extracted_contents[ws] = data_obj
    return extracted_contents

targets = find_workbook_attachments("embedd_test.xlsx", content_type = 'http://schemas.openxmlformats.org/officeDocument/2006/relationships/package')
attachments = extract_workbook_attachments_by_sheet("embedd_test.xlsx", targets)

Charlie Clark

unread,
Jul 7, 2023, 12:02:26 PM7/7/23
to openpyxl-users

On 7 Jul 2023, at 16:31, Josh Grou wrote:

Thanks for the feedback and advice Charlie,

I was being a bit lazy and not understanding the use of the openpyxl
RelationshipList class, I spent some time to refactor using the openpyxl
handling which turned out to be simpler than I thought it would be.
There may be a better method, but this seems to solve the task that I had
for this project in a fairly robust way. In the future I may add a file
extension filter to the find_workbook_attachments function as I am not sure
what all can be included in the openxml package content type.

The code looks fine though I don't know why you're loading the whole workbook first. You're also depending upon the private ws._rels attribute, which I wouldn't as I plan to do away with it.

A couple of notes:

  • Please don't alias openpyxl when importing. It doesn't expose much at the top level that you can't import it directly. I know it's standard with numpy where it kind of makes sense, and pandas, where it exposes an overly ambitious API IMO.
  • Use ws.title instead of ws_name; you can just loop over the sheets in a workbook
  • I don't quite understand the dictionary logic: are values lists or single items? defaultdict could be your friend.

I've done some refactoring of RelationshipList this week, more related to its use when writing files but also looking forward to making reading dependencies more deterministic. Procedural is where we came from but it does make the code a tad repetitive.

Reply all
Reply to author
Forward
0 new messages