Caching content during import

165 views
Skip to first unread message

Filippo Tampieri

unread,
Apr 16, 2016, 6:08:39 PM4/16/16
to Beancount
I have a CSV importer class that is instantiated 4 times so I can use it as an importer for 4 different accounts.
I have been expecting the cache.FileMemo object to help reduce the number of times that a CSV file is parsed when running bean-extract on it.
In fact, I was expecting my CSV parser to be run only once since I wrapped it in a method of my importer class like this:

    def parse(self, file):
        return file.convert(parse)

where the inner parse is the actual parser function:

def parse(filename):
    do-all-the-work-here

and the rest of my importer class only calls self.parse(file) to get to the results of the parser.

Unfortunately, looking at the bean-extract code in beancount/ingest/extract.py, I see the cache.FileMemo object that is passed to my importer is constructed by extract_from_file() once for each combination of importer and filename; since the cache is stored as an attribute of the FileMemo object (the 'file' variable), there is no chance to share it across importers.

On the other hand, the code in beancount.ingest.identify.find_imports does share the FileMemo object across all the importers when calling their identify() method.

Since I need the parsed contents of my CSV files in both the identify() and extract() methods of my importers, I end up with each CSV file being loaded and parsed twice.

Martin, are you aware of this?
If this is the way it is going to be, please let me know so I can use a lighter-weight parsing step for the identify() method and a beefier one for the extract() method.

Thank you

Martin Blais

unread,
Apr 17, 2016, 9:53:54 PM4/17/16
to Beancount
On Sat, Apr 16, 2016 at 3:08 PM, Filippo Tampieri <filippo....@gmail.com> wrote:
I have a CSV importer class that is instantiated 4 times so I can use it as an importer for 4 different accounts.
I have been expecting the cache.FileMemo object to help reduce the number of times that a CSV file is parsed when running bean-extract on it.
In fact, I was expecting my CSV parser to be run only once since I wrapped it in a method of my importer class like this:

    def parse(self, file):
        return file.convert(parse)

where the inner parse is the actual parser function:

def parse(filename):
    do-all-the-work-here

and the rest of my importer class only calls self.parse(file) to get to the results of the parser.

Unfortunately, looking at the bean-extract code in beancount/ingest/extract.py, I see the cache.FileMemo object that is passed to my importer is constructed by extract_from_file() once for each combination of importer and filename; since the cache is stored as an attribute of the FileMemo object (the 'file' variable), there is no chance to share it across importers.

On the other hand, the code in beancount.ingest.identify.find_imports does share the FileMemo object across all the importers when calling their identify() method.

Since I need the parsed contents of my CSV files in both the identify() and extract() methods of my importers, I end up with each CSV file being loaded and parsed twice.

Martin, are you aware of this?

That's a very good observation. I hadn't noticed. I had intended to make it work in the way that you want, but in the big conversion from LedgerHub I moved forward and failed to notice this.

 
If this is the way it is going to be, please let me know so I can use a lighter-weight parsing step for the identify() method and a beefier one for the extract() method.

I will fix this. 
FileMemo objects should be valid for the duration of the process, as you describe.

(In addition, there ought to be an on-disk cache as well. I need to build that in, because I have this large PDF file which somehow takes forever to convert on Mac OS X.)




 

Thank you

--
You received this message because you are subscribed to the Google Groups "Beancount" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beancount+...@googlegroups.com.
To post to this group, send email to bean...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/beancount/a271af07-ed78-45fb-a318-ced6de0a82ec%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Martin Blais

unread,
Apr 17, 2016, 11:01:14 PM4/17/16
to Beancount
Reply all
Reply to author
Forward
0 new messages