the Brill tagger demo refuses to pickle

28 views
Skip to first unread message

Stan Szpakowicz

unread,
Oct 8, 2017, 2:11:26 PM10/8/17
to nltk-users
I ran (in IDLE 3.5.2) the Brill tagger demo at <http://www.nltk.org/_modules/nltk/tbl/demo.html>. Next, I tried to rerun it and save the trained tagger. I made the simplest change, True instead of None in one definition:
 
def postag(
    #[...]
    serialize_output=True,
    #[...]

Now I am getting a runtime error.

Traceback (most recent call last):
  File "nltk.tbl.demo.py", line 382, in <module>
    demo_learning_curve()
  File "nltk.tbl.demo.py", line 111, in demo_learning_curve
    postag(incremental_stats=True, separate_baseline_data=True, learning_curve_output="learningcurve.png")
  File "nltk.tbl.demo.py", line 300, in postag
    pickle.dump(brill_tagger, print_rules)
TypeError: write() argument must be str, not bytes

The relevant lines 297-304:

    if serialize_output is not None:
        taggedtest = brill_tagger.tag_sents(testing_data)
        with open(serialize_output, 'w') as print_rules:
            pickle.dump(brill_tagger, print_rules)
        print("Wrote pickled tagger to {0}".format(serialize_output))
        with open(serialize_output, "r") as print_rules:
            brill_tagger_reloaded = pickle.load(print_rules)
        print("Reloaded pickled tagger from {0}".format(serialize_output))

I see nothing wrong here. I tried to change 'w' to 'wb' (no error now) and "r" to 'rb' (a new error):

  File "nltk.tbl.demo.py", line 302, in postag
    with open(serialize_output, 'rb') as print_rules:
OSError: [Errno 9] Bad file descriptor

The same error appears if I keep "r".

Please help!

Stan

Dimitriadis, A. (Alexis)

unread,
Oct 9, 2017, 12:46:21 AM10/9/17
to nltk-...@googlegroups.com
I see nothing wrong here. I tried to change 'w' to 'wb' (no error now) and "r" to 'rb' (a new error):

  File "nltk.tbl.demo.py", line 302, in postag
    with open(serialize_output, ‘rb') as print_rules:

The first argument to `open` must be a filename.



--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nltk-users+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Stan Szpakowicz

unread,
Oct 9, 2017, 7:09:32 PM10/9/17
to nltk-users
On Sunday, October 8, 2017 at 9:46:21 PM UTC-7, Alexis wrote:
I see nothing wrong here. I tried to change 'w' to 'wb' (no error now) and "r" to 'rb' (a new error):

  File "nltk.tbl.demo.py", line 302, in postag
    with open(serialize_output, ‘rb') as print_rules:

The first argument to `open` must be a filename.

You are absolutely right. My bad. But: the problem persists. The demo program runs demo_learning_curve(). This worked wonderfully. I then ran another method: demo_serialize_tagger(), which calls postag(serialize_output="tagger.pcl"), with a legitimate file name.

Here is what I got, quite the same error again:

Traceback (most recent call last):
  File "<pyshell#0>", line 1, in <module>
    demo_serialize_tagger()
  File "nltk.tbl.demo.py", line 126, in demo_serialize_tagger
    postag(serialize_output="tagger.pcl")

  File "nltk.tbl.demo.py", line 300, in postag
    pickle.dump(brill_tagger, print_rules)
TypeError: write() argument must be str, not bytes

Now what?

Thanks for any help!

Stan

Stan Szpakowicz

unread,
Oct 12, 2017, 3:24:26 AM10/12/17
to nltk-users
Never mind. The author of the demo at <http://www.nltk.org/_modules/nltk/tbl/demo.html> kindly helped me (thanks!!). The program runs fine in Python 2. In Python 3, one needs "pickled" files in byte mode. Change 'r' to 'rb' and 'w' to 'wb' in two cases.

if cache_baseline_tagger:


if serialize_output is not None:

Stan

Marcus

unread,
Oct 12, 2017, 3:24:51 AM10/12/17
to nltk-users
This is a bug -- pickled files should always be read and written in binary mode ('rb' and 'wb', rather 
than 'r' and 'w').  With text mode, it'll work only by happy coincidence, under python2 or 
if the protocol used happens to be 0. 

There are four occurrences in total in the file you refer to. I will submit a patch, but if you are 
impatient, here it is:

251c251
<             with open(cache_baseline_tagger, 'w') as print_rules:
---
>             with open(cache_baseline_tagger, 'wb') as print_rules:
254c254
<         with open(cache_baseline_tagger, "r") as print_rules:
---
>         with open(cache_baseline_tagger, "rb") as print_rules:
310c310
<         with open(serialize_output, 'w') as print_rules:
---
>         with open(serialize_output, 'wb') as print_rules:
313c313
<         with open(serialize_output, "r") as print_rules:
---
>         with open(serialize_output, "rb") as print_rules:


thanks,

Marcus
Reply all
Reply to author
Forward
0 new messages