Efficiently switch io.Reader to another decoder on error

242 views
Skip to first unread message

Rory Campbell-Lange

unread,
Jan 12, 2025, 6:53:29 PMJan 12
to golang-nuts
I'm looking to develop an alternative to an existing piece of code that reads email parts into byte slices and then returns these after decoding.

As library users may not wish to use these email parts and because there a multiple byte slice copies being used, I'm attempting to rationalise the process by simply wrapping the provided io.Reader with the necessary decoders to reduce memory usage and unnecessary processing.

The wrapping strategy seems to work ok. However there is a particular issue in detecting base64.StdEncoding versus base64.RawStdEncoding, which requires draining the io.Reader using base64.StdEncoding and (based on the current implementation) switching to base64.RawStdEncoding if an io.ErrUnexpectedEOF is found.

I'd be grateful for any thoughts on the most efficient way of dealing with this type of issue, avoiding the need for lots of in-memory copies of -- say -- a 50MB email attachment. Unfortunately neither net/mail.Message.Body or mime/multipart.Part, which provide the input to this func, provide ReadSeekers.

Code snippet below.

Thanks!
Rory


// decodeContent wraps the content io.Reader in either a base64 or
// quoted printable decoder if applicable. It further wraps the reader
// in a transform character decoder if an encoding is supplied.
func decodeContent(content io.Reader, e encoding.Encoding, cte ContentTransferEncoding) io.Reader {

var contentReader io.Reader

switch cte {
case cteBase64:

contentReader = base64.NewDecoder(base64.StdEncoding, content)
// ideally check for errors.Is(err, io.ErrUnexpectedEOF); switch decoder to
// contentReader = base64.NewDecoder(base64.RawStdEncoding, content)

case cteQuotedPrintable:
contentReader = quotedprintable.NewReader(content)
default:
contentReader = content
}

if e == nil {
return contentReader
}
return transform.NewReader(contentReader, e.NewDecoder())
}

robert engels

unread,
Jan 12, 2025, 7:05:59 PMJan 12
to Rory Campbell-Lange, golang-nuts
create a ReadSeeker that wraps the Reader providing the buffering (mark & reset) - normally the buffer only needs to be large enough to detect the format contained in the Reader.

You can search Google for PushbackReader in Go and you’ll get a basic implementation.
> --
> You received this message because you are subscribed to the Google Groups "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
> To view this discussion visit https://groups.google.com/d/msgid/golang-nuts/Z4QPbTZ4gemg9kwV%40campbell-lange.net.

Rory Campbell-Lange

unread,
Jan 12, 2025, 8:46:25 PMJan 12
to robert engels, golang-nuts
Thanks for the suggestion of a ReadSeeker to wrap an io.Reader.

My google fu must be deserting me. I can find PushbackReader implementations in Java, but the only similar thing for Go I could find was https://gitlab.com/osaki-lab/iowrapper. If you have a specific recommendation for a ReadSeeker wrapper to an io.Reader that would be great to know.

Since the base64 decoding error I'm looking for is an EOF, I guess the wrapper approach will not work when the EOF byte position is > than the io.ReadSeeker buffer size.

Rory

On 12/01/25, robert engels (ren...@ix.netcom.com) wrote:
> create a ReadSeeker that wraps the Reader providing the buffering (mark & reset) - normally the buffer only needs to be large enough to detect the format contained in the Reader.
>
> You can search Google for PushbackReader in Go and you’ll get a basic implementation.
>
> > On Jan 12, 2025, at 12:52 PM, Rory Campbell-Lange <ro...@campbell-lange.net> wrote:
...
> > I'm attempting to rationalise the process [of avoiding reading email parts into byte slices] by simply wrapping the provided io.Reader with the necessary decoders to reduce memory usage and unnecessary processing.

Robert Engels

unread,
Jan 12, 2025, 8:54:28 PMJan 12
to Rory Campbell-Lange, golang-nuts

But yea, the basic premise is that you buffer the data so you can rewind if needed 

Are you certain it is reading to the end to return EOF? It may be returning eof once the parsing fails. 

Otherwise I would expect this is being decoded wrong - eg the mime type or encoding type should tell you the correct format before you start decoding. 

On Jan 12, 2025, at 2:46 PM, Rory Campbell-Lange <ro...@campbell-lange.net> wrote:

Thanks for the suggestion of a ReadSeeker to wrap an io.Reader.

robert engels

unread,
Jan 12, 2025, 8:58:13 PMJan 12
to Rory Campbell-Lange, golang-nuts
Also, this is what Gemini provided which looks basically correct - but I think encapsulating it with a Rewind() method would be easier to understand.



While Go doesn't have a built-in PushbackReader like some other languages (e.g., Java), you can implement similar functionality using a custom struct and a buffer. 

Here's an example implementation: 

package main

import (
"bytes"
"io"
)

type PushbackReader struct {
reader io.Reader
buffer *bytes.Buffer
}

func NewPushbackReader(r io.Reader) *PushbackReader {
return &PushbackReader{
reader: r,
buffer: new(bytes.Buffer),
}
}

func (p *PushbackReader) Read(b []byte) (n int, err error) {
if p.buffer.Len() > 0 {
return p.buffer.Read(b)
}
return p.reader.Read(b)
}

func (p *PushbackReader) UnreadByte() error {
if p.buffer.Len() == 0 {
return io.EOF
}
lastByte := p.buffer.Bytes()[p.buffer.Len()-1]
p.buffer.Truncate(p.buffer.Len() - 1)
p.buffer.WriteByte(lastByte)
return nil
}

func (p *PushbackReader) Unread(buf []byte) error {
if p.buffer.Len() == 0 {
return io.EOF
}
p.buffer.Write(buf)
return nil
}

func main() {
// Example usage
r := NewPushbackReader(bytes.NewBufferString("Hello, World!"))
buf := make([]byte, 5)
r.Read(buf)
r.UnreadByte()
r.Read(buf)
}

Explanation: 
  • PushbackReader struct: This struct holds the underlying io.Reader and a buffer to store the pushed-back bytes. 
  • NewPushbackReader: This function creates a new PushbackReader from an existing io.Reader. 
  • Read method: This method reads bytes from either the buffer (if it contains data) or the underlying reader. 
  • UnreadByte method: This method pushes back a single byte into the buffer. 
  • Unread method: This method pushes back a slice of bytes into the buffer. 
Important Considerations: 
  • The buffer size is not managed automatically. You may need to adjust the buffer size based on your use case. 
  • This implementation does not handle pushing back beyond the initially read data. If you need to support arbitrary pushback, you'll need a more complex solution. 

Generative AI is experimental.

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.

robert engels

unread,
Jan 12, 2025, 9:00:54 PMJan 12
to Rory Campbell-Lange, golang-nuts
Also, see this https://stackoverflow.com/questions/69753478/use-base64-stdencoding-or-base64-rawstdencoding-to-decode-base64-string-in-go as I expected the error should be reported earlier than the end of stream if the chosen format is wrong.

Rory Campbell-Lange

unread,
Jan 12, 2025, 9:29:32 PMJan 12
to robert engels, golang-nuts
Thanks very much for the links, pointers and possible solution.

Trying to read base64 standard (padded) encoded data with base64.RawStdEncoding can produce an error such as

illegal base64 data at input byte <n>

Reading base64 raw (unpadded) encoded data produces the EOF error.

I'll go with trying to read the standard encoded data up to maybe 1MB and then switch to base64.RawStdEncoding if I hit the "illegal base64 data" problem, maybe with reference to bufio.Reader which has most of the methods suggested below.

Yes, the use of a "Rewind" method would be crucial. I guess this would need to:
1. error if more than one buffer of data has been read
2. else re-read from byte 0

Thanks again very much for these suggestions.
> >> To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com <mailto:golang-nuts...@googlegroups.com>.
> >> To view this discussion visit https://groups.google.com/d/msgid/golang-nuts/DD0C1480-D237-447A-B978-78FC8951FE05%40ix.netcom.com <https://groups.google.com/d/msgid/golang-nuts/DD0C1480-D237-447A-B978-78FC8951FE05%40ix.netcom.com?utm_medium=email&utm_source=footer>.
> >
>

Robert Engels

unread,
Jan 12, 2025, 9:42:29 PMJan 12
to Rory Campbell-Lange, golang-nuts
No worries - happy to help. One last thing base64 coding is fairly trivial - a cursory shows that the padded version uses = signs. I suspect you could write a decoder that handled either during the decoding.

> On Jan 12, 2025, at 3:29 PM, Rory Campbell-Lange <ro...@campbell-lange.net> wrote:
>
> Thanks very much for the links, pointers and possible solution.

Axel Wagner

unread,
Jan 13, 2025, 9:32:44 AMJan 13
to Rory Campbell-Lange, golang-nuts
Hi,

one way to solve your problem is to wrap the body into an io.Reader that strips off everything after the first `=` it finds. That can then be fed to base64.RawStdEncoding. This approach requires no extra buffering or copying and is easy to implement: https://go.dev/play/p/CwcVz7oietI

The downside is, that this will not verify that the body is *either* correctly padded Base64 *or* unpadded Base64. So, it will not report an error if fed something like "AAA=garbage".
That can be remedied by buffering up to four bytes and, when encountering an EOF, check that there are at most three trailing `=` and that the total length of the stream is divisible by four. It's more finicky to implement, but it should also be possible without any extra copies and only requires a very small extra buffer.

To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/golang-nuts/Z4Q0AFRkkoNH52_B%40campbell-lange.net.

Axel Wagner

unread,
Jan 13, 2025, 10:39:27 AMJan 13
to Rory Campbell-Lange, golang-nuts
Just realized: If you twist the idea around, you get something easy to implement and more correct.
Instead of stripping padding if it exist, you can ensure that the body *is* padded to a multiple of 4 bytes: https://go.dev/play/p/SsPRXV9ZfoS
You can then feed that to base64.StdEncoding. If the wrapped Reader returns padded Base64, this does nothing. If it returns unpadded Base64, it adds padding. If it returns incorrect Base64, it will create a padded stream, that will then get rejected by the Base64 decoder.

Brian Candler

unread,
Jan 13, 2025, 10:41:05 AMJan 13
to golang-nuts
> I'm looking to develop an alternative to an existing piece of code that reads email parts into byte slices and then returns these after decoding

Aside: are you actually seeing invalid (unpadded) base64-encoded MIME parts in the wild?

Since a padded base64-encoded block will always be a multiple of 4 characters (excluding whitespace) I guess you could make a reader which counts these as it goes; if it sees EOF when N%4 is 2 or 3, it adds 2 or 1 "=" characters respectively.  (N%4 == 1 can never occur unless the base64 stream is corrupt).

Rory Campbell-Lange

unread,
Jan 13, 2025, 1:10:05 PMJan 13
to Axel Wagner, golang-nuts
Thanks very much for the playground link and thoughts.

The use case is reading base64 email parts, which could be of a very large size. It is unclear when processing these parts if they are base64 padded or not.

I'm trying to avoid reading the entire email part into memory. Consequently I think your earlier idea of adding padding (or removing it) in a wrapper could work. Perhaps wrapping the reader with another using a bufio.Reader to track bytes read and detect EOF. At EOF the wrapper could add padding if needed.

Rory

Rory Campbell-Lange

unread,
Jan 13, 2025, 10:43:51 PMJan 13
to Axel Wagner, golang-nuts
AS I wrote earlier, I'm trying to avoid reading the entire email part into memory to discover if I should use base64.StdEncoding or base64.RawStdEncoding.

The following seems to work reasonably well:

type B64Translator struct {
br *bufio.Reader
}

func NewB64Translator(r io.Reader) *B64Translator {
return &B64Translator{
br: bufio.NewReader(r),
}
}

// Read reads off the buffered reader expecting base64.StdEncoding bytes
// with (potentially) 1-3 '=' padding characters at the end.
// RawStdEncoding can be used for both StdEncoded and RawStdEncoded data
// if the padding is removed.
func (b *B64Translator) Read(p []byte) (n int, err error) {
h := make([]byte, len(p))
n, err = b.br.Read(h)
if err != nil {
return n, err
}
// to be optimised
c := bytes.Count(h, []byte("="))
copy(p, h[:n-c])
// fmt.Println(string(h), n, string(p), n-c)
return n - c, nil
}

https://go.dev/play/p/H6ii7Vy-8as

One odd thing is that I'm getting extraneous newlines (shown by stars in the output), eg:

--
raw: Bonjour joyeux lion
Qm9uam91ciwgam95ZXV4IGxpb24K
ok: false
decoded: Bonjour, joyeux lion* <-------------------- e.g. here
--
std: "Bonjour, joyeux lion"
IkJvbmpvdXIsIGpveWV1eCBsaW9uIg==
ok: true
decoded: "Bonjour, joyeux lion"
--

Any thoughts on that would be gratefully received.

Rory
> To view this discussion visit https://groups.google.com/d/msgid/golang-nuts/Z4UQYJmuk7Oe6xSG%40campbell-lange.net.

robert engels

unread,
Jan 13, 2025, 11:09:14 PMJan 13
to Rory Campbell-Lange, Axel Wagner, golang-nuts
As has been pointing out, you don’t need to read the whole thing into memory, just wrap the data provider with one that adds the padding it doesn’t exist - and always read with the padded decoder.

To add the padding you only need to keep track of the count of characters read before eof to determine how many padding characters to synthetically add - if the original data is padding this will be 0 (if it was padded correctly).

Rory Campbell-Lange

unread,
Jan 13, 2025, 11:35:26 PMJan 13
to robert engels, Axel Wagner, golang-nuts
I'm just doing the reverse of that, I think, by removing the padding.

I can't seem to trigger an EOF with this code below:
> >>> On Mon, 13 Jan 2025 at 10:31, Axel Wagner <axel.wa...@googlemail.com <mailto:axel.wa...@googlemail.com>>
> >>> wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> one way to solve your problem is to wrap the body into an io.Reader that
> >>>> strips off everything after the first `=` it finds. That can then be fed to
> >>>> base64.RawStdEncoding. This approach requires no extra buffering or copying
> >>>> and is easy to implement: https://go.dev/play/p/CwcVz7oietI
> >>>>
> >>>> The downside is, that this will not verify that the body is *either*
> >>>> correctly padded Base64 *or* unpadded Base64. So, it will not report an
> >>>> error if fed something like "AAA=garbage".
> >>>> That can be remedied by buffering up to four bytes and, when encountering
> >>>> an EOF, check that there are at most three trailing `=` and that the total
> >>>> length of the stream is divisible by four. It's more finicky to implement,
> >>>> but it should also be possible without any extra copies and only requires a
> >>>> very small extra buffer.
> >>>>
> >>>> On Sun, 12 Jan 2025 at 22:29, Rory Campbell-Lange <ro...@campbell-lange.net <mailto:ro...@campbell-lange.net>>
> >>>>> send an email to golang-nuts...@googlegroups.com <mailto:golang-nuts...@googlegroups.com> <mailto:
> >>>>> golang-nuts...@googlegroups.com <mailto:golang-nuts...@googlegroups.com>>.
> >>>>>>>> To view this discussion visit
> >>>>> https://groups.google.com/d/msgid/golang-nuts/DD0C1480-D237-447A-B978-78FC8951FE05%40ix.netcom.com
> >>>>> <
> >>>>> https://groups.google.com/d/msgid/golang-nuts/DD0C1480-D237-447A-B978-78FC8951FE05%40ix.netcom.com?utm_medium=email&utm_source=footer
> >>>>>> .
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>> --
> >>>>> You received this message because you are subscribed to the Google Groups
> >>>>> "golang-nuts" group.
> >>>>> To unsubscribe from this group and stop receiving emails from it, send an
> >>>>> email to golang-nuts...@googlegroups.com <mailto:golang-nuts...@googlegroups.com>.
> >>>>> To view this discussion visit
> >>>>> https://groups.google.com/d/msgid/golang-nuts/Z4Q0AFRkkoNH52_B%40campbell-lange.net
> >>>>> .
> >>>>>
> >>>>
> >>
> >> --
> >> You received this message because you are subscribed to the Google Groups "golang-nuts" group.
> >> To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com <mailto:golang-nuts...@googlegroups.com>.
> >> To view this discussion visit https://groups.google.com/d/msgid/golang-nuts/Z4UQYJmuk7Oe6xSG%40campbell-lange.net.
> >
> > --
> > You received this message because you are subscribed to the Google Groups "golang-nuts" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com <mailto:golang-nuts...@googlegroups.com>.
> > To view this discussion visit https://groups.google.com/d/msgid/golang-nuts/Z4WW2goeTO5Vz5Lc%40campbell-lange.net.
>

robert engels

unread,
Jan 13, 2025, 11:45:15 PMJan 13
to Rory Campbell-Lange, Axel Wagner, golang-nuts
You wouldn’t get an eof if the data is properly encoded. Not sure what the problem is.

You need to be doing something with the Reader - most likely writing to a file, streaming to a database record, etc.

I would simplify the code to a single test case that demonstrates the issue you are having with the code.

Brian Candler

unread,
Jan 14, 2025, 10:07:53 AMJan 14
to golang-nuts
> AS I wrote earlier, I'm trying to avoid reading the entire email part into memory to discover if I should use base64.StdEncoding or base64.RawStdEncoding.

As I asked before, why would you ever need to use RawStdEncoding? It just means the MIME part was invalid, most likely corrupted/truncated.

> One odd thing is that I'm getting extraneous newlines (shown by stars in the output), eg:

You are feeding two different inputs which do not differ by truncation alone.

% echo -n "Qm9uam91ciwgam95ZXV4IGxpb24K" | base64 -D | hexdump -c
0000000   B   o   n   j   o   u   r   ,       j   o   y   e   u   x
0000010   l   i   o   n  \n
0000015

% echo -n "IkJvbmpvdXIsIGpveWV1eCBsaW9uIg==" | base64 -D | hexdump -c
0000000   "   B   o   n   j   o   u   r   ,       j   o   y   e   u   x
0000010       l   i   o   n   "
0000016

The second one has encoded double-quotes before and after the content.

Brian Candler

unread,
Jan 14, 2025, 10:10:22 AMJan 14
to golang-nuts
Sorry ignore that, I hadn't checked your playground link.

Brian Candler

unread,
Jan 14, 2025, 10:23:50 AMJan 14
to golang-nuts
I was more or less right. The input string, which you encoded to "Qm9uam91ciwgam95ZXV4IGxpb24K", contains an encoded newline at the end. It's not spurious.

Confirmed by the "echo" pipeline I gave above, or in Go itself:
You can also confirm it by multiplying the length of the input by 3/4 

% echo -n "Qm9uam91ciwgam95ZXV4IGxpb24K" | wc -c
      28

28*3/4 = 21
B o n j o u r
, _ j o y e u
x _ l i o n \n

Rory Campbell-Lange

unread,
Jan 14, 2025, 2:54:22 PMJan 14
to Brian Candler, golang-nuts
Thanks for finding that foolish error, Brian.

To wrap the thread up, the implementation below seems to work ok for reading both base64.RawStdEncoding and base64.StdEncoding encoded data using the base64.RawStdEncoding decoder.

Example usage:

b64 := NewB64Translator(bytes.NewReader(encodedBytes))
b, err := io.ReadAll(base64.NewDecoder(base64.RawStdEncoding, b64))

The implementation:

type B64Translator struct {
br *bufio.Reader
}

func NewB64Translator(r io.Reader) *B64Translator {
return &B64Translator{
br: bufio.NewReader(r),
}
}

// Read reads off the buffered reader expecting base64.StdEncoding bytes
// with (potentially) 1-3 '=' padding characters at the end.
// RawStdEncoding can be used for both StdEncoded and RawStdEncoded data
// if the padding is removed.
func (b *B64Translator) Read(p []byte) (n int, err error) {
h := make([]byte, len(p))
n, err = b.br.Read(h)
if err != nil {
return n, err
}
// check if there is any padding in the last three bytes
tail := make([]byte, 3)
if n > 3 {
_ = copy(tail, h[n-3:n])
} else {
_ = copy(tail, h[:n])
}
c := bytes.Count(tail, []byte("="))
copy(p, h[:n-c])
return n - c, nil
}

For larger data the "tail" approach seems to have a tiny speed improvement over a naive bytes.Count(b, []byte("=")) over the whole buffer.

Thanks to everyone for their help.

Rory
> To view this discussion visit https://groups.google.com/d/msgid/golang-nuts/a990ab8b-7437-45f3-a0e5-81d9b7cab4a3n%40googlegroups.com.

roger peppe

unread,
Jan 14, 2025, 10:41:50 PMJan 14
to Rory Campbell-Lange, Brian Candler, golang-nuts
Tangentially related to this thread, a while back, I wrote a Go implementation of the base64 command that is agnostic about which encoding it reads (and can write all the possible encodings). It can be installed with:

It's arguably a little too lenient in what it accepts, but it works for me :)


Rory Campbell-Lange

unread,
Jan 14, 2025, 11:23:18 PMJan 14
to roger peppe, Brian Candler, golang-nuts
Thanks for the pointer, Roger.

After finally getting the normalising to rawstd base64 encoding to work I was trying to get my head around the fact that base64 content seems to often have several newlines around it.

Then I found encoding/base64, which has the func (r *newlineFilteringReader) Read(p []byte) (int, error) which elegantly resolves this.
https://cs.opensource.google/go/go/+/refs/tags/go1.23.4:src/encoding/base64/base64.go;l=622

I stole the function and simply added '=' in addition to '\n' and '\r' to the list of runes to skip. I'll see how I go with that but might need to look at your longer list of "garbage" runes.

I'm going to enjoy looking through the code. Thank you!

Rory
> > https://groups.google.com/d/msgid/golang-nuts/Z4Z6VkUeV3w3EOQS%40campbell-lange.net
> > .
> >
Reply all
Reply to author
Forward
0 new messages