The limits you hit are reasons why I think the library may have run its course. A Webduino server is very limited -- one of those limits is the inability to serve more than one session at a time, so any request blocks any other request until the first one is finished. That makes it inappropriate for serving files to a modern web browser, as those expect to be able to make multiple connections to a server, so if you have just a couple of active clients, you quickly overload the connection queue in the W5100 chip.
As for optimization, it's tough. We already have done a huge amount to push strings into flash and to not store buffers of data, but instead process connections byte by byte. There could be some additional savings by changing some of the string comparison work to use flash-based strings, but I doubt you'd get more than 100 bytes back of SRAM.
Writing to SD is more complicated that reading, as you not only need to parse a FAT file structure, you have to be able to allocate new structures, write directories, and buffer data to output to a file. There is some code to do data logging to a SD card out there, but I've never used it.