converting publication lists to html

jack...@gmail.com

unread,

May 11, 2013, 1:26:16 AM5/11/13

to

I have to convert a slew of publication lists to html and because this would be quite tedious, I decided to write a short script to do the work for me.

The simple script below reads each line of the file. If the line has text, it just prints the line. If the line is blank, it prints the closing "</div>" and the opening "<div>" with an in-line style set for the time-being. When it encounters a 2nd consecutive blank line, it just prints the blank line using the flags.

while IFS="
" read -r line
do
if [ "$( echo "$line" |grep \"[0-z]\" )" = "" ]
then
if [ "$flag" = "false" ]
then
printf "%s\n" "</div>"
printf "%s\n" '<div style="margin-bottom: 0.75em">'
flag="true"
else
printf "%s\n" ""
fi
else
printf "%s\n" "$line"
flag="false"
fi
done < publications.doc

However, since this is often a Microsoft Word document, I also encounter Microsoft's non-ascii characters such as their:

beginning quote mark ( “ )
ending quote mark ( ” )
hyphen ( – )

Is there any way to test for these characters as I read each line? I just want to replace them with standard ascii quote marks and dashes.

Thanks.

Janis Papanagnou

unread,

May 11, 2013, 5:39:53 AM5/11/13

to

On 11.05.2013 07:26, jack...@gmail.com wrote:
> I have to convert a slew of publication lists to html and because this would
> be quite tedious, I decided to write a short script to do the work for me.
>
> The simple script below reads each line of the file. If the line has text,
> it just prints the line. If the line is blank, it prints the closing "</div>"
> and the opening "<div>" with an in-line style set for the time-being. When it
> encounters a 2nd consecutive blank line, it just prints the blank line using
> the flags.

For such tasks (reading all lines consecutively, match patterns, insert or
replace text) it is advantageous to switch to a more appropriate tool (like
awk, perl, ...) it will make the task a lot easier. For example...

awk '
NF { bl=0; print; next }
!bl { bl++; print "</div>\n<div style=\"...\">"; next }
{ print "" }
' publications.txt

(This is untested code.) Mind that on WinDOS you may have quoting issues;
in that case put the awk program (everything between the single quotes) in
a file and call that file with awk -f awkfile .

Some more notes on your shell code...

>
>
> while IFS="
> " read -r line
> do
> if [ "$( echo "$line" |grep \"[0-z]\" )" = "" ]

You can use a command pipeline with grep directly after 'if' and avoid
the test expression and comparison. To invert expressions use grep -v,
and newer shells allow negation of the whole command after 'if' with '!'.
The expression [0-z] is not safe; better use character classes.

> then
> if [ "$flag" = "false" ]
> then
> printf "%s\n" "</div>"
> printf "%s\n" '<div style="margin-bottom: 0.75em">'
> flag="true"
> else
> printf "%s\n" ""
> fi
> else
> printf "%s\n" "$line"
> flag="false"
> fi
> done < publications.doc
>
>
> However, since this is often a Microsoft Word document, I also encounter
> Microsoft's non-ascii characters such as their:
>

> beginning quote mark ( ï¿½ ) ending quote mark ( ï¿½ ) hyphen ( ï¿½ )

Note that such characters *may* be multi-byte encodings (not sure
what Word uses here). I'd export from Word in a portable standard
format, or convert the output with iconv before further processing.

>
>
> Is there any way to test for these characters as I read each line? I just
> want to replace them with standard ascii quote marks and dashes.

It depends on the used tools how they handle such characters and
whether they support multi-byte character encodings. For character
replacements I use the tr command, for string replacements I use
the sed command. If you want the substitution embedded in the above
awk program you'd use the awk function gsub(), for example as in

{ gsub(/ï¿½/,"-",$0); gsub(/[ï¿½ï¿½]/,"\"",$0) }

Use the GNU gawk version of awk to be on the safe side.

Janis

>
> Thanks.
>

reinhard...@aon.at

unread,

May 12, 2013, 9:16:48 AM5/12/13

to

In article <83807149-be9e-4b6b...@googlegroups.com>,
jack...@gmail.com says...

>
> I have to convert a slew of publication lists to html and because this would be quite tedious, I decided to write a short script to do the work for me.
>
> The simple script below reads each line of the file. If the line has text, it just prints the line. If the line is blank, it prints the closing "</div>" and the opening "<div>" with an in-line style set for the time-being. When it encounters a 2nd consecutive blank line, it just prints the blank line using the flags.
>
>
> while IFS="
> " read -r line
> do
> if [ "$( echo "$line" |grep \"[0-z]\" )" = "" ]
> then
> if [ "$flag" = "false" ]
> then
> printf "%s\n" "</div>"
> printf "%s\n" '<div style="margin-bottom: 0.75em">'
> flag="true"
> else
> printf "%s\n" ""
> fi
> else
> printf "%s\n" "$line"
> flag="false"
> fi
> done < publications.doc
>
>
> However, since this is often a Microsoft Word document, I also encounter Microsoft's non-ascii characters such as their:
>

> beginning quote mark ( ? )
> ending quote mark ( ? )
> hyphen ( ? )

>
>
> Is there any way to test for these characters as I read each line? I just want to replace them with standard ascii quote marks and dashes.
>
> Thanks.

Hi Jack
I would use tr and awk like :
tr -d '<unliked.chars>' < inputfile | awk '<a awk script>' > output.html
If you show me an example of your input I could give you a better hint.
Regard Reinhard

richard...@googlemail.com

unread,

May 14, 2013, 11:09:15 AM5/14/13

to

On Saturday, May 11, 2013 6:26:16 AM UTC+1, jack...@gmail.com wrote:
> [...]

>
> However, since this is often a Microsoft Word document, I also encounter Microsoft's non-ascii characters such as their:
>
>
>
> beginning quote mark ( “ )
>
> ending quote mark ( ” )
>
> hyphen ( – )
>
>
>
>
>
> Is there any way to test for these characters as I read each line? I just want to replace them with standard ascii quote marks and dashes.
>

iconv will be able to fix it ...

$ cat x

beginning quote mark ( “ )

ending quote mark ( ” )

hyphen ( – )

$ iconv -f UTF8 -t ASCII//TRANSLIT x

beginning quote mark ( " )

ending quote mark ( " )

hyphen ( - )

Maybe, you'd need WINDOWS-1252 (or ISO_8859-1, or ...) instead of UTF8, depending on what encoding Word is using.