new Line characters in HTML texts

397 views
Skip to first unread message

Qi Z

unread,
Jun 24, 2021, 9:24:01 AM6/24/21
to Google Cloud Translation API
Hi,
I'm having an issue on using Google Cloud Translator API to translate HTML texts. When the translator is in HTML mode, it returns a String without any new line characters. The translated HTML text is very difficult to read because the entire text is in one line, I wonder if I can customize my translator's behavior to keep the line breaks. Using "text" mode of the translator will sometimes translate HTML tags and modify my HTML entities, which is not ideal.
More details can be found on my post on StackOverflow:
https://stackoverflow.com/questions/68112551/how-to-use-google-cloud-translator-for-
html-text-and-preserve-the-line-breaks .
Any help is appreciated!

Music Li

unread,
Jun 28, 2021, 1:50:53 PM6/28/21
to Google Cloud Translation API
Hi,

<br> tag means new line in HTML text (instead of a new line character). If you want to preserve newlines, you can replace them with <br> tag and they will be preserved by html translation.

Best,
Music

Qi Z

unread,
Jul 2, 2021, 3:21:21 AM7/2/21
to Google Cloud Translation API
Hi,
My intent is to preserve the line breaks in the original HTML code, not the rendered outcome. For example, I want
<img
 src="..."
 alt="text"
 title="title">
to be translated as
<img
 src="..."
 alt="translated text"
 title="translated title">,
but the Translator API outputs <img src="..." alt="translated text" title="translated title">. It does not affect how the HTML is rendered, but the readability of the HTML code is impacted for the programmer when a tag is long.
Your answer suggests replacing \n with <br> before sending the text to translate, and replacing <br> with \n in the translation output. However, this is not a very ideal approach. In the example above, s <img <br> src="..." <br> alt="text" <br> title="title">, will disrupt the original structure of the <img> tag and have undesired output. The StackOverflow answer uses a similar approach and causes same problem.

What would be an alternative solution?

Juan Carlos Barroso Ruiz

unread,
Jul 6, 2021, 9:26:42 AM7/6/21
to Google Cloud Translation API
Hello,

Here's my reproduction using the Python Client Library. I used the pypi package "beautifulsoup4" to format the one-liner returned by the API to a pretty indented html format. The code is a slightly modified version of what you can find in this Quickstart.

```
from bs4 import BeautifulSoup as bs # to format html text

def translate_text(target, text):
    """Translates text into the target language.

    Target must be an ISO 639-1 language code.
    """
    import six
    from google.cloud import translate_v2 as translate

    translate_client = translate.Client()

    if isinstance(text, six.binary_type):
        text = text.decode("utf-8")

    # Text can also be a sequence of strings, in which case this method
    # will return a sequence of results for each text.
    result = translate_client.translate(text, target_language=target, format_="html") # note the I set the format_ to "html"

    return result["translatedText"]

html_str =   """
<p><a class="selfLink" id="notes" href="#notes" rel="help"><strong>Notes</strong></a>
<ul>
<li><a class="selfLink" id="disclaimer" href="#disclaimer" rel="help">DISCLAIMER OF LIABILITY</a> 
"""
result = translate_text("es", html_str) # translate to Spanish
soup = bs(result)
print(soup.prettify())
```
Output:

```
<p>
 <a class="selfLink" href="#notes" id="notes" rel="help">
  <strong>
   Notas
  </strong>
 </a>
 <ul>
  <li>
   <a class="selfLink" href="#disclaimer" id="disclaimer" rel="help">
    RENUNCIA DE RESPONSABILIDAD
   </a>
  </li>
 </ul>
</p>

```
This is not exactly the input format but it's still great for reading.

I hope this was of help to you.

Kind regards,

Juan Carlos
-----
Reply all
Reply to author
Forward
0 new messages