google app engine python encoding not work properly

Maksym Polynskyi

unread,

Sep 12, 2015, 11:39:52 AM9/12/15

to Google App Engine

I parse some site and when I run code on my pc - I got nice html code. But when I try to run it via google app engine I got wrong encoded text like this:

 o�Ƭ"3��&�C�]���lu}�j   ���?��l�.�3 Y�e?�bG���c�����No�ј�}�����e�(�NK��S$T�8]I�G֤�TZְ
It's ISO-8859-1. Python can't parse on google app engine. How to fix it?

Here the video https://www.youtube.com/watch?v=MnIUk5QkHZU

Here the parser:

# -*- coding: utf-8 -*-
import requests
import lxml.html

class Rutor:
    def __init__(self, title, year='', qu=''):
        self.title = title
        self.year = year
        self.qu = qu
        self.main_domain = 'http://www.rutor.org/'
        self.search_params = '/search/0/1/100/0/'  # only New movies
        self.search_text = ""
        self.count = 0
        self.result = {}

    def construct_search_text(self):
        l = [self.title, self.year, self.qu]
        l = filter(None, l)
        search_text = " ".join(l)
        self.search_text = search_text
        return self.search_text[:]  # [:] - magic

    def construct_search_url(self):
        search_link = "".join((self.main_domain, self.search_params, self.construct_search_text()))
        print(search_link)
        return search_link

    def get_page_sourse(self):
        r = requests.get(self.construct_search_url())
        print("encoding is: "+r.encoding)
        return r.text.encode(r.encoding)  # r.encoding return used codec

    def parse_it(self):
        all_torrent_links_xpath = "//div[@id='index']//a[starts-with(@href, '/torrent')]"
        page = lxml.html.document_fromstring(self.get_page_sourse())
        print(self.get_page_sourse()) #here I printing source core for stackowerflow
        all_torrent_links = page.xpath(all_torrent_links_xpath)
        if all_torrent_links:
            for link in all_torrent_links:
                print(link)
                if not (link.text.lower()).find(u'трейлер') != -1:  # we don't need trailers
                    title = link.text_content()
                    torrent_file = link.getprevious().getprevious().attrib['href']
                    magnet = link.getprevious().attrib['href']
                    self.result[self.count] = {'title': title[:], 'torrent_file': torrent_file, 'magnet': magnet}
                    # I used [:] c'z title type is 'lxml.etree._ElementUnicodeResult' but not <unicode>
                    # because of lxml.html fromstring()
                    self.count += 1


if __name__ == '__main__':
    m = Rutor('Avengers: Age of Ultron', '2015', '1080p')
    m.parse_it()
    print(m.result)


If I run it in subline text - I will get nice html page source, and m.result is not empty. However, if I run this code in google app engine with flask:

import Rutor
...
@app.route('/test')
def test():
    m = Rutor('Avengers: Age of Ultron', '2015', '1080p')
    m.parse_it()
    pprint(m.result)
    return 'test'

I will get wrong encoded text in my console and m.resutl is empty

I can't fix it more than 2 weeks, please help me.

Nick (Cloud Platform Support)

unread,

Sep 14, 2015, 12:27:19 PM9/14/15

to Google App Engine

Hey Maksym,

A thread like this is off-topic for Google Groups, and should be posted to Stackoverflow. While I'd like to help you, this isn't the place to do it. If you post to stackoverflow.com, where we also monitor, you'll be in touch with a much larger user-base of people who are going to help you, and in a format which is designed for that purpose. This forum isn't meant for specific 1-on-1 technical issues, but for general discussion of the platform and its services.

One piece of advice I'd have before you post there is to print out the headers of the request and the headers of the response, and to also provide the printed output (or as much as is relevant, probably leaving out the actual response body other than a few characters) of the program when you post your question, rather than sending a video.

It seems from a quick glance on my part at the code and output that the encoding is wrong for the text, although I wasn't able to take the response text and use a utf8-to-latin1 (ISO-8859-1) transcoder to find anything meaningful, so it's also possible that the response payload is gzip'ed, and the devserver doesn't automatically un-gzip it, although the native python implementation does? These are just guesses, so you'll need to create a stackoverflow question and provide more information in order to get a 100% solution.

I wish you the best of luck in getting your question addressed on stackoverflow,

Nick

Nick (Cloud Platform Support)

unread,

Sep 14, 2015, 12:30:21 PM9/14/15

to Google App Engine

Hey Maksym,

I've also just noticed that you posted this issue to the public issue tracker. Please refrain from cross-posting and instead try to find the most accurate place to make your thread. Given that this could possibly be a problem in the SDK local implementation of the UrlFetch service (which the App Engine version of "requests" uses), posting there is also a valid choice. Please feel free to provide the information requested in my last message over there at the public issue tracker thread, and I'll be glad to assist.

Sincerely,

Nick

On Saturday, September 12, 2015 at 11:39:52 AM UTC-4, Maksym Polynskyi wrote:

Reply all

Reply to author

Forward