needed advise

瀏覽次數:45 次
跳到第一則未讀訊息

serikbek...@gmail.com

未讀,
2017年8月8日 下午3:31:462017/8/8
收件者:Common Crawl
Hello, 
I have a problems with my code :( Im trying to access to the data in WARC in order to get IP address ,but it doest show. Please can you help me with it? Also, Do you know how I can get information like: what libraries their using, what scripting technologies and so on ...

Thank you  so much ;)))


import ujson as json
import logging
import pprint

from sparkcc import CCSparkJob


class ServerCountJob(CCSparkJob):
    """ Count server names sent in HTTP response header
        (WARC and WAT is allowed as input)"""

    name = "CountServers"
    fallback_server_name = '(no server in HTTP header)'
    fallback_ip_address = '(no ip_address in WARC header)'

    def process_record(self, record):
        server_name = None
        ip_address = None
        if self.is_wat_json_record(record):
            # WAT (response) record
            record = json.loads(record.content_stream().read())
            try:
                payload = record['Envelope']['Payload-Metadata']
                if 'HTTP-Response-Metadata' in payload:
                    server_name = payload['HTTP-Response-Metadata'] \
                                         ['Headers'] \
                                         ['Server'] \
                                         .strip()
                else:
                    server_name = 'NULL'
            

                payload = record['Envelope']
                if 'WARC-IP-Address' in payload:
                    ip_address  = payload['WARC-IP-Address'] \
                                         .strip()
                else:
                    ip_address  = 'NULL'
                    

   
                    
            except KeyError:
                pass
        elif record.rec_type == 'response':
            # WARC response record
            server_name = record.http_headers.get_header('server', None)
        else:
            # warcinfo, request, non-WAT metadata records
            return

        if server_name and server_name != '':
            yield server_name, 1
        else:
            yield ServerCountJob.fallback_server_name, 1

        if ip_address and ip_address  != '':
            yield ip_address, 1
        else:
            yield ServerCountJob.fallback_ip_address, 1
if __name__ == "__main__":
    job = ServerCountJob()
    job.run()

Sebastian Nagel

未讀,
2017年8月9日 清晨7:26:522017/8/9
收件者:common...@googlegroups.com
Hi,

at a first glance you're on the right way:

- the correct path to get the IP address from a WAT record is
record['Envelope']['WARC-Header-Metadata']['WARC-IP-Address']
(it's not part of the payload resp. HTTP response header)

- for a WARC record you'll get the IP address via
record.rec_headers.get_header('WARC-IP-Address')

- if it's only about the IP address: the smartest way is to use the
robots.txt data set
http://commoncrawl.org/2016/09/robotstxt-and-404-redirect-data-sets/
It's smaller than the main WARC or WAT data sets and the robots.txt
is fetched (successful or not) for every server crawled

> what libraries their using, what scripting technologies and so on ...

Just a couple of examples going into this direction:

- servers and backends
http://norvigaward.github.io/naward01/doc/index.html
- usage of RSS feeds
https://exascale.info/Quantifying-Syndication-Feeds-Usage-on-the-Web/
- wordpress themes
https://medium.com/@paulrim/mining-common-crawl-with-php-39e14082c55c
- Google Analytics snippets
https://habrahabr.ru/post/268205/
- tracking snippets
https://ssc.io/trackingthetrackers/
http://smerity.com/cs205_ga/

Probably you need to analyze directly the HTML payload in WARC records,
the WAT extracts may not be sufficient.


Best,
Sebastian
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

Aigerim Serikbekova

未讀,
2017年8月9日 上午9:33:332017/8/9
收件者:common...@googlegroups.com
Thank you so much!!!


> To post to this group, send email to common...@googlegroups.com
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl+unsubscribe@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
回覆所有人
回覆作者
轉寄
0 則新訊息