needed advise

serikbek...@gmail.com

未讀,

2017年8月8日下午3:31:462017/8/8

收件者：Common Crawl

Hello,

I have a problems with my code :( Im trying to access to the data in WARC in order to get IP address ,but it doest show. Please can you help me with it? Also, Do you know how I can get information like: what libraries their using, what scripting technologies and so on ...

Thank you so much ;)))

import ujson as json

import logging

import pprint

from sparkcc import CCSparkJob

class ServerCountJob(CCSparkJob):

""" Count server names sent in HTTP response header

(WARC and WAT is allowed as input)"""

name = "CountServers"

fallback_server_name = '(no server in HTTP header)'

fallback_ip_address = '(no ip_address in WARC header)'

def process_record(self, record):

server_name = None

ip_address = None

if self.is_wat_json_record(record):

# WAT (response) record

record = json.loads(record.content_stream().read())

try:

payload = record['Envelope']['Payload-Metadata']

if 'HTTP-Response-Metadata' in payload:

server_name = payload['HTTP-Response-Metadata'] \

['Headers'] \

['Server'] \

.strip()

else:

server_name = 'NULL'

payload = record['Envelope']

if 'WARC-IP-Address' in payload:

ip_address = payload['WARC-IP-Address'] \

.strip()

else:

ip_address = 'NULL'

except KeyError:

pass

elif record.rec_type == 'response':

# WARC response record

server_name = record.http_headers.get_header('server', None)

else:

# warcinfo, request, non-WAT metadata records

return

if server_name and server_name != '':

yield server_name, 1

else:

yield ServerCountJob.fallback_server_name, 1

if ip_address and ip_address != '':

yield ip_address, 1

else:

yield ServerCountJob.fallback_ip_address, 1

if __name__ == "__main__":

job = ServerCountJob()

job.run()

Sebastian Nagel

未讀,

2017年8月9日清晨7:26:522017/8/9

收件者：common...@googlegroups.com

Hi,

at a first glance you're on the right way:

- the correct path to get the IP address from a WAT record is
record['Envelope']['WARC-Header-Metadata']['WARC-IP-Address']
(it's not part of the payload resp. HTTP response header)

- for a WARC record you'll get the IP address via
record.rec_headers.get_header('WARC-IP-Address')

- if it's only about the IP address: the smartest way is to use the
robots.txt data set
http://commoncrawl.org/2016/09/robotstxt-and-404-redirect-data-sets/
It's smaller than the main WARC or WAT data sets and the robots.txt
is fetched (successful or not) for every server crawled

> what libraries their using, what scripting technologies and so on ...

Just a couple of examples going into this direction:

- servers and backends
http://norvigaward.github.io/naward01/doc/index.html
- usage of RSS feeds
https://exascale.info/Quantifying-Syndication-Feeds-Usage-on-the-Web/
- wordpress themes
https://medium.com/@paulrim/mining-common-crawl-with-php-39e14082c55c
- Google Analytics snippets
https://habrahabr.ru/post/268205/
- tracking snippets
https://ssc.io/trackingthetrackers/
http://smerity.com/cs205_ga/

Probably you need to analyze directly the HTML payload in WARC records,
the WAT extracts may not be sufficient.

Best,
Sebastian

> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

Aigerim Serikbekova

未讀,

2017年8月9日上午9:33:332017/8/9

收件者：common...@googlegroups.com

Thank you so much!!!

> common-crawl+unsubscribe@googlegroups.com <mailto:common-crawl+unsub...@googlegroups.com>.

> To post to this group, send email to common...@googlegroups.com

> <mailto:common-crawl@googlegroups.com>.

> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.

To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl+unsubscribe@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.

回覆所有人

回覆作者

轉寄