Fraud News

Juan Pablo Torino

unread,

Jan 3, 2017, 2:23:56 PM1/3/17

to Common Crawl

Hi all

Hope you can help me to shape an idea.

We are trying to analyze who electonic fraud evolved since 2015. Based on certain criterias using key words, we are planning to navigate through WARC files or WET files filtering them based on these keywords; those that math will be processed initially with a basic word count scrip using another list of fraud key words.

My doubt is who we can pre-filter x number domains based on keywords instead of checking all domains existing in the index. My first idea is to gather manually a significant ammount of related domains that treat Fraud, however this approach requeires heavy manual effort...

Any suggestion please?

Thks

Greg Lindahl

unread,

Jan 3, 2017, 4:38:46 PM1/3/17

to common...@googlegroups.com

On Tue, Jan 03, 2017 at 11:23:55AM -0800, Juan Pablo Torino wrote:

> We are trying to analyze who electonic fraud evolved since 2015.

Blekko had a "pharma spam" word list that was pretty good for that
class of webspam, but of course it's hard to know if the sales website
behind it is fraudulent or not. (Kwikmed, for example, appears to be
a fully licensed pharmacy, but of course who knows what's going on
with their affiliates.)

Other kinds of fraud, like reverse mortgage scams, are hard to figure
out using keywords. One basic rule is that any lead-generation website
-- fill out your details and a salesman will call! -- is likely going
to hook your elderly relative up with a high-pressure "boiler room"
sales organization. But you can't stare at the webpage contents and
figure out if they're pushing high-sales-fee icky-terms reverse
mortgages or not.

-- greg

Juan Pablo Torino

unread,

Jan 4, 2017, 4:36:20 AM1/4/17

to Common Crawl

Hi Greg

First of all, thks for your comment.. and I've may not explained my self correctly, quit sure is my english :( .

What we are trying to do is analyze the text of web pages matching them against our list of "fraud related keywords" that we want to analyze, if there is sufficient match, we will them count words related with fraud and then identify trends or areas where fraud is peaking, for instance.
It is not the intention to conclude that a site (domain) is fraudulent or not.

Any idea on how we can compile x amount of domains to analyze, without checking all existing warcs/wets?

Thks!

Ivan Habernal

unread,

Jan 4, 2017, 5:45:35 AM1/4/17

to Common Crawl

Hi Juan,

This looks like a typical NLP (natural language processing) or IR (information retrieval) task to me. Unfortunately "analysing web pages" is very vague and it can mean many things, from simple domain origin counting to some sort of deep information extraction from the text. However, you can start with keyword matching over a _subset_ of CommonCrawl (just couple of random WARCs, not the entire corpus) to get a gist what's there. Regarding keyword search, we did something similar in a "clean" (plain-text only) version of CommonCrawl, you may have a look here: https://zoidberg.ukp.informatik.tu-darmstadt.de/jenkins/job/DKPro%20C4Corpus/org.dkpro.c4corpus$dkpro-c4corpus-doc/doclinks/1/#_use_case_search_for_patterns_in_c4corpus But your problem seems too broad at the moment to give any good practical advice.

Best,

Ivan

Juan Pablo Torino

unread,

Jan 4, 2017, 11:38:39 AM1/4/17

to Common Crawl

Hi Ivan

Let me check... thks for your comments.

Regards.

Reply all

Reply to author

Forward