Cyber Crime Dataset Csv

1 view

Skip to first unread message

Carlos Beirise

unread,

Jul 27, 2024, 5:48:46 PM7/27/24

to coftherelea

Statista R identifies and awards industry leaders, top providers, and exceptional brands through exclusive rankings and top lists in collaboration with renowned media brands worldwide. For more details, visit our website.

As internet usage continues to increase, so does the amount of personal information and data that is made available online. This could be out of choice, for example, somebody providing personal details to a social network in order to use their service - or it could be unwillingly, as a victim of a cybercrime attack or data breach. With the development of various AI tools, cyber crime is currently transforming. Therefore, the risks to individuals, companies, organizations, and governments have never been greater.

cyber crime dataset csv

DOWNLOAD ->>->>->> https://urlgoal.com/2zS0OK

Cybercrime can take many forms - identity fraud, data theft, ransomware attacks, copyright infringement, and phishing campaigns are just some examples. Organizations consider the loss of personally identifiable information of customers or employees as one of the most dangerous outcomes of cyberattacks. The loss of sensitive information can have serious repercussions for companies, such as damaged reputation and loss in revenue.

Organizations worldwide don't only pay to get back the data lost in cyberattacks they also suffer downtime and disruption in operations caused by cybercrime. The average cost of a data breach worldwide is around 4.35 million U.S. dollars, but financial repercussions differ greatly depending on the region, organization size, and industry. The average cost of a data breach in the healthcare sector is 10.1 million dollars.

While governments around the world take steps toward improved personal data protection, users are now more aware and interested in minimizing the risks in the online environment. As of 2023, seven in ten internet users stated having taken steps to protect their identity online. At the same time, some users are willing to accept risks for more convenient internet usage. Nearly 70 percent of global respondents said they felt more vulnerable to identity theft now than they did years ago.

The GDPR (General Data Protection Regulation) was introduced in the EU in 2018 in an attempt to better regulate the handling of data and personal information by companies and organizations and to provide greater protection for citizens' rights and privacy. It remains the most comprehensive data privacy regulation globally. In recent years, other countries have developed similar laws protecting users' online information.

Statista offers comprehensive data on the ever-evolving issue of cyber crime. In this section, users can find information on the most common types of cyber attacks, the biggest online data breaches, and the regions and industries most targeted by cyber intrusions worldwide.

The database contains statistics on the number of cyber incidents and compromised data records as well as a breakdown of the costs associated with cyber crime and cyber security among companies. These financial figures are further complemented with survey data and other relevant insights into the forms and frequency of cyber bullying among internet users.

We "scrape" a number of publicly available underground forums where there is discussion of cybercrimeand advertising of the results of cybercrime. Some of these forums have been operating for manyyears and we have now amassed a complete collection of posts (excluding those that have been activelydeleted of course). Currently we have over 100 million posts, some dating back more than 10 years.

It is possible to use this data to determine both what has been posted about a particular cybercrimetechnique (and when) and also what some particular person (hidden behind a pseudonym ofcourse) might have been posting about.

We are expanding our forum collection to include material from "extremist" forums. Althoughthere are some cybercrime aspects to this material it will mainly be of interest to thosewho are studying hate groups, extremism and radicalisation. We will shortly have morethan 40 million posts.

We have also been collecting from a range of "Incel" (INvoluntary CELibates) forums. These forumssupport online subcultures, where members are unable to find a romantic partner despite desiring one.Extremist thoughts and opinions are commonly found on these forums. Our dataset already holdsmore than 7 million posts and 700,000 threads and it is being scraped on an on-going manner.

We do not currently scrape any underground marketplace websites, but plan to expand intothis area soon. However, one of the underground forums that we do scrape has introduceda service for processing "contracts" and from this we collect a range of valuable informationsuch as the nature of the goods and services being exchanged, maker/taker obligations, contractvalues, agreement term and reputation ratings of the parties involved. It may also containpayment details, including bitcoin wallets and transaction hashes. This is a ground-truth dataset,which can be used to understand part of the underground economy and its underlying social network.The dataset contains roughly 180,000 contracts at present and it is being collected on a regular basis.

Some people choose to boast about their hacking ability by breaking into websites, defacingpages and then publishing details of the defaced page online. We are building a dataset of theseboasts ... we currently have about 550,000 sets of details (notifier, location, IP address,domain, webserver information and snapshot of defaced page) and expect this total to growmarkedly over the next few months.

A number of publicly accessible channels on Discord and Telegram are used for discussionsof cybercrime topics such as illicit markets and booter (DDoS) services. We currently havea collection of over 3 million Telegram message (from 50+ channels) and 2.5 million Dicordmessage (from over 3000 channels).

Investment fraudsters and financial scam operators aim to lure victims into making investmentsin fake schemes, which either promise very high rates of return (with very low risk), impersonatesome genuine companies or do not exist at all. We have been collecting scam reports from multiplesources including blocklists, scam reporting forums and online social media posts and we currentlyhold details of more than 150,000 associated web URLs.

Third-party app marketplaces are now filled with large numbers of modded Android apps offering similar(or more) functionality compared with the original application. The diversity of these marketplaces hasopened new opportunities for malicious actors: modifying the in-app ad networks, including malware, etc.We are building a dataset of these apps collected from several sources -- 3000 so far and growing.

The sensors respond to packets associated with scanning for 'reflectors' that are to be often used indistributed reflected amplified denial of service (DDoS) attacks -- and this means that our sensorsare often called into play for these attacks, which means that we have a record of the victim IP.

Our dataset starts in March 2014, though the number of sensors varies over time. A high level descriptionof our collection system and a summary of the data appears in our paperDaniel R. Thomas, Richard Clayton, and Alastair R. Beresford:"1000 days of UDP amplification DDoS attacks", APWG eCrime, 2017.

The Mirai malware scans for devices that it may be able to compromise by sending out distinctiveTCP SYN packets. We collect these packets when they hit our sensors. We have data from about a dozensensors from mid-November 2016 onwards, but significantly better coverage from a (circa) /16 afterApril 2017 and from a further (bit more than a) /14 from mid-October 2017 onwards.

We operate honeypots that are specifically intended to collect Mirai malware (they also get a certainamount of bycatch -- copies of QBot variants etc.). We have around 15000 binaries (which is not asimpressive as it sounds because a Mirai source file is usually compiled ten or more times fordifferent CPU architectures and we collect all the variants we can).

Although the comments are often just lists of URLs to pharma sites, they are sometimes sociallyengineered to try and encourage us to make them visible -- and from time to time the postersget confused and post their templates rather than the customised result.

We have a very substantial list of phishing URLs going back over 10 years. We obtain theseURLs not only from the APWG, but also from other sources so that our list is probably oneof the most extensive there is. That said, some of the URLs are on the list in error andthis makes complicates experimental design. If you are considering using this dataset thenwe can assist by explaining its provenance in more detail.

We have a dataset of phishing emails (sent to a small set of email addresses) from 2005 onwards.The dataset contains numerous duplicates so numbers are inexact, but contains up to 10000 emailsper year until 2010 and around 1000 emails a year since then.

We have a dataset of Advanced Fee Fraud (sometimes called 419 scam) emails (sent to a small setof email addresses) from 2001 onwards. The dataset has duplicates so numbers are inexact, but contains around 75000 unique emails. For the past few years offers of loans have been includedin this dataset.

We have a dataset of "spam" email (sent to a small set of email addresses) dating from 2003onwards (and indeed some spam email from the mid-1990s onwards as well). Numbers of emailsvary dramatically from year to year (and there are large numbers of duplicates) but in recenttimes exceed 2000 emails a month. Note that for relevant periods phishing and Advanced FeeFraud emails have been extracted into separate datasets.

We also have a very substantial dataset of email spam provided by Abusix. Our data startsin July 2020 and runs at around 10 to 20 million emails a day (c 100G in a highly compressedformat) so it is worth considering resource requirements, and talking to us, whilstconsidering your experimental design.