Fetch_20newsgroups Dataset Download

1 view

Skip to first unread message

Pirjo Unzicker

unread,

Jul 22, 2024, 6:45:54 AM7/22/24

to feasmewardded

The 20 newsgroups dataset comprises around 18000 newsgroups posts on20 topics split in two subsets: one for training (or development)and the other one for testing (or for performance evaluation). The splitbetween the train and test set is based upon a messages posted beforeand after a specific date.

This module contains two loaders. The first one,sklearn.datasets.fetch_20newsgroups,returns a list of the raw texts that can be fed to text featureextractors such as sklearn.feature_extraction.text.CountVectorizerwith custom parameters so as to extract feature vectors.The second one, sklearn.datasets.fetch_20newsgroups_vectorized,returns ready-to-use features, i.e., it is not necessary to use a featureextractor.

fetch_20newsgroups dataset download

Download ✑ ✑ ✑ https://urloso.com/2zCzML

The sklearn.datasets.fetch_20newsgroups function is a datafetching / caching functions that downloads the data archive fromthe original 20 newsgroups website, extracts the archive contentsin the /scikit_learn_data/20news_home folder and calls thesklearn.datasets.load_files on either the training ortesting set folder, or both of them:

First the categories are specified to filter the records from the original dataset. let us say we get 1200 articles only using the filter. By default, the original dataset has 20 categories. If we don't specify the categories, it will pull all articles of all categories into the dataset.

Function tran_test_split() randomly separates data into training and testing dataset, while function fit() trains the classifier with selected training data (it defines model which parameters match the model input with an output) and score() gives us the accuracy for testing data. Function time() is here just to give us some information about the training duration.

In the following we will use the built-in dataset loader for20 newsgroups from scikit-learn. Alternatively it is possibleto download the dataset manually from the web-site and use thesklearn.datasets.load_files function by pointing it to the20news-bydate-train subfolder of the uncompressed archive folder.

You can notice that the samples have been shuffled randomly (witha fixed RNG seed): this is useful if you select only the firstsamples to quickly train a model and get a first idea of the resultsbefore re-training on the complete dataset later.

Fortunately, most values in X will be zeros since for a givendocument less than a couple thousands of distinct words will beused. For this reason we say that bags of words are typicallyhigh-dimensional sparse datasets. We can save a lot of memory byonly storing the non-zero parts of the feature vectors in memory.

The book mentioned that the text data in the 20 newsgroups dataset that we downloaded from fetch_20newsgroups data is highly dimensional. I do not understand this statement. It is my understanding that dimension is used to describe axies that an array has.For example,

The initial dataset contains other meta-data, such as a desctiption of the dataset, the names of the target categories and also the location of each sample's file. We don't really care about these for the pure modelling part. So there are only text blocks, called data, and the target categories, called target. Your input is then 1d - the text blocks.

When you encode a given text, each word in one-hot encoded, so the text becomes a vector of the length that equals size of the dictionary (classical bag of words approach). That usually goes into thousands (e.g. Yelp review dataset has over milion unique words) so clearly this is a highly dimensional problem.

Your ML model will be trained on the 20newsgroups dataset that contains 20,000 newsgroup posts on 20 topics. The 20newsgroups dataset is curated by Carnegie Mellon University School of Computer Science and publically available from scikit-learn.

The scikit-learn dataset is naturally broken down into a training set and a test set. You complete some preprocessing on the test set, then set it aside for use only for testing. You split the training set into training and validation sets to test the model performance during training.

Note: The following approach for splitting the data only works if your entire dataset has been shuffled first. Optionally, you can use the sklearn.modelselection train_test_split API to perform the split and set the random_state seed to a numerical value to ensure repeatability.

The NTM supports both CSV and RecordIO protobuf formats for data in the training, validation, and testing channel. The following helper function converts the raw vectors into RecordIO format, and using the n_parts parameter, optionally breaks the dataset into shards which can be used for distributed training.

In this module, you imported and fetched the dataset you use for your content recommendation system. Then, you prepared the dataset through preprocessing, lemmatization and tokenization. Finally, you split the dataset into training and validation sets, then staged them in your Amazon S3 bucket.

The 20 newsgroups dataset comprises around 18000 newsgroups posts on 20 topics split in two subsets.The data is organized into 20 different newsgroups, each corresponding to a different topic. Some of the newsgroups are very closely related to each other. Here are the groups:

This involves removing punctuation,stop words and if the data is in html format we can do more things like removing html,css,script content using BeautifulSoup. For the 20 newsgroups dataset scikit learn provides a remove argument which can be used to clean the text such as removing headers,footers and quotes.

Start by reading the data. You'll use a dataset included in scikit-learn called 20 newsgroups. This dataset consists of roughly 20,000 newsgroup documents, split into 20 categories. For the sake of simplicity, you'll only use five of those categories in this example.

Once you have this numerical representation, you can pass this dataset to your machine learning model. This is what you'll do with the documents in the 20 newsgroup dataset. Keep in mind that because the dataset has so many documents, you'll end up with a matrix with many more columns than the example above.

For this tutorial, we will use the 20 Newsgroups dataset, which is a collection of approximately 20,000 newsgroup documents, partitioned across 20 different categories. Scikit-Learn provides a function to fetch this dataset:

Next, we need to split the dataset into training and testing sets. The training set will be used to train the model, while the testing set will be used to evaluate the model's performance. We will use an 80/20 split for training and testing, respectively:

The 20 Newsgroups dataset is a popular collection of approximately 20,000 newsgroup documents partitioned across 20 different categories. The dataset is widely used for text classification tasks, serving as a benchmark for various machine learning algorithms. The categories in the dataset cover a diverse range of topics such as religion, politics, sports, and technology.

Naive Bayes models are a group of extremely fast and simple classification algorithms that are often suitable for very high-dimensional datasets.Because they are so fast and have so few tunable parameters, they end up being very useful as a quick-and-dirty baseline for a classification problem.This section will focus on an intuitive explanation of how naive Bayes classifiers work, followed by a couple examples of them in action on some datasets.

The last two points seem distinct, but they actually are related: as the dimension of a dataset grows, it is much less likely for any two points to be found close together (after all, they must be close in every single dimension to be close overall).This means that clusters in high dimensions tend to be more separated, on average, than clusters in low dimensions, assuming the new dimensions actually add information.For this reason, simplistic classifiers like naive Bayes tend to work as well or better than more complicated classifiers as the dimensionality grows: once you have enough data, even a simple model can be very powerful.

One method to validate the number of clusters is the elbow method. The idea of the elbow method is to run k-means clustering on the dataset for a range of values of k (say, k from 1 to 10 in the examples above), and for each value of k calculate the sum of squared errors (SSE).

Value of K need to be specified beforehand: We can expect this value only if we have a good idea about our dataset and if we are working with a new dataset then elbow method can be used to determine value of K.

Well, Feature Selection is a very important step in machine learning. When we get a dataset in a tabular form, each and every column is a feature, right? But the question is which features are relevant.

In a nutshell, it is good to know, that if you are struggling with a dataset and not sure where to start from, just take a look at the scikit-learn documentation. I am sure it will have something useful to offer.

Next, we need to load the 20 newsgroup dataset using 'sklearn'. On Colab we could use the 'fetch_20newsgroups' method to download and load the dataset. If using locally we can also download the dataset manually and then use the 'load_files' method to load the dataset.

You would see that both the train and test datasets consist of the data (i.e. the text documents), the filenames of these text documents, the target_names i.e. the document labels in text, and the target i.e. the document labels in numbers.

We now need to create the vector representations of all the documents in the training and test datasets using the TfidfVectorizer object from 'sklearn'. We would fit the vectorizer object with the training dataset using the fit_transform method, which would first create features based on all training documents and then transform the training samples into vector representations of these features. We could do that as below.

For training, we choose the linear support vector machine model (SVC), and then train it with the training dataset as below. The model choice is arbitrary, for this demonstration. In an ideal situation, you should start out with a set of different models and choose the one with the best performance.