Use the most popular data loader for Salesforce to quickly and securely import, export and delete unlimited amounts of data for your enterprise. Get started quickly with our simple, 100% cloud solution.
DataLoad reduces the time and costs to load configuration and data. Its easy-to-use interface means it can be used by non-technical users, giving end-users an active role in the implementation and support of any system.
An iterable-style dataset is an instance of a subclass of IterableDatasetthat implements the __iter__() protocol, and represents an iterable overdata samples. This type of datasets is particularly suitable for cases whererandom reads are expensive or even improbable, and where the batch size dependson the fetched data.
When using a IterableDataset withmulti-process data loading. The samedataset object is replicated on each worker process, and thus thereplicas must be configured differently to avoid duplicated data. SeeIterableDataset documentations for how toachieve this.
For iterable-style datasets, data loading orderis entirely controlled by the user-defined iterable. This allows easierimplementations of chunk-reading and dynamic batch size (e.g., by yielding abatched sample at each time).
The rest of this section concerns the case withmap-style datasets. torch.utils.data.Samplerclasses are used to specify the sequence of indices/keys used in data loading.They represent iterable objects over the indices to datasets. E.g., in thecommon case with stochastic gradient decent (SGD), aSampler could randomly permute a list of indicesand yield each one at a time, or yield a small number of them for mini-batchSGD.
This is the most common case, and corresponds to fetching a minibatch ofdata and collating them into batched samples, i.e., containing Tensors withone dimension being the batch dimension (usually the first).
When batch_size (default 1) is not None, the data loader yieldsbatched samples instead of individual samples. batch_size anddrop_last arguments are used to specify how the data loader obtainsbatches of dataset keys. For map-style datasets, users can alternativelyspecify batch_sampler, which yields a list of keys at a time.
The batch_size and drop_last arguments essentially are usedto construct a batch_sampler from sampler. For map-styledatasets, the sampler is either provided by user or constructedbased on the shuffle argument. For iterable-style datasets, thesampler is a dummy infinite one. Seethis section on more details onsamplers.
When both batch_size and batch_sampler are None (defaultvalue for batch_sampler is already None), automatic batching isdisabled. Each sample obtained from the dataset is processed with thefunction passed as the collate_fn argument.
When automatic batching is disabled, collate_fn is called witheach individual data sample, and the output is yielded from the data loaderiterator. In this case, the default collate_fn simply converts NumPyarrays in PyTorch tensors.
When automatic batching is enabled, collate_fn is called with a listof data samples at each time. It is expected to collate the input samples intoa batch for yielding from the data loader iterator. The rest of this sectiondescribes the behavior of the default collate_fn(default_collate()).
For instance, if each data sample consists of a 3-channel image and an integralclass label, i.e., each element of the dataset returns a tuple(image, class_index), the default collate_fn collates a list ofsuch tuples into a single tuple of a batched image tensor and a batched classlabel Tensor. In particular, the default collate_fn has the followingproperties:
It preserves the data structure, e.g., if each sample is a dictionary, itoutputs a dictionary with the same set of keys but batched Tensors as values(or lists if the values can not be converted into Tensors). Samefor list s, tuple s, namedtuple s, etc.
Within a Python process, theGlobal Interpreter Lock (GIL)prevents true fully parallelizing Python code across threads. To avoid blockingcomputation code with data loading, PyTorch provides an easy switch to performmulti-process data loading by simply setting the argument num_workersto a positive integer.
In this mode, data fetching is done in the same process aDataLoader is initialized. Therefore, data loadingmay block computing. However, this mode may be preferred when resource(s) usedfor sharing data among processes (e.g., shared memory, file descriptors) islimited, or when the entire dataset is small and can be loaded entirely inmemory. Additionally, single-process loading often shows more readable errortraces and thus is useful for debugging.
After several iterations, the loader worker processes will consumethe same amount of CPU memory as the parent process for all Pythonobjects in the parent process which are accessed from the workerprocesses. This can be problematic if the Dataset contains a lot ofdata (e.g., you are loading a very large list of filenames at Datasetconstruction time) and/or you are using a lot of workers (overallmemory usage is number of workers * size of parent process). Thesimplest workaround is to replace Python objects with non-refcountedrepresentations such as Pandas, Numpy or PyArrow objects. Check outissue #13246for more details on why this occurs and example code for how toworkaround these problems.
In this mode, each time an iterator of a DataLoaderis created (e.g., when you call enumerate(dataloader)), num_workersworker processes are created. At this point, the dataset,collate_fn, and worker_init_fn are passed to eachworker, where they are used to initialize, and fetch data. This means thatdataset access together with its internal IO, transforms(including collate_fn) runs in the worker process.
torch.utils.data.get_worker_info() returns various useful informationin a worker process (including the worker id, dataset replica, initial seed,etc.), and returns None in main process. Users may use this function indataset code and/or worker_init_fn to individually configure eachdataset replica, and to determine whether the code is running in a workerprocess. For example, this can be particularly helpful in sharding the dataset.
For map-style datasets, the main process generates the indices usingsampler and sends them to the workers. So any shuffle randomization isdone in the main process which guides loading by assigning indices to load.
It is generally not recommended to return CUDA tensors in multi-processloading because of many subtleties in using CUDA and sharing CUDA tensors inmultiprocessing (see CUDA in multiprocessing). Instead, we recommendusing automatic memory pinning (i.e., settingpin_memory=True), which enables fast data transfer to CUDA-enabledGPUs.
On Windows or MacOS, spawn() is the default multiprocessing start method.Using spawn(), another interpreter is launched which runs your main script,followed by the internal worker function that receives the dataset,collate_fn and other arguments through pickle serialization.
Make sure that any custom collate_fn, worker_init_fnor dataset code is declared as top level definitions, outside of the__main__ check. This ensures that they are available in worker processes.(this is needed since functions are pickled as references only, not bytecode.)
In worker_init_fn, you may access the PyTorch seed set for each workerwith either torch.utils.data.get_worker_info().seedor torch.initial_seed(), and use it to seed other libraries before dataloading.
The default memory pinning logic only recognizes Tensors and maps and iterablescontaining Tensors. By default, if the pinning logic sees a batch that is acustom type (which will occur if you have a collate_fn that returns acustom batch type), or if each element of your batch is a custom type, thepinning logic will not recognize them, and it will return that batch (or thoseelements) without pinning the memory. To enable memory pinning for custombatch or data type(s), define a pin_memory() method on your customtype(s).
len(dataloader) heuristic is based on the length of the sampler used.When dataset is an IterableDataset,it instead returns an estimate based on len(dataset) / batch_size, with properrounding depending on drop_last, regardless of multi-process loadingconfigurations. This represents the best guess PyTorch can make because PyTorchtrusts user dataset code in correctly handling multi-processloading to avoid duplicate data.
All datasets that represent a map from keys to data samples should subclassit. All subclasses should overwrite __getitem__(), supporting fetching adata sample for a given key. Subclasses could also optionally overwrite__len__(), which is expected to return the size of the dataset by manySampler implementations and the default optionsof DataLoader. Subclasses could alsooptionally implement __getitems__(), for speedup batched samplesloading. This method accepts list of indices of samples of batch and returnslist of samples.
When used in a worker_init_fn passed over toDataLoader, this method can be useful toset up each worker process differently, for instance, using worker_idto configure the dataset object to only read a specific fraction of asharded dataset, or use seed to seed other libraries used in datasetcode.
Every Sampler subclass has to provide an __iter__() method, providing away to iterate over indices or lists of indices (batches) of dataset elements, and a __len__() methodthat returns the length of the returned iterators.
It is especially useful in conjunction withtorch.nn.parallel.DistributedDataParallel. In such a case, eachprocess can pass a DistributedSampler instance as aDataLoader sampler, and load a subset of theoriginal dataset that is exclusive to it.
Prepare, integrate and cleanse big data faster and easier without writing code using a point-and-click user interface. SAS Data Loader for Hadoop helps you manage big data on your own terms with self-service data preparation. Power users can run SAS code and data quality functions faster on Hadoop for improved productivity and reduced data movement.
08ab062aa8