ndumdore dagney island

0 views

Skip to first unread message

Catherine Rubeo

unread,

Aug 2, 2024, 8:01:07 PM8/2/24

to bapisigimp

One thing people don't quite get as they enter the field of ML is howmuch of it deals with data - putting together datasets, exploring thedata, wrangling the data, etc. The key points of this lecture are:

There are many possibilities for the sources of data. You might haveimages, text files, logs, or database records. In deep learning, youneed to get that data into a local filesystem disk next to a GPU. Howyou send data from the sources to training is different for eachproject.

The filesystem is a fundamental abstraction. Its fundamental unit isa file - which can be text or binary, is not versioned, and is easilyoverwritten. The filesystem is usually on a disk connected to yourmachine - physically connected on-prem, attached in the cloud, or evendistributed.

The first thing to know about discs is that their speed and bandwidthrange - from hard discs to solid-state discs. There are two orders ofmagnitude differences between the slowest (SATA SSD) and the fastest(NVMe SSD) discs. Below are some latency numbers you should know, withthe human-scale numbers in parentheses:

The object storage is an API over the filesystem. Its fundamentalunit is an object, usually in a binary format (an image, a sound file, atext file, etc.). We can build versioning or redundancy into the objectstorage service. It is not as fast as the local filesystem, but it can befast enough within the cloud.

Databases are persistent, fast, and scalable storage and retrievalof structured data systems. A helpful mental model for this is: all thedata that the databases hold is actually in the computer\'s RAM, but thedatabase software ensures that if the computer gets turned off,everything is safely persisted to disk. If too much data is in the RAM,it scales out to disk in a performant way.

You should not store binary data in the database but the object-storeURLs instead. Postgres isthe right choice most of the time. It is an open-source database thatsupports unstructured JSON and queries over that JSON.SQLite is also perfectly goodfor small projects.

Most coding projects that deal with collections of objects thatreference each other will eventually implement a crappy database. Usinga database from the beginning with likely save you time. In fact, mostMLOps tools are databases at their core (e.g.,W&B is a database of experiments,HuggingFace Hub is adatabase of models, and LabelStudio is a database of labels).

Data warehouses are stores for online analytical processing (OLAP),as opposed to databases being the data stores for online transactionprocessing (OLTP). You get data into the data warehouse through aprocess called ETL (Extract-Transform-Load): Given a number of datasources, you extract the data, transform it into a uniform schema, andload it into the data warehouse. From the warehouse, you can runbusiness intelligence queries. The difference between OLAP and OLTP isthat: OLAPs are column-oriented, while OLTPs are row-oriented.

Data lakes are unstructured aggregations of data from multiplesources. The main difference between them and data warehouses is thatdata lakes use ELT (Extract-Load-Transform) process: dumping all thedata in and transforming them for specific needs later.

The big trend is unifying both data lake and data warehouse, so thatstructured data and unstructured data can live together. The two bigplatforms for this areSnowflake andDatabricks. If you arereally into this stuff, "Designing Data-IntensiveApplications" is a great bookthat walks through it from first principles.

To explore the data, you must speak its language, mostly SQL and,increasingly, DataFrame. SQL is the standard interface forstructured data, which has existed for decades. Pandas is the mainDataFrame in the Python ecosystem that lets you do SQL-like things. Ouradvice is to become fluent in both to interact with both transactionaldatabases and analytical warehouses and lakes.

Our ultimate task is to train the photo predictor model, but we need tooutput data from the database, compute the logs, and run classifiers tooutput their predictions. As a result, we have task dependencies.Some tasks can't start until others are finished, so finishing a taskshould kick off its dependencies.

Airflow is a standardscheduler for Python, where it's possible to specify the DAG(directed acyclic graph) of tasks using Python code. The operatorin that graph can be SQL operations or Python functions.

The primary advice here is not to over-engineer things. You can getmachines with many CPU cores and a lot of RAM nowadays. For example,UNIX has powerful parallelism, streaming, and highly optimized tools.

Let's say your data processing generates artifacts you need fortraining. How do you make sure that, in production, the trained modelsees the same processing taking place (which happened during training)?How do you avoid recomputation during retraining?

The first mention of feature stores came from this Uber blog postdescribing their ML platform,Michelangelo.They had an offline training process and an online predictionprocess, so they built an internal feature store for bothprocesses to be in sync.

HuggingFaceDatasets is a greatsource of machine learning-ready data. There are 8000+ datasets coveringa wide variety of tasks, like computer vision, NLP, etc. The Github-Codedataset on HuggingFace is a good example of how these datasets arewell-suited for ML applications. Github-Code can be streamed, is in themodern Apache Parquet format, and doesn't require you to download 1TB+of data in order to properly work with it. Another sample dataset isRedCaps, which consists of 12M image-text pairs from Reddit.

Self-supervised learning is a very important idea that allows you toavoid painstakingly labeling all of your data. You can use parts of yourdata to label other parts of your data. This is very common in NLP rightnow. This is further covered in the foundation model lecture. The longand short of it is that models can have elements of their data masked(e.g., the end of a sentence can be omitted), and models can use earlierparts of the data to predict the masked parts (e.g., I can learn fromthe beginning of the sentence and predict the end). This can even beused across modalities (e.g., computer vision and text), as OpenAICLIP demonstrates.

Image data augmentation is an almost compulsory technique to adopt,especially for vision tasks. Frameworks liketorchvision help withthis. In data augmentation, samples are modified (e.g., brightened)without actually changing their core "meaning." Interestingly,augmentation can actually replace labels.SimCLRis a model that demonstrates this - where its learning objective is tomaximize agreement between augmented views of the same image andminimize agreement between different images.

For other forms of data, there are a couple of augmentation tricks thatcan be applied. You can delete some cells in tabular data to simulatemissing data. In text, there aren't established techniques, but ideasinclude changing the order of words or deleting words. In speech, youcould change the speed, insert pauses, etc.

Synthetic data is an underrated idea. You can synthesize data basedon your knowledge of the label. For example, you can createreceiptsif your need is to learn how to recognize receipts from images. This canget very sophisticated and deep, so tread carefully.

You can also get creative and ask your users to label data for you.Google Photos, as any user of the app knows, regularly gets users tolabel images about where people in photos are the same or different.

Labeling has standard annotation features, like bounding boxes, thathelp capture information properly. Training annotators properly is moreimportant than the particular kind of annotation. Standardizing howannotators approach a complex, opinable task is crucial. Labelingguidelines can help capture the exact right label from an annotator.Quality assurance is key to ensuring annotation and labeling arehappening properly.

Full-service companies offer a great solution that abstracts the needto build software, manage labor, and perform quality checks. It makessense to use one. Before settling on one, make sure to dedicate time tovet several. Additionally, label some gold standard data yourself tounderstand the data yourself and to evaluate contenders. Take calls withseveral contenders, ask for work samples on your data, and compare themto your own labeling performance.

LabelStudio is an open-sourcesolution for performing annotation yourself, with a companionenterprise version. It has a great set of features that allow youto design your interface and even plug-in models for activelearning!

Snorkel is a dataset managementand labeling platform that uses weak supervision, which is asimilar concept. You can leverage composable rules (e.g., allsentences that have the term "amazing" are positive sentiments)that allow you to quickly label data faster than if you were totreat every data point the same.

In conclusion, try to avoid labeling using techniques likeself-supervised learning. If you can't, use labeling software andeventually outsource the work to the right vendor. If you can't affordvendors, consider hiring part-time work rather than crowdsourcing thework to ensure quality.

Level 0 is bad. In this case, data just lives on some file system.In these cases, the issue arises because the models areunversioned since their data is unversioned. Models are part code,part data. This will lead to the consequence of being unable toget back to a previous level of performance if need be.

DVC is a great tool for this. DVChelps upload your data asset to a remote storage location every time youcommit changes to the data file or trigger a commit; it functions like afancier git-lfs. It adds features like lineage for data and modelartifacts, allowing you to recreate pipelines.