The first is the UnstructuredFileLoader. This has a simple interface (you just pass it a file path) but under the hood Unstructured is doing a lot of smart logic to infer which data type it is (PDF, PowerPoint, image, etc) and extract text.
The second is the DirectoryLoader. Again, this has a pretty simple interface: it takes only a path to a directory and an optional regex to glob for files against. But under the hood it is looping over all files and using the above UnstructuredFileLoader to load them. This makes it possible to load files of all types in a single call.
@thorfdbg's solution is simple and elegant, and I recall seeing the same method advocated by some type-in listing in one of the old Atari magazines in the late 80s. For a short while, I had a library of SDX-related commands (CHDIR, MKDIR and such) which I would ENTER as required.
LangChain is the framework to build AI models like Large Language Models using natural languages to answer queries in text. To train these models, the user needs to get a huge pool of data so the model can answer a variety of questions from different users. LangChain allows the developers to use directory loaders to get the data from different locations at once.
To use the file directory loader in LangChain, simply install LangChain, OpenAI, and unstructured modules to load files from the directory. The LangChain framework offers multiple methods of using the DirectoryLoader() function with different strategies. This guide has illustrated the process of using the file directory loader with multiple methods in LangChain.
This example goes over how to load data from folders with multiple files. The second argument is a map of file extensions to loader factories. Each file will be passed to the matching loader, and the resulting documents will be concatenated together.
In previous posts, I kick-started my large language models (LLM) exploration journey with simple and persistent-memory chatbots. LLMs are phenomenal, but if you want to extend the pre-trained corpus of knowledge, you need to insert context or fine-tune the models. In this post, we'll explore the former i.e. in-context learning using LlamaIndex for data ingestion and indexing.
LlamaIndex (previously called GPT Index) is an open-source project that provides a simple interface between LLMs and external data sources like APIs, PDFs, SQL etc. It provides indices over structured and unstructured data, helping to abstract away the differences across data sources. It can store context required for prompt engineering, deal with limitations when the context window is too big, and help make a trade-off between cost and performance during queries.
Let's create a simple index.py file for this tutorial with the code below. We'll use the paul_graham_essay.txt file from the examples folder of the LlamaIndex Github repository as the document to be indexed and queried. You can also replace this file with your own document, or extend the code and seek a file input from the user instead. You'll find my complete code here.
In the sample code below, we load and index the documents from the data folder using a simple vector store index, and then query the index for the information requested by the user. For the frontend, we use Streamlit to create a simple question submission field, with the ability to dynamically update the form with the response. We use the text-davinci-003 model by default, but you can replace it.
The SDE modules within FME are designed to work with other SDE products. For example, a user with a simple GUI search engine can easily identify all features satisfying a complicated query, then use FME with the SDE reader module to process these features.
That is, these files must be loaded by a special loader (which is part of the Editor/Assembler cartridge, but also of the Extended Basic cartridge). This loader is able to place the contents in almost any memory location, automatically adjusting references to addresses.
Referencing the layout above, there are four major sections we need to deal with. These are the Bootloader, FAT tables, root entries, and data/clusters. I can format the disk into these four sections with the below function.
I am looking for a simple feature where I can log what a user inputs in a text box i.e. save it to a csv/.txt file. Atm, seems like I cannot write to a file and save it to the repo or even cannot open a existing file and append text to it and save it again. Is there some way to record/log user inputs to the repo?
There are different ways to add the repository through yum, dnf, and apt-get; describing them all is beyond the scope of this article. To make it simple, this example will use apt-get, but the idea is similar for the other options.
Accessing Knowledge Graphs from Google, DBPedia, and Wikidata allows you to integrate real world facts and knowledge with your applications. While I mostly work in the field of deep learning I frequently also use Knowledge Graphs in my work and in my personal research. I think that you, dear reader, might find accessing highly structured data in KGs to be more reliable and in many cases simpler than using web scraping.
If you already use Google Drive to store your working notes and other documents, then you might want to expand the simple example in this chapter to build your own query system for your documents. In addition to Google Drive, I also use Microsoft Office 365 and OneDrive in my work and personal projects.
Using the Zapier service is simple. You need to register the services you want to interact with on the Zapier developer web site and then you can express how you want to interact with services using natural language prompts.
I have a long work history of writing natural language interfaces for relational databases that I will review in the chapter wrap up. For now, I invite you to be amazed at how simple it is to write the LangChain scripts for querying a database in natural language.
The last book I wrote Practical Python Artificial Intelligence Programming used an OpenAI example -cookbook/blob/main/examples/Backtranslation_of_SQL_queries.py that shows relatively simple code (relative to my older hand-written Java and Common Lisp code) for a NLP database interface.
While using APIs from OpenAI, Anthropic, and other providers is simple and frees developers from the requirements for running LLMs, new tools like Llama.cpp make it easier and less expensive to run and deploy LLMs yourself. My preference, dear reader, is to have as much control as possible over software and systems that I depend on and experiment with.
When I wanted to experiment with generative models, backed by my personal recipe data, to create recipes, having available recipe data from my previous project as well as tools like OpenAI APIs and LangChain made this experiment simple to set up and run. It is a common theme in this book that it is now relatively easy to create personal projects based on our data and our interests.
Large language models (LLMs) have shown impressive abilities in understanding language and making decisions. However, their capabilities for reasoning and taking action has been new work with some promising results. Here we look at using LLMs to generate both reasoning traces and task-specific actions together. This allows for better synergy between the two: reasoning traces help the model create and update action plans, while actions let it gather more information from external sources. For question answering and fact verification tasks, ReAct avoids errors by using a simple Wikipedia API and generates human-like solutions. On interactive decision making tasks, ReAct has higher success rates compared to other methods, even with limited examples.
I will show one simple example that I run on my laptop to search the contents of all of the books I have written as well as a large number of research papers. You can find my example in the GitHub repository for this book in the directory langchain-book-examples/embedchain_test. As usual, you will need an OpenAI API account and set the environment variable OPENAI_API_KEY to the value of your key.
This book was frustrating in the sense that it is now so very easy to build applications that just a few years would have been impossible to write. Usually when I write books I have two criteria: I only write about things that I am personally interested in and use, and I also hope to figure out non-obvious edge cases and make easier for my readers to use new tech. Here my frustration is writing about something that it is increasingly simple to do so I feel like my value is diminished.
Query decomposition allows us to break down complex queries into simpler, targeted ones. A keyword index will enable us to route the queries through keyword searches. Finally, a vector store index allows us to process semantic information. Putting all of this together, we can answer a question that requires information from many sources.
First, the transformer breaks the question into simple queries that a single data source can answer. Then, it uses the keyword index to route the simple queries to the right data source and the vector store index to answer the question. Finally, the question transformer combines the information and answers our original, complex query.
We no longer have a simple Python script. We now have a data pipeline, and data pipelines need an orchestrator like Dagster. Dagster makes it fast and easy for us to add this multi-step caching capability, as well as support additional features like adding automatic scheduling and sensors to re-run the pipeline on external triggers.
Suppose you have a large corpus of text on economics that you'd like to build an NLP app over. Your corpus may be a mix of text files, PDF documents, HTML web pages, images, and more. Currently, document loaders leverage the Python library Unstructured to convert these raw data sources into text that can be processed.
df19127ead