AI

The State of Autonomous AI Agents

by Vine Ventures ( 8 min read )
Jul 06, 2023

The current iteration of AI, specifically GenAI, has already left its mark on design, marketing, and even gaming use cases. Yet, it stumbles when it comes to tackling intricate, real-life tasks that aren’t about answering questions or generating stories. This blog attempts to provide a glimpse into the realm of Large Language Models (LLMs) and the extraordinary potential of Autonomous AI Agents, exploring their current architecture, challenges, and future directions. While LLMs have taken giant leaps in language comprehension, they struggle with issues like inaccuracy, unpredictability, and nailing the context. Autonomous AI Agents are possible game-changers. Equipped with self-learning and adaptability, they might have what it takes to solve these challenges. And if the creators (AI researchers & developers) pull this off, we could be on the brink of a transformative technological shift, changing the game for industries, revving the pace of scientific discovery, and reshaping society as we know it.

LLMs in Focus: The Remarkable Shift to In-Context Learning

LLM’s novelty from conventional ML approaches makes their operation a bit of an enigma, leading to issues like inaccurate answers, hallucinations, and failure to solve multi-step tasks. Recently, developers have woven ‘in-context learning’ into the LLM fabric to combat these hurdles. The idea is to use LLMs as is, off-the-shelf, and manipulate their behavior via smart prompting.

This practice, in theory, translates to pouring a wealth of data from an existing knowledge base into a GPT-4 prompt paired with your query. Though powerful, this ‘contextualizing’ method doesn’t scale up well. GPT-4, for instance, can only process approximately 50 pages of text, and as the context window is approached, accuracy and inference time degrade. To optimize LLMs and in-context learning, a three-tiered infrastructure is taking shape (more on these tiers by Matt at a16z).

 

The LLM Stack:

  1. Data Preparation and Embedding: breaking data into chunks that are passed through an embedding model, then stored in a vector database for later retrieval
  2. Data Preparation and Embedding: breaking data into chunks that are passed through an embedding model, then stored in a vector database for later retrieval
  3. Prompt Assembly: a mix of a prompt template hard-coded by the developer, examples of valid outputs called few-shot examples, information retrieved from external APIs, and additional contextual data from the vector database (for a deep dive into Prompt Engineering)

Though the toolbox for each tier is under constant refinement, its potential has limits. High-quality prompting currently only unveils ‘emergent capabilities‘ that solve a tight range of tasks – excelling in tasks like Q&A and text summarization. In-context learning is effective for dynamic learning, yet it falls short when dealing with multi-step tasks that need access to vast external data, APIs, and tools. To overcome these challenges, we require a more advanced technology stack to power LLMs with greater planning, reasoning, and accessibility to external tools. The next section will delve into the exciting potential of Autonomous AI Agents, an incredible technology promising to solve these challenges.

From Demos to Dominance: The Exciting Potential of AI Agents

Artificial General Intelligence (AGI) will take the form of an AI agent, not just one agent, but many. What’s even more interesting is that you guys (developers) are at the forefront of building AI agents today, not the LLM labs.

– Andrej Karpathy, June 24th, 2023.

There’s been quite a buzz around groundbreaking open-source AI agent projects like AutoGPT, BabyAGI, and GPT-Engineer. With their intriguing demos, they’ve captured the attention of not just AI thought leaders on Twitter but also those who tend to be more skeptical about AI. While these demonstrations have an undeniable wow factor, they are mostly cherry-picked and fail to operate in real-life environments. However, agent architectures are evolving rapidly, with exciting new papers being published almost weekly, and we are only at the very beginning of this evolution. As with any emerging technology, the exciting potential is in the fine print – so let’s dive deeper.

 

The fundamentals of an AI Agent System – Planning, Memory, and Tools:

You can think about an agent as a body powered by an LLM as its brain. The agent utilizes a blend of memory and external tools, all in a thoughtful, cyclical process – with the capacity to learn, adapt, and improve from its environment. Each agent has the following components:

Planning:
Complicated tasks usually involve many steps. An agent must be able to identify these steps and plan ahead

  • Subgoal and decomposition: The agent breaks down large tasks into smaller, manageable subgoals, enabling efficient handling of complex tasks
  • Reflection and refinement: The agent can do self-criticism and self-reflection on past actions, learn from mistakes and refine them for future steps, thereby improving the quality of final results

Memory:
Memory can be defined as the processes used to acquire, store, retain, and later retrieve information. There are several types of memory in human brains

  • Short-term memory: can be considered the same as “in-context learning.” (described above)
  • Long-term memory: allows the agent to retain and recall information over extended periods, often by leveraging an external vector database

Tool use:
Equipping LLMs with external tools can significantly extend the model’s capabilities

  • The agent learns to call external APIs for extra information missing from the model weights (often hard to change after pre-training), including current information, code execution capability, access to proprietary information sources, and more. Note: specialized LLMs for a specific task can be considered tools

 

Fig.1. Overview of a LLM-powered autonomous agent system

 

The nascent yet evolving implementations of AI Agents:

The implementation of AI Agents lacks a standardized “framework,” pushing developers to devise their own methods. We connected with several leading AI startups to understand how they’re implementing AI agents today. Ideally, one envisions an all-powerful agent, but practically, the ‘no free lunch theorem’ suggests that specific agents are better suited for certain tasks, with a master agent overseeing sub-agents. Consider an agent system as a well-coordinated ensemble, where multiple sub-agents each specialize in distinct tasks. These agents may differ on several fronts, such as the LLMs they use, designated tools, data/memory sources, objectives, and their reasoning/planning mechanisms. A small (yet growing) number of startups are implementing slightly more advanced methods that include having specialized fine-tuned LLMs for specific agents (oftentimes leveraging proprietary data) and agents that are experts in communicating with other APIs (or other agents.) A handful of concrete examples of this include planning agents, context retrieval agents, and personalized agents (i.e. each user would be able to configure and decide on the agent’s character.) We’re aiming to share a post that explores this in much more detail.

The graph below illustrates the transition from solitary LLMs to synergistic autonomous AI Agent Systems.

 

 

What’s Next?

Let’s journey through a forward-thinking framework for AI agents inspired by human cognition. With a rich trove of research already on the table, much is to be said. Taking a page from Daniel Kahneman’s book “Thinking, Fast, and Slow,” we can imagine AI functions as a two-part system. System 1, mirroring our automatic cognitive processes, handles tasks on ‘autopilot’ – think of riding a bike or driving. This is similar to prompt engineering routines and tool usage in agents. System 2, conversely, reflects our deliberative and analytical thoughts, involving critical thinking, planning, and informed action. Thanks to recent advances, AI agents are beginning to show hints of System 2-like mechanisms, tackling complex tasks, engaging in reasoning, and even developing System 1 routines for support.

Yann LaCun’s influential 2022 paper suggests a generalized architecture for autonomous agents. Many of the suggested concepts resonate with existing AI agent designs, albeit with some gaps. Our conversation will revolve around these gaps and the possible pathways for bridging them.

 

World Model: Envisioning future possibilities based on imagined action sequences is a remarkable human ability, enabling us to learn from few or no demonstrations, particularly when actions carry high stakes. Take driving as an example—our understanding of physics, vehicle control, and traffic rules helps us anticipate the results of sudden acceleration or abrupt braking. AI agents of today embed their world model within the LLM. While this works fine for general tasks, it could fall short on specialized ones. Consider an agent using GPT4 to play Minecraft [1]. It learns much about the game from the vast amount of relevant data used to train the model, like wikis and Reddit discussions. But what happens when it’s thrown into a completely new game environment? The extent to which LLMs reason within the language space is still a hot research topic. A pathway for more robust world models involving more efficient self-supervised learning methods in multimodal domains could greatly enhance an agent’s causal understanding and reasoning skills within and beyond the language domain. Self-supervision training methods, prioritizing an overarching understanding of data rather than a specific focus on every little detail, are starting to reveal their potential [2].

Actor: The “Actor” in AI agents is the planner, coming up with the best game plan for the task at hand. A reliable world model would help here, initially judging the actor’s proposed plans. It can rule out steps unlikely to work, forecasting their outcomes. Today’s AI systems already use an LLM to reflect on plans and actions, signaling progress in this area. More advanced methods, like [3], treat the possible actions as a tree structure navigating through the branches. AI systems also train policy networks to handle simpler, immediate-response actions, similar to how humans form and use tools ​​[4, 5]. Here an interesting concept that emerged recently is the use of humans as tools, mainly for feedback and guidance within the AI agent journey of actions. Looking to the future, we expect significant advancements in creating a trainable actor module. The work being done on process supervision seems especially promising [6].

Memory: AI agents currently manage memory by using simple key-value pairs of embeddings and associated text data. Techniques such as MIPS (Maximum Inner Product Search) are employed to retrieve relevant memories. While this works for simple tasks, it struggles with complex ones, such as writing an in-depth README for a code repository, necessitating a conceptual understanding of the code flows and more advanced memory processes. Hierarchical memory, organizing data at varying detail levels, and associative memory, linking related data, are promising solutions (GraphDBs are an example of this.) Existing agents have started storing their reflections in memory, aiding in abstract knowledge accumulation [7]. Nowadays, AI agents use a combination of memory and tools for ongoing learning and adaptation. Because most leading LLMs aren’t fine-tunable (fine-tuning is complex and costly) it seems that memory usage takes precedence over fine-tuning. However, the pure memory-based approach leads to performance bottlenecks in retrieval of relevant data into context. Our reliable ally, gradient descent, could be the key here again. Instead of relying on heuristics and static memory embeddings, training memory-specific models that will naturally display human-like characteristics are a promising direction. Future innovations may take a hybrid approach, directly integrating part of the learning process into the language model’s weights while supporting a dedicated memory module.

Efficient Training: Propelling all other advancements in the AI sphere, the field is rapidly progressing towards more effective AI training methodologies, involving strategies such as distilled datasets, the QLORA training method for rapid model fine-tuning, knowledge transfer, prompt embedding editing, collaboration of multiple expert models, and usage of consumer-grade CPUs/GPUs. Alongside these, there’s a trend towards scaling up training to create increasingly advanced models, whose impacts remain to be fully seen for future models. If scaling laws persist as they did between GPT3 & 4, we might be closer to AGI than we think.

Pioneering This AI Era

AI agents are at the dawn of their evolution. It’s challenging to predict their trajectory a decade from now, but two things are clear: i) we’re a long way from realizing fully autonomous systems that can solve ‘all problems’, ii) innovation in algorithms and architecture will be required to achieve more advanced autonomous AI systems (not just advancements in foundational models). The gap, however, signals not a roadblock but an exciting opportunity for aspiring founders.

There is exhilaration in the race itself: technical founders and developers are at the forefront of implementing novel AI agents and (often when paired with a non-technical co-founder or GTM Lead) are paving the way for the commercialization of the products that leverage them. Establishing and maintaining technological defensibility will require rapid adaptation to regular academic breakthroughs. Companies that focus heavily on R&D to devise unique agent implementations stand a better chance of outpacing rivals. 

Large horizontal AI platforms such as OpenAI, Anthropic, and Inflection AI are well positioned (first-movers, first-to-adoption, significant funding, etc.) to solve general enterprise challenges that can be solved via chatbot or search functionality. The key is zeroing in on specific verticals, where you can tackle new problem spaces that seem to be unsolvable or unfeasible today, leveraging AI Agents to automate complex multi-step tasks. Focusing vertically and winning on differentiated UX (tailored to the vertical user), workflow fit (tailored to the vertical flow), and governance (tailored to the regulatory requirements or centralized rules of the vertical) could be a winning strategy. Success may hinge on a laser-focused go-to-market strategy and an intimate understanding of the problem space.

Should you be working on something within this sector and wish to discuss it, we’re eager to connect! You can contact me at dan@vineventures.com. Alternatively, if you’re deploying agents to production and need technical advice, feel free to reach out to me at ori.kabeli@gmail.com.