Table of contents
Chatbot Arena is an open research project created by members of LMSYS and UC Berkeley SkyLab. Its goal is to build an open crowdsourcing platform for collecting feedback and evaluating large language models (LLMs) in practical applications. Using the platform is completely free and doesn’t require logging in.
Functionalities of the project
The project offers several features:
- Arena (battle): Allows the comparison of random chatbots in two side-by-side windows (blind test).
- Arena (side-by-side): Enables the comparison of user-selected chatbots.
- Direct Chat: Engage in a conversation with a selected text chatbot.
- Vision Direct Chat: Chat with a chatbot that utilizes computer vision.
- Leaderboard: Displays the ranking of the best models.
Our tests
On the platform, users can encounter a wide range of chatbots, from the 2-billion Gemma to models like GPT-4 and Claude 3 Opus, which are about 1000 times larger. This helps users understand how model capabilities increase with their size (measured by the number of parameters) and the quality of the model architecture and training data. The weakest models may struggle to understand user queries, but even their mistakes can be interesting—sometimes echoing training data, and other times resulting in amusing errors due to a lack of context understanding.
Interestingly, the models didn’t know who the CEO of Anthropic (Dario Amodei) is. While GPT-3.5 turbo completely made it up, Claude 3 Opus simply gave up.
The current leader of the leaderboard, or the freshly updated GPT-4? Why choose, when you can try both! By the way, we might have accidentally jailbroken Claude with this prompt. We do not endorse his behaviour. Bad Claude! Misaligned!
Now, let’s test a model with 4 billion parameters – small enough to run on a mid-range laptop without internet access. It may not be a genius, but let’s see what it can do:
Chatbot Arena – parameters
In single-chat mode, in addition to the standard voting buttons, we have access to several settings:
- Temperature: Controls the level of randomness or variation in responses. At a temperature of zero, responses are deterministic, while higher values introduce more randomness.
- Top P: Determines the percentage of the most probable words considered for each word in the response. Higher values allow for more creative responses.
- Max Output Tokens: Sets the maximum length of the response in tokens.
Image processing models
Currently, only three models with image processing are available. They are not as powerful as leading models like GPT-4 and are most proficient in English.
Leaderboard
The Leaderboard Chatbot Arena brings some surprising results. The open-source model by Cohere ranks sixth, while Claude 3 Opus sits at the top. The French Mistral didn’t make it to the top, and the entire top 9 belongs to North American companies. Claude 3 Haiku, the smallest of the Claude 3 family, offers better performance and lower prices than GPT-3.5. It ranks higher than all versions of Mistral and the weakest version of GPT-4.
Many experts, including Andrej Karpathy, consider Chatbot Arena to be the most reliable LLM ranking because other rankings are based solely on benchmark tests, which do not always reflect real-world performance in user interactions. There’s also a risk of models being trained specifically for those tests.
Chatbot Arena is both a fun curiosity and a credible source of information about LLMs available on the market. We encourage you to try it out and share your experiences!
Don’t miss out; read AI News.
AI News
11/04/2024