Last week, something big happened in the world of AI. Meta released a new version of their Llama model – Llama3. There are two versions, 8B and 70B, each available in pre-trained and instruct options. If you’re not too tech-savvy – pre-trained models understand language broadly, but instruct options are fine-tuned for specific tasks or domains, making them better for specialized jobs. So, from now on, we’ll talk about the instruct version.

What do we know about Llama3?

From a technical standpoint, it features a vocabulary of 128k tokens, with an 8k context window (honestly, this is a weak result compared to the competition). However, Meta assures that the model’s weak point, the small window, will be improved in the near future – models with larger windows, reaching up to 400B parameters, are planned and being trained. Additionally, they are to be trained on 15T tokens from public sources in over 30 languages.

Currently, how does Llama3 compare to other models?

Meta Llama 3 Instruct model performance

As we can tell from the charts attached – Meta is doing really well… But there’s one thing – where are GPT4 and Claude 3 Opus in these comparisons? We’re not sure, but we can guess…

So, let’s check out LMSys Arena to see if our guesses are right:

Meta Llama 3 in comparison to other models like Claude 3 or GPT-4 via LMSys Arena

Here it’s quite clear that Llama falls behind models like GPT4-Turbo and Claude 3 Opus. But let’s take a step back and consider the context of how big these models are. Opus? Estimates even mention 2T parameters; GPT 4, Gemini Pro? And here, we’re talking about really big numbers. So, despite its much smaller size, Llama 3 still achieves very good results. It’s worth noting that it’s the only open-source model on this list that we can download and run locally ourselves! That’s definitely something we like.

Meta’s Llama 3 Test

Enough talk about Llama itself – now it’s time to put it to the test. We’ll be using LMSys Arena for this, a platform where we can try out different models.

We’re not ones for asking about ordinary, everyday things or engaging in casual small talk with the model. We prefer questions that challenge the model’s reasoning abilities. Just to give you a heads up – we’ll be pitting Llama against Claude 3 Opus.

So, let’s start with something simple:

Meta Llama 3 in comparison to other models like Claude 3 or GPT-4 via LMSys Arena

As you can see, both Llama and its opponent easily deduced that we still have 10 pears. Smart models!

So, how about a classic geography quiz question?

Meta Llama 3 in comparison to other models like Claude 3 or GPT-4 via LMSys Arena

This time, Llama got completely lost and gave us incorrect information, mixing up the poles. However, it got a second chance, during which it decided to admit its mistake instead of insisting on it.

So, how about something even more challenging? Or tricky – something that might stump a human, a question that requires thinking outside the box? Here you go:

Both models fail to provide the correct answer to this question. Although, from our perspective, Llama seems to handle it better here by not insisting on the wrong answer but admitting its mistake, unlike Claude. Despite both models being mistaken, there is indeed a correct answer to this question. It’s intriguing to see how many people are aware of it. 🙂

So, how about one last example – but this time, let’s observe how both Llama 70B and Llama 8B perform?

Oops! Neither of the Llamas managed to grasp the essence of the riddle. Surprisingly, the larger one, which should perform better, did worse by imagining a new light bulb. Well, a model is a model – it’s entitled to make mistakes.

What do we think about Llama?

Llama, like any other model, makes mistakes – but our questions weren’t the easiest, and it must be admitted that the model handled them reasonably well. At least no worse than its counterparts, as reflected in the arena ranking.

We rate Llama positively. One might even say it’s a kind of game-changer – with a smaller architecture and being open-source, we’re able to almost match the largest and best models.