Groq unveils lightning-fast LLM engine; developer base rockets past 280K in 4 months

We want to hear from you! Take our quick AI survey and share your insights on the current state of AI, how you’re implementing it, and what you expect to see in the future. Learn More

Groq now allows you to make lightning fast queries and perform other tasks with leading large language models (LLMs) directly on its web site.

The company quietly introduced the capability last week. The results are much faster and smarter than the company has demoed before. It lets you type your queries, but also lets you speak the queries with voice commands.

On the tests I did, Groq replied at around 1256.54 tokens per second, a speed that appears almost instantaneous, and something that GPU chips from companies like Nvidia are unable to do, according to Groq. The speed is up from an already impressive 800 tokens per second Groq showed off in April.

By default, Groq’s site engine uses Meta’s open source Llama3-8b-8192 LLM. It also lets you choose from the larger Llama3-70b, some Gemma (Google) and Mistral models, and it will support other models soon. 

Countdown to VB Transform 2024

Join enterprise leaders in San Francisco from July 9 to 11 for our flagship AI event. Connect with peers, explore the opportunities and challenges of Generative AI, and learn how to integrate AI applications into your industry. Register Now

The experience is significant because it demonstrates to developers and non-developers alike just how fast and flexible a LLM chatbot can be. Groq’s CEO Jonathan Ross says usage of LLMs will increase even more once people see how easy it is to use them on Groq’s fast engine. For example, the demo provides glimpses at what other tasks can be done easily at this speed, for example generating job postings or articles and changing them on the fly.

In one example, I asked it to critique the agenda of our VB Transform event about generative AI that starts tomorrow. It was almost instantaneous at providing feedback, including suggesting clearer categorization, more detailed session descriptions  and better speaker profiles. When I asked for suggestions of great speakers to make the lineup more diverse, it immediately generated a list, with the organizations they are affiliated with – in a table format like I had suggested. I could change the table on the fly, adding a column for contact details, for example.

In a second exercise, I asked it to create a table of my speaking sessions next week, to help me organize. It not only created the tables I was asking for, but allowed me to easily make changes quickly, including spelling corrections. I could also change my mind and ask it to create extra columns for things I’d forgotten to ask for. It can translate it into different languages too. Sometimes it took me asking a couple of times for it to make a correction, but such bugs are generally at the LLM level, not the processing level. It certainly hints at the news sets of things LLMs can do when operating at this sort of speed. 

Groq has gained attention because it promises it can do AI tasks much faster and more affordably than competitors, which it says is possible due to its language processing unit (LPU) that is much more efficient than GPUs at such tasks, in part because the LPU operates linearly. While GPUs are important for model training, when AI applications are actually deployed – “inference” refers to actions the model takes –  they require more efficiency at less latency.

So far Groq has offered its service to power LLM workloads for free, and it’s gotten a massive uptake from developers, now at more than 282,000, Ross told VentureBeat. Groq launched its service 16 weeks ago.

Groq offers a console for developers to build their apps, similar to what other inference providers offer. Notably, though, Groq lets developers who build apps on OpenAI swap their app over to Groq in seconds, using some simple steps.

I spoke with Ross in preparation for a talk at VB Transform, where he’s one of the opening speakers tomorrow. The event is focused on enterprise generative AI in deployment, and Ross says he’s soon going to focus on the enterprise. Large companies are moving to deployment of AI applications, and require more efficient processing for their workloads, he said.

While you can type in your queries to the Groq engine, you can also now speak your queries after pushing a microphone icon. Groq uses the Whisper Large V3 model, the latest open source automatic speech recognition and speech translation model from OpenAI, to translate your voice into text. That text is then inserted as the prompt for the LLM.

Groq says its technology uses about a third of the power of a GPU at worst, but most of its workloads use as little as a tenth of the power. In a world where it seems like those LLM workloads will never stop scaling, and energy demand will just keep growing, Groq’s efficiency represents a challenge to the GPU-dominated compute landscape.

In fact, Ross claims that by next year, over half of the globe’s inference computing will be running on their chips. Ross will have the answers and a lot more at VentureBeat’s Transform 2024. 

Source link

About The Author

Scroll to Top