vak | Наш чип слышит речь и отвечает на вопрос о картинке

In this video we demonstrate the ability of Modalix to run speech to text and large multimodal models. Given a speech prompt and image, Modalix first runs the Whisper-Small model and converts the speech prompt to text. The text and image are then converted into a sequence of tokens. The image itself is converted into approximately 600 tokens and the combined text and image tokens are passed to the LLaVA-7B large multimodal model, which is also running on Modalix. Modalix accelerates the Time-To-First-Token by processing these tokens in batch at a rate of 240 tokens per second, and it currently generates tokens at a rate of 6.8 tokens per second. The generated tokens are converted into words and streamed back to the host, where they are converted back to speech.

(pressing button "Choose File" and opening file "car.jpg")
(pressing button "Begin Record")

Who makes the car in the picture?

(Modalix speaks)

"The car in the picture is made by Porsche."

(this concludes the large multimodal model demo)

Watch on YouTube

Подробности читайте в компанейском блоге: sima.ai/implementing-multimodal-genai-models-on-modalix/

Пятнадцать байтов на стек от конца

Наш чип слышит речь и отвечает на вопрос о картинке

Наш чип слышит речь и отвечает на вопрос о картинке

Профиль

Метки

Посетители