vak | Наш чип рассказывает, что он видит

In this video we demonstrate the ability of Modalix to run a large multimodal model in order to perform scene analysis. A text prompt and image are both passed to Modalix and converted into a sequence of tokens. The image itself is converted into approximately 600 tokens. And the combined text and image tokens are passed to the LLaVA-7B large multimodal model that is running on Modalix. Modalix accelerates the Time-To-First-Token by processing these tokens in batch at a rate of 240 tokens per second. And it currently generates tokens at a rate of 6.8 tokens per second. The generated tokens are converted into words and stream back to the host, where they are converted back to speech.

(pressing button "Analyze Scene")
(Modalix speaks)

"The image features a large airplane flying low over a city with its wings visible. The airplane is positioned above a highway, and there are several cars driving on the road below. The scene also includes a tall building with a large "A" on its side, possibly representing a company or landmark. The overall atmosphere suggests a bustling urban environment with air traffic and busy streets."

(this concludes the scene analysis demo)

Watch on YouTube

Подробности читайте в компанейском блоге: sima.ai/implementing-multimodal-genai-models-on-modalix/

Пятнадцать байтов на стек от конца

Наш чип рассказывает, что он видит

Наш чип рассказывает, что он видит

Профиль

Метки

Посетители