Multimodal models

A multimodal model in the field of artificial intelligence is a model that can handle and integrate data from multiple different modalities, or types, of input. These types of inputs can include text, images, audio, video, and more.

1. Audio to text

Whisper from OpenAI

Whisper (OpenAI) embodies a versatile speech recognition solution, having been developed using a substantial assortment of different audio data. Its capabilities span beyond mono-lingual speech recognition, extending to translating speech and pinpointing languages.

When using "Audio to text", two nodes will be required at minimum: the model node (right now only OpenAI model is available) and an output node.

  • The model node will request an url that should contain the audio file (i.e., .mp3, .wav).

  • The output node will display the result of the transcription performed by the model.

The result of the model node could be instead sent to an LLM node for processing. In the video below, the transcription is sent to the LLM that combines this information with data coming from an url that has been retrieved and processed.

2. Text to audio


ElevenLabs brings one of the most realistic Text to Speech and Voice Cloning model. The node is available in the multimodal section.

To get it running, two nodes are needed: an input node where the text should be written and the "Text to Audio" node that will generate the audio. This audio could be either played in Stack AI platform or downloaded.

Defaults voices are available using Stack AI's API Key (default configuration). The list is the following one:

['Rachel', 'Domi', 'Bella', 'Antoni', 'Elli', 'Josh', 'Arnold', 'Adam', 'Sam']

Setting up custom API Key is available in the settings section of the "text to audio" node. Voice appears now as a parameter that could be inputed in the deployment.

3. Text to Image

Several image-to-text models are integrated into Stack AI's platform. Find below a list of the models, a brief description and links for further information.



Provides approximate text prompts that can be used with stable diffusion to re-create similar looking versions of the image/painting.


This model generates text that is conditioned on both text and image prompts. Unlike standard multi-modal models, it has also be fine-tuned to follow human instructions.


Vision encoder with a pretrained ViT and Q-Former, a single linear projection layer, and an advanced Vicuna large language model.

Last updated