Could Telegram be a competitor of voice assistants, like Amazon Alexa or Google Assistant?

An open letter to Pavel Durov, containing some change requests to enable voice integration into Telegram bots ecosystem

Published in

ConvComp.it

9 min readJul 4, 2022

photo credit: Pavel Durov, founder of Telegram Messenger, in 2020.

I started writing this article almost two years ago, and from time to time the topic comes in my mind, so maybe the question in the title is still valid.
In the first part of the article I analyze what is happening in the smartspeaker-based voice-enabled assistants market, and in the second part I introduce some change requests that could allow Telegram to become a winner competitor in this restless vertical.

Let‘s recap the current smartspeaker market landscape

For a few years, Amazon Alexa and Google Assistant have been competitors in the consumer smartspeaker market, with a similar market share. To be precise, Amazon Echo devices are a bit more spread all around the word, anyway the gap between the two competitor is not so huge.

Bixby has been a possible emergent third competitor, but Samsung seems have given up with the release of a branded smartspeaker device and the company has reserved the technology just for its branded smartphones.
End of the game of a splendid conversational technology conceived by Viv Labs’s Adam Cheyer and others.

After more than a year of stagnation, things are now changing because recently Google announced to sunset conversational actions. This means that:

In almost one year, Google Assistant will no more support actions designed only for smartspeakers.

Google focus on Android actions is announced and foreseen since a while. But now the company seems to have suspended his investments in voice technologies through smartspeakers.

Recently Sonos announced a new “service”, its own voice assistant that could be a competitor of Alexa (de facto, the current monopolist). But it’s not clear to me if the Sonos service will be implemented as an hardware device (an HI-FI smartspeaker / home multi-hub, aka the “soundbar”?).
The company, after having acquired in 2019 great snips.ai opensource-based company, will certainly focus on data privacy, proposing some local device processing of user voice data (on the smartspeaker, not on the cloud!).

Apple has his Siri voice assistant, running on their HomePod devices, right! But so far these Apple smartspeakers haven’t achieved an important market share. The first release of HomePods was too expensive and the Siri third party applications integration has never gained huge success.

Microsoft smartspeaker for Cortana was never born and the entire Cortana project seems to be died. Game never started.

Last but not list, also Facebook, pardon, Meta, is apparently at stake:

who remember Facebook Portal devices?

And what happened with the Facebook-Amazon agreement that established the coexistence of Facebook Portal and Amazon Alexa on the same devices produced and sold by Facebook? Hasn’t been heard from since! Also,

Whatsapp for business has not shined in the spaces of enterprise chatbot solutions.

Facebook, sorry Whatsapp, selected a short list of system integrator companies (2nd party) to “filter” 3rd party enterprise companies. In my opinion this has been an unsuccessful path that duplicated data privacy concerns and over-complicated accounting requests, creating a lot of frustration. By example, so far it’s impossible to set-up a chatbot on Whatsapp even for academic or no-profit purposes (e.g. my accounting access request to Whatsapp for CPIAbot, coming from myself as a public national research institute researcher, has never received a reply). Unclear business strategy.

Is the unique voice assistant a failed model?

So why the smartspeaker-based voice integration assistants are in this stagnation? In this second quarter of 2022, we are in a situation were

Amazon Alexa seems to be the only competitor left in the market of voice assistants based on smartspeakers.

But the question now is:

Has a unique smartspeaker-based central voice assistant a future (as a main hub at home or in the office)?

I remember discussions of experts few years ago, when pretty all agreed with the fact people need (at home) a unique voice assistant, not many assitants!
Is it still so?

Alexa now seems the winner of this prediction, the unique “1st party” assistant, the only one that leads the market. But this winning is maybe more related to the fact that Amazon competitors are “loosing out”. In fact also Alexa devices and investments didn’t grow as we expected. I have the feeling of a general stagnation also inside Alexa departments, by example the developers communities are no longer supported as they were years ago and many smart guys moved from Alexa to AWS. Weird, but significant signal.

So the winner-take-all desire of all above mentioned companies is probably failing and there are many reasons. All these big companies pretended to be the exclusive winner, proposing “walled gardens” and cloud-based ecosystems initially perceived by the market as disruptive, but these models are failing:

Regarding final users, because the privacy data concerns haven’t yet solved (also because the cloud-based architecture proposed).
Regarding 3rd party developers companies, because not proposing to them a complete technical advantage and a clear and profitable business model

My personal vision is that we do need a completely different approach, that exits from big player proprietary walled gardens, with two fundamental requirements:

Hardware: we preferably do need a open-hardware smartspeaker device and some embedded open-software (and common protocols) on top.
Software: we do need an open-architecture based on the coexistence of “peer-to-peer” multiple (voicefirst) assistants, operating on a common open architecture.

In other words,

As a final user you wish a smartspeaker where you can use one or more voice assistants (made by “3rd parties”)
As an application developer (service supplier/enterprise company), you wish a common protocol to plug-in your service into the above open-hardware smartspeaker ecosystem.

What does all this have to do with Telegram?

Even if the famous instant messaging (mobile) app is still a cloud-based closed-source system, an important and well engineered feature is the “by-design” possibility to enable 3rd party bots, in part following the multi-assistant architecture I mentioned above. Let’s deep dive!

What’s Telegram and what Telegram Bot APIs are?

Telegram is, for many reasons and above all user experience and proven security, probably the best instant messaging app available on common smartphones and personal computers operating systems!

This is not only because of the usual reasons Telegram fans mention when comparing this app with unloved Whatsapp, but, for developers, Telegram is great mainly because they can easily build chatbot applications, the so called Telegram Bots.

As all you programmers know, Telegram supplies a totally free, easy and performance-based way to build chatbots, using well done Bot APIs (just updated to version 6.1 in June 2022). You can set-up your chatbot in a few minutes using some high-level APIs on your preferred programming language. Last but not least, Telegram generously allows you to store for free a really huge amount of gigabyte of file storage. So far so good.

My experience as researcher and developer of CPIAbot

When I was researcher at ITD-CNR, from 2018 to 2020, I conceived and implemented CPIAbot, a Telegram voice-chatbot to help foreigners (emigrants) students of CPIA (Italian public adult schools) to learn Italian language as a second language at a basic level (L2/pre-A1).

As you can imagine, for a person that has to learn a foreign language, the verbal/spoken language understanding is a fundamental goal. So I allowed my chatbot to get inbound voice messages and to reply with outbound (synthetic) voice messages, back to the user. BTW, following research statistics we elaborated in the experimentation phase, the voice channel has been the preferred way for learners to interact with the bot!

From the UX perspective, using CPIAbot, students send text and/or voice messages to the bot. A custom server produces a text transcript with an ASR engine (at times I used pretty good Facebook wit.ai free service). Afterward the user transcript is elaborated by a dialog manager engine (my own opensourced naifjs) and the response is returned to the user as a text and/or voice message again, using a Google TTS voice, or a human-spoken audio recording. More info about CPIAbot on my old article here and on the academic project home page.

What’s missing in the almost apparent Telegram Bot API perfection?

So far you can develop a Telegram voicebot following the message-based paradigm. That’s not totally natural, I have to admit (from the natural language spoken interactions), even if nowadays people is used to communicate exchanging (also spoken) messages through instant messaging apps. Let’s consider audio messages ok for now.

Now I claim some change requests that could bring Telegram to be a competitor of big-players voice-assistants masterbots (someone says metabots) I mentioned above. Simply speaking:

Imagine your Telegram app as a smart-speaker embedded on your phone, able to connect you to any 3rd party bot without being an all-around “masterbot” assistant.

Sound good! Isn’t it? But how would it work in practice? To enable a great voice interface user experience, some features are now missing.

Change request #1: 🔊 Voice/Audio Messages Auto-Play

You need that when a bot is answering user (spoken) request, the bot voice message (response) is auto-played by the device (just when you are currently interacting with). Now, instead, user must click on the voice message icon to play it.

Of course, also for trivial privacy reasons, I would like to configure (opting-in/opting-out) the auto-play feature, with a general mute-all flag, and/or a bot-dependent flag (I probably want to un-mute just the more frequently used bots).

Related to the voice auto-play experience (and in general about any audio or music content play), of course a smart-phone has not the loudspeaker audio power of a smart-speaker (as the Amazon Echo or Google Nest devices, all great devices, no doubt).
And even if enhanced with a desired audio auto-play software feature, the coupling with an external (bluetooth-connected?) loudspeaker is something highly recommended, but optional.

Change request #2: 🎙️ Voice Wake-Word Detection

This feature is a bit more complex to implement, and controversial. You would replicate the user experience of the invocation sentence on a smart-speaker, when you say “Hey Google…”, sorry, I mean:

Hey Telegram open MyAppName…

You want the named bot starts waiting your spoken command utterance, recording your voice until a silence is detected, and afterward the audio message is forwarded to the Telegram server as usual (at the end of the day to the MyAppName bot).
There are also alternative solutions where we come to invocate a specific bot (by example you could want to invocate bot with his specific name, etc.), but you get the idea. Last but not least, the current Telegram client app push-to-talk chatbot selection mode is not so bad, eliminating many privacy-related concerns.

Change request #3: 🖧 A Decentralized Architecture Support for Voice-based Bots

Decentralized, headless or no-brokerage, are possible keywords that could differentiate Telegram bots from the centralized (1st party master-bot + 3rd party skill-bots) approach big-players so far imposed on us.
In these well-known scenarios the master company supplies a proprietary first-party master-bot (e.g. Google Assistant or Alexa) that commands all the games. In facts in this model, your bot (the action in Google Assistant parlance, or the skill in Amazon Alexa parlance, or the capsule in Samsung Bixby jargon) must follow a rigid contractual and technological framework and all data “passes through” central master-bot cloud servers. That’s bad, or at least, that’s controversial in terms of data privacy, vendor lock-in, etc.

So forget the first party approach of a (big-player owned) cloud system that monopolizes/centralizes traffic of all external (third-party) skill-bots. Instead, imagine Telegram just as a light middleware that supply two things:

The common client (the TG app) updated with the above proposed change requests to integrate an enhanced voice integration
Some server-side services on the Telegram cloud server that finally redirect to your (private person or enterprise) bot.

In other words, as final user:

Imagine you could access on your smartphone Telegram app a short-list of voicebot services provided by independent suppliers you (and only you) selected!

In this scenario Telegram would be not the unique (voice) assistant, but just a vehicle to enable you to “talk” with a set of independent assistants of your choice. It’s a complete different business model (with respect of big-players), requiring just minor architectural and software updates on the current Telegram client (and some optional enablers on Telegram servers).

Is this bot-independent model sustainable for Telegram company?

That’s a critical open point. I guess Telegram could set-up the overall architecture:

Giving for free some basic features (by example the audio auto-play and the wake-word detection on the client, etc.)
Supplying as paid services some server-side enablers, as a multilingual speech-to-text and text-to-speech platforms, and why not some smart “conversational AI” services to simplify/automatize chatbot developments (as a speech-to-text engine, a text-to-speech engine, an intent/entities classifier, a dialog manager, a semantic search engine, etc. etc.).

Maybe Telegram Premium, extended for a developers program (“Telegram Premium for Enterprises”), could be a possible solution to cover all costs of the upgrade and a possible economical gain?

I imagine by example that TG Premium for Entrprises contract could be set up, in such a way that a chatbot developer company could pay for premium services like a networking traffic enhancement, the use above mentioned server-side enablers, etc. etc.

Please comment!

What do you developers or final users think about all this?
Please leave your feedback on the comments!

Giorgio