Multi-Platform Development of Skills and Chatbots
Alexa, Siri and Skype under one hat
Digital assistants are becoming increasingly popular with end customers. There is hardly a more intuitive way of interacting than via speech. However, the development of these so-called skills is challenging - especially when skills are to be available on multiple platforms.
Digital assistants have been all the rage in human-machine interaction for years. Siri was released back in 2010 and has been an integral part of many Apple products since 2011. Following this, the other big players in IT have also delivered: Amazon Alexa, Google Now/Assistant and Microsoft Cortana. For some time now, it has also been possible for third-party providers to add new functions to these assistants. These extensions are usually called skills. For example, DHL offers its customers the option of requesting the parcel status by voice command. Another option for interacting with IT systems by voice are the so-called chatbots. These are digital chat partners on platforms such as Skype or Facebook Messenger with whom you communicate via text messages.These interaction options simplify the use of IT systems - you no longer need to be able to operate or even own a PC if you want to complete tasks in the digital world. But one man's joy is another man's sorrow. The other side of the coin is that there are competing platforms vying for the favor of users - and therefore also having to be served by companies. For precisely those companies, it is desirable to be able to serve several platforms with just one software solution. In the area of chatbots, Microsoft's Bot Framework already offers a wide range of possibilities (e.g. Skype, Slack, Facebook Messenger). In the area of digital assistants, on the other hand, things look bleak. The following example of an Alexa skill describes how an architecture can look like that solves this problem.
Multi-Platform Architecture for Digital Assistants and Chatbots
The goal of a multi-platform architecture is to provide an extensible architecture where adding an additional chat platform or digital assistant (hereafter summarized under the term channel) no longer requires any customization to the backend of the application.
Everything starts with a statement from the user, which is entered via text or voice. The skill backend accepts this. This is the backend system of the channel - in the case of Amazon Alexa, for example, it is called Alexa Skills Kit. This component is provided by the channel's vendor and must be configured by a developer so that the incoming messages contain the user's statements. In the case of speech-driven channels, the conversion from speech-to-text or text-to-speech takes place here.As mentioned at the outset, the Microsoft Bot Framework already supports a large number of channels out of the box. For the connection of additional channels, there is also a REST API with which the chatbot backend can be addressed - called DirectLine API. Since the connected channel cannot communicate directly with this API, a so-called skill mediator mediates between them. This takes over the bidirectional conversion and forwarding of the messages. It receives messages from the skill backend and forwards them via the DirectLine API - or vice versa. Each connected channel usually uses its own message format and is therefore implemented individually.The chatbot backend now receives the message via the DirectLine API. This is the central component of the architecture. It controls the dialogs - in other words, it knows what information needs to be requested from the user for a particular action and how the bot should respond. External APIs are also connected here so that necessary data (e.g., weather report, customer profile) can be determined or manipulated. The chatbot backend can only support new actions if it has been adapted accordingly.If data is to be user-specific or its access is to be restricted to a user, an identity provider must also be integrated.To identify which action is to be executed, the Natural Language Understandig (NLU) component is used. In this example, the LUIS service is used for this purpose as part of Microsoft Cognitive Services. (However, our architecture is not limited to this solution.) The component maps textual statements (Utterances) to user intentions (Intents). Actions are executed based on the detected intent. It is also important to filter entities from the statements.
Intent and entities are mapped to an action in the back end. Here, for example, a weather API is queried for the necessary data and a suitable response is generated for the user.
To ensure that the architecture operates statelessly, the information collected during the conversation is cached outside the components. This is done by a storage component, e.g. Azure Table Storage. However, the Bot Framework also offers an interface for connecting any persistence system.
So much for the basic structure of a multi-platform architecture for digital assistants and chatbots. This makes it possible to use chatbots and digital assistants with an identical backend. An adapted dialog guidance may nevertheless be advisable: Based on the capabilities of the channel - chatbots usually use visual elements, digital assistants may only use audio media - interactions in the backend can be handled differently accordingly.
The particular challenges of a multi-platform architecture are illustrated below in the connection of Alexa to the chatbot backend.
Variante 1 – Alexa as Relay Station
In this approach, the Alexa Skill is built in such a way that no dialogs are configured in the Skill backend. The skill only serves to capture the user's statement, convert it into text (speech-to-text) and forward it to the chatbot backend.
Since the Alexa Skills Kit (ASK) does not inherently provide a way to tap the user's statements, a little trick is necessary: To do this, a single intent is created that uses a custom slot type to grab the entire statement and provide it as an entity (called a slot in ASK). As a starting point for the Skill Mediator, the alexa-bridge Project was used and extended. Details on how to set up the makeshift intent and the custom slot type can also be found here.
The application now reads the utterance from the slot value and sends it as a message to the chatbot backend. Once this returns a message, it is converted in the Skill Mediator and sent back to the Alexa backend. From there, it is forwarded to the Amazon Echo end device for output.
The procedure is well suited for very simple dialogs as well as Alexa Skills operated primarily in English. However, if the dialogs are to be more complex or a skill in German is used, variant 2 is better suited. This is mainly due to the fact that the speech recognition (NLU) of German texts lags behind the English language. Depending on the type and form of the statements and entities, comprehension problems may occur here.
The big advantage of this variant is the limitation to only one intent, which is configured in the skill backend. If new dialogs are added, no adaptation is necessary.
Variant 2 - Alexa Takes Over Dialog Guidance
The skill mediator is also used here, but in a modified form: In the skill backend, the dialogs are completely modeled and necessary entities are queried. This data is then sent together as a message to the chatbot backend. Since this processes the data again in LUIS (see Fig. 1), a format must be defined for this message and stored as a statement for the intents in LUIS. Such a format could look like this
alexa-intent - <Intent-Name-im-Alexa-Backend>;<Slot-Value 1>;…;<Slot-Value n>
This way, the skill mediator can continue to work independently of the configured intents and only needs to construct and send a message according to this scheme.
In this way, speech recognition works more reliably - especially for German skills - because entities within the statements can already be marked in the skill backend and thus extracted more reliably. One disadvantage is that dialog maintenance is necessary in both LUIS and the Alexa backend. However, as the speech recognition of German texts matures, this "detour" could become superfluous.
Keeping Track With the Skill Mediator
To cope with the growing number of chat and assistant platforms, it is necessary to develop an architectural approach that enables platform-independent development of bots and skills.
The solutions presented here use a skill mediator that unifies the connection of digital assistant platforms. The component handles the coupling to the respective desired assistant platform and thus creates an efficient extension interface. No customization of the chatbot backend is required. Depending on the quality of the speech-to-text engine of the assistant platform and the power of the Natural Language Understanding framework used, multiple maintenance of dialog guidance is also eliminated. The Skill Mediator thus not only saves effort and costs, but also enables a uniform user experience across all connected platforms. This increases user acceptance when a skill is used both via Amazon Echo and via smartphone using Google Assistant, for example.