<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[ConvComp.it - Medium]]></title>
        <description><![CDATA[Conversational Computing: Advancing LLM-Based AI Agents - Medium]]></description>
        <link>https://convcomp.it?source=rss----e9c948ff6ebd---4</link>
        <image>
            <url>https://cdn-images-1.medium.com/proxy/1*TGH72Nnw24QL3iV9IOm4VA.png</url>
            <title>ConvComp.it - Medium</title>
            <link>https://convcomp.it?source=rss----e9c948ff6ebd---4</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Sat, 04 Apr 2026 13:36:00 GMT</lastBuildDate>
        <atom:link href="https://convcomp.it/feed" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[Conversation Routines: A Prompt Engineering Framework for Task-Oriented Dialog Systems]]></title>
            <link>https://convcomp.it/conversation-routines-a-prompt-engineering-framework-for-task-oriented-dialog-systems-bd3f1c26ec7f?source=rss----e9c948ff6ebd---4</link>
            <guid isPermaLink="false">https://medium.com/p/bd3f1c26ec7f</guid>
            <category><![CDATA[llm]]></category>
            <category><![CDATA[automation]]></category>
            <category><![CDATA[no-code]]></category>
            <category><![CDATA[conversational-ai]]></category>
            <category><![CDATA[agentic-ai]]></category>
            <dc:creator><![CDATA[Giorgio Robino]]></dc:creator>
            <pubDate>Thu, 09 Jan 2025 10:36:35 GMT</pubDate>
            <atom:updated>2025-02-18T14:21:59.946Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*XAsoxqFw_z4W2blkS44r7g.jpeg" /></figure><p><strong>Abstract</strong><br>This study introduces Conversation Routines (CR), a structured prompt engineering framework for developing task-oriented dialog systems using Large Language Models (LLMs). While LLMs demonstrate remarkable natural language understanding capabilities, engineering them to reliably execute complex business workflows remains challenging.</p><p>The proposed CR framework enables the development of Conversation Agentic Systems (CAS) through natural language specifications, embedding task-oriented logic within LLM prompts. This approach provides a systematic methodology for designing and implementing complex conversational workflows while maintaining behavioral consistency. We demonstrate the framework’s effectiveness through two proof-of-concept implementations: a Train Ticket Booking System and an Interactive Troubleshooting Copilot. These case studies validate CR’s capability to encode sophisticated behavioral patterns and decision logic while preserving natural conversational flexibility.</p><p>Results show that CR enables domain experts to design conversational workflows in natural language while leveraging custom functions (tools) developed by software engineers, creating an efficient division of responsibilities where developers focus on core API implementation and domain experts handle conversation design.</p><p>While the framework shows promise in accessibility and adaptability, we identify key challenges including computational overhead, non-deterministic behavior, and domain-specific logic optimization. Future research directions include CR evaluation methods based on prompt engineering framework driven by goal-oriented grading criteria, improving scalability for complex multi-agent interactions, enhancing system robustness addressing the identified limitations across diverse business applications.</p><h3>UPDATE (18/02/2025)</h3><p>The article has been published on arXiv! Compared to the original version, the updated version submitted to arXiv includes som experimental results, refinements and formatting adjustments: <a href="https://arxiv.org/abs/2501.11613">https://arxiv.org/abs/2501.11613</a></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/976/1*QhexFM_cYYzuRcJ3r0-yBg.png" /><figcaption><a href="https://arxiv.org/abs/2501.11613">https://arxiv.org/abs/2501.11613</a></figcaption></figure><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=bd3f1c26ec7f" width="1" height="1" alt=""><hr><p><a href="https://convcomp.it/conversation-routines-a-prompt-engineering-framework-for-task-oriented-dialog-systems-bd3f1c26ec7f">Conversation Routines: A Prompt Engineering Framework for Task-Oriented Dialog Systems</a> was originally published in <a href="https://convcomp.it">ConvComp.it</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[SWARMing Conversational AI]]></title>
            <link>https://convcomp.it/swarming-conversational-ai-e6e0dfd0dc02?source=rss----e9c948ff6ebd---4</link>
            <guid isPermaLink="false">https://medium.com/p/e6e0dfd0dc02</guid>
            <category><![CDATA[llm-applications]]></category>
            <category><![CDATA[workflow-automation]]></category>
            <category><![CDATA[conversation-design]]></category>
            <category><![CDATA[conversational-ai]]></category>
            <category><![CDATA[ai-agent]]></category>
            <dc:creator><![CDATA[Giorgio Robino]]></dc:creator>
            <pubDate>Thu, 17 Oct 2024 09:53:50 GMT</pubDate>
            <atom:updated>2024-11-03T20:10:49.444Z</atom:updated>
            <content:encoded><![CDATA[<h4>Integrating No-Code and Code in Agent-Based Workflows</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/854/1*3h2frseAhLXqbgjc-EZlLQ.png" /><figcaption>source: <a href="https://cookbook.openai.com/examples/orchestrating_agents">https://cookbook.openai.com/examples/orchestrating_agents</a></figcaption></figure><p>A few days ago, the just released <a href="https://github.com/openai/swarm">SWARM</a> open-source project [<a href="https://github.com/openai/swarm">1</a>] from OpenAI sparked quite a bit of discussion within the agent-based Generative AI community, particularly among those focused on conversational AI. It’s a small, simple project (so far) that the company defines as:</p><p>An educational framework for exploring ergonomic, lightweight multi-agent orchestration. It is managed by the OpenAI Solutions team.</p><p>When comparing SWARM to well-known multi-agent frameworks such as LangGraph, CrewAI, AutoGen, and others, many argue that there is nothing groundbreaking about this small framework, which appears more like a demo than a production-ready platform. Indeed, OpenAI itself tends to characterize its project as a simple cookbook [<a href="https://cookbook.openai.com/examples/orchestrating_agents">2</a>].</p><p>In a certain sense, I strongly agree; however, several key concepts in SWARM’s proof-of-concept align with my perspective on constructing LLM-based conversational agents. It is important to clarify that the term (conversational) agent, as I use it, has a very specific meaning, which only partially overlaps with the concept of AI agents as understood in the LLM-based community recently. For a more detailed discussion, please refer to my previous article: Conversational Agent with a Single Prompt [<a href="https://www.linkedin.com/pulse/conversational-agent-single-prompt-giorgio-robino-vrppf/">3</a>].</p><h3>Agents = Routines + Handoffs</h3><p>The accompanying insightful OpenAI github cookbook [<a href="https://github.com/openai/openai-cookbook/blob/main/examples/Orchestrating_agents.ipynb">4</a>] highlights several key points. The framework introduces the concept of routines that embed conversational workflow logic in a no-code manner (as I previously referred to in my article, referring to this as Directive Instructions on Conducting the Dialog).</p><p>The fundamental premise of SWARM is to decompose a complex conversational workflow (macro-task) into multiple smaller tasks managed by agents, which can be viewed as LLM-based experts in specific domains and policies. These agents collaborate through straightforward yet effective handoff mechanisms based on function-calling design patterns. So far, nothing new — I agree [<a href="https://microsoft.github.io/autogen/dev/user-guide/core-user-guide/design-patterns/handoffs.html">5</a>][<a href="https://platform.openai.com/docs/guides/prompt-engineering/strategy-split-complex-tasks-into-simpler-subtasks">6</a>].</p><h3>Instructions (on Conducting the Dialog)</h3><p>With SWARM, it is possible to define complex workflows where conversational designers (prompt engineers) articulate the business logic of the workflow in natural language. The related backend business logic components, referred to as tools within the context of LLM programming, remain separate and can reside in custom Python code (or any programming language of choice).</p><p>This allows applications — whether conversational or otherwise — to be constructed from distinct components: LLM-based workflows, developed by prompt engineers (or coders), and hard-coded programs, handled by traditional software developers.</p><blockquote>To me, this is the most crucial aspect when building an AI team in practice: bringing together developers (sw coders) and conversational designers (~no-coders) to work collaboratively!</blockquote><p>Let us now examine a simple explanatory code snippet extracted from the blog example:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/854/1*gI3Pa7A3yU6p2eudW-huTA.png" /><figcaption>source: <a href="https://cookbook.openai.com/examples/orchestrating_agents">https://cookbook.openai.com/examples/orchestrating_agents</a></figcaption></figure><p>Notably, in the example provided in the blog, the LLM-based routine may include fixed (deterministic) steps, such as mandatory dichotomous questions (yes/no), implemented as Python functions. In the snippet here above, two routines (agents), are defined: triage_agent and sales_agent, each possessing its own workflow as specified in the instructions prompt, along with a set of associated functions, commonly known as tools, which implement the relevant business logic on the backend.</p><h3>No-code Instructions</h3><p>The instructions consist of simple sequences of conversational step directives, written in natural language or pseudo-code (such as bullet points or any structure expressible in natural language), which may include conditionals and/or loops.</p><blockquote>This is significant as it demonstrates a method for conceptualizing chatbot interactions that are not reliant on hard-coded scripts (in a specific programming language or dedicated conversation workflow tool) but instead are based on high-level directives for conducting dialogue, articulated in natural language within system prompts.</blockquote><h3>Deterministic Workflow Checkpoints</h3><p>In revisiting the snippet analysis, the most significant feature is the execute_order() function outlined earlier within the sales_agent tool. The notable aspect of this function is that when the sales agent determines an order should be executed, it invokes the execute_order() function, which can prompt a yes/no confirmation request from the user.</p><blockquote>SWARM thus enables a synthesis of no-code programming (implemented as instructions in prompts) with workflows that include hard-coded dialog turns (implemented as programming code in invoked functions). I refer to these functions as workflow checkpoints.</blockquote><p>This approach is particularly noteworthy as it allows for the design of complex conversational applications where hard-coded workflows are seamlessly integrated with LLM-based workflows.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*ORebUDtFDoi2HSYckpirrw.png" /></figure><h3>Context Variables and Task-Oriented State Machines</h3><p>The framework introduces context variables — a simple yet effective mechanism for retrieving and storing contextual data shared across routines. While the implementation may appear basic, its straightforwardness is part of its strength.</p><p>Interestingly, opinions differ on how SWARM handles state management: some consider context variables as part of a potential state-machine-based approach, while others argue that the framework remains fundamentally stateless, given its dependence on stateless calls to the LLM models driving the agents.</p><p>From my perspective, particularly in conversational design, SWARM effectively implements a task-oriented state machine, albeit at a high level of abstraction. In this framework, conversational states are naturally encoded within the logic of routines, enabling both input and output data to be stored in shared context variables. This allows conversation designers to focus on agent-driven tasks without the need to explicitly conceptualize a full state network, as in the Langraph approach.</p><p>Additionally, the framework adopts a minimal yet effective testing method through evals. Once again, I appreciate this clear and practical methodology for validating routine behaviors.</p><h3>Conclusion</h3><p>While SWARM may not yet rival more established multi-agent platforms like <a href="https://www.langchain.com/langgraph">LangGraph</a>, <a href="https://www.crewai.com/">CrewAI</a>, and <a href="https://microsoft.github.io/autogen/">AutoGen</a> (to mention the most prominent) in terms of sophistication, it introduces a promising approach to orchestrating LLM-based conversational agents. Its ability to decompose workflows into smaller, specialized tasks managed by individual agents demonstrates a practical framework for agent-based orchestration.</p><p>What may differentiate this approach from other frameworks is its emphasis on workflow development through the seamless integration of no-code and hard-coded processes. This allows conversational designers to define overarching logic using natural language prompts, while developers handle more complex backend functions through traditional coding methods.</p><p>This hybrid design creates a fluid workflow where the roles of both no-code and coded components are clearly delineated, allowing for flexible, collaborative development of conversational applications.</p><p>Moreover, SWARM’s minimalistic reliance on context variables and simple testing methodologies strikes a balance between simplicity and functionality, making it a pragmatic choice for developing medium-complexity agentic systems. Whether SWARM will evolve into a robust, production-ready framework remains to be seen. However, its approach represents an important step toward bridging the gap between conversational designers and software developers, fostering a more collaborative environment for building next-generation conversational AI systems.</p><p>As the community further explores its potential, SWARM may prove to be more than just an educational tool, but rather a viable option in the growing landscape of LLM-based multi-agent frameworks.</p><h3>Update (2024–11–03): Insights from Initial Experiments</h3><p>Following several hands-on tests with SWARM, I would like to provide further insights into my experience with this framework. The key advantage of adopting an agentic approach (exemplified by SWARM, along with other LLM-based agentic frameworks) lies in a principle reminiscent of the <em>structured programming</em> paradigm of the 1960s: break down the main problem (or workflow in this context) into smaller, manageable sub-problems.</p><blockquote><em>Each sub-problem is assigned to a specialized agent, which operates based on instructions crafted by a conversation designer and utilizes tools or backend functions implemented in Python by a developer.</em></blockquote><p>This approach represents a substantial advancement over the monolithic, LLM-based methods that were our primary focus just a year ago. For example, consider a<em> Retrieval-Augmented Generation </em>(RAG) application. In tests, I implemented a search workflow using cooperating agents aligned with SWARM’s design. In this setup, a <em>master orchestrator agent</em> delegates tasks to <em>specialized agents</em>, each tasked with retrieval from sources such as relational databases or vector stores. The results demonstrated minimal hallucinations and an effective response strategy for unmet requests. Overall, this multi-agent approach outperformed the single-LLM model.</p><p>Beyond the design pattern best practices mentioned above, I have reassessed my previously favorable view of what I termed “Deterministic Workflow Checkpoints.” Specifically, the example OpenAI Python function execute_order(), which directly reads user input (via input()) and writes output (using print()), illustrates a suboptimal design pattern. The hard-coded backend logic in this function bypasses both instructional flows and user interface design, leading to several limitations.</p><p>For example, in scenarios where user confirmation is required (such as verifying a purchase or payment), it is valid to include a <em>checkpoint function</em> that prompts a yes/no confirmation. However, it is not ideal for such functions to embed workflow interaction logic directly. Instead, it would be more effective to delegate interaction logic to a specialized <em>user interface agent</em> — this could be the main orchestrator agent or, even better, a dedicated agent attuned to specific user interface modalities (whether chat, voice interface, or email), the bot-persona design and so on.</p><p>While the principle of using functions as checkpoints is sound, the original OpenAI example relying on direct input() and print() calls could be considered an anti-pattern, as it does not conform to the separation of concerns in agent-based design.</p><p>Regarding LLM models, some have argued that SWARM is specific to OpenAI’s models, but this is not accurate. OpenAI API calls have become a de facto standard, and many non-OpenAI vendors provide compatible interfaces. In my tests, I used Azure OpenAI deployments, and the framework operated effectively with only few lines of additional initialization code I wrote, as shown below:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/759/1*RvJt9a2t6zyYv6jSlAeJBw.png" /><figcaption>SWARM initialization using a Azure OpenAI Deployment</figcaption></figure><p>Lastly, you can utilize SWARM as-is on-premises (either locally on your hardware or in a private cloud) with the outstanding <a href="https://ollama.com/">ollama</a> engine, which supports lightweight open models, as demonstrated by Cole Medin in his YouTube video [<a href="https://www.youtube.com/watch?v=8jpVeUTNExI">7</a>].</p><p>Last but not least, It is unfortunate that OpenAI appears reluctant to further develop the SWARM project on GitHub. Specifically, they have disabled the ability for users to open issues, which is an unusual practice for a prominent organization that incorporates the term “Open” in its branding.</p><p>Nonetheless, I remain appreciative of OpenAI’s contributions. I would like to express my gratitude for underscoring a principle in which I firmly believe: that LLMs serve as effective conversationalists and that the multi-agent approach supports the no-code aspirations of many chatbot conversation designers, enabling machines to facilitate dialogues on their behalf.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/680/0*rC8Qa0OkzJ0DOq3_" /><figcaption>my tweet: <a href="https://x.com/solyarisoftware/status/1849398923980402926">https://x.com/solyarisoftware/status/1849398923980402926</a></figcaption></figure><p>What do you think?</p><h3>References</h3><p>[1] OpenAI SWARM GitHub Repository: <a href="https://github.com/openai/swarm">https://github.com/openai/swarm</a></p><p>[2] OpenAI SWARM GitHub Cookbook Blog Article: <em>Orchestrating Agents: Routines and Handoffs</em> <a href="https://cookbook.openai.com/examples/orchestrating_agents">https://cookbook.openai.com/examples/orchestrating_agents</a></p><p>[3] Concept of Conversational Agents: <em>Conversational Agent with a Single Prompt?</em> <a href="https://www.linkedin.com/pulse/conversational-agent-single-prompt-giorgio-robino-vrppf/">https://www.linkedin.com/pulse/conversational-agent-single-prompt-giorgio-robino-vrppf/</a></p><p>[4] OpenAI SWARM Cookbook Jupyter Notebook: <a href="https://github.com/openai/openai-cookbook/blob/main/examples/Orchestrating_agents.ipynb">https://github.com/openai/openai-cookbook/blob/main/examples/Orchestrating_agents.ipynb</a></p><p>[5] Microsoft Autogen Handoffs: <a href="https://microsoft.github.io/autogen/dev/user-guide/core-user-guide/design-patterns/handoffs.html">https://microsoft.github.io/autogen/dev/user-guide/core-user-guide/design-patterns/handoffs.html</a></p><p>[6] Old OpenAI Prompt Engineering Tutorial: <a href="https://platform.openai.com/docs/guides/prompt-engineering/strategy-split-complex-tasks-into-simpler-subtasks">https://platform.openai.com/docs/guides/prompt-engineering/strategy-split-complex-tasks-into-simpler-subtasks</a></p><p>[7] Ollama + OpenAI’s Swarm — EASILY Run AI Agents Locally, by Cole medin: <br><a href="https://www.youtube.com/watch?v=8jpVeUTNExI">https://www.youtube.com/watch?v=8jpVeUTNExI</a></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=e6e0dfd0dc02" width="1" height="1" alt=""><hr><p><a href="https://convcomp.it/swarming-conversational-ai-e6e0dfd0dc02">SWARMing Conversational AI</a> was originally published in <a href="https://convcomp.it">ConvComp.it</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Testing the Language Proficiency of Popular LLMs]]></title>
            <link>https://convcomp.it/testing-the-language-proficiency-of-popular-llms-64b2bd7ddb75?source=rss----e9c948ff6ebd---4</link>
            <guid isPermaLink="false">https://medium.com/p/64b2bd7ddb75</guid>
            <category><![CDATA[language]]></category>
            <category><![CDATA[large-language-models]]></category>
            <category><![CDATA[l2]]></category>
            <category><![CDATA[language-proficiency]]></category>
            <category><![CDATA[certification]]></category>
            <dc:creator><![CDATA[Giorgio Robino]]></dc:creator>
            <pubDate>Tue, 16 Jul 2024 16:51:09 GMT</pubDate>
            <atom:updated>2024-07-16T17:12:42.830Z</atom:updated>
            <content:encoded><![CDATA[<h4>A Semi-Serious LLM Self-Evaluation Experiment</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*YfDMvmF-AePP3v0Q-yEo8Q.jpeg" /><figcaption>Image made with <a href="https://firefly.adobe.com/">https://firefly.adobe.com/</a></figcaption></figure><p>Last weekend, just for personal fun, I conducted a non-scientific experiment to test how well the Large Language Models (LLMs) available on the market “know” a specific natural language (Italian, in my case) according to CEFR guidelines.</p><p>I’ll briefly recap what the CEFR classification standard is, detail the experiment, and finally share some thoughts about it.</p><h3>What’s CEFR?</h3><p>CEFR (<a href="https://en.wikipedia.org/wiki/Common_European_Framework_of_Reference_for_Languages">Common European Framework of Reference for Languages</a>), also known in Italy as QCER (Quadro Comune Europeo di Riferimento per le Lingue), is a standard for classifying language proficiency.</p><p>The CEFR is a guideline used to describe the achievements of language learners across Europe and increasingly worldwide. In November 2001, a European Union Council Resolution recommended using the CEFR to establish systems for validating language ability.</p><p>The six reference levels (A1, A2, B1, B2, C1, C2) are widely accepted as the European standard for grading an individual’s language proficiency. These levels cover several competencies: written and oral comprehension, and written and oral production.</p><ul><li><strong>A1 and A2:</strong> Basic users who can understand and use simple phrases and sentences in familiar contexts. A1 indicates very basic comprehension and production, while A2 shows slightly more advanced skills.</li><li><strong>B1 and B2:</strong> Independent users who can handle more complex language. B1 users understand the main points of clear standard input on familiar matters and can produce simple connected text. B2 users comprehend and interact on a wider range of topics, producing detailed and coherent text.</li><li><strong>C1 and C2:</strong> Proficient users with advanced skills. C1 users understand a wide range of demanding texts, recognize implicit meaning, and express themselves fluently. C2 users easily understand virtually everything heard or read, summarizing information from various sources coherently.</li></ul><h3><strong>The LLM S</strong>elf-Evaluation <strong>Experiment</strong></h3><p>I collected the responses of each of the nearly 50 models available on the <a href="https://chat.lmsys.org">Chatbot Arena</a> website to this question (prompt) in Italian:</p><blockquote>“Qual è il tuo livello di conoscenza della lingua italiana, rispetto alla classificazione QCER? Rispondi solo con una parola indicante il tuo livello di competenza:”</blockquote><p>Meaning in English: <em>“What is your level of Italian language proficiency, according to the CEFR classification? Respond with only one word indicating your level of competence”</em>.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/969/1*XnEcVi_jZV6UkYl78b6ZYA.png" /><figcaption>Submitting the question on the <a href="https://chat.lmsys.org">Chatbot Arena website</a></figcaption></figure><p>For each model, I collected the response and had to elaborate a bit because, in some cases, the models replied with synonyms, long sentences, or nonsensical answers. I compiled all the results in a table. See the attached screenshots:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/762/1*K_MmMPX3Xfmc6w1OgFYFHQ.png" /><figcaption>Models Declaring CEFR C1-C2 Levels</figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/787/1*4qPRoHq4RCFMfGQFkK9DYA.png" /><figcaption>Models Declaring CEFR B1-B2 Levels</figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/796/1*2Rtq50s1juhuBOqAxNvFdA.png" /><figcaption>Models Declaring CEFR A1-A2 Levels</figcaption></figure><p>The table shows three columns: MODEL-NAME, QCER_LEVEL (the resulting CEFR level as a single word), and RANK (where 1.0 means the LLM replied exactly with the expected level; between 0.0 and 1.0 means the LLM replied with a synonym or a long sentence).</p><p>Almost half of the models, including small and open-parameter models, self-evaluate with good or very good proficiency.</p><blockquote>Hmm… Maybe there is a bit of overestimation in these LLM self-assessments?!</blockquote><h3>Automating Language Proficiency Assessments?</h3><p>My experiment was clearly just a curiosity! I’m not a linguist, so a more thorough investigation should be conducted by domain experts (linguist researchers).</p><p>The quick test I conducted was done via the chat (textual) web interface, so the models can’t estimate listening and speaking! I admit my simple prompt biased the LLMs to reply with a single word that we assumed valid just for reading/writing abilities. A complete evaluation would require interfacing these LLMs with a voice interface (to test listening and speaking) and the production of content with varying levels of difficulty.</p><p>With a more scientific evaluation, a human language expert, typically a language teacher from a certifying body, could examine an LLM to classify proficiency using the same CEFR metrics we use for humans (reading comprehension, writing production, listening comprehension, speaking production, interaction, mediation).</p><p>A further step could be to develop a comprehensive LLM-based testing application to fully automate the proficiency examination, using a top-level LLM as the examiner to test other LLMs (acting as examinees). So, perhaps, by using high-proficiency LLMs as CEFR experts, we could automate some of the real teachers’ work on CEFR examinations, evaluating the language proficiency of human students… or other LLMs.</p><p>More generally, many e-learning activities (language-learning related and beyond) usually done by human teachers could be partially implemented by LLM-based conversational agents.</p><p>For example, teachers could be assisted by conversational assistant applications to manage the “heavy work” of exercising students and examining them, reporting relevant milestones and events to the teacher. There is a vast range of automation in the edutech sector that is now enhanced by generative AI.</p><p>What do you think?</p><p>#GenerativeAI #LLMs #LargeLanguegeModels #LanguageLearning #eLearning #EduTech #CEFR #QCER</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=64b2bd7ddb75" width="1" height="1" alt=""><hr><p><a href="https://convcomp.it/testing-the-language-proficiency-of-popular-llms-64b2bd7ddb75">Testing the Language Proficiency of Popular LLMs</a> was originally published in <a href="https://convcomp.it">ConvComp.it</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[A Conversational Agent with a Single Prompt?]]></title>
            <link>https://convcomp.it/a-conversational-agent-with-a-single-prompt-957c4e804209?source=rss----e9c948ff6ebd---4</link>
            <guid isPermaLink="false">https://medium.com/p/957c4e804209</guid>
            <category><![CDATA[autonomous-agent]]></category>
            <category><![CDATA[prompt-engineering]]></category>
            <category><![CDATA[chatbots]]></category>
            <category><![CDATA[large-language-models]]></category>
            <category><![CDATA[conversational-ai]]></category>
            <dc:creator><![CDATA[Giorgio Robino]]></dc:creator>
            <pubDate>Wed, 05 Jun 2024 20:27:04 GMT</pubDate>
            <atom:updated>2024-06-18T14:41:14.209Z</atom:updated>
            <content:encoded><![CDATA[<h4>Using Large Language Models for Chatbot Development: Specializing in Prompt Design</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*KesWdAgvBTuQ6zQW0M3a9Q.jpeg" /></figure><h3>Using Large Language Models for Chatbot Development: Specializing in Prompt Design</h3><p>In this article, I share my experience in constructing Generative AI prompts to develop Conversational Agents.</p><p>First, I will clarify the relevant terms. Then, I will provide a brief overview of how we can utilize Large Language Models (LLMs) as intelligent conversationalists. Finally, I will present some compelling use cases where I have refined prompt engineering best practices to implement chatbots solely from no-code requirement specifications (system prompts).</p><h3>Conversational Agents</h3><p>Years ago, during my previous academic career (specifically as an assistant researcher at <a href="https://www.itd.cnr.it/istituto/personale/robino-giorgio.html">ITD-CNR</a>), my research leader and other researchers always referred to chatbots as conversational agents. This perplexed me, as I’m particular about terminology in computer science. I always understood an agent to be any kind of software that intermediates among humans to perform some task (usually delivered by a human).</p><blockquote><em>My point was that not every chatbot is truly an agent in functional terms.</em></blockquote><p>For example, consider a voice system that acts as an assistant (nowadays we might call it a voice copilot) for a worker, assisting them in accomplishing specific real-life working tasks. Is it correct to define this system as a conversational agent? Maybe not, because it lacks agentive functionality. The term assistant may be more appropriate for augmented-reality scenarios like this (read also my previous article, <a href="https://convcomp.it/voice-cobots-in-industry-a-case-study-352294bd0d5a">Voice-cobots in industry. A case study</a>). However, I admit that historically, in the scientific and academic community, conversational agent and chatbot have been used as synonyms.</p><p>Nevertheless, things have become more confusing with recent advancements in LLM-based autonomous agents. In this research area, which is broader than conversational applications, agents can autonomously define and execute micro-tasks based on a human-provided description in natural language (the system prompt) of a specific high-level duty or activity. This is a fascinating area of research with potentially disruptive practical applications, and there are many software frameworks available, but that’s a slightly different topic. Let’s now focus instead on the conversational application verticals.</p><blockquote><em>Overall, I use the term Conversational Agent to refer to a specific type of agent that performs conversational tasks on behalf of a human.</em></blockquote><p>Progress with LLM-based conversational agents allows us to build chat systems with a single prompt based on cognitive architectures. By utilizing advanced state-of-the-art LLMs, developers can describe what the chatbot should do without having to program the conversation as a series of fixed dialog states. From the development perspective, this could be a definitive cost-saving alternative to solutions based on intents, slots, states, and hard-coded flow management.</p><h3>LLMs as Core Layers for Agent Engines</h3><p>Long story short, GPT-based Large Language Models have revolutionized the field of conversational AI since the release of GPT-3 by OpenAI. These recent LLMs, trained on vast amounts of text data, can generate human-like responses and engage in meaningful dialogues. Their ability to understand and generate language makes them ideal for building chatbots.</p><h3>Instruction-based Chat Completion Models</h3><p>A basic foundation model (a large language model trained with sufficient data to ‘know’ a specific human language) is not sufficient to be a valid engine, capable of making conversations and reasoning.</p><p>Simply put, the disruptive improvement in GPT-3 models occurred with GPT-3.5-turbo (the model behind the famous ChatGPT, see my previous article: <a href="https://convcomp.it/reflecting-on-chatgpts-anniversary-5873db65bdb8">Reflecting on ChatGPT’s Anniversary</a>). GPT-3.5-turbo is based on the foundation of GPT-3 but enhanced by a supervised training algorithm (HFRL and similar supervised training mechanisms) that enables it to converse with people in a fluid natural language, using polite and ‘controlled’ manner.</p><p>More importantly, the models from GPT-3.5 onwards are also instruction-based models because they are trained with programming code (OpenAI coined the term instruct). This last feature enabled some sort of ‘reasoning’ abilities. The LLMs are now able to perform themself some programmatic ‘logic’ understanding, such as concepts of programming languages including sequences, conditionals, and iterations.</p><h3>Function Calling Feature</h3><p>Another disruptive feature that nearly all state-of-the-art generative models now possess is the ability to call external functions/APIs (sometime called tools in LLM agents jargon). This is achieved through special fine-tuning of the aforementioned models, enabling LLMs to ‘call’ external functionalities, such as programs made in any programming language, to solve specific requests or actions and retrieving real-time data. This is a fundamental need in a cognitive architecture, where the LLM is the core ‘reasoning’ component that autonomously retrieves information from external systems or invokes actuators.</p><p>The function-calling feature is crucial for autonomous agents but not essential for building basic conversational agents. However, function-calling becomes a must-have when the conversational system needs to invoke external APIs. For example, a customer care assistant might need to open a ticket in an internal ticketing system or query the system to monitor the ticket status and inform the customer during the conversation.</p><p>The recent generative language models (instruction-based LLMs fine-tuned for chat completions, also enabled by function calling) can understand logic and instructions (through directive written in natural language in the prompt) and have an improved capacity to conduct human-like conversations in nearly any natural language. Additionally, these models can interact with external (proprietary) APIs. All in all, today’s models like GPT-4 or equivalents are viable engines for building autonomous agents capable of performing task-oriented conversations typically handled by humans.</p><p>In the next paragraphs, I will delve into this with some examples, but first, I will introduce the prompt engineering approach I used.</p><h3>Prompt Design for Task-oriented Conversations</h3><p>Prompt engineering is the practice of designing and refining input prompts to effectively guide the behavior and output of language models. By carefully crafting these prompts, users can enhance the model’s ability to understand and respond to complex instructions, ensuring more accurate and contextually appropriate outputs. This technique is crucial for optimizing the performance of state-of-the-art generative models, enabling them to perform specific tasks, generate creative content, and simulate ‘human-like’ conversations with precision.</p><blockquote><em>The techniques I experimented with are about writing system prompts to instruct the LLM to conduct conversations in specific application domains to achieve particular tasks.</em></blockquote><h3>In-Context Learning</h3><p>In all use cases I’ll introduce, I used a similar approach: the system prompt is composed of an introductory context section where I defined 1: the goal of the conversation (or task), 2: the bot-persona (the description of the agent’s characteristics/character, using the usual conversation design metrics), 3: the user persona (a description of the user profile), 4: The core part of the context is contextual data useful for the current conversation session. For example, if the conversation is an interview for a job applicant, this data includes the job description file and the candidate’s curriculum. More generally, the technique is akin to the one made famous with Retrieval Augmented Generation (RAG) applications, where you ‘stuff’ inside the prompt data retrieved (maybe with some embeddings database or any specific vertical data retrieval system).</p><p>When considering the data needed to accomplish a task-oriented conversation, it could be anything that fits into the prompt context window size (4K tokens, 8K tokens, 16K tokens, and so on). In all practical use cases I have experimented with and mentioned below, a context window of 4–7K tokens has been entirely sufficient for the purpose.</p><h3>Directive Instructions on Conducting the Dialog</h3><p>After the context part of the prompt, in the following instruction section, I detailed the required steps (actions to be accomplished in a specific order). This is the tricky part where you instruct the model not just on how to conduct the conversation in terms of social practices and human conventions, but also provide guidelines regarding the topics to cover, possibly including explicit questions or general behaviors to adopt.</p><p>Here, you instruct the LLM on what topics must be covered in the chat, how to conduct the dialogue with more or fewer guidelines, and how to guide the conversation from point A to the desired point B. Finally, the instructions must include criteria for deciding when to end the conversation session, which depends on the specific application and can be a bit tricky to implement.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/558/1*QiefQlHmYlXAKP1cWo2AbA.png" /></figure><h3>Some Application Use cases</h3><p>I introduce three dialogue systems I prototyped for entirely different verticals. All these applications have in common the fact that I wrote the conversation program as a single system prompt for an LLM. In chronological order of my developments:</p><h3>Case 1: A Virtual Caregiver for Patient Telemedicine Visits</h3><p>I have been involved in some prototypes related to the healthcare vertical, specifically in transcribing and extracting data from practitioner-patient visits for Conversational Analysis (CA) using LLMs. As a side project, I developed an emulation of a remote monitoring visit where a virtual assistant (acting as a practitioner or caregiver) contacts a patient every day via an instant messaging app to monitor their health status, particularly considering the patient is potentially affected by COVID-19. The virtual caregiver asks the patient about their health status, chats with them in a very natural way, delves into symptoms, and engages in small talk if the patient initiates it, while keeping the conversation focused on retrieving certain parameters: health status, temperature, blood oxygenation, and a few other variables.</p><p>Once all the requested information is retrieved, the virtual caregiver says goodbye to the patient and closes the conversation, internally returning a data structure (a JSON) containing all the information obtained from the patient. Interestingly, in this case, the end of the conversation is not strictly necessary. After the initial session, the user can re-engage with updates on their symptoms. The virtual caregiver replies to any patient questions or statements about their symptoms and internally emits any data updates via a function call. This example is also interesting for its psychological support aspects, but that’s another story.</p><p>You could argue that the described conversation is just an old-fashioned form-filling that one could implement with a simple hard-coded chatbot. However, the novelty of the LLM-based conversation is the naturalness of the interaction. This variance in how the system conducts any new conversation session, allowing user digressions while returning to the programmed goal of gathering information, is invaluable!</p><h3>Case 2: A Customer Care Assistant</h3><p>This is a classic chatbot application that I already mentioned in the article. Imagine a virtual assistant helping an employee of a very large company submit requests or report issues that can be tracked by opening tickets on a specific backend system. The user must also be able to ask about the status of previously submitted tickets. This is a very common chatbot application that I delivered in production as a standard state-machine flow tool seamlessly integrated with external REST APIs.</p><p>Subsequently, I tried to re-implement the same application using an LLM-based approach. The initial application involved highly constrained workflows, so what’s the advantage of using an LLM as a dialog conductor? I also struggled with implementing these programmatic steps that are simple to implement with a hard-coded flow. So, what are the advantages of implementing all this logic with a ‘declarative’ approach instead of using a standard software program?</p><p>There are two interesting pros: the conversation built by the LLM seems more ‘natural’, emulating the behavior of a human being (e.g. a help desk operator), allowing the user to describe an issue in various ways and guiding them to explain the problem concisely to gather all the necessary data.</p><p>The second advantage is the reduction in development time: with the single-prompt approach, the chatbot developer is no longer a software programmer using a chatbot development tool, but rather a prompt engineer with conversational design skills, who writes the chatbot specification as a special text in a natural language (English, Italian, etc.).</p><p>Besides the prompt engineer, we still need a backend developer who knows how to integrate external APIs, but what’s nice is that these two roles are quite distinct, and the software responsibility boundaries are clear.</p><h3>Case 3: A Virtual Job Position Interviewer</h3><p>The most fun and intriguing application I’m experimenting with is in the Human Resources vertical. Using the usual in-context learning prompt-writing approach, I built an emulator of a recruiter conducting an interview with a person who applied for a certain job position.</p><p>In the prompt context, I included the job post description and the candidate’s curriculum vitae. In the instruction section, I taught the LLM to act as a perfect recruiter, asking questions to verify all matches and mismatches between the role description and the candidate’s experience. The results are very impressive, and the virtual interviewer’s behavior is smart enough to detect weaknesses and strengths of the candidate by comparing the CV with the required skills. As a test candidate myself, I have been unable to lie in response to such precise investigative questions.</p><p>Since I’m not an expert recruiter myself, my approach could surely be improved with input from a domain expert in human resource recruiting. Nevertheless, my current experiments are astonishing. The system conducts a natural (similar to a human-to-human dialog) yet very rational interview, exploring points of weakness and verifying the truth of user statements in a polite and positive manner (as I instructed the bot-persona to do).</p><p>Besides the above application, I also created collateral LLM-based tools, such as a ‘pre-interview’ prompt to decide if a candidate deserves to be interviewed and some ‘post-interview’ tools that analyze the interview dialog and produce a structured report with a final ranking, but these are collateral (one-shot) LLM-based applications.</p><h3>Prompt Development Challenges</h3><p>LLMs are not deterministic. This has certain advantages, such as enabling smooth, fluent, always slightly different conversation variations, but it also presents some drawbacks. When considering the applications covered here, this randomness can potentially create issues. The main challenge I encountered was not the outcomes of the first prompt I designed, but the subsequent editing required to refine it to adjust some incorrect or unexpected runtime behavior.</p><p>Related to this, LLMs suffer from what I call fragility syndrome: you may have an initially well-functioning prompt, but even a minor, seemingly insignificant modification of a statement or a typo (by the way, typos are absolutely forbidden when writing prompts; please use a spell checker!) can cause different and unexpected runtime behaviors. Fixing this usually requires a lot of time spent on trial and error, where I rethink the prompt and often have to rewrite or reorganize it following a new, more logical approach.</p><p>For the prototypes I created, I admit I did not use any automated testing tools to validate the LLM outputs. This automatic evaluation is not a trivial task, although there are some emerging tools that can help prompt engineers validate prompts (this is a topic for a future article).</p><h3>Tentative Conclusions</h3><p>There is a lot of hype around optimal use cases for LLMs. Since 2023, I have seen hundreds of papers, articles and videos concerning RAG/LLM applications. While LLM-enabled data retrieval-based chatbots are certainly an important use case, for me, as a conversation designer also, the perfect use case for state-of-the-art LLMs is to exploit the conversational capabilities embedded in LLMs trained and fine-tuned on human conversations.</p><blockquote><em>The no-code dream of developing chatbots has now become a reality with just prompt engineering skills?</em></blockquote><p>What are your thoughts?</p><p>#promptEngineering #LLMs #generativeAI #GenAI #nocode #conversationalAgents #AutonomousAgents #chatbots #ConversationDesign #AI #MachineLearning #NaturalLanguageProcessing #AIChatbots #AIApplications</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=957c4e804209" width="1" height="1" alt=""><hr><p><a href="https://convcomp.it/a-conversational-agent-with-a-single-prompt-957c4e804209">A Conversational Agent with a Single Prompt?</a> was originally published in <a href="https://convcomp.it">ConvComp.it</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Reflecting on ChatGPT’s Anniversary]]></title>
            <link>https://convcomp.it/reflecting-on-chatgpts-anniversary-5873db65bdb8?source=rss----e9c948ff6ebd---4</link>
            <guid isPermaLink="false">https://medium.com/p/5873db65bdb8</guid>
            <category><![CDATA[conversational-ai]]></category>
            <category><![CDATA[llm]]></category>
            <category><![CDATA[co-pilot]]></category>
            <category><![CDATA[chatgpt]]></category>
            <category><![CDATA[prompt-engineering]]></category>
            <dc:creator><![CDATA[Giorgio Robino]]></dc:creator>
            <pubDate>Thu, 30 Nov 2023 10:16:23 GMT</pubDate>
            <atom:updated>2023-11-30T10:16:23.255Z</atom:updated>
            <content:encoded><![CDATA[<h4>What is the path forward after a year of revolutionary strides in conversational AI?</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*jHt0T6FHzwIrjWGjWlr9sw.jpeg" /><figcaption>source: <a href="https://openai.com/blog/chatgpt">https://openai.com/blog/chatgpt</a></figcaption></figure><p>ChatGPT celebrates its first birthday on November 30, 2023. Exactly one year ago, OpenAI announced the birth of ChatGPT in a blog post, which you can find <a href="https://openai.com/blog/chatgpt">here</a>.</p><p>In December 2022, the ChatGPT craze took off immediately, with hundreds of thousands of users and developers (like myself) eagerly engaging with this innovative technology.</p><p>I must confess that before ChatGPT was born, I was skeptical of large language models (LLMs) and statistic-based AI technologies in general. I followed the debates surrounding GPT-3 in 2022 with keen interest, and I must admit I was quite perplexed about the potential successful evolution of what some referred to as ‘statistic parrots’.</p><p>However, everything changed with the release of GPT-3.5 by OpenAI and subsequent models. These new models were extensively trained not only with human texts, including narratives and conversations, but also with programming code (InstructGPT). This has been a game-changer, in my opinion. These LLMs have been ‘filtered’ (with RLHF techniques, etc.) and fine-tuned for a chat experience interface (ChatGPT), not only demonstrating a nearly perfect natural language syntax but also exhibiting remarkable semantics, as acknowledged even by Walid Saba.</p><p>Inspired by amazing David Shapiro’s live coding videos, I delved into prompt engineering techniques in December 2022. Throughout 2023, I focused on exploring the applications of large language models, specifically in the realms of natural language processing and conversational interfaces, especially chatbots.</p><p>In my free time, I also studied a lot and participated a little bit in the growth of some excellent open-source Python development frameworks like LangChain, LlamaIndex, LiteLLM, just to name a few. I have been overwhelmed by the sheer number of academic papers published at an insane rate throughout the year.</p><p>Now, a crucial development is that the latest GPT-based models have demonstrated the capability to perform some reasoning-based tasks.</p><p>Many researchers are currently grappling with the challenge of constructing ‘autonomous’ agents, or more accurately, copilot assistants — systems designed to interact in real-time with users and collaboratively execute workflows alongside humans. This is a topic that has interested me for quite some time, as I developed in 2020 a research project concerning a <a href="https://convcomp.it/voice-cobots-in-industry-a-case-study-352294bd0d5a">voice cobot</a> in logistics industrial real-time workflows.</p><p>I am currently intrigued by the prospect of utilizing LLMs as foundational layers in cognitive architecture frameworks to develop a new era of goal-oriented conversational agents. These systems engage in real-time conversations with users, perhaps in voice-first augmented-reality applications, incorporating voice, chat, vision, physical sensors, actuators, to achieve user goals without depending on hard-coded logic (an ‘imperative-driven’ software written in some programming language).</p><p>Perhaps we have reached a juncture where we can design these systems, setting ‘business requirements’ almost entirely in natural language, maybe by the final user directly (or probably helped by a prompts design expert). Are we witnessing the realization of the ‘no-code’ dream that some of us fantasized about just a few years ago?</p><p>Surely, 2024 will be an exciting year for the conversational AI community!</p><p>What are your thoughts?</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=5873db65bdb8" width="1" height="1" alt=""><hr><p><a href="https://convcomp.it/reflecting-on-chatgpts-anniversary-5873db65bdb8">Reflecting on ChatGPT’s Anniversary</a> was originally published in <a href="https://convcomp.it">ConvComp.it</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Non-English Languages Prompt Engineering Trade-offs]]></title>
            <link>https://convcomp.it/non-english-languages-prompt-engineering-trade-offs-7e529866faba?source=rss----e9c948ff6ebd---4</link>
            <guid isPermaLink="false">https://medium.com/p/7e529866faba</guid>
            <category><![CDATA[foreign-langauge]]></category>
            <category><![CDATA[large-language-models]]></category>
            <category><![CDATA[prompt-engineering]]></category>
            <category><![CDATA[conversational-ai]]></category>
            <dc:creator><![CDATA[Giorgio Robino]]></dc:creator>
            <pubDate>Tue, 05 Sep 2023 13:16:06 GMT</pubDate>
            <atom:updated>2023-09-05T13:15:16.850Z</atom:updated>
            <content:encoded><![CDATA[<h4>To employ or not to employ the language of English, this is the question.</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*qCYPhxaLWCM2S6-PEPVeRQ.jpeg" /><figcaption>Synthetic image I made: <a href="https://creator.nightcafe.studio/studio?open=creation&amp;panelContext=%28jobId%3ACPpndBWWGWRBNGfthD8r%29">https://creator.nightcafe.studio/studio?open=creation&amp;panelContext=%28jobId%3ACPpndBWWGWRBNGfthD8r%29</a></figcaption></figure><p>English stands as the most widely utilized language on the internet, particularly in Western countries. It unquestionably serves as the lingua franca within the fields of computer science and scientific communities across the globe. The detailed breakdown of languages and their respective proportions in the training data for models like GPT-3.5 remains undisclosed by OpenAI and other providers.</p><p>Most contemporary large language models, including GPT-3.5, are likely trained on an extensive and diverse corpus of internet text, encompassing content from a multitude of languages. It is a rough assumption that the percentage of training data follows the language prominence on the internet. Please refer to the table below for approximate reference.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/748/1*ZuciQibcX3zspA4iucb25Q.png" /><figcaption><strong>Rough Estimates of Language Prominence on the Internet (from a reluctant ChatGPT)</strong></figcaption></figure><p>If we limit the languages to Western countries, we would probably see a higher percentage of English, possibly exceeding 50%. Do you concur?</p><p>What astonishes me, however, is the near flawless proficiency of state-of-the-art LLMs (Large Language Models) in minor languages, including my own: Italian. Rarely do I encounter syntax and “semantic understanding” errors, even in conversations conducted in highly proficient Italian.</p><p>This is undeniably remarkable!</p><p>So, when it comes to constructing complex LLM-based applications in a non-English language, one might assume that employing prompts exclusively in the non-English language is the straightforward and convenient approach. Nonetheless, I’m not entirely certain that it consistently produces the best results.</p><p>At first glance, there doesn’t appear to be a significant qualitative difference when comparing applications generated using prompts in Italian versus English.</p><p>However, let’s take a moment to consider some important factors in the equation: token usage, cost, latency, procedural understanding, language subtleties, and proficiency.</p><h4>Tokens usage</h4><p>Several months ago, I participated in a small Twitter thread where Hassan Hayat (<a href="https://twitter.com/TheSeaMouse">@TheSeaMouse</a>) shared a small experiment demonstrating that, when given the same text (an abstract of a famous paper in that case), GPT consumes fewer tokens when processing English compared to other languages:</p><h3>Giorgio Robino on Twitter: &quot;Yes! Italian even more: 465 tokensMaybe the translation test is not perfect but gives the idea. It seems to me that token composition is based on ASCII char encoding. E.g. accented vowels (é, à, etc) count 2 tokens and in general Italian words are longer than English pic.twitter.com/Fr7D7eIP2Y / Twitter&quot;</h3><p>Yes! Italian even more: 465 tokensMaybe the translation test is not perfect but gives the idea. It seems to me that token composition is based on ASCII char encoding. E.g. accented vowels (é, à, etc) count 2 tokens and in general Italian words are longer than English pic.twitter.com/Fr7D7eIP2Y</p><p>Well, token usage significantly increases for languages with non-Latin character sets. That’s partially expected, but what surprised me was that the original text, when translated into Italian and tokenized, required almost double the number of tokens compared to English.</p><p>In other words, a prompt written in Italian costs twice as many tokens as one written in English. Quite interesting!</p><h3>Money Cost and Latency</h3><p>These variables are closely tied to token length. If you’re using a cloud-based LLM with a pay-per-usage pricing model (say Azure Openai deployment), the more tokens you employ, the more you pay. However, even if you’re running an on-prem model at home, such as a LLAMA 70B or similar, the number of tokens you process represents an indirect cost.</p><p>More tokens processed translate to increased computation, which, in simplified terms, results in longer latency. This latter point is crucial, particularly when your application involves a conversational interactive system like a chatbot or, even more, a voice-interfaced assistant, where the latency is a crucial factor.</p><p>By the way, I don’t have any benchmarks or comparisons regarding the relationship between context window token size and latency. Please do let me know if you come across any relevant research or discussions on this topic.</p><h3>Efficiency in ‘Procedural Understanding’</h3><p>I’ve been experimenting with LLM prompts that implement conversational agent goal-oriented workflows, such as the common customer care use case where a chatbot guides the user to open a ticket, among other tasks.</p><p>In such scenarios, your goal is to instruct an LLM to follow a procedural workflow. This workflow may involve conditional statements, a sequence of actions such as slot filling, API requests, and even the generation of events in structured formats like JSON. For an academic example, refer to:</p><pre>TOPIC: Opening a Support Ticket<br>STEP-BY-STEP WORFLOW:<br>1. Begin the process of opening a support ticket for the user&#39;s issue.<br>2. Initiate a conversation to gather all the necessary details.<br>3. Collect the following attributes one at a time:<br>   a. Issue Description: Ask the user to provide a detailed description of the problem they are encountering.<br>   b. Product or System: Inquire about the name or model of the product or system they are using.<br>   c. Contact Information:<br>      i. Choose a preferred method of contact:<br>         - If &quot;email&quot; is selected:<br>           - Request the user&#39;s email address.<br>           - Confirm the provided email address.<br>         - If &quot;phone&quot; is selected:<br>           - Request the user&#39;s phone number.<br>           - Confirm the provided phone number.<br>4. Display a summary of the gathered information and request confirmation from the user before proceeding.<br>5. Finally, submit the support ticket with the provided information. Generate the following JSON code without comments:<br>   {&quot;api&quot;: &quot;open_ticket&quot;, &quot;email&quot;: email, &quot;phone&quot;: phone, &quot;description&quot;: description, &quot;product&quot;: product}</pre><p>It has come to my attention that these pseudo-code prompts are better understood when written in English. This may be because recent LLMs are also trained in programming languages, where the most commonly used programming terms are in English.</p><p>I’m not entirely certain about this observation, and I lack quantitative data for a definitive comparison. It’s more of a personal impression, and I would appreciate it if you could share any research studies on this topic.</p><h4>Temporary Conclusions</h4><p>My current approach is to write prompts in English, even for Italian language LLM-based applications, whether they are conversational systems or involve more complex tasks in the Italian natural language application verticals (such as meeting transcript summarizations or spoken dialogue conversational analysis).</p><p>By the way, to prompt the LLM to reply in Italian, I simply use a straightforward instruction like this:</p><pre>LANGUAGE: Conduct the conversation in informal, fluent Italian.</pre><p>This approach apparently offers several advantages:</p><ul><li>It minimizes token length, thereby reducing costs and latency.</li><li>It maximizes procedural comprehension when it comes to workflow instruction prompts. Debatable.</li><li>It allows for the creation of multilingual applications by design; you write your prompts once in English and can deploy your system in any language with a single word substitution!</li></ul><p>However, I acknowledge that all of my considerations thus far have been quite qualitative and are based on my empirical experiments. I invite you to share your experiences and any relevant scientific evidence on the discussed topics.</p><p>Thank you for taking the time to read this article. Your feedback is highly valuable to me, so please feel free to leave a like and a comment below to share your thoughts and insights.</p><p>Giorgio</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=7e529866faba" width="1" height="1" alt=""><hr><p><a href="https://convcomp.it/non-english-languages-prompt-engineering-trade-offs-7e529866faba">Non-English Languages Prompt Engineering Trade-offs</a> was originally published in <a href="https://convcomp.it">ConvComp.it</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Could Telegram be a competitor of voice assistants, like Amazon Alexa or Google Assistant?]]></title>
            <link>https://convcomp.it/could-telegram-be-a-competitor-of-voice-assistants-like-amazon-alexa-or-google-assistant-14f1ac7ec113?source=rss----e9c948ff6ebd---4</link>
            <guid isPermaLink="false">https://medium.com/p/14f1ac7ec113</guid>
            <category><![CDATA[smart-speaker]]></category>
            <category><![CDATA[voice-assistant]]></category>
            <category><![CDATA[telegram]]></category>
            <category><![CDATA[smartspeakersmarketshare]]></category>
            <category><![CDATA[telegram-bot]]></category>
            <dc:creator><![CDATA[Giorgio Robino]]></dc:creator>
            <pubDate>Mon, 04 Jul 2022 20:53:06 GMT</pubDate>
            <atom:updated>2022-07-05T07:52:30.947Z</atom:updated>
            <content:encoded><![CDATA[<h4>An open letter to Pavel Durov, containing some change requests to enable voice integration into Telegram bots ecosystem</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*TyS9LOsSoh-506qWQURRlw.jpeg" /><figcaption><a href="https://moscow.biohacking.events/en/article/pavel-durov-podelilsya-novim-opitom-biohakinga-on-praktikuet-fasting-98786">photo credit</a>: Pavel Durov, founder of <a href="https://en.wikipedia.org/wiki/Telegram_(software)">Telegram Messenger</a>, in 2020.</figcaption></figure><p>I started writing this article almost two years ago, and from time to time the topic comes in my mind, so maybe the question in the title is still valid.<br>In the first part of the article I analyze what is happening in the smartspeaker-based voice-enabled assistants market, and in the second part I introduce some change requests that could allow Telegram to become a winner competitor in this restless vertical.</p><h4>Let‘s recap the current smartspeaker market landscape</h4><p>For a few years, Amazon Alexa and Google Assistant have been competitors in the consumer <em>smartspeaker </em>market, with a similar market share. To be precise, Amazon Echo devices are a bit more spread all around the word, anyway the gap between the two competitor is not so huge.</p><p><a href="https://bixbydevelopers.com/">Bixby</a> has been a possible emergent third competitor, but Samsung seems have given up with the release of a branded smartspeaker device and the company has reserved the technology just for its branded smartphones. <br>End of the game of a splendid conversational technology conceived by Viv Labs’s Adam Cheyer and others.</p><p>After more than a year of stagnation, things are now changing because recently Google announced to <a href="https://developers.google.com/assistant/ca-sunset">sunset <em>conversational actions</em></a>. This means that:</p><blockquote>In almost one year, Google Assistant will no more support actions designed only for smartspeakers.</blockquote><p>Google focus on <em>Android actions</em> is announced and foreseen since a while. But now the company seems to have suspended his investments in voice technologies through smartspeakers.</p><p>Recently Sonos <a href="https://www.theverge.com/2022/5/4/23056149/sonos-voice-assistant-features-release-date?utm_campaign=theverge&amp;utm_content=entry&amp;utm_medium=social&amp;utm_source=twitter">announced</a> a new “service”, its own voice assistant that could be a competitor of Alexa (de facto, the current monopolist). But it’s not clear to me if the Sonos service will be implemented as an hardware device (an HI-FI smartspeaker / home multi-hub, aka the “soundbar”?). <br>The company, after having acquired in 2019 great <a href="https://snips.ai/">snips.ai</a> opensource-based company, will certainly focus on data privacy, proposing some local device processing of user voice data (on the smartspeaker, not on the cloud!).</p><p>Apple has his Siri voice assistant, running on their <a href="https://www.apple.com/homepod-mini/">HomePod</a> devices, right! But so far these Apple smartspeakers haven’t achieved an important market share. The first release of HomePods was too expensive and the <a href="https://developer.apple.com/siri/">Siri third party applications integration</a> has never gained huge success.</p><p>Microsoft smartspeaker for Cortana was never born and the entire Cortana project seems to be died. Game never started.</p><p>Last but not list, also Facebook, pardon, Meta, is apparently at stake:</p><blockquote>who remember <a href="https://about.facebook.com/technologies/portal/">Facebook Portal devices</a>?</blockquote><p>And what happened with the Facebook-Amazon agreement that established the coexistence of Facebook Portal and Amazon Alexa on the same devices produced and sold by Facebook? Hasn’t been heard from since! Also,</p><blockquote><a href="https://business.whatsapp.com/">Whatsapp for business</a> has not shined in the spaces of enterprise chatbot solutions.</blockquote><p>Facebook, sorry <a href="https://www.whatsapp.com/">Whatsapp</a>, selected a short list of system integrator companies (2nd party) to “filter” 3rd party enterprise companies. In my opinion this has been an unsuccessful path that duplicated data privacy concerns and over-complicated accounting requests, creating a lot of frustration. By example, so far it’s impossible to set-up a chatbot on Whatsapp even for academic or no-profit purposes (e.g. my accounting access request to Whatsapp for CPIAbot, coming from myself as a public national research institute researcher, has never received a reply). Unclear business strategy.</p><h4>Is the unique voice assistant a failed model?</h4><p>So why the smartspeaker-based voice integration assistants are in this stagnation? In this second quarter of 2022, we are in a situation were</p><blockquote>Amazon Alexa seems to be the only competitor left in the market of voice assistants based on smartspeakers.</blockquote><p>But the question now is:</p><blockquote>Has a unique smartspeaker-based central voice assistant a future (as a main hub at home or in the office)?</blockquote><p>I remember discussions of experts few years ago, when pretty all agreed with the fact people need (at home) a unique voice assistant, not many assitants! <br>Is it still so?</p><p>Alexa now seems the winner of this prediction, the unique “1st party” assistant, the only one that leads the market. But this winning is maybe more related to the fact that Amazon competitors are “loosing out”. In fact also Alexa devices and investments didn’t grow as we expected. I have the feeling of a general stagnation also inside Alexa departments, by example the developers communities are no longer supported as they were years ago and many smart guys moved from Alexa to AWS. Weird, but significant signal.</p><p>So the winner-take-all desire of all above mentioned companies is probably failing and there are many reasons. All these big companies pretended to be the exclusive winner, proposing “walled gardens” and cloud-based ecosystems initially perceived by the market as disruptive, but these models are failing:</p><ul><li>Regarding <strong>final users</strong>, because the privacy data concerns haven’t yet solved (also because the cloud-based architecture proposed).</li><li>Regarding<strong> 3rd party developers companies</strong>, because not proposing to them a complete technical advantage and a clear and profitable business model</li></ul><p>My personal vision is that we do need a completely different approach, that exits from big player proprietary walled gardens, with two fundamental requirements:</p><ul><li><strong>Hardware</strong>: we preferably do need a open-hardware smartspeaker device and some embedded open-software (and common protocols) on top.</li><li><strong>Software</strong>: we do need an open-architecture based on the coexistence of “peer-to-peer” multiple (<em>voicefirst</em>) assistants, operating on a common open architecture.</li></ul><p>In other words,</p><ul><li>As a <strong>final user</strong> you wish a smartspeaker where you can use one or more voice assistants (made by “3rd parties”)</li><li>As an <strong>application developer (service supplier/enterprise company)</strong>, you wish a common protocol to plug-in your service into the above open-hardware smartspeaker ecosystem.</li></ul><blockquote>What does all this have to do with <a href="https://telegram.org/">Telegram</a>?</blockquote><p>Even if the famous <em>instant messaging</em> (mobile) app is still a cloud-based closed-source system, an important and well engineered feature is the “by-design” possibility to enable 3rd party <em>bots</em>, in part following the multi-assistant architecture I mentioned above. Let’s deep dive!</p><h4>What’s Telegram and what Telegram Bot APIs are?</h4><p>Telegram is, for many reasons and above all user experience and proven security, probably the best instant messaging app available on common smartphones and personal computers operating systems!</p><p>This is not only because of the usual reasons Telegram fans mention when comparing this app with unloved Whatsapp, but, for developers, Telegram is great mainly because they can easily build chatbot applications, the so called <em>Telegram Bots</em>.</p><p>As all you programmers know, Telegram supplies a totally free, easy and performance-based way to build chatbots, using well done <a href="https://core.telegram.org/bots/api">Bot APIs</a> (just <a href="https://core.telegram.org/bots/api-changelog#june-20-2022">updated to version 6.1</a> in June 2022). You can set-up your chatbot in a few minutes using some high-level APIs on your preferred programming language. Last but not least, Telegram generously allows you to store for free a really huge amount of gigabyte of file storage. So far so good.</p><h4>My experience as researcher and developer of CPIAbot</h4><p>When I was researcher at ITD-CNR, from 2018 to 2020, I conceived and implemented <a href="https://www.itd.cnr.it/ricerca/progetti/cpiabot.html">CPIAbot</a>, a Telegram voice-chatbot to help foreigners (emigrants) students of CPIA (Italian public adult schools) to learn Italian language as a second language at a basic level (L2/pre-A1).<br> <br>As you can imagine, for a person that has to learn a foreign language, the verbal/spoken language understanding is a fundamental goal. So I allowed my chatbot to get inbound voice messages and to reply with outbound (synthetic) voice messages, back to the user. BTW, following research statistics we elaborated in the experimentation phase, the voice channel has been the preferred way for learners to interact with the bot!</p><p>From the UX perspective, using CPIAbot, students send text and/or voice messages to the bot. A custom server produces a text transcript with an ASR engine (at times I used pretty good Facebook <a href="https://wit.ai/">wit.ai</a> free service). Afterward the user transcript is elaborated by a dialog manager engine (my own opensourced <a href="https://github.com/solyarisoftware/naifjs">naifjs</a>) and the response is returned to the user as a text and/or voice message again, using a Google TTS voice, or a human-spoken audio recording. More info about CPIAbot on my old article <a href="https://convcomp.it/are-alexa-and-google-assistant-both-unfit-as-language-learning-assistant-inside-outside-the-972bfafeddfd">here</a> and on the academic project home <a href="https://www.itd.cnr.it/en/research/projects/cpiabot.html">page</a>.</p><h4>What’s missing in the almost apparent Telegram Bot API perfection?</h4><p>So far you can develop a Telegram <em>voicebot</em> following the <em>message-based</em> paradigm. That’s not totally natural, I have to admit (from the natural language spoken interactions), even if nowadays people is used to communicate exchanging (also spoken) messages through instant messaging apps. Let’s consider audio messages ok for now.</p><p>Now I claim some change requests that could bring Telegram to be a competitor of big-players voice-assistants <em>masterbots (</em>someone says <em>metabots)</em> I mentioned above. Simply speaking:</p><blockquote>Imagine your Telegram app as a <strong>smart-speaker</strong> embedded on your phone, able to connect you to any 3rd party bot without being an all-around “masterbot” assistant.</blockquote><p>Sound good! Isn’t it? But how would it work in practice? To enable a great voice interface user experience, some features are now missing.</p><h4>Change request #1: 🔊 Voice/Audio Messages Auto-Play</h4><p>You need that when a bot is answering user (spoken) request, the bot voice message (response) is auto-played by the device (just when you are currently interacting with). Now, instead, user must click on the voice message icon to play it.</p><p>Of course, also for trivial privacy reasons, I would like to configure (<em>opting-in/opting-out</em>) the auto-play feature, with a general mute-all flag, and/or a bot-dependent flag (I probably want to <em>un-mute</em> just the more frequently used bots).</p><p>Related to the voice auto-play experience (and in general about any audio or music content play), of course a <em>smart-phone </em>has not the loudspeaker audio power of a <em>smart-speaker</em> (as the Amazon Echo or Google Nest devices, all great devices, no doubt). <br>And even if enhanced with a desired audio auto-play software feature, the coupling with an external (bluetooth-connected?) loudspeaker is something highly recommended, but optional.</p><h4>Change request #2: 🎙️ Voice Wake-Word Detection</h4><p>This feature is a bit more complex to implement, and controversial. You would replicate the user experience of the <em>invocation sentence </em>on a smart-speaker, when you say “<em>Hey Google</em>…”, sorry, I mean:</p><blockquote>Hey Telegram open MyAppName…</blockquote><p>You want the named bot starts waiting your spoken command utterance, recording your voice until a silence is detected, and afterward the audio message is forwarded to the Telegram server as usual (at the end of the day to the MyAppName bot). <br>There are also alternative solutions where we come to invocate a specific bot (by example you could want to invocate bot with his specific name, etc.), but you get the idea. Last but not least, the current Telegram client app <em>push-to-talk</em> chatbot selection mode is not so bad, eliminating many privacy-related concerns.</p><h4>Change request #3: 🖧 A Decentralized Architecture Support for Voice-based Bots</h4><p><em>Decentralized</em>, <em>headless</em> or <em>no-brokerage</em>, are possible keywords that could differentiate Telegram bots from the centralized (1st party <em>master-bot</em> + 3rd party <em>skill-bots</em>) approach big-players so far imposed on us. <br>In these well-known scenarios the master company supplies a proprietary first-party master-bot (e.g. Google Assistant or Alexa) that commands all the games. In facts in this model, your bot (the <em>action</em> in Google Assistant parlance, or the <em>skill</em> in Amazon Alexa parlance, or the <em>capsule</em> in Samsung Bixby jargon) must follow a rigid contractual and technological framework and all data “passes through” central master-bot cloud servers. That’s bad, or at least, that’s controversial in terms of data privacy, vendor lock-in, etc.</p><p>So forget the first party approach of a (big-player owned) cloud system that monopolizes/centralizes traffic of all external (third-party) skill-bots. Instead, imagine Telegram just as a light middleware that supply two things:</p><ul><li>The common client (the TG app) updated with the above proposed change requests to integrate an enhanced voice integration</li><li>Some server-side services on the Telegram cloud server that finally redirect to your (private person or enterprise) bot.</li></ul><p>In other words, as final user:</p><blockquote>Imagine you could access on your smartphone Telegram app a short-list of <em>voicebot services </em>provided by independent suppliers you (and only you) selected!</blockquote><p>In this scenario Telegram would be not the unique (voice) assistant, but just a vehicle to enable you to “talk” with a set of independent assistants of your choice. It’s a complete different business model (with respect of big-players), requiring just minor architectural and software updates on the current Telegram client (and some optional enablers on Telegram servers).</p><h4>Is this bot-independent model sustainable for Telegram company?</h4><p>That’s a critical open point. I guess Telegram could set-up the overall architecture:</p><ul><li>Giving for free some basic features (by example the audio auto-play and the wake-word detection on the client, etc.)</li><li>Supplying as paid services some server-side enablers, as a multilingual speech-to-text and text-to-speech platforms, and why not some smart “conversational AI” services to simplify/automatize chatbot developments (as a speech-to-text engine, a text-to-speech engine, an intent/entities classifier, a dialog manager, a semantic search engine, etc. etc.).</li></ul><blockquote>Maybe <a href="https://telegram.org/blog/700-million-and-premium">Telegram Premium</a>, extended for a developers program (“Telegram Premium for Enterprises”), could be a possible solution to cover all costs of the upgrade and a possible economical gain?</blockquote><p>I imagine by example that <em>TG Premium for Entrprises </em>contract could be set up, in such a way that a chatbot developer company could pay for premium services like a networking traffic enhancement, the use above mentioned server-side enablers, etc. etc.</p><h4>Please comment!</h4><p>What do you developers or final users think about all this? <br>Please leave your feedback on the comments!</p><p>Giorgio</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=14f1ac7ec113" width="1" height="1" alt=""><hr><p><a href="https://convcomp.it/could-telegram-be-a-competitor-of-voice-assistants-like-amazon-alexa-or-google-assistant-14f1ac7ec113">Could Telegram be a competitor of voice assistants, like Amazon Alexa or Google Assistant?</a> was originally published in <a href="https://convcomp.it">ConvComp.it</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Voice-cobots in industry. A case study]]></title>
            <link>https://convcomp.it/voice-cobots-in-industry-a-case-study-352294bd0d5a?source=rss----e9c948ff6ebd---4</link>
            <guid isPermaLink="false">https://medium.com/p/352294bd0d5a</guid>
            <category><![CDATA[industry-5-0]]></category>
            <category><![CDATA[shipping-industry]]></category>
            <category><![CDATA[voice-interfaces]]></category>
            <category><![CDATA[virtual-assistant]]></category>
            <category><![CDATA[cobot]]></category>
            <dc:creator><![CDATA[Giorgio Robino]]></dc:creator>
            <pubDate>Mon, 18 Jan 2021 11:00:57 GMT</pubDate>
            <atom:updated>2021-03-29T06:59:00.055Z</atom:updated>
            <content:encoded><![CDATA[<h4>A voice assistant application in the shipping container industry</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*P-fWyaX5C-4TjuCK9rHMOw.jpeg" /><figcaption>A reachstacker vehicle moving a shipping container, in an intermodal terminal (<a href="https://port.today/contargo-successfully-tests-liebherr-reachstacker-lrs-545-%E2%80%A8/">source</a>).</figcaption></figure><blockquote>Update: I participated at the <a href="http://www.lingofest2021.com">www.lingofest2021.com</a> event, 2021, March 26th, presenting my talk: <strong>Enterprise Voice Cobots</strong>, where this case study is deepened. Slides and video links available at the end of the article.</blockquote><p>Voice assistants, for consumers at home, are nowadays taken for granted, but there is a huge space of applications of voice virtual assistants also in enterprise verticals.</p><p>I want to introduce my current R&amp;D project, involving an innovative voice assistant application in the logistics shipping container operations.</p><p>In a sentence,<strong> the <em>voice-cobot</em> I conceived helps forklift vehicle drivers to load and unload shipping containers from/to yard spots from/to container trailers.</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/595/1*DvUnKEA_n50QQm3LvpmCYg.jpeg" /><figcaption>A reach stacker (kind of forklift vehicle) loads a container on a trailer (<a href="https://www.driverknowledgetests.com/resources/whats-a-side-loader-trailer/">source</a>)</figcaption></figure><p>Let me share, before all, the general concept of a <em>voice-cobot</em>, as a possible application of a conversational <em>assistive-reality computing</em>. Afterward I will show the solution I found for the specific industrial scenario.</p><h4>What’s a cobot?</h4><p>On Wikipedia you get this <a href="https://en.wikipedia.org/wiki/Cobot">definition</a>:</p><p><strong><em>“Cobots</em></strong><em>, or </em><strong><em>collaborative robots</em></strong><em>, are robots intended for direct human robot interaction within a shared space, or where humans and robots are in close proximity.”</em></p><p>Well, the above definition refers to industrial robots<strong>, </strong>hardware machines — capable of carrying out a complex series of actions automatically — guided by a human being, using an external control device or whit a control embedded within. Robots may be constructed on the lines of human form, but most robots are machines designed to perform a task with no regard to their aesthetics.</p><p>So in general we refer to cobots meaning industrial robots (the common case is an articulated electro-mechanical robots) that helps automate unergonomic tasks such as helping people moving heavy parts, or machine feeding or assembly operations. But that’s not what I want to talk to you about!</p><h4>So what’s a Voice-Cobot?</h4><blockquote>For <strong>voice-cobot</strong> I mean a voice-interfaced digital assistant that, through a real-time spoken conversation, helps a human operator to accomplish a specific working task.</blockquote><p>Now forget the usual virtual assistant scenario that, as end-users, we experiencing with Amazon Alexa or Google Assistant at home, through smartspeakers or smartdisplay devices. Instead I want to talk here about the special case of <strong>private virtual assistants for industry enterprises</strong>.</p><p>The disruption of a voice assistant in enterprise spaces refers to the concept of an assistive-reality computing automation made by a private, company-owned virtual assistant that collaborates with human operators (employees, skilled workers, professional technicians) to accomplish working tasks, literally having <strong>real-time conversational interactions</strong> (by voice, text, other UI).</p><blockquote>Voice is an essential requirement, but the voice interface itself is not the game-changer, instead the collaborative enterprise-assistant computing is!</blockquote><p>There is a common misconception where the novelty is just about the fact that you “talk to machine”, using speech (instead of chatting or using a graphical user interface), but that misses the most important point.</p><p>A <em>voicefirst</em> interface and maybe a <em>voiceonly</em> interface are — in many cases — the best way to interact with computer, in situations where the human is working “hands-on” as in the case of a vehicle driver, a machine operator, a doctor, etc.<br> <br>Nevertheless, the best human-machine interface must be evaluated case-by-case and it can be made with different strategies, by example in input it could be voice, text, a camera scene, any IoT sensors, where in output it could be voice, earcons, text, graphics, light-signal devices, etc. All this could also work in parallel in a multimodal strategy.</p><p>The real innovation is not just the voice interface, but the virtual assistant collaborative logics, meaning that</p><blockquote><strong>enterprise processes are controlled by a single conversational-AI assisted-reality computing that interacts with human operators</strong>,</blockquote><p>collaborating with them to accomplish workflows, and leading to time and costs savings and possibly to a better (and fun) user experience.</p><h4>The case study: empty shipping containers handling</h4><p>As a software engineer, consultant and researcher, I was initially asked by <a href="http://www.diten.unige.it/">DITEN</a> — Università di Genova- Dipartimento di Ingegneria Navale, Elettrica, Elettronica e delle Telecomunicazioni — to solve an apparently standard computer vision text detection/recognition problem.</p><p>The topic was to automatically recognize the shipping container <a href="https://en.wikipedia.org/wiki/ISO_6346">marking code</a>s, during the containers loading/unloading tasks made by an operator driving an empty container handler machine.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*sWekriogWRqViU6D3ZVRQQ.jpeg" /><figcaption>A shipping container identified by the ISO 6346 marking MTBU 213401 9 22G1 (<a href="https://www.nzbox.kiwi.nz/wp-content/uploads/2020/11/20ft-container-front-doors.jpg">source</a>).</figcaption></figure><p>After some fun (I’m ironic) with text recognition from images algorithms I implemented, I soon realized how difficult is to detect texts in real-life motion scenes. It’s hard to achieve a detection system with an accuracy near to 100%. And what to do in case of the no detection/erroneous cases?</p><p>That’s why I got a trivial idea:</p><p><em>What if the operator dictated the code to a voice assistant? Just talking!.</em></p><p>My friend and account manager at DITEN University replied:</p><p><em>Let’s deepen, Giorgio, maybe it’s not crazy as it seems</em>.</p><p>So Forklift-cobot born! <br>The concept was to supply to forklift vehicle operator, specifically <a href="https://www.google.com/search?q=empty%20container%20handlers">empty container handler</a> and <a href="https://en.wikipedia.org/wiki/Reach_stacker">reach stacker</a>, a simple voice assistant <em>command-and-control</em> software, running on common tablet/mobile device fixed on the vehicle cabin.</p><p>The foreseen operator’s user experience concept is really simple: he gives voice commands (<em>pull mode</em>) to insert data, as the task name, the handled container code, the container-truck vehicle plate, yard spot name, etc.</p><p>The voice assistant checks dictated/spelled data, does search in the company backend database and stores transactions on task completion. Avoiding any stop of the operator just to insert data on a web GUI app (as currently happens).</p><p>I made two short video demonstrations of the proof-of-concept desktop-prototype I implemented, where I show the functionalities and the voice interaction activation on a mobile device.</p><p>I’m Italian and the the video are realized in Italian language, but you can enable Youtube subtitles in English.</p><iframe src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2Fpn6ZlG5IyRQ%3Ffeature%3Doembed&amp;display_name=YouTube&amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3Dpn6ZlG5IyRQ&amp;image=https%3A%2F%2Fi.ytimg.com%2Fvi%2Fpn6ZlG5IyRQ%2Fhqdefault.jpg&amp;key=a19fcc184b9711e1b4764040d3dc5c07&amp;type=text%2Fhtml&amp;schema=youtube" width="854" height="480" frameborder="0" scrolling="no"><a href="https://medium.com/media/8e7e496e0558a3d9e13dc2eca12ec199/href">https://medium.com/media/8e7e496e0558a3d9e13dc2eca12ec199/href</a></iframe><iframe src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2FNTSZ1Zp5TkI%3Ffeature%3Doembed&amp;display_name=YouTube&amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DNTSZ1Zp5TkI&amp;image=https%3A%2F%2Fi.ytimg.com%2Fvi%2FNTSZ1Zp5TkI%2Fhqdefault.jpg&amp;key=a19fcc184b9711e1b4764040d3dc5c07&amp;type=text%2Fhtml&amp;schema=youtube" width="854" height="480" frameborder="0" scrolling="no"><a href="https://medium.com/media/52eb36f6655dcfcf5b2e5587769b0ba2/href">https://medium.com/media/52eb36f6655dcfcf5b2e5587769b0ba2/href</a></iframe><h4>Research &amp; Development open points</h4><p>The implemented prototype is now in the on-field test phase, where the proposed system has to be evaluated by expert senior machine operators.</p><p>Besides, there are many technical open points, mainly related to the voice recognition/audio subsystems and the on-cabin user interface ergonomics. Let me introduce some.</p><p><strong>Speech recognition issues</strong><br>Noise is a common industrial environment problem and an urgent related topic is the availability of a <strong>on-premise noise-proof ASR</strong> avoiding any cloud-enabled service. For security reasons, a key requirement is <strong>data-privacy</strong>: all business processes data transactions must not exit outside the enterprise intranet.</p><blockquote>You need local and private (on-premise), multi-lingual, noise-proof robust speech recognition engine (ASR).</blockquote><p><strong>Voice activation UI, human-machine HW interfaces</strong><br>A not-trivial aspect is related to find the suitable hardware interfaces with the cobot, case-by-case.</p><p>The current speech activation solution is done with multiple concurrent (working in parallel) <em>push-to-talk</em> solutions, as touchscreen, foot-switch activation, or physical push-button on the vehicle dashboard (see figure here below).</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*xT17bxwPQhZMxuJfQx3mMg.png" /><figcaption>Inside the cabin of an heavy forklift, a dashboard details (<a href="https://www.youtube.com/watch?v=mLo2eVywhkY">source</a>)</figcaption></figure><p>Also the mic/headphones subsystem has challenges solvable with many audio options: headset, open-air mic/loudspeakers, etc.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/965/1*OeH-MI76RtqAa8C0yMTDIw.jpeg" /><figcaption>An operator dressing an industrial-avionic headset (<a href="https://www.soneticscorp.com/wp-content/uploads/2015/12/penner-nash_3.jpg">source</a>)</figcaption></figure><p><strong>The webapp-based architecture</strong><br>As shown in the videos, the prototype has been implemented with a client-server web architecture. Having implemented the client as an application running on a standard web browser on top of any mobile device, has many pros.</p><p>All the audio message exchange is realized with the Web Audio API and currently any web app (on a mobile device) can access also any internal devices (even the video camera streaming), the GPS geolocation coordinates and accelerometers (helping to localize the vehicle movement), even USB interfaced external devices (e.g. a RFID long-range reader), last but not least some Bluetooth interfaced peripherals (audio I/O, custom buttons), etc.</p><p><strong>To wearable or not to wearable?<br></strong>Another advantage of having the client running on a (web app) mobile device, is just the fact the “terminal” of the cobot run on a very cheap portable handset. You can use pretty any mobile phone or tablet, to be mounted inside a cabin and/or to be used outside the cabin, by an operator walking with the handset, for different tasks, as container positions control in the yard area, etc.</p><p>That said, the mobile device web client is just an option among many others. A “fixed” client could run on a more powerful edge-computer, or even a micro-controller.</p><p>Another alternative to mobile tablet/phones is a smartglasses wearable. Apparently that’s the definitive solution, but with a lot of issues like high costs, lack of standard API interfaces, unergonomic for an operator inside a cabin.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/823/1*InWLD6r0l4L-aT1Srgrd9g.png" /><figcaption>An operator using a famous voice-controlled smart-glasses headset (<a href="https://idnet-us.com/uploads/media/2019/10/20191025041813.pdf">source</a>)</figcaption></figure><p><strong>Conversational design</strong><br>On the (software) user interface side there are interesting <em>conversational design</em> research topics. The current UI implementation uses a tablet/mobile device. The client-side software runs on a web browser and exploit some <em>multimodal paradigms</em> using voice, texts and, last but not least, usual graphical interface capabilities of a browser.</p><p>By example, in the prototype, the screen background color has been used to visually explicit the status of conversation turns between the user and the machine. Synthetic voice (TTS) short answers and prompts are accompanied by longer explanations written on the display, added by suggestions of the next steps the user could have to do, etc.</p><p>The <em>command-and-control</em> approach is less simple than what appears, especially if the conversation between user and the machine has to be at the same time short but nice and even engaging.</p><p>Maybe there is a challenge if you want to integrate the task-completion essential features (minimal requirements), with a domain-specific or open-domain question-answering query system, or you want to manage any user-defined alerts and reminds, an inter-operators intercom subsystem, etc.</p><p><strong>Conversational AI backend computing</strong><br>Here’s where it gets tricky. Consider the voice-cobot not just as yet another vertical application to solve a specific enterprise task.</p><p>Instead the cobot become the “enterprise’s computer” able to talk with the single user with possibly a user-defined <em>botpersona</em>, adapted to that user needs and preferences, and at the same time serving users for many requirements, as a single “company voice”.</p><p>The company cobot bot-persona appears to users as different (user-defined) bot-persona while being a single business logic intelligence.</p><p>Last but not least, the kernel component is the dialog manager / conversational AI intelligence.</p><p>For the prototype, because of the pretty “simple” scenario, I used my own dialog engine <a href="https://github.com/solyarisoftware/naifjs">NaifJs</a> — that I opensourced in 2020 — where dialog tracking is based on a state-machine approach.</p><p>For more complex scenarios, with many concurrent tasks and <em>push mode</em>/mixed-initiative where the assistant start a conversation, maybe to assign tasks, etc. the current implementation could be enhanced with an high-level rule-based language. This is an open research topic.</p><p>Another applied research/engineering topic is about how to “standardize” the company knowledge-base/database integration.</p><p>The problem could be solved trivially with APIs and database queries but the real challenge is to define a standard common ground company private knowledge base, to be queried and used for inferences made by the conversational AI.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/554/1*R8ND1o1QYR97bQtqfbZUaQ.jpeg" /><figcaption>An empty container handler operator in action (<a href="https://www.pinterest.com.au/pin/359443613986224962/">source</a>)</figcaption></figure><h4>Temporary conclusion</h4><p>The described case study describe a very specific shipping industry scenario where a voice-cobot helps the working tasks of a container handler vehicle operator, reducing times and costs.</p><p>The described assistive-reality system is applied to a specific industrial operation, but it could be applied to many other factory tasks and different operator roles. By example, in the shipping container depots/repairs industry, the voice cobot could assists many other kind of human operator activities: truck gates automation, container inspection reporting, safety/emergency alerting, truck drivers “help desk”, operators tutoring, etc. etc.</p><p>Is all not just about another voice-interface application in industry. It’s instead about the rethinking of all business processes, where an assistive enterprise computing could collaborates with many/all humans. <br>Isn’t this just Industry 5.0?</p><blockquote>2021, March 26th I participated at the <a href="http://www.lingofest2021.com">www.lingofest2021.com</a> event, presenting my talk: <strong>Enterprise Voice Cobots</strong>, where this case study is deepened.</blockquote><p>Slides @ #lingofest2021:<br><a href="https://docs.google.com/presentation/d/1ieZnAdREzEGXkcO4C_XPIbS9YAnE76mB0wpP2k-yOlQ/edit#slide=id.g4412d4946c_0_0">https://docs.google.com/presentation/d/1ieZnAdREzEGXkcO4C_XPIbS9YAnE76mB0wpP2k-yOlQ/edit#slide=id.g4412d4946c_0_0</a></p><p>Video @ #lingofest2021:</p><iframe src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2FHm85vr6N1CA%3Ffeature%3Doembed&amp;display_name=YouTube&amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DHm85vr6N1CA&amp;image=https%3A%2F%2Fi.ytimg.com%2Fvi%2FHm85vr6N1CA%2Fhqdefault.jpg&amp;key=a19fcc184b9711e1b4764040d3dc5c07&amp;type=text%2Fhtml&amp;schema=youtube" width="854" height="480" frameborder="0" scrolling="no"><a href="https://medium.com/media/656c308a7a2b3be19218dee388d26f09/href">https://medium.com/media/656c308a7a2b3be19218dee388d26f09/href</a></iframe><h4>Contact</h4><p>If you are an enterprise company, maybe in the shipping / supply-chain/ smart-factory automation, or in any vertical where you think a voice-bot could solve a real workflow, or you are in R&amp;D ICT company, or if you are an academic organization and you are interested on deepening this applied research contexts, I’m available to collaborate, as researcher and as a consultant.</p><p>You can contact me on <a href="http://www.linkedin.com/in/giorgiorobino">linkedin</a> or just send me an email at <a href="mailto:giorgio.robino@gmail.com">giorgio.robino@gmail.com</a>.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=352294bd0d5a" width="1" height="1" alt=""><hr><p><a href="https://convcomp.it/voice-cobots-in-industry-a-case-study-352294bd0d5a">Voice-cobots in industry. A case study</a> was originally published in <a href="https://convcomp.it">ConvComp.it</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Whither Almond, the Stanford University open virtual assistant, will go?]]></title>
            <link>https://convcomp.it/whither-almond-the-stanford-university-open-virtual-assistant-will-go-b4d66167e76c?source=rss----e9c948ff6ebd---4</link>
            <guid isPermaLink="false">https://medium.com/p/b4d66167e76c</guid>
            <category><![CDATA[conversational-ai]]></category>
            <category><![CDATA[privacy-protection]]></category>
            <category><![CDATA[virtual-assistant]]></category>
            <category><![CDATA[privacy]]></category>
            <category><![CDATA[stanford]]></category>
            <dc:creator><![CDATA[Giorgio Robino]]></dc:creator>
            <pubDate>Tue, 08 Sep 2020 17:54:41 GMT</pubDate>
            <atom:updated>2020-09-22T09:29:37.032Z</atom:updated>
            <content:encoded><![CDATA[<h4><strong>Interview with Giovanni Campagna, one of the Almond principal developers</strong></h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*8d2XDrK33LZP8Hbf" /><figcaption>From the left: Jackie Yang, Michael Fischer, Giovanni Campagna, Silei Xu, Mehrad Moradshahi, in foreground Prof. Monica Lam. Photo by Brian Flaherty (<a href="https://www.instagram.com/brianflaherty/">https://www.instagram.com/brianflaherty/</a>)</figcaption></figure><p>I learned of <a href="https://almond.stanford.edu/">Almond</a>, the Stanford University open virtual assistant, for the first time one year ago reading an <a href="https://voicebot.ai/2019/06/20/stanford-scientists-are-developing-an-open-virtual-assistant/">article</a> on voicebot.ai. <br>I have been immediately enthusiastic about the core concepts on which the project is based: the user’s data privacy, the need of a web based on linguistic user interfaces, the distributed computers architecture, the natural language programming approach and many others topics related to a possible next generation of the web, populated by a federation of humans and theirs personal assistants.</p><p>Soon I discovered that one of the principal developers of Almond software is <a href="https://web.stanford.edu/~gcampagn/">Giovanni Campagna</a>, a PhD student in the Computer Science Department at Stanford University and member of the Stanford Open Virtual Assistant Lab (<a href="https://oval.cs.stanford.edu/">OVAL</a>) who works with prof. Monica Lam. Hence the idea of my interview with him, about the past and the future of Almond and personal virtual assistants.</p><h4><strong>Introduction to Almond</strong></h4><p><strong>Giorgio:</strong> Ciao Giovanni! In order to sketch out what it is and what it will be Almond, may you briefly summarize the project history? The oldest public presentation I remember was by Monica Lam in <a href="https://www.youtube.com/watch?v=WoLprtB9JnI">fall 2018</a>. May you give us the basic concepts of the project and explain what is the role of Almond within OVAL laboratory research?</p><p><strong>Giovanni:</strong> The Almond project started in Spring of 2015 as a class project to explore a new, distributed approach to the popular <a href="https://ifttt.com/">IFTTT </a>service. At the time, it was called Sabrina. Soon after, we realized the need for both a formal language to specify the capability of the assistants, and a natural language parser to go along with it, so end users could access those capabilities. The first publication for Almond, and the first reference under its current name, came in April 2017 at the WWW conference, where we described the architecture: the Almond assistant, the Thingpedia repository of knowledge, and the ThingTalk programming language connecting everything together.</p><p>Since then, we have been working on natural language understanding, focusing in particular with the problem of cost. State of the art NLU, used by Alexa and Google Assistant, requires a lot of annotated data, and is very expensive. We made two recent advancements: first, in <a href="https://dl.acm.org/doi/10.1145/3314221.3314594">PLDI 2019</a> we showed that using synthesized data can greatly reduce the cost of building the assistant for event-driven, IFTTT-style commands. Later, in CIKM this year my colleague showed how to build a Q&amp;A agent that can understand complex questions over common domains (restaurants, hotels, people, music, movies, and books) with high accuracy at low cost.</p><p>In <a href="https://www.aclweb.org/anthology/2020.acl-main.12/">ACL</a> this year, I presented how the same synthesized data approach can be used to build a multi-turn conversational interface, achieving state of the art zero-shot (no human-annotated training data) accuracy on the challenging MultiWOZ benchmark.</p><p>Most recently, we have received a grant from the Alfred P. Sloan Foundation to build a truly usable virtual assistant (not just a research prototype), and we hope to release it in 2021.</p><h4><strong>About Giovanni Campagna as researcher and developer</strong></h4><p><strong>Giorgio:</strong> I’m curious about your personal history, Giovanni. I know you are Italian and probably you moved to Stanford as a PhD student. What made you start developing Almond? Did you initially work on Almond as part of your PhD thesis? Are you the leader of the software project development? And how is the team composed? Besides the Almond project, What is your current role in the OVAL team and what are your long-term research interests?</p><p><strong>Giovanni:</strong> I moved to the US and to Stanford as a Master’s student, and I started Almond in class. I was interested in programming languages and pursuing the software theory concentration, which includes PL, compilers, and formal methods. I met Prof. Lam in her Advanced Compilers class. At the time, she was mainly focusing on messaging and social networks, with the goal of disrupting the Facebook monopoly (this was part of the Programmable Open Mobile Internet initiative, aka the Mobisocial Lab). But even then, she had the vision of what would come next, and while Alexa was not popular yet, it was clear that virtual assistants would be the next potential monopoly.</p><p>Personally, I started developing Almond as research on the ThingTalk programming language: designing a programming language that can enrich the power of the assistant beyond simple commands, and give the power of programming to everyone. Parallel to ThingTalk, I worked on converting natural language to code, because natural language is the medium of choice to make programming accessible. Our research found that neural networks are extremely effective for natural language programming tasks, as long as training data is available. ThingTalk is still the core of my PhD thesis, but over time we moved our focus to reducing the cost of training data acquisition.</p><p>I built the original version of the Almond assistant, and a large chunk of the current code, so in a way I am still the maintainer of it, but Almond would not have been possible without the help of my colleagues. These include Silei, Michael, Jackie, Mehrad, and Sina (all current PhD students). Silei was the first to join after me and he’s the second-most active dev on Almond. His research is mainly on Q&amp;A over structured data. Michael recently defended his thesis; he was working on multi-modality: bridging GUIs and natural language. Jackie also worked on multi-modality, with a paper in UIST on mobile app interactions using natural language. Mehrad is working on multilinguality leveraging machine translation technology, as well as named entity recognition. Sina is working on Q&amp;A over free text, paraphrasing, and error correction. Additionally, we have a number of MS and undergraduate students who have helped on various projects.</p><h4><strong>What is your view on data privacy and how this is related to personal assistants?</strong></h4><p><strong>Giorgio:</strong> Data privacy is probably a foundation concept that gave birth to the Almond project. May you deepen why privacy is so important for all of us, citizens and companies? How is all this related to democracy and people’s freedom on the web?</p><p>One problem I see in current big player personal assistants is the fact that people’s data (voice conversations, also containing background ambient audio, e.g. at home) are processed by cloud systems proprietary platforms. In the best scenario, all this data is used to improve “AI-blackbox” cloud-based proprietary services, feeding machine learning algorithms. In the worst case, conspiracy theorists do suppose a malicious use by such companies that would steal personal end-user data populating people’s knowledge bases for any further commercial usage. Do you consider this last scenario an actual concern?<br>To protect privacy, in opposition to cloud-based “walled gardens”, Almond provides a tech architecture based on virtual assistants that can run on users’ devices. Do you think that the issue can be fully solved at the architectural level or do we need in any case government regulations?</p><p><strong>Giovanni:</strong> This is not an easy question, and let me preface by saying this is my personal opinion, not the project’s. I think privacy is inextricable from freedom: I am not truly free if everything I do is tracked, logged, and stored forever by a company or government. I am not truly free if I can be judged in the future for anything I’ve done at any point in my life. One closes the blinds to be free to do whatever in the privacy of their home. And because so much of our lives is now conducted over the Internet, it’s clear that Internet privacy overlaps significantly with real life privacy.</p><p>Now, as you point out, virtual assistants and conversational AIs in general pose unique challenges to the privacy problem. First, state of the art natural language requires a lot of user data for training, which means the virtual assistant providers are continuously collecting all the conversations performed on the assistant, and have contractors continuously listening and annotating the data. To have somebody listen to my conversations, that’s not very private. Reducing the need for annotated real data has been a strong focus of our research, and we believe we’re finally getting there.</p><p>Second, and most importantly, virtual assistants inherently have access to all our data, through our accounts: banking, health, IoT, etc.. We want the virtual assistant to get access to our accounts because we want help, and we want the convenience of natural language. But what guarantee do we have that a proprietary service won’t suck all the data and use it for marketing purposes? Why wouldn’t a proprietary assistant provider look at our banking information to promote a credit card or mortgage product? Why wouldn’t a proprietary assistant look at the configured IoT devices to promote similar or compatible products? And with Amazon and Google dominating the online retail and ad markets respectively, it would be surprising if they did not eventually start doing that.</p><h4><strong>Tell me about your general vision on open-source software</strong></h4><p><strong>Giorgio:</strong> I see that many Almond software components are made in <a href="https://nodejs.org/en/">Node.js</a> and I know you have been a member of the Linux/<a href="https://www.gnome.org/">GNOME</a> community. Could you share your point of view about the importance of open-source software in general, such as the Linux operating system (I know you are a Linux desktop user, like me). How open source and open data are related to data privacy?</p><p><strong>Giovanni:</strong> I have been an advocate of free software for a long time, and I am a strong believer of the four fundamental software freedoms as purely ethical principles. The freedom to study software is what allowed me to learn how to build software before I started college, and the freedom to modify and distribute software allows people to collaborate and build something bigger than any single individual could build.</p><p>I also believe that, unlike proprietary software, free software cannot abuse the trust of users. It is trivial to detect a free software app collecting more data that claims, or doing anything shady with the data. It is trivial to fork the app and remove any privacy-invasive functionality. Hence, free software communities are very careful to gain the trust of users and protect their privacy. You can see it for example with Firefox: while Firefox collects data for telemetry, they’re very careful to allow people to disable the telemetry, and do not collect more than they claim.</p><p><strong>Giorgio:</strong> Why did you decide to develop Almond in <a href="https://nodejs.org/en/">Node.js</a>? As nodejs developer myself, I’m specifically curious about the engineering reasons that drove you to use the Javascript environment. Is it an opportunistic matter, maybe because it is easy with nodejs to develop in multi-platform environments? Or are there other software engineering reasons?</p><p><strong>Giovanni:</strong> I chose to build Almond in nodejs because it is the most portable platform. At the beginning, we had this idea that the full Almond assistant could run on web, phone (Android and iOS), desktop, embedded. Nodejs is the only platform that supports that. Over time, we found that running the assistant on Android or iOS is quite challenging, and we moved to an architecture where a user keeps the assistant running on a home server or a smart speaker. Yet, I still find nodejs more programmer friendly and just nicer to work with than the obvious alternative, Python. I should also note, using a type-safe compiled language would have been challenging, given the ever changing nature of a research prototype.</p><h4><strong>What do you mean by <em>Linguistic Web</em>?</strong></h4><p><strong>Giorgio:</strong> Regarding Almond’s core concepts, I have been impressed by the statement “<em>We are witnessing the start of proprietary linguistic webs</em>” that I found in these Almond presentation <a href="https://almond-static.stanford.edu/papers/slides-keynote-mobicom18.pdf">slides</a>. May you clarify what you mean by Linguistic Web? Are proprietary linguistic webs an implicit reference to Google Assistant and Amazon Alexa voice-based / smartspeaker-based virtual assistants? If so, what are your concerns in the “walled gardens” proprietary (virtual assistants) ecosystems and how Almond could be an alternative for private citizens and/or companies? What are strengths of open and non-proprietary platforms to improve (linguistic) democracy (and freedom) on the web?</p><p><strong>Giovanni:</strong> That is exactly right: what we’re referring to is the third-party skill platforms being walled gardens controlled top-down by the assistant providers. Any company who wishes to have a voice interface must submit to Alexa and Google Assistant. As proprietary systems, these can shutdown competing services, or impose untenable fees.</p><p>We believe instead that every company should be able to build their own natural language interface, without depending on Amazon or Google. These natural language interfaces should be accessible to any assistant. One example of work in this direction is<a href="https://arxiv.org/abs/2001.05609"> Schema2QA</a> (to appear in CIKM 2020), a tool to build Q&amp;A agents using the standard Schema.org markup. Any website can include the appropriate markup to build a custom Q&amp;A skill for themselves, and furthermore the data is available to be aggregated across websites by the assistant.</p><h4><strong>Could you explain what <em>Natural Language Programming </em>is?</strong></h4><p><strong>Giorgio:</strong> Could you define what LUI (Linguistic User Interfaces) and <em>Natural Language Programming</em> mean for you? With the popularity of smart speakers and current voice-first interfaces we are moving from the GUI (Graphical user Interface) to CUI/VUI (Conversational / Voice User Interfaces). One of Almond’s disruptive milestones is the idea that the final user must program himself his personal virtual assistant, just speaking with a computer in natural language. That’s currently an unachieved goal, in human-machine interfaces.</p><p>What do you think it will be, say in the next ten years, the way users will program their private virtual assistants? More in general, do you foresee that common people will interact with computers (and professional software developers will develop applications) in some sort of natural language programming?</p><p><strong>Giovanni:</strong> To me, there is no distinction between LUI, CUI and VUI. Of course, language is conversational, and we expect the assistant to sustain multi-turn conversations, with follow-ups and error correction. This is by the way something that Alexa and Google do very well for their first party skills, but don’t really offer to their third-party skills, which get basic single-shot intent classification and perhaps simple slot filling. (In contrast, Bixby is another assistant that is built conversational from the start, and in many ways has a similar design to Almond). Note also that voice tech is mature and standard STT works really well.</p><p>Where things change is natural language programming. Ultimately, the goal of a virtual assistant is to use natural language to do things, and because the assistant is a machine, all it can do is execute code. So the idea is that every natural language command issued to the assistant can be mapped to an executable statement in a programming language (a domain-specific language, in our case ThingTalk). The job of the assistant then is just to execute the generated code and present the results to the user. Once you frame the assistant this way, the capability of the assistant is only limited by what the DSL can represent. For example, we experimented with a DSL of access control policies, and found it could be used to grant fine-grain access to shared devices and accounts (in <a href="https://www.youtube.com/watch?v=R-BHyvli6c0">Ubicomp 2018</a>).</p><h4><strong>Which is the personal assistant end user experience, according to Almond vision?</strong></h4><p><strong>Giorgio:</strong> Which kind of personal assistant user experience Almond provides to end users? It seems to me that Almond is built upon the task-oriented approach to automate everyday user actions/tasks (especially browsing/querying the web), following the IFTTT-like way (<em>“alert me when the price of BitCoin is below $3600”</em>). This approach reminds me of the Google Assistant mantra “To get things done<em>”</em> and that’s smart! This kind of personal-programmed micro-tasks completion feature is in fact pretty absent (or just hinted) in big players platforms such as Google Assistant and Amazon Alexa.</p><p>On the other hand, Google and Amazon provide some sort of general-purpose / not-personal basic question-answering and (news/ music) streaming services, referring to third party developers (Action in Google parlance / skills in Amazon parlance) for any other specific service.</p><p>May you deepen the key values and UX features that differentiate Almond from big players?</p><p><strong>Giovanni:</strong> First of all, I want to stress, Almond is still a research prototype. It is an experimental platform to test our ideas, both in NLP and in HCI. We have received a grant from Sloan to turn the prototype into a truly usable product. As we do that, we imagine we will also focus on the most important skills: music, news, Q&amp;A, weather, timers, etc. Yet, the technical foundation to support end-user programming will remain there, and we will try to support it going forward.</p><p>In terms of differentiating features, I imagine the key differentiator is really privacy, rather than UX. I imagine Almond will be supported on a traditional smart speaker interface, because that’s the most common use case for a voice interface. I also personally like using Almond on the PC, where we have an opportunity on the free software OSes. I’ve given a <a href="https://www.youtube.com/watch?v=ZRNZpGfnu3w">talk at GUADEC</a> (the GNOME conference) recently about potential opportunities there.</p><p><strong>Giorgio: </strong>What do you think about the big player server-centric (1st + 3rd party) information architecture, especially in terms of quality of service to end users?</p><p><strong>Giovanni: </strong>Finally, because we’re fully open source, I think the distinction between first party and third party will be blurred in our assistant. We give the same technology to everyone, unlike Alexa for example which keeps AMRL (the Alexa Meaning Representation Language) only for first party skills. We imagine that even long tail skills will be developed in an open repository, and everyone will collaborate to build those skills. <br>The model should be similar to <a href="https://www.home-assistant.io/">Home Assistant</a>, a leader in the open-source IoT space, which is all built by the community.</p><h4><strong>About the NLP chain: LUINet, ThingTalk and Thinkpedia</strong></h4><p><strong>Giorgio: </strong>Could you introduce the Almond natural language processing chain you conceived?</p><p><strong>Giovanni:</strong> The key idea of our approach to NLU is to factor the domain-independent aspects of natural language from the specific domains. Our goal is to raise the level of abstraction, so that developers don’t have to build the same thing over and over again. Instead, we want developers to specify their APIs and database schemas, with a few bits of natural language on every field. We then use a general state machine of dialogues and a general grammar of English to synthesize millions of dialogues that talk about the domain of interest, which we train on. The tools to build these synthesized datasets for training are part of Genie, which is the core NLP technology backing Almond. In diagram form:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*mYghHqlbqaY6Y1Cs" /></figure><p>At inference time, this is what the agent does:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*B7E7BiHtSBFDwSte" /></figure><p>Our pipeline uses a neural semantic parser (the LUINet model, based on the BERT-LSTM architecture) to understand the input sentence, and maps it to an executable form in the ThingTalk programming language. The ThingTalk code makes use of primitive APIs defined in Thingpedia, such as the Yelp skill in this example. The code is JIT compiled and executed, and returns the results. The results are then passed to a general dialogue state machine, that, given the AST of the executed code, the results, and annotations on the APIs, is able to generate both the new formal representation of the dialogue, and the agent’s utterance.</p><p>Here is the state machine at a glance:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*l4wfI1-jZ90D0xzg" /></figure><p>The interesting aspects of this state machine is that it does not depend on the particular domain of interest. So the same state machine can be used for restaurants, movies, music, etc. The state machine is built once and for all, and new domains can be plugged with very little cost. Additionally, when the state machine is refined to add new features, the refinements are shared across all skills. This is also a way in which all skills are “first-party skills”: all skills benefit from the work done to improve other skills, when that work is not domain-specific.</p><h4><strong>About the technology behind the Almond NLP chain</strong></h4><p><strong>Giorgio:</strong> The work you made with LUINet, Genie, and all other components is impressive! It seems to me that you followed the “classic” semantic parsing approach where natural language statements are translated to a formal language (ThingTalk). The semantic parsing approach makes absolute sense for me, maybe differentiating itself from currently very popular intent-based probabilistic classifiers approach, used by Google (Dialogflow), Amazon (Lex) and many other NLU (Natural Language “Understanding”) platforms available on the market.</p><p>Could you deepen how the Almond approach differs from the intent-based classifiers? What are pros and cons of these different approaches? <br>Don’t you think that in the long term the intent-based machine learning could be a simpler way to let the assistant learn (”retrain”) new user intents/requests? On the other hand, in the long term, the Almond NLP formal language production would win in terms of AI explainability in a possible “machine reasoning”. What do you think about it?</p><p><strong>Giovanni:</strong> First, let me clarify that when we say semantic parsing, we mean neural semantic parsing, rather than classic semantic parsing. Classic semantic parsing is template or grammar based, and tries to match spans of the sentence to specific primitives in the knowledge base. <br>Neural semantic parsing instead has a lot more in common with machine translation: a sentence is fed to a neural network, and the neural network outputs a program, token by token. <br>That makes neural semantic parsing a super set of intent-and-slot systems: intent-and-slot are similar to semantic parsing, where the target language has a single function call with parameters. <br>The neural semantic parsing approach is more general, because the target language can be any formal language, and need not match the sentence exactly, nor it needs to be limited to only one API call. For example, semantic parsing allows us to translate questions to SQL-like statements with joins, projections and filters, instead of hard-coded API calls, which allows us to better understand complex questions.</p><p>The downside is that neural semantic parsing requires more data to train, which, if annotated by hand, must be annotated by an expert. Building a semantic parsing training set by hand is practically infeasible: the closest that has been built are dialogue state tracking datasets such as MultiWOZ (which are known to have annotation problems) or paraphrase-based datasets such as Overnight, WikiSQL and Schema-Guided Dialogues (but performance on paraphrase is known to overestimate performance on real data). On the other hand, Genie allows us to use synthesis for the training set, and only annotate a small amount of data for evaluation, which makes semantic parsing practical again.</p><p>I imagine long-term all assistants will move to semantic parsing. Note for example that Alexa is also using some form of semantic parsing (through AMRL) for first-party skills. Intent classification is too limited in what kind of sentences are understood.</p><p>Genie also differs from other assistants in how it approaches state tracking (carrying over state across multiple turns). While commonly used tech separates NLU and state tracking, Genie combines both problems into a single neural network. This reduces the problem of “unhappy paths” typical of rule-based state tracking, where all the different ways the user might continue the conversation or change the subject have to be modeled explicitly. <br>The rule-based state machine is still imbued into the neural model using the state-machine based synthesis, but the neural network can generalize beyond it. In our experiments on the MultiWOZ dataset, we found the state machine could cover 83% of turns, but the neural network could still interpret correctly 47% of the remaining 17%, hence generalizing in what state transitions are supported by the agent. <br>See paper: <a href="https://arxiv.org/pdf/2009.07968.pdf">State-Machine-Based Dialogue Agents with Few-Shot Contextual Semantic Parsers</a>.</p><h4><strong>Will ThingTalk evolve from a smart commands interpreter to a full “conversational companion”?</strong></h4><p><strong>Giorgio:</strong> I see ThingTalk, the Almond natural language programming language inspired by IFTTT, as the first attempt to implement a general natural language programming. Using ThingTalk you can set-up in natural language some “actions’’ triggered by external (web APIs/local) events (e.g. <em>“When I use my inhaler, get my GPS location, if it is not home, write it to a log file in Box.”</em>). <br>Does ThingTalk also include any personal-facts continuous learning and a personal knowledge base memory? <br>Do you have any news about how Almond could eventually evolve into a general-purpose personal “conversational AI”, able to sustain multi-turn conversations, not only in event-based task-completion contexts, but maybe also in <a href="https://vimeo.com/288109032">companion-like</a> open-domains / chit-chat dialogs?</p><p><strong>Giovanni:</strong> This is absolutely on our radar. First of all, we’re partnering with Chirpy Cardinal, another Stanford project who won second place in the Alexa Social Bot Challenge. In the near future, we will integrate Chirpy Cardinal into Almond for companionship and chit-chat capabilities.</p><p>We also imagine that the assistant will learn the profile of the user, their preferences, and will have memory of all transactions, inside and outside the agent. We do not have any released work on this yet.</p><h4><strong>About stateful and contextual dialog management</strong></h4><p><strong>Giorgio:</strong> In general, one topic I’m personally obsessed with is how to program multi-turn chatbot dialogues, in contextual (closed/open) domains. That’s a goal not yet achieved by Google and Amazon cloud-based assistants. To date, in fact, both the famous systems surprisingly do not maintain dialog context in multi-turn conversations even on a simple domain as weather forecasts. The lack of context is not just related to conversational domain, but also to the “time”. Namely, the above mentioned voice assistants are not able to remember pretty anything about a previous interaction with a specific user. No memory (“stateless”, if we think about a conversation as a state-machine workflow). Worst, there isn’t any incremental learning by conversations.</p><p>Now, in what directions do you think conversational technology will evolve? Personally, I foresee a next generation of personal assistants that will be able to sustain task-based / closed-domains dialogs (say in the ThingTalk way) and to chat about general open-domain knowledge. The basic personal assistants feature I do not see yet in any state of the art chatbot is the ability to understand and reason about personal user facts. Do you agree with this view?</p><p><strong>Giovanni:</strong> I think state of the art assistants will grow conversational capabilities for task-oriented skills very quickly. Some, like Almond and Bixby, are built to support multi-turn from the start. Others, like Alexa, will require re-engineering for multi-turn, but they will get there very soon. See also <a href="https://developer.amazon.com/en-US/docs/alexa/conversations/about-alexa-conversations.html">Alexa Conversations</a> as an emerging technology for multi-turn, multi-skill experiences.</p><p>Incremental learning is a much more open-ended area. There is a large body of work in this space, starting with <a href="https://igorlabutov.com/static/papers/lia.pdf">LIA</a>, the “teachable” assistant from CMU. I also imagine the assistant will grow a profile of the user, both by data mining on the conversation history and by explicitly tracking a KB of the user’s information. In a sense, this is already available: the assistant knows my contacts and family relations, it knows my location, it knows my preferred music provider, etc. It will only grow over time as more features are added.</p><p><strong>Giorgio</strong>: BTW, what is your opinion about any practical usage of “statistical web-crowd-sourcing” (my definition) in systems like Generative Pre-trained Transformer 3 (<a href="https://arxiv.org/pdf/2005.14165.pdf">GPT-3</a>), the auto-regressive language model that uses deep learning to produce human-like text?</p><p><strong>Giovanni</strong>: Pretraining is at the core of the modern NLP pipeline, whether it’s masked-language-model “fill in the blanks” pretraining (BERT and subsequent works), generative pretraining (GPT 1, 2 and 3) or sequence-to-sequence (T5, BART). It is key to understand language, because it can be trained unsupervised, so it has significantly less cost than supervised training. I can only imagine the use of pretraining will grow over time. As for GPT3 specifically, the few-shot results are honestly impressive on a range of tasks. At the same time, the model is so large that it cannot be easily fine-tuned, so it’s quite difficult to apply it to a downstream task.</p><p><strong>Giorgio:</strong> About closed domain vs open domain chatbot building, what do you think about <a href="https://rasa.com/">RASA</a> (the open-source opensource engine to build contextual assistants) approach?</p><p><strong>Giovanni:</strong> What RASA is doing is quite interesting in that they’re also trying to push the envelope of conversationality, and they also recognize the limit of intent-based systems. At the same time, I see their current NLU product is still using a classic intent-based dialogue tree. Their dialogue manager requires fully annotated examples of conversations, which are incredibly hard to acquire and annotate well. But I’m looking forward to new stuff, when it becomes available!</p><h4><strong>The Distributed ThingTalk Protocol and the Federated Virtual Assistants Architecture</strong></h4><p><strong>Giorgio:</strong> One of the things I love more of Almond is the vision of a next generation web made by a network of federated (Almond based) virtual assistants. As far as I understood, in this model each person would have a virtual assistant acting as a virtual secretary and talking with other people-clones assistants or people directly. The virtual assistant would act as a “programmable interface”, managing “access control” and sharing personal info based on dynamic programming made by users themselves. That is, in my opinion, very powerful and disruptive! May you explain this concept and could you give us some technical implementation architectural details?</p><p><strong>Giovanni:</strong> I think your question summarizes it very well. The idea is that every person would have their personal virtual assistant running on their own trusted device. The virtual assistant executes requests on behalf of the owner, and on behalf of others, with access control. The requests are represented in ThingTalk and exchanged over a messaging protocol; in our prototype, we used the <a href="https://matrix.org/">Matrix</a> messaging protocol. The access control policies are also represented in ThingTalk. Access control is enforced using <a href="https://en.wikipedia.org/wiki/Satisfiability_modulo_theories">Satisfiability Modulo Theory</a>, so the access control is formally verified. I recommend looking at our Ubicomp 18 <a href="https://oval.cs.stanford.edu/papers/ubicomp18.pdf">paper</a> for further technical details. <br>The interesting thing of this work though was noting how useful fine-grained access control would be: in our user study, we found that across 20 scenarios, the willingness to share data and accounts would double with fine-grained control. We also found that our access control language covered 90% of the enforceable use cases suggested by crowdworkers.</p><h4><strong>On which devices Almond will run: smartphones, smart-speakers, personal computers?</strong></h4><p><strong>Giorgio:</strong> I know you spent a lot of energy trying to run Almond on a vast range of personal computer platforms, focusing on the Android app as a common “personal computer”, maybe because smartphones are the personal computers in this era, for common people. Beside, one of the possible weaknesses I see in Almond is the absence of a (home-based) voice-based interface, maybe through an open-hardware smart-speaker? Do you have any plan to allow private citizens to interface Almond through smart-speaker or any voice-based platform? What are pros and cons of voice-first interfaces?</p><p><strong>Giovanni:</strong> We absolutely see Almond on the smart speaker as a first class citizen. Since fall of 2019 we have partnered with Home Assistant to bundle Almond as an official add-on, so you can use Almond to control a Home Assistant-based smart speaker. That means one can build a fully open-source voice assistant stack using a Raspberry Pi, Home Assistant OS, and Almond. There are a couple challenges in using Almond with a pure voice interface, mainly around the wake word, for which there is no easy to use open-source solution. (Recently, we discovered <a href="https://github.com/castorini/howl/">Howl</a> from UWaterloo, which is also used by <a href="https://voice.mozilla.org/firefox-voice/">Firefox Voice</a>, and we’re investigating that). Also, building a conversational interface that is friendly to pure speech output is not easy. Even commercial assistants work better on a phone when they can display links, cards, and interactive interfaces.</p><h4><strong>Could the Almond federated architecture be also a solution for business companies?</strong></h4><p><strong>Giorgio:</strong> A distributed architecture of virtual assistants (where each end-user has his local assistant) that allows people to tune a fine-grained access control, selecting what info is public and what actions external assistants (aka people) can access to, seems to me a breakthrough in the current debate on personal data sharing.<br>BTW, You may know that Tim Berners-Lee in 2018 announced to be working on a personal assistant (code-name: <a href="https://www.fastcompany.com/90243936/exclusive-tim-berners-lee-tells-us-his-radical-new-plan-to-upend-the-world-wide-web">Charlie</a>). “Unlike with Alexa, on Charlie people would own all their data”. Are you in touch with him or any people at <a href="https://inrupt.com/">Inrupt</a>?</p><p><strong>Giovanni:</strong> I know that Monica has spoken with Tim Berners-Lee in the past. In any case, I believe this space is quite young, and there is certainly an opportunity for multiple open-source projects who focus on different aspects of the stack. Our focus is really in the NLP and dialogue management, while their focus seems to be the distributed architecture.</p><p><strong>Giorgio:</strong> Almond seems now focused on providing an assistant to private citizens (end private users). The distributed architecture and the access control management you propose for end private users couldn’t be applicable also to business companies that want to provide their services to people? I imagine a scenario where an end user’s assistant talks to a company-assistant. May this Almond possible extension in the future be coupled with Thingpedia APIs?</p><p><strong>Giovanni:</strong> Of course! The goal of our research prototype of a distributed virtual assistant was to show how useful access control can be in natural language. The use cases need not be limited to consumer access control: it could be applied in corporate settings, and it could be applied to sharing data between consumers and businesses. For an example of the latter, see this <a href="https://arxiv.org/abs/2003.10128">paper</a> from HTC &amp; NTU which uses ThingTalk technology to audit sharing of medical data.</p><h4><strong>First European Open Virtual Assistant Workshop</strong></h4><p><strong>Giorgio:</strong> In June 2020, the <a href="https://oval.hipeac.net/2020/#/">First European Open Virtual Assistant Workshop</a> scheduled by OVAL was cancelled due to the COVID-19 outbreak. The goal of the workshop was to introduce OVAL lab’s open, federated and privacy-preserving virtual assistant to the European research and business communities. There is any plan to reschedule the workshop?</p><p><strong>Giovanni:</strong> Unfortunately, as you can imagine all in-person events have been canceled for the foreseeable future due to COVID-19. I don’t know at this time when the workshop will be rescheduled.</p><p><strong>Giorgio:</strong> In general, what do you think about the recent European citizens’ privacy-preserving initiatives and related supporting laws (see <a href="https://gdpr-info.eu/">GDPR </a>regulations, the recent <a href="https://www.data-infrastructure.eu/GAIAX/Navigation/EN/Home/home.html">GAIA-X</a> project and, specifically regarding personal assistants, the <a href="https://www.speaker.fraunhofer.de/en.html">Fraunhofer SPEAKER</a> platform)? Do you see common points between current European policy on “data sovereignty” and the Almond goals?</p><p><strong>Giovanni:</strong> I think European efforts in this space are very important in terms of raising awareness of the importance of privacy. Building effective alternatives to the Amazon / Google duopoly is one way to restore privacy, like we’re doing with Almond.</p><p>At Almond, we’re also collaborating with the <a href="https://www.ai4eu.eu/">AI4EU</a> initiative, which aims to build an European cloud of AI infrastructure.</p><p>At the same time, on a purely personal basis, as an European citizen with an opinion, I often disagree with the choices of our Commission, which seems to be animated more by economic strategy (and fear of American competition) than by sincere values. The goal should be privacy for all, not just making sure the next Google pays European taxes.</p><p>Often, it is also difficult to assess certain projects, because there is no open-source code, no released product, not even a development version. We see a lot of press releases and reports, but there is no sense of a coherent software artifact, accessible to developers. Even for the two projects you linked, development has reportedly started, but there is no code accessible anywhere. To me, and I stress this is a purely personal opinion, this is not the way to run a successful open source initiative.</p><h4><strong>How do you see the future of Almond?</strong></h4><p><strong>Giorgio:</strong> Recently Almond received <a href="https://sloan.org/storage/app/media/programs/public_understanding/program%20highlights/2020.6ProgramUpdates.pdf">Sloan Support</a>. As far as I understand, the new funds will support the engineering of Almond solutions, with the goal to convert the developed prototypes into real products usable by consumers.</p><p>May you give more details and describe the next steps of the project, in the short term and in long-term (say next 5 years)?</p><p><strong>Giovanni:</strong> The short term goal, in the next year, is to use the funds from Sloan and other foundations to build an initial product. We aim for a small initial user base of enthusiasts who care about privacy.</p><p>The long term goal is then to use this initial product to further fund raise, and then use the established product to build both a successful open-source community, and an ecosystem of companies using Almond technology in their products. This should allow Almond to thrive and become self-sustainable.</p><h4>Recent Talks (2020)</h4><p><a href="https://www.youtube.com/watch?v=ZRNZpGfnu3w">Almond: An Open, Programmable Virtual Assistant</a> <br>Giovanni Campagna, GUADEC, the GNOME Conference, ORBIS 2020, <br>July 24, 2020.</p><p><a href="https://www.youtube.com/watch?v=uybmgFHrupE">Building the Smartest and Open Virtual Assistant to Protect Privacy</a><br>Monica Lam, Stanford Online Seminar, April 9, 2020.</p><h4>Links</h4><p><a href="https://web.stanford.edu/~gcampagn/">Giovanni Campagna home</a><br><a href="https://almond.stanford.edu/">Almond project home</a><br><a href="https://oval.cs.stanford.edu/">Stanford Open Virtual Assistant Lab</a><br><a href="https://wiki.almond.stanford.edu/">Open Virtual Assistant Initiative Wiki</a><br><a href="https://community.almond.stanford.edu/">Almond Community Forum</a><br><a href="https://github.com/stanford-oval">Almond open-source repositories</a></p><h4><a href="https://wiki.almond.stanford.edu/contributing"><strong>How to contribute to Almond</strong></a></h4><blockquote>Updates: <br>september 22th 2020: in paragraph “About the technology behind the Almond NLP chain” the answer has been integrated, inserting the link of a new paper.</blockquote><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=b4d66167e76c" width="1" height="1" alt=""><hr><p><a href="https://convcomp.it/whither-almond-the-stanford-university-open-virtual-assistant-will-go-b4d66167e76c">Whither Almond, the Stanford University open virtual assistant, will go?</a> was originally published in <a href="https://convcomp.it">ConvComp.it</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Are Alexa and Google Assistant both unfit as language learning assistant, inside/outside the…]]></title>
            <link>https://convcomp.it/are-alexa-and-google-assistant-both-unfit-as-language-learning-assistant-inside-outside-the-972bfafeddfd?source=rss----e9c948ff6ebd---4</link>
            <guid isPermaLink="false">https://medium.com/p/972bfafeddfd</guid>
            <category><![CDATA[amazon-echo]]></category>
            <category><![CDATA[education]]></category>
            <category><![CDATA[google-assistant]]></category>
            <category><![CDATA[alexa]]></category>
            <category><![CDATA[google-nest]]></category>
            <dc:creator><![CDATA[Giorgio Robino]]></dc:creator>
            <pubDate>Sun, 15 Dec 2019 17:23:49 GMT</pubDate>
            <atom:updated>2020-01-15T17:36:21.787Z</atom:updated>
            <content:encoded><![CDATA[<h3>Are Alexa and Google Assistant both unfit as language learning assistants, inside/outside the classroom?</h3><h4>Some reasons why I’m developing a Telegram chatbot, giving up to develop an Alexa or Google Assistant application.</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/617/1*QI-StFH5nwFJzfMnplopBA.jpeg" /><figcaption>Students of Italian language at CPIA courses (Italian public adult schools). <a href="https://www.gildavenezia.it/educazione-adulti-oltre-100mila-iscritti-ai-cpia-in-aumento-gli-stranieri/">source</a></figcaption></figure><blockquote>This article is a remake of my original answer to <a href="https://medium.com/u/77381b34c63f">Julie Daniel Davis</a> article <a href="https://medium.com/voiceedu/google-assistant-versus-amazon-alexa-which-could-be-queen-of-the-classroom-540314d8f981">Google Assistant versus Amazon Alexa: Which Could be Queen of the Classroom?</a><br> I mitigated the original title “Are Alexa and Google Assistant both looser in edutech space?”, too extreme and generic, I admit. <br>The reviewed article is above all a way of reflecting about my experience with CPIAbot, a chatbot I’m developing to assist no-native italian language L2/PreA1 (almost illiterate) students of some Italian public schools, as part of a <a href="https://www.itd.cnr.it/Progetti_Rispo1.php?PROGETTO=1193">ITD-CNR research project</a>. <br>More broadly, my thoughts are about some limits of today Amazon and Google voice assistant and smartspeakers technology.</blockquote><p>Almost one year ago I started to develop CPIAbot, a <em>language-first</em>, voice-first multimodal chatbot, running on <a href="http://www.telegram.org">Telegram</a>, to assist foreigners, students of <a href="http://www.retecpialiguria.it/">CPIA</a>, Italian public adult schools courses, to learn Italian language basics.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*2Q2Ah8CU0aLV0wqtTxQ4oA.png" /><figcaption>CPIAbot <a href="https://www.slideshare.net/slideshow/embed_code/key/vRtX3MSrkrCW2M">slides</a>, presented at italian event <a href="https://twitter.com/solyarisoftware/status/1194644685601546241?s=20">www.c1a0.ai</a></figcaption></figure><p>Let me tell the story. In fall 2018, our initial research goal, at <a href="https://www.itd.cnr.it/Progetti_Rispo1.php?PROGETTO=1193">ITD-CNR</a>, was to realize a smartspeaker application (on Alexa or on GoogleAssistant) but we faced many issues. I detailed a long list of points in an academic paper I’ve just submitted (soon available), entitled “<em>Un assistente conversazionale a supporto dell’apprendimento dell’italiano L2 per migranti: CPIAbot”</em> (more details at the end of this article).</p><p>Long story, short, about using smartspeakers in our ongoing experiment:</p><blockquote>From the linguistics/educational perspective, a voiceonly application is too demanding in term of learner’s cognitive effort, especially in the case of no native/illiterate learner (QCER level L2/Pre A1) .</blockquote><h4>Unique User identification</h4><p>There are also many related tech issues, but for me, in educational realms (in classroom or outside the classroom),</p><blockquote>the big issue that both systems have is the lack of a real <strong>Unique User Identification</strong>.</blockquote><p>That means a way to identify uniquely the user (any student, any teacher) of the (conversational/voice) application. Having an user ID requirement is for me essential for any chat or voice application made with the goal of follow the student learning.</p><p>The related big point is related to the impossibility to process voice recordings of students. In short:</p><blockquote>an Alexa Skill or a Google Assistant Conversational Action can not access user voices. Full stop.</blockquote><p>Here below I deepen the voice/text flow that happens when a user interact with a third party app on Alexa or Google Assistant (through a smartspeaker).</p><h4><strong>Personal Voice Identification</strong></h4><p>Voice signature identification is just a subtopic of users (students) identification need. Probably we could renounce to identify a student by his/her voice, but at the end of the day we need to identify the student (especially for personalized exercises, part of our application/research goal). If we want to track and improve the specific student learning progress, we need to identify him/her.</p><blockquote>So far, both Alexa and Google Assistant do not provide a convincing/final solution to identify speaking user in front of a smartspeaker (a far-field device).</blockquote><h3>Giorgio Robino on Twitter</h3><p>@4ICT Well, I&#39;m discovering that: 1/ GoogleHome has &quot;voice match&quot; feature that could allow to identify all users did the matching procedure. Afterward GA could discard &quot;remote waking&quot;, but that seems NOT possible. See: https://t.co/N8rxrLAnkx =&amp;gt; functional privacy bug immo cont</p><p>Both big players allow to identify a small set of voiceprints (Google call these <a href="https://support.google.com/googlenest/answer/7342711?hl=en"><em>voice matches</em></a>, whereas Amazon call these <a href="https://www.amazon.com/gp/help/customer/display.html?nodeId=202199440"><em>voice profiles</em></a>). Google Assitant recognize a max of six different voices, whereas is not clear if there is a limit of recognized voices in Amazon Alexa. The good news is that both systems pass info to third party skills. See <a href="https://actions-on-google.github.io/actions-on-google-nodejs/classes/conversation.user.html">here</a> and <a href="https://developer.amazon.com/blogs/alexa/post/1ad16e9b-4f52-4e68-9187-ec2e93faae55/recognize-voices-and-personalize-your-skills">here</a>.</p><p>But the proposed solutions are unpractical or even impossible in classroom scenarios, where there could be a lot of students in front to a single smartspeaker. All in all, the present voice recognition is not a suitable solution to identify a student.</p><h4>Users Speech (voice recordings)</h4><p>Voice recognition a part, there are other basic limits about voice recordings in general. The biggest issue involves privacy topics, but let take a part that thin ice for now; just let’s talk about some flat technical points:</p><blockquote>both Alexa and Google Assistant “by design” do not forward the users (students in our case) <strong>voice recordings</strong> to third party skills/actions.</blockquote><p>Generally speaking, this is quite understandable because big players do not want any possible malicious use of people voices by third party applications. But this inhibit a lot of smart elaborations the application could do with voice analysis. I already point out the need to use voice print recognition to identify speakers, and…</p><blockquote>without the student audio/voice recordings, the application can’t do correct pronunciation analysis, sentence intonation recognition, emotional detection (“sentiment analysis”), etc. etc.</blockquote><h4><strong>On Alexa, user utterances are not forwarded to the third party skills</strong></h4><p>In case of Alexa, user utterances (transcripted from voice to <strong>text</strong>) are not forwarded to third party skills, “as is”. In facts it happens that:</p><blockquote>An Alexa skill do not even receive the full sentence (as voice-to-text transcript) of the user speech!</blockquote><p>Instead, the skill gets just a label (an <em>intent</em>, in “conversational AI” jargon) that the developer has initially planned (during the “<a href="https://developer.amazon.com/en-US/docs/alexa/custom-skills/create-the-interaction-model-for-your-skill.html">Alexa Skill nteraction model</a>” design phase) that matches with the current user sentence. Strange, but this is the way!</p><blockquote>Why Alexa do this “censorship”?</blockquote><p>As far as I know, it has never been officially explained by Amazon, but I do believe it was a deliberate strategic decision. My suggestion is that the Alexa “interaction model” mitigates possible malicious uses/abuses by third party skills, giving Amazon an automatable way to control 3rd Party apps, avoiding privacy issues, etc. That’s a possible “customer first” right dogma.</p><p>On the other hand, the intent-based interaction inhibit third party skills to elaborate the user full utterance, limiting the NLU (natural language understanding).</p><h3>Giorgio Robino on Twitter</h3><p>new episode: to filter (Alex with &quot;interaction model&quot;) or not to filter (Assistant Actions SDK) user utterances? Details: https://t.co/UTpk1tYa6j Google win this play.</p><p>Google Assistant, with <em>Actions SDK</em>, gives developers more freedom allowing them to use a classic intent-based classifier (the <a href="https://dialogflow.com/">Dialogflow</a> platform) as Alexa does, but Google also gives the option to use an alternative “low level” pass-trough (<em>Actions SDK API</em>), where the user utterances are passed to application, without any filtering. Thanks Google for that!</p><p>Why this “pass-through” is so important for an e-learning app or any (language) assistant bot?</p><blockquote>Having the complete student text and voice input is paramount to analyse utterances on a human-machine conversation, e.g. a linguistic exercise.</blockquote><p>Let’s imagine a simple conversation where an assistant chatbot asks the student to describe a scene displayed on a image or a video. In the example below, the CPIAbot exercise “guess the word” asks the student to guess the word (part of a glossary) describing the people that work in the scene.</p><p>Regarding the specific image, the students could answer: <em>sales girls</em>, <em>women</em>, <em>cashiers</em>, <em>shop girls</em>, <em>clerks</em> and many other definitions, that would be interesting to catch and analyze by the bot. This is feasible but pretty hard to implement with the Alexa interaction model.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/772/1*wIR2UOlXRIXZVG_3OBvKBA.png" /><figcaption>screenshot from CPIAbot chatbot exercise “indovina la parola”</figcaption></figure><h4>Devices costs</h4><p>The other point the original <a href="https://medium.com/voiceedu/google-assistant-versus-amazon-alexa-which-could-be-queen-of-the-classroom-540314d8f981">article </a>mentioned are smartspeakers devices end-user costs. The author is absolutely right when she says that an <em>Amazon Alexa Dot</em> is cheaper than a <em>Google Nest Home</em>. The same is for <a href="https://www.amazon.com/Echo-Buds/dp/B07F6VM1S3">Echobuds</a> (Amazon offers the cheapest earables devices in the market). BTW, earbuds (personal, near-field earbuds) solve the user identification problem previous mentioned.</p><p>In a language learning context (and in any discipline), we want students use the (voice bot) application also outside the classroom, and</p><blockquote>the “cheapest” and “easiest to own” device for people, especially refugees is a smartphone.</blockquote><p>So back to our comparison, again both Google Assistant And Alexa lose in facts, because it’s true that both assistants are available as mobile app, but there are limits on voice/text interactions, by example in case of Amazon Assistant the students can’t text to Alexa. Google Assistant is a bit better, allowing users to write (or speak), but (if I well remember) there is no the way, action application side, to distinguish if user has wrote or has spoken (I could be wrong, I have to double check).</p><h4>Minor notes about the devices audio quality</h4><p><em>Amazon Echo</em> has a jack audio output and <em>Google Nest</em> has not. That’s true. But both the devices could be coupled via bluetooth. Problem solved.</p><p>For me, as an “audiophile”, in terms of audio quality, comparing my <em>Google Home Mini</em> with an <em>Amazon Echo Dot (v3)</em>, I much prefer the Google’s device because the more natural sound, or vice-versa, I literally dislike the over-compressed audio quality of <em>Amazon Echo</em> devices. Well, that’s subjective and on the other hand, in a classroom, the Echo’s bigger audio is a plus, I admit.</p><h4>Authoring tools for teachers</h4><p><a href="https://blueprints.amazon.com/">Alexa Skills Blueprint</a>? Maybe they are a nice tool for an initial engagement and gaming, but you can’t develop interesting/serious educational applications with them.</p><p>The most important point for me, is that both Google and Amazon do not provide so far a real simple convincing tools for non-developer and application developers in education realms, as teachers.</p><p>The visual design (GUI) vs language-based (CUI) tools is an old fashioned but highly discussed debate among conversational designers/developers. My side, regarding skill/actions development, and more in general about conversational application design and development, I’m supporter, since few years ago, of <strong>not-visual</strong>, but high-level <strong>authoring tools </strong>(possibly using very simple, declarative, ~natural language) programming languages. So far, I do believe that</p><blockquote>both Amazon and Google do not provide authoring tools that allow teachers, contents creators and any no-developer, to easily create serious/complex custom applications.</blockquote><p>Maybe Amazon is in the right direction with the <a href="http://www.litexa.com">www.litexa.com</a> approach (I’ll deepen that topic on a future article). On the other hand, I have to say, as developer, that <em>Google Actions</em> programming paradigm is better than the Amazon skills programming proposal. There are many reasons, a bit tech/off topic here, that I often pointed out with my <a href="https://twitter.com/solyarisoftware">tweets</a>.</p><p>Concluding, I confess I’m not too happy about how the two biggest players now support application development (in educational realms). We need much more! :-)</p><blockquote>CPIAbot academic papers:</blockquote><blockquote>1. F. Ravicchio, G. Robino, G. Trentin.<br><strong>CPIAbot: un chatbot nell’insegnamento dell’Italiano L2 per stranieri</strong>. 2019.<br>Published in Didamatica 2019 acts, Best Paper Award in section: BYOD. Mobile e Mixed Learning (ISBN 978–88–98091–50–8 <a href="https://www.aicanet.it/didamatica2019/atti-2019">https://www.aicanet.it/didamatica2019/atti-2019</a> p. 77–86).</blockquote><blockquote>2. F. Ravicchio, G. Robino, S. Torsani, G. Trentin.<br><strong>Un assistente conversazionale a supporto dell’apprendimento dell’italiano L2 per migranti: CPIAbot.</strong> Nov 2019. <br>Submitted to the Italian <a href="https://ijet.itd.cnr.it/">Journal of Educational Technology (IJET)</a>.</blockquote><blockquote>Related article: <a href="https://convcomp.it/stateful-alexa-skills-e9a64c10d902">Stateful Alexa Skills</a>?</blockquote><h3><strong>I’m happy to read your opinion. Please let me know your experience!</strong></h3><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=972bfafeddfd" width="1" height="1" alt=""><hr><p><a href="https://convcomp.it/are-alexa-and-google-assistant-both-unfit-as-language-learning-assistant-inside-outside-the-972bfafeddfd">Are Alexa and Google Assistant both unfit as language learning assistant, inside/outside the…</a> was originally published in <a href="https://convcomp.it">ConvComp.it</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
    </channel>
</rss>