Can AI Sniff Like Humans?
Can AI Sniff Like Humans?
This study investigates how well AI can interpret human descriptions of scents. Using a system named Sniff AI, 40 participants engaged in interactive tasks to help the AI guess scents based on their descriptions. Findings revealed limited perceptual alignment, with certain biases in scent identification, highlighting challenges in enhancing human-AI alignment in olfactory perception.
Can AI Sniff Like Humans?
@kapilmohan2 months ago
SNIFF AI: IS MY 'SPICY' YOUR 'SPICY'? EXPLORING LLM'S PERCEPTUAL ALIGNMENT WITH HUMAN SMELL EXPERIENCES
Shu Zhong 1 , Zetao Zhou 2 , Christopher Dawes 1 , Giada Brianz 1 , and Marianna Obrist 1
1 Department of Computer Science, University College London, United Kingdom 2 Division of Psychology and Language Sciences, University College London, United Kingdom
ABSTRACT
Aligning AI with human intent is important, yet perceptual alignment-how AI interprets what we see, hear, or smell-remains underexplored. This work focuses on olfaction, human smell experiences. We conducted a user study with 40 participants to investigate how well AI can interpret human descriptions of scents. Participants performed "sniff and describe" interactive tasks, with our designed AI system attempting to guess what scent the participants were experiencing based on their descriptions. These tasks evaluated the Large Language Model's (LLMs) contextual understanding and representation of scent relationships within its internal states - high-dimensional embedding space. Both quantitative and qualitative methods were used to evaluate the AI system's performance. Results indicated limited perceptual alignment, with biases towards certain scents, like lemon and peppermint, and continued failing to identify others, like rosemary. We discuss these findings in light of human-AI alignment advancements, highlighting the limitations and opportunities for enhancing HCI systems with multisensory experience integration.
1 Introduction
Aligning Artificial intelligence (AI) behaviour with human preference is critical for the future of AI. An important yet often overlooked aspect of this alignment is the perceptual alignment. Perceptual alignment refers to the agreement between AI assessments and human subjective judgments across different sensory modalities, such as vision, hearing,
- 1. The Scent Description Task: This task (Task 1 in Figure 3) tests the LLMs encoder's ability to match a specific scent based on human-provided descriptions in the latent embedding space. The AI identifies the closest match within its scent embedding space by matching the internal semantic similarity of scent representations.
- 2. The Interactive Scent Comparison Task: This task (Task 2 in Figure 3) evaluates the LLMs encoder's ability to understand and represent the transitions between different scents. This task examines whether the AI can reflect the progression from one scent to another using comparative descriptions provided by humans. For example, we examine if the vector from "the scent of mint" to "the scent of rose" in the LLM's embedding space reflects a shift from a fresh scent to a more floral one and aligns with human smell experience.
- · OpenAI GPT-4 and GPT-4o [56]: gpt-4-turbo-2024-04-09 and gpt-4o-2024-08-06
- · Google Gemini 1.0 and 1.5 [57]: gemini-1.0-pro and gemini-1.5-pro
- · Anthropic Claude 3 and 3.5 [62]: claude-3-opus-20240229 and claude-3-5-sonnet-20240620
- [1] Shu Zhong, Elia Gatti, Youngjun Cho, and Marianna Obrist. Exploring human-ai perception alignment in sensory experiences: Do llms understand textile hand? arXiv preprint arXiv:2406.06587 , 2024.
- [2] Raja Marjieh, Ilia Sucholutsky, P v Rijn, Nori Jacoby, and Thomas L Griffiths. Large language models predict human sensory judgments across six modalities. arXiv preprint arXiv:2302.01308 , 2023.
- [3] Ilia Sucholutsky, Lukas Muttenthaler, Adrian Weller, Andi Peng, Andreea Bobu, Been Kim, Bradley C. Love, Erin Grant, Iris Groen, Jascha Achterberg, Joshua B. Tenenbaum, Katherine M. Collins, Katherine L. Hermann, Kerem Oktar, Klaus Greff, Martin N. Hebart, Nori Jacoby, Qiuyi Zhang, Raja Marjieh, Robert Geirhos, Sherol Chen, Simon Kornblith, Sunayana Rane, Talia Konkle, Thomas P. O'Connell, Thomas Unterthiner, Andrew K. Lampinen, Klaus-Robert Müller, Mariya Toneva, and Thomas L. Griffiths. Getting aligned on representational alignment, 2023.
- [4] Alberto Broggi, Alex Zelinsky, Ãmit Ãzgüner, and Christian Laugier. Intelligent vehicles. In Springer Handbook of Robotics , pages 1627-1656. Springer, 2016.
- [5] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 , 2018.
- [6] Yinhan Liu. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 , 2019.
- [7] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research , 21(140):1-67, 2020.
- [8] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems , 33:1877-1901, 2020.
- [9] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems , 35:27730-27744, 2022.
- [10] Yonadav Shavit, Sandhini Agarwal, Miles Brundage, Steven Adler, Cullen O'Keefe, Rosie Campbell, Teddy Lee, Pamela Mishkin, Tyna Eloundou, Alan Hickey, et al. Practices for governing agentic ai systems. Research Paper, OpenAI, December , 2023.
- [11] Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, and Jacob Steinhardt. Aligning ai with shared human values. In International Conference on Learning Representations , 2020.
- [12] Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, et al. Toy models of superposition. arXiv preprint arXiv:2209.10652 , 2022.
- [13] Neel Nanda, S Rajamanoharan, J Kramár, and R Shah. Fact finding: Attempting to reverse-engineer factual recall on the neuron level. In AI Alignment Forum, 2023c. URL https://www. alignmentforum. org/posts/iGuwZTHWb6DFY3sKB/fact-finding-attempting-to-reverse-engineer-factual-recall , page 19, 2023.
- [56] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 , 2023.
- [57] Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 , 2023.
- [58] Arvind Neelakantan, Tao Xu, Raul Puri, Alec Radford, Jesse Michael Han, Jerry Tworek, Qiming Yuan, Nikolas Tezak, Jong Wook Kim, Chris Hallacy, et al. Text and code embeddings by contrastive pre-training. arXiv preprint arXiv:2201.10005 , 2022.
- [59] Grant Sanderson. How might llms store facts | chapter 7, deep learning, 2024.
- [60] Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning , pages 28492-28518. PMLR, 2023.
- [61] Susan T Fiske and Martha G Cox. Person concepts: The effect of target familiarity and descriptive purpose on the process of describing others 1. Journal of Personality , 47(1):136-161, 1979.
- [62] AI Anthropic. Introducing the next generation of claude, 2024.
- [63] Christina Strauch, Thu-Huong Hoang, Frank Angenstein, and Denise Manahan-Vaughan. Olfactory information storage engages subcortical and cortical brain regions that support valence determination. Cerebral Cortex , 32(4):689-708, 2022.
- [64] Pierre-Marie Lledo, Gilles Gheusi, and Jean-Didier Vincent. Information processing in the mammalian olfactory system. Physiological reviews , 85(1):281-317, 2005.
- [65] Mary Ann Drake and Gail Vance Civille. Flavor lexicons. Comprehensive reviews in food science and food safety , 2(1):33-40, 2003.
- [66] Manuel Zarzo. Relevant psychological dimensions in the perceptual space of perfumery odors. Food Quality and Preference , 19(3):315-322, 2008.
taste, touch, and smell [1, 2, 3]. It enables AI to better understand the physical world as humans experience it, ensuring that AI applications are reliable and beneficial in real-world settings. For example, consider autonomous vehicles: if the "AI eye" misinterprets data from sensors such as cameras and fails to recognize obstacles or pedestrians, it poses significant safety risks [4]. Beyond safety considerations, perceptual alignment plays a critical role in everyday AI applications [3], whereas olfactory alignment remains completely unexplored. Imagine a future where AI assistants are capable of controlling environmental factors like lighting and scents based on user requests. For instance, instead of asking "Alexa, play uplifting workout music", you can ask Alexa to "spice up my workout session" or "help me remember my holiday to Madrid". Here, the challenge for AI goes beyond playing music and ventures into "AI sniff", to select the ideal scent aligned with human descriptions. This poses the question of how AI would understand and interpret scents in a way that resonates with our personal sensory experiences. Or simply: Is my "spicy" AI's "spicy"?
AI, or mostly Large Language Models (LLMs), lack human-like perceptions; rather, they process human provided language inputs using algorithms that function on binary systems to analyse information [5, 6, 7, 8, 9]. We place a special emphasis on LLMs as they are being increasingly viewed as the interface for human interaction, with agentic AI systems and general alignment being one of its core research domains [10, 11]. LLMs, or any neural network-based learning systems, represent concepts and ideas in what is commonly known as an embedding space - a learned internal high-dimensional vector space [12, 13]. In this space, semantically similar items cluster closely together, and the semantic differences between items are preserved [14]. This naturally lends us a hand in analyzing how closely aligned humans and AI are. If two items are deemed similar within the AI system's embedding space, humans should also perceive them as similar, if human-AI alignment exists. Our work then exploits exactly this approach and focuses on analyzing the model's embedding space. We leverage LLM-based embedding models to develop an AI system capable of performing the "Human sniff and describe and AI guesses" task. Here, the LLM encoder translates human language descriptors of scents into the embedding space, and the system makes then scent suggestions based on the semantic similarities measured in this vector space. Given the recent advancements in LLMs' ability to interpret human language, an intriguing question arises: can LLMs effectively understand scents based on user descriptions? For instance, will both LLMs and humans agree on Jasmine and Ylang-Ylang being perceptually similar?
To investigate this question, we conducted an in-person user study with 40 participants, where participants engaged in interactive tasks where an AI system had to guess what scent they were experiencing. These participants, who were non-domain experts, were specifically chosen to reflect common perceptions of scents as conceptualized by Henning's Odour Prism [15], the Fragrance Wheel [16], and attributes such as the fresh, citrusy, and zesty qualities typically associated with lemon. We analysed the system's performance using both quantitative and qualitative methods to capture the participant feedback. Our findings suggest that scent-related semantics are represented in the embedding space, though to a limited extent. There is some degree of perceptual alignment, but it was biased toward certain scents and characteristics. For instance, sometimes the AI believed the human description of "an aromatic plant that probably you use for like stew or for like chicken and it's very green and is fresh" refers to eucalyptus rather than rosemary. Occasionally, participants were surprised by certain emergent behaviours; for instance, the AI correctly identified a "intense masculine scent" as oakmoss. We discuss these findings in light of recent advancements of LLMs and efforts towards improved human-AI alignment, with the opportunity to enhance HCI systems with multisensory experience integration.
2 Background and Related Work
2.1 Human-AI Interaction and Perceptual Alignment
Human-AI alignment involves designing, developing, and refining AI systems to understand, predict, and enhance human intentions and actions, whilst generating informative, harmless, and helpful responses [11, 17, 3]. A notable contribution is from Hendrycks et al. [11], who introduced the ETHICS dataset to evaluate language models' understanding of fundamental moral concepts, such as justice and well-being. Recent studies emphasize integrating human-in-the-loop to reduce toxicity and foster ethical behaviour [9, 18, 19].
There is growing research into representational alignment between AI and humans. Representation alignment refers to the extent to which the internal representations of two or more information processing systems are aligned [3]. Peter et al. [20] compared the performance of general-purpose embedding models, such as OpenAI's text-embedding-ada-002 , with domain-specific models, such as the CLAP text-audio embedding model[21] in capturing nuances in expressive piano performances. In this particular case, all models showed a degree of alignment, and general-purpose models outperformed domain-specific ones.
Extending this exploration of Human-AI alignment into sensory judgments, Lee et al. [22] introduced the VisAlign dataset, designed to evaluate the alignment between AI and human visual perception. This dataset aids in understanding how to better align AI with human vision perceptual processes, particularly in vision. Another exemplified research is
done by Zhong et al. [23], who investigated the agreement between humans and six vision-based Multimodal Large Language Models (MLLMs) in describing tactile qualities of textiles, they found these models' descriptions were detached from human descriptions in both sentiment and word usage. Differently, Marjieh et al. [2] demonstrated that GPT-4 can effectively interpret certain human sensory judgments (e.g., colour, sound and taste) based on textual sensory inputs. For example, they presented the same pair of colours (red and blue) to both humans and GPT models 1 , requesting each to provide a similarity rating and then comparing the resulting scores. Their findings showed that GPT models produced judgments that correlated with those of humans. To the best of our knowledge, the alignment between AI and human olfaction remains unexplored. Our research aims to extend these foundational studies by integrating semantic embeddings [24] to investigate olfactory perception.
AI systems process information differently from humans, by representing concepts and ideas within a latent space or embedding space [12, 13]. These spaces allow the AI to efficiently encode, process, and manipulate complex information, capturing relationships and patterns in the data [25, 14]. In embedding spaces, embeddings are learned representations, similar items are grouped closely together, while differences are preserved through distance and direction [14, 26]. The embedding model is a fundamental component of the LLM architecture, and studying the encoder helps explore how the model interprets and organizes information at its core - both dimensional and categorical structure [26]. Understanding the encoder's function is crucial, as it provides insights into how the LLM structures its internal knowledge [12]. Zhong et al. [1] explored textile tactile experiences using an LLM embedding model, finding limited perception alignment and biases toward specific textiles. In this study, we investigate the ability of LLMs to understand human smell experiences by analysing how they represent scents within their embedding models. By examining how these models encode olfactory concepts, we aim to understand how they capture both the similarities and the relationships between different scents. This understanding of LLMs' internal processes can offer a more meaningful grasp of the internal workings of LLMs and potentially enhance their performance [12, 27].
2.2 Growing Importance of Smell in HCI and Experience Design
The integration of olfactory experiences in HCI has been a relatively underused sense compared to vision and sound technologies [28, 29]. However, olfactory experiences hold the potential to significantly enhance virtual reality environments and add a new dimension to HCI applications, particularly by facilitating attention shifts [30]. In traditional audio-visual interactive media, scent is frequently used as a supplementary tool to deepen immersion [31, 32]. From an anthropological perspective, olfaction is deeply intertwined with human culture and may serve as a primary sensory modality to enrich multisensory experiences [33], especially in digital applications that emphasise voluntary engagements, such as memory recall, attention, spatial orientation, and wellbeing [28, 34, 35, 36]. Yet, many questions about human perception and experience of smell remain unanswered. What we, however, know is that smell is uniquely linked to our emotional responses and memories [35, 32], making it a compelling modality for enhancing HCI and experience design, especially in light of AI advancements.
There is a growing interest within the HCI community in integrating olfactory devices into everyday life [31]. For example, Brooks et al. [37] introduced the "Smell & Paste" toolkit, a scratch-and-sniff prototyping tool that allows novices to quickly create personalized olfactory experiences. Additionally, toolkits like OWidgets have been developed to enable the creation and replication of olfactory experiences in HCI, beyond the traditional audio-visual design space [38].
2.3 Bridging Human and Machine Understanding of Smell Experiences
The olfactory sense presents a unique challenge for cognitive science research, particularly in the realms of perception, memory, and language [39, 40, 41]. To understand olfaction, researchers employ a variety of behavioural, neurophysiological, computational, and theoretical methods to study how smells are perceived and represented in the brain and language [42, 43, 44, 45, 46, 39]. From a human perspective, conveying smell experiences is more challenging than describing colours, which benefit from a standardized vocabulary (e.g., adjectives like "red" and hex codes). Scents lack an equivalent standardized language[47], making it challenging for the general public, without specialised knowledge, to identify scents based solely on chemical names [48]. To address this, classification systems such as Henning's Odour Prism [15] and the Fragrance Wheel [16] organize scents based on their olfactory characteristics.
Building on the challenges of scent categorization, recent linguistic and AI research has explored the connection between molecular structures and odour perception or multimodal representation [49, 50, 51, 52]. Studies like the DREAM Olfaction Prediction Challenge by Keller et al. [50] have aimed to understand how humans perceive different molecules as scents. Similarly, the Odeuropa project [52] has significantly contributed to digital heritage by creating a smell-linguistic odour dataset [51] and launching a multimodal data challenge that integrates vision and text to
Sniff AI System
categorize sniff behaviours in digital heritage [53]. Lee et al. [49] used a Graph Neural Network (GNN) to predict olfactory descriptors from molecular structures, showing AI's growing capability in olfaction. Additionally, mainstream AI in this field often utilizes electronic noses (e-nose) with Interdigitated Electrode (IDE) structures and Molecular Imprinted Polymer (MIP) sensors for detecting specific chemicals such as limonene [54]. These developments indicate that with sufficient data, AI models have the potential to match or even exceed human capabilities in olfactory perception, whereas the smell experience in existing state-of-art-art models, like LLMs, remains underexplored.
Previous studies have used word embeddings to learn sensory description languages in natural language processing [55]. However, word embeddings often fail to capture the nuanced sentiment and contextual meanings of entire texts, as they represent words as static vectors without considering variations in usage across different contexts [27]. In addition, the nature of scent is multidimensional, extending beyond a single descriptor. To address these complexities, our work adopts a more sophisticated approach by using sentence-level embeddings generated from LLM encoders. LLM encoders process entire sentences rather than individual words, allowing them to capture the semantic, rather than focusing solely on lexical or stylistic elements. This results in context-aware embeddings that adapt to the surrounding text, providing a richer and more comprehensive analysis of the content's sentiment and meaning [5, 8].
3 The Sniff AI System Design
This section introduces Sniff AI, our system to investigate whether LLMs can effectively understand smells based on human descriptions. We would like to understand how AI interprets human-described scent differences by analyzing the vector relationships in LLM's latent embedding space. We begin by discussing how LLMs encode concepts like scent descriptions into high-dimensional vector spaces (embeddings) and provide an overview of the Sniff AI system design in Section 3.1. Following this, we then detail the design of our Sniff AI system with five core components as shown in Figure 2: Automatic Speech Recognition (ASR) and pre-built scent embedding generation (described in Section 3.2), mapping human smell experiences into the AI's embedding space (Section 3.3), the AI guessing mechanism (Section 3.4), and scent delivery (Section 3.5). We also discuss how we selected the scents in Section 3.5.
3.1 Scent Embedding Space and the Sniff AI System Overview
AI systems "think" differently from humans, as they represent concepts and ideas within an embedding space [12, 13]. While the decoder-only models like GPT-4 [56] and Gemini [57] are for generating output in an auto-regressive fashion, the encoder-based models, or embedding models, are fundamental for understanding how the model interprets and organizes information [58, 26]. For example, in an LLM embedding space, the distance between "man" and "woman" is comparable to that between "king" and "queen", reflecting their analogous relationships. The direction between vectors from the model's embedding space can also reflect semantic differences and similarities, such as the shift from "man" to "woman" mirroring the shift from "king" to "queen", both indicating a change in gender [59]. By studying how LLMs encode human scent descriptions, we gain insight into how they capture both similarities and the nature of relationships between scents.
3.1.1 Scent Embedding Space
In a scent embedding space, if AI's perception aligns with human experience, similar scents should be located near each other (in terms of their vector norms, such as l p norms), while dissimilar ones should be positioned further apart, with the direction and distance between them representing their differences. For example, "lemon" and "lime" might be
near each other due to their similar citrus characteristics, while "rose" and "jasmine" could be close as they are both floral. The placement, distance, and direction of these scent vectors, obtained from the LLM embedding model, should reflect meaningful olfactory relationships. If a scent is described as sweeter than "lemon," we would expect AI to point out "vanilla" is closer rather than "peppermint", suggesting a transition from a fresh to a sweet scent. Conversely, an embedding vector from "sandalwood" to "peppermint" might indicate a transition from a warm, woody scent to a cool, minty one. These vector mappings are key to understanding how AI interprets and differentiates between various scents.
3.1.2 An overview of the "Sniff AI" study and system
We conducted a user study, where we use the Sniff AI system to identify scents based on human descriptions. The Sniff AI system recognizes human voice input, predicts, and delivers scents in real-time. The study's objective is to evaluate the alignment between human sensory descriptions and the AI's scent judgement via two tasks: the Scent Description Task and the Interactive Scent Comparison Task. We describe further these tasks in Section 3.3 and the detailed user study in Section 4.
3.2 Pre-encoded Scent Embeddings
To evaluate the AI's ability to understand and distinguish between scents, we first need to generate scent representations in a form suitable for AI processing, we call this process pre-encode scent embeddings. This includes converting scent-related data into numerical vectors within the LLM's high-dimensional embedding space.
We consulted and worked with domain experts to create catalogue descriptions for each scent. These catalogue descriptions provide essential information about each scent's source and composition. For example, "The scent of Rosemary is from its essential oil. The essential oil of Rosemary is extracted from the Rosmarinus Officinalis plant (CAS 2 : 8000-25-7)" . These catalogue descriptions ( x i ) were encoded into embedding vectors using OpenAI's text-embedding-3-large model with 3072 dimensions [58], defined as:
v i = f encoder ( x i ) (1)
A total of 20 unique vectors were generated since we have 20 scents, each representing a different scent ( v i ). These vectors, known as E scent = { v 1 , v 2 , . . . , v 20 } , form the basis of the AI's scent knowledge and are used repeatedly throughout this study, as referenced in Table 1. Detailed information on the scents, including their concentrations, manufacturers, and catalogue descriptions, is provided in the supplementary materials.
3.3 ASR and Mapping Human Olfactory Experiences to the AI Embedding Space
To explore LLM's embedding space representations, we designed two interactive tasks involving human participants. In both tasks, participants sniff scents and provide descriptions of their smell experiences, which is then encoded by the LLM encoder and serve as representations of LLM's olfactory perception, as illustrated in Figure 1 and detailing the specifics of the implementation Figure 3:
The mapping methods and AI prediction mechanisms associated with these two distinct tasks (Task 1 and 2) then differ slightly. Each participant verbally describes their smell experiences. In Task 1, participants describe the smell of a single scent, while in Task 2, they describe the differences between two different scents. These verbal descriptions were captured and transcribed into text using ASR - OpenAI's whisper-1 model [60]. The transcribed texts were then encoded into numerical vectors using the same encoder, f encoder , OpenAI's text-embedding-3-large model [58].
Prescreening & Demographics Questionnairs
Interview
3.3.1 Task 1: Scent Description
Participants provided descriptions for a single scent, which were transcribed as x h single and processed by f encoder , resulting in a query vector v h query , calculated as follows:
v h query = f encoder ( x h query ) (2)
3.3.2 Task 2: Interactive Scent Comparison
Participants described differences between a target scent ( i tar ) and a reference scent ( i ref ). The AI system is initially given the embedding of the reference scent, v ref â E scent , as the starting point for identifying the target scent. The AI uses an updated v h update to make predictions, following a multi-step process. Firstly, the descriptive differences x h diff were encoded into a vector v h diff using Equation 3. Then, the embedding vector of the reference scent, v ref , was combined with v h diff . The resultant vector was normalized to obtain the updated query vector v h update :
v
h diff = f encoder ( x h diff ) (3) v h update = v ref + v h diff /parallel.alt1 v ref + v h diff /parallel.alt1 (4)
Here, /parallel.alt1 v /parallel.alt1 is the Euclidean norm of v , calculated as â vv T .
3.4 The AI Guessing Mechanism
The general principle of relying on the AI's embedding to make a guess in both Task 1 and 2 are alike. The AI guessing mechanism performs an information retrieval using the cosine similarity measure defined as:
v query = arg max v i âE scent v query · v i /parallel.alt1 v query /parallel.alt1/parallel.alt1 v i /parallel.alt1 (5)
For each prediction, the AI aims to identify a target scent i tar within E scent . The system selects the most similar embedding from the set of 20 possible embedding vectors E scent = { v 1 , v 2 , . . . , v 20 } :
ID = max ( Cos ( v query , v i )/divides.alt0 v i â E scent ) (6)

where ID represent the indices of the largest similar embedding to the query, Cos computes the cosine similarities between vectors.
3.4.1 Task 1: Scent Description
Here, v h query from Section 3.3 is used as v query in Equation (5) to identify the scent embedding that most closely matches the description. The system then provides feedback by audibly announcing and visually displaying the guessed scent ID on the interface.
To initiate the task, a target scent i tar is allocated and diffused to the participant via the novel scent-delivery device, activated by pressing the "sniff the scent" button in our UI. Participants are instructed to describe their smell experiences verbally aloud to the system. The AI then processed the descriptions and made an informed guess, presenting the guessed ID both audibly and on the display interface. If the AI successfully identifies the scent, that round is completed. If not, participants can refine their descriptions, prompting the AI to make another guess based on the new description. This iterative process continues until the scent is correctly identified or a predefined limit is reached (see Section 4.3).
3.4.2 Task 2: Interactive Scent Comparison
In Task 2, participants are involved in an interactive scent comparison. They provide multiple comparative descriptions of the target scent, x h diff , defined in Section 3.4. The AI predicts the position of the target scent in the embedding space, E scent , using an updated vector v h update . This vector is generated by combining the embedding of the reference scent with the human-provided comparative description (Equation (4)). A cumulative description is learned. The system then delivers the guessed scent via the scent-delivery device.
Each round starts with participants receiving an initial pair of scents: a reference scent ( i ref ) and a target scent ( i tar ). And AI aims to identify ( i tar ). Participants verbally describe the differences between the reference and target scents aloud. These descriptions are captured via ASR in our system and processed by the system to generate a comparative description vector ( v h diff ), as described in Section 3.3. The updated query vector v h update in the later rounds is calculated by integrating v ref with v h diff , as defined in Equation 4.
3.5 Scent Delivery and Selection
The scents were delivered using a digital scent delivery device developed by OW Smell Made Digital 3 . The device pumps air through individually separated scent channels. We added 250 microlitres of each scent to a cellulose sponge (25mm x 10mm x 1mm) placed into the device. The level of fragrance was determined based on previous studies using the same device [34].
We investigate smell experiences using the Fragrance Wheel [16] - a well-established scents classification system based on their scent profile characteristics. For our study, we selected 20 scents, shown in Table 1, representing a diverse range of profiles by choosing five scents from each of the four Fragrance Wheel families: Floral, Fresh, Woody, and Oriental [16]. Careful consideration was given to ensure a balanced variety of scents within each category to avoid making the AI's task too simple, which could lead to ceiling effects. For example, if participants describe the only floral
(b) A close view of the device and user interface.
scent as "floral", the AI would immediately guess correctly. However, by having multiple but distinct floral scents, such as rose, geranium, and lavender, the task becomes more challenging. On the other hand, we avoided focusing exclusively on a single scent family (e.g., only florals) to prevent the task from becoming too difficult and to ensure the implications for human-AI alignment extend beyond just one scent family. We cross-reference our selection with the literature presented in Section 2.3 on the Fragrance Wheel [16]. To maintain authenticity, all scents used were natural essential oils and extracts to ensure the fragrance closely matched its label (e.g., the lemon scent was derived from lemon peel oil extract). Additionally, we sourced fragrances from the same supplier whenever possible to maintain consistency across scent quality and formulation.
4 User Study
The user study, as described before, utilized the Sniff AI system to identify scents based on human descriptions, where participants sniffed and provided detailed descriptions of the scents. Our aim was to evaluate the alignment between human sensory descriptions and the AI's scent judgements through two tasks: the Scent Description Task (Task 1) and the Interactive Scent Comparison Task (Task 2). Both tasks were designed to assess how well the AI could understand and interpret human olfactory experiences. Task 2, in particular, focused more specifically on evaluating the alignment between human and AI olfactory perception in their respective representations, as discussed in Section 3. The study employed a mixed-methods approach using a repeated-measures, within-subjects design to ensure a robust evaluation of AI's ability to capture human scent perception.
4.1 Participants
We aimed to explore the integration of smell experiences into daily activities through natural interactions. Therefore, we recruited the general public rather than domain experts for our AI alignment task. A total of 40 participants (22 female, 18 male; aged 19-50, mean = 28.95, SD = 5.99) were recruited for an in-person user study, targeting the general public rather than domain experts, as detailed in Section 2.3. None of the participants had any olfactory sensory impairments that could affect their olfactory perception. The participants came from 14 countries across four continents: Europe (21), Asia (17), North America (1), and Oceania (1), and all were native or highly proficient English speakers. Their professions were diverse, including computer scientists (6), IT and engineers (6), HCI and psychology students (4), bio/medical students (4), neuroscience researchers (3), local government workers(2), a psychotherapist and etc. Participants provided written informed consent before participating in the 60-minute study and were compensated with a gift voucher. The study was approved by the local University's Research Ethics Committee.
4.2 Study Set-up and Procedure
We hosted the Sniff AI system on a local machine (13in, MacBook Pro Intel Core i7), allowing participants to complete tasks autonomously. Participants were provided with on-screen instructions to initiate the study, diffuse the scent, sniff
it, and then describe it to the system. Participants also provided their subjective judgments via questions and rating options available on the interface, as shown in Figure 4b. The detailed procedure of the study is illustrated in Figure 5.
Pre-screening and Setup Before joining the study, participants completed a pre-screening survey to ensure they had no olfactory or speaking impairments that could affect their participation. Upon arrival, they filled out a demographics questionnaire to collect essential background information. Participants were then introduced to the scent delivery device, as shown in Figure 4a, and instructed to maintain a distance of approximately 20 cm from the output nozzle. Each scent was delivered for ten seconds upon activation via an interface button to ensure standardized exposure throughout the study. To ensure a balanced evaluation, we limited the number of guesses the AI system could make in each round to three for Task 1 and five for Task 2. This threshold was established based on internal pilot testing, aimed at balancing the AI's opportunities with maintaining participant engagement.
Participants were reminded that the task focused on investigating AI's ability to learn about human sensory descriptions, rather than the human ability to recognize scents. They were instructed to avoid naming the scent (e.g., lime) in their descriptions and instead focus on describing the characteristics of the scent they experienced (e.g., sweet, punchy), how it made them feel (e.g., pleasant, unpleasant), and any other sensory qualities they observed.
Task 1: Scent Description Each participant completed two rounds of Task 1. In each round, participants sniffed a target scent i tar delivered via the novel scent-delivery device and described their olfactory experiences aloud to the Sniff AI system. The AI processed these descriptions and made an informed guess, presenting the results audibly and visually through an interface. Participants rated the scent on intensity and familiarity for the first sniff. If the AI correctly identified the scent, the round was considered complete. If the AI's guess was incorrect, participants had the opportunity to refine their descriptions, allowing the AI system to make up to three additional attempts per round. Subsequently, a different scent was introduced to continue with the next round.
Task 2: Interactive Scent Comparison Participants completed four rounds of the Interactive Scent Comparison Task, each involving a new pair of reference and target scents. After sniffing both scents, participants described their differences and rated each scent on familiarity, intensity, and similarity. The AI then attempted to identify the target scent by asking, "Is this your target scent?" and delivering its guessed scent through the device. If the AI's prediction was correct, the round was completed, with the validity and similarity scores automatically recorded as 10. If the guess was incorrect, the mistakenly guessed scent became the new reference for subsequent descriptions and ratings. Both similarity and validity scores are key metrics for evaluating human-AI perceptual alignment. Definitions of these metrics are detailed in Section 4.3.
Interview After completing both tasks, participants were invited to a follow-up interview.
4.3 Evaluating AI Sniff Performance
The study used a mixed-methods approach with a repeated-measures design to evaluate the performance of the LLM encoder. To ensure comprehensive coverage across scents and their families and minimize comparison biases, Task 1 involved randomly assigning two target scents to each participant while counterbalancing the total number of scents tested. Task 2 utilised a Latin Square arrangement with additional intra-group sequencing. This structured approach helped effectively control confounding variables through deliberate pairing and repeated-measures design, ensuring that each scent family, as well as each scent, was equally represented in the total number of trials. To measure the degree of perceptual alignment between humans and the AI model, we used the following evaluation metrics.
AI Success Rate The success rate was calculated by dividing the number of correct predictions by the total number of rounds in each task. This metric aims to quantify the AI's effectiveness in accurately identifying scents.
Validity Scores In Task 2, each AI guess receives a validity score, measuring how well the guess aligns with the human description by human expectation. A higher score indicates the AI accurately understood the user's input and chose a relevant scent. The validity of each AI guess is rated on an 11-point scale, ranging from 0 ("Completely Invalid") to 10 ("Completely Valid"). Correct predictions automatically receive a score of 10, while participants rate incorrect predictions.
Similarity Scores In Task 2, each pair of reference scent and target scent (including the initial reference) receives a similarity score where participants evaluate the perceived similarity between them. This score reflects how closely the AI's guess matches the actual scent's characteristics. The similarity of each pair is rated on an 11-point scale, ("Not Similar at All") to 10 ("Completely Similar"). Similar to validity, correct predictions automatically receive a score of 10, and participants rate incorrect predictions. Our AI system learns from cumulative descriptions to predict a target scent (see Section 3.4), with each similarity score representing a different scent pair. By analyzing these scores when the AI guesses incorrectly, we can assess whether its guesses are progressively closer to the target scent, even when not perfectly accurate.
Familiarity and Intensity Scores In scent-related studies, familiarity and intensity are commonly used metrics. Familiarity refers to how well participants recognize a scent, while intensity measures the perceived strength of the scent. We hypothesize that a higher familiarity with a scent may lead to a higher success rate in identifying or describing it, as people tend to describe things more accurately when they are familiar with them [61].
Qualitative Insights from Semi-Structured Interviews After completing the tasks, participants took part in semistructured interviews to provide qualitative feedback on their experience with the AI system. These interviews explored their perceptions of the AI's performance, its alignment with human scent perception, areas for improvement, and their vision for future human-AI scent interactions. The questions asked during the interviews are provided in the Supplementary Materials.
5 Study Results
We first present the overall performance of the Sniff AI system in both tasks. In Task 1 (Scent Description), the AI system made 213 guesses in 80 rounds, averaging 2.66 guesses per round (SD = 0.67). Task 2 (Interactive Scent Comparison Description), it made 648 guesses in 160 rounds, averaging 4.05 guesses per round (SD = 1.47). Next, we provide a detailed analysis of scent-specific performances, categorizing results into scent families (e.g., Fresh) as shown in Table 1; exploring the validity scores for the guessed scents and similarity between the reference and target scents as rated by human participants. Finally, we present qualitative insights from the interviews capturing their subjective experiences and feedback on the AI guesses.
5.1 Success Rates for Sniff AI Tasks
5.1.1 Overall performance
The AI's overall performance was measured by its success rates across the two tasks. The success rate was calculated as the number of correct predictions divided by the total rounds for each task. In Task 1 (Scent Description), The AI system correctly identified the target scent in 22 out of 80 rounds, achieving a success rate of 27.50% (SD = 0.29). The correct round count was predicted with 1 guess (9 times), 2 guesses (9), 3 guesses (4). For Task 2 (Interactive Scent Comparison), the overall success rate was 37.50% (SD = 0.23) for 60 correct predictions out of 160 rounds. The correct round count was predicted with 1 guess (19 times), 2 guesses (14), 3 guesses (12), 4 guesses (10), and 5 guesses (5). A two-proportion z-test suggested that the performance increase in Task 2 was not statistically significant (z = 1.54, p = 0.0618).
Wethen categorized this rate per scent family as detailed in Table 2. For Task 1, the success rates per scent family-Fresh, Floral, Oriental, and Woody-were 40.00%, 30.00%, 25.00% and 15.00%, respectively. To determine the statistical significance of these differences, we conducted a Chi-Squared test, resulting in Ï 2 = 8 . 43 ( p = 0 . 75 ). Task 2 exhibited rates of 45.00%, 42.5%, 40.0% and 22.5% for the same categories, with a Chi-Squared test resulting in Ï 2 = 16 ( p = 0 . 38 ). No statistical signification was found between families.
(a) Success rates for each scent in Task 1.
(b) Success rates for each scent in Task 2.
5.1.2 Scent-specific Performance
Delving into the specific scents themselves, we present the descriptive summarised in Figure 6a, Figure 6b as well as Table 2. As shown, there were vast differences in success rates. For example, in Task 1, Lemon (ID 4) was identified in all rounds, whereas scents such as Rose (ID 8), Jasmin (ID 9), and Rosemary (ID 1) were never identified. Whereas in Task 2, Peppermint (ID 5) was identified for most of the rounds and achieved the highest success rate at 87.5%, while neither rosemary (ID 1) nor black pepper (ID 19) were never identified with the lowest at 0%.
5.1.3 Confusion Matrices
Next, we created confusion matrices to understand when the AI was incorrect, which scent it would most commonly guess, to understand how the AI was misaligned. The confusion matrices that compare the AI's predictions (guesses) to the actual target scents for both tasks are displayed in Figure 7a for Task 1 and Figure 7b for Task 2. The horizontal axes represent the actual scent sample IDs, and the vertical axes show the AI's predicted scent IDs (guessed IDs). Each cell in the matrix represents how many times the AI predicted a specific scent ID (row) for a given actual scent (column). Darker cells indicate a higher frequency of predictions. The diagonal cells (from top left to bottom right) represent correct predictions where the actual scent matches the predicted scent (correct predictions). If human-AI alignment was


strong, it would be expected a less total guess and to see a dark diagonal line from the top left to the bottom right of the graph.
As the scent IDs are grouped by family, a slight deviation from this pattern would suggest that incorrect guesses were more commonly with same-family scents, compared to scents from a different family. From these graphs, for example, we see a bias for the AI model to incorrectly guess peppermint (ID 5) for the actual targets of rosemary (ID 1), Eucalyptus (ID 2), Pine (ID 16) and Black pepper (ID 19), where are all fresh or trigeminal scents. Similarly, the actual scent of Gardenia (ID 7) was most commonly guessed as Rose (ID 8).
Additionally, some columns show higher totals, indicating a bias in the AI's predictions towards certain scents. For example, in Task 1, the AI never guessed Rosemary (ID 1) but guessed Peppermint (ID 5) 30 times. In Task 2, the AI predicted Peppermint (ID 5) 137 times, while it made very few guesses for Cinnamon (ID 20), with only 5 predictions.
5.2 Familiarity Scores
To determine whether the success rate is related to participants' familiarity with the scent, we analyzed the relationship between familiarity scores and success rates for the target scents in both tasks using Pearson correlation. In Task 1, the Pearson correlation showed a moderate but non-significant positive correlation (Pearson r = 0 . 4004 , n = 20 , p = 0 . 0802 ). Task 2 exhibited a weak, non-significant negative correlation (Pearson r = -0 . 1577 , n = 20 , p = 0 . 5068 ), suggesting no reliable linear relationship.
5.3 Validity Scores for Interactive Scent Comparison (Task 2) and its Relationship to the AI's Performance
The validity score assesses how well the AI's interpretations (guessed scent's characteristics) align with the participants' expectations in relation to their given descriptors. In Task 2, the overall validity score was 5.30 (SD = 2.98), which falls between "neither valid nor invalid" and "marginally valid". A one-sample t-test revealed that this mean score was significantly different from the neutral midpoint of 5 at a minimal effect size ( t ( 647 ) = 2 . 55 , p = 0 . 011 , d = 0 . 10 ), suggesting overall users rated the AI's judgments as marginally valid. We then exclude the successful round and explore the validity score during those rounds that AI failed in any guess; all scores fall around "marginally invalid" as illustrated in Figure 8a. Due to the non-normality in the data (Shapiro-Wilk test, all p < 0.05), we used the Friedman test, which has non-significant differences across guesses ( p > 0 . 05 ).
(b) Mean similarity scores from initial part to per guess with error bars representing standard deviations (Where AI failed to guess correctly).
We further calculated the relationships between the average validity score and success rate across all scents to understand how subjective perceptions of validity correlate with objective AI performance. This helps determine if participants' perceptions of the AI's interpretations agree with its actual effectiveness. The results showed a moderate, significant positive correlation (Pearson r = 0.64, n = 20, p = 0 . 0026 ). This indicates a significant positive relationship between these variables, supporting that subjective perceptions positively relate to AI effectiveness.
5.4 Similarity Scores for Interactive Scent Comparison (Task 2)
The similarity score measures the perceptual similarity between two scents as judged by participants. We focused on the similarities across guesses and explored whether the AI model improved in providing a more closely matched scent to the user's description. The average similarity score for the AI's guesses is 5.33 (SD = 2.92, excluding the similarity between initial target and reference), which falls between "neither similar nor dissimilar" and "marginally similar". A one-sample t-test revealed that this mean score was significantly different from the neutral midpoint of 5 at a minimal effect size, ( t ( 647 ) = 2 . 88 , p = 0 . 004 , d = 0.11), suggesting overall users rated the guessed scents as slightly more similar than neutral.
To analyze if there was an increase in similarity as the task progressed, we compared these ratings for where the AI did not correctly predict the scent, as successful rounds may have fewer than five guesses. Figure 8 shows the mean similarity scores from "initial similarity" to the "5th guess", with error bars representing standard deviations.
Due to the non-normality in the data (Shapiro-Wilk test, all p < 0.05), we used the Friedman test, which indicated significant differences across guesses ( p = 0 . 0032 ). Follow-up Wilcoxon signed-rank tests with False Discovery Rate (FDR) corrections revealed that similarity scores significantly improve from the initial similarity (mean = 3.68, SD = 2.57, "marginally dissimilar") to the 1st guess (mean = 4.70, SD = 2.52, "neither similar nor dissimilar", p = 0.0051, r = -0.3084). However, scores did not improve at any of the subsequent guesses (all p > 0.05, r < -0.13); suggesting that users did not feel the AI improved in providing them a more similar scent after the initial guess.
5.5 Interview results
We summarized the results of our semi-structured interviews in Table 3, using thematic analysis with double-blind coding by two co-authors. Four main themes were discovered in a bottom-up fashion presented in Table 3, under the Addressed Question section. Each theme consists of three subthemes: (1) Perceived degree of human-AI alignment , which categorised participant's views into High, Medium, and Low; (2) Unaddressed factors in AI scent selection , including Emotion, Personal Experiences, and Perception; (3) User reflections on AI scent interaction , with subthemes of Difficulty in Verbalizing Experiences, Observed AI Behaviour; (4) Future prospects of Human-AI olfactory interaction . Specifically, to better understand AI's behaviour, we identified 4 subthemes: Words, Deviations from Descriptions, Repetitive Jumping, and Miscellaneous (comments on AI's working mechanism). To ensure consistency, we counted the frequency of each theme once per participant, even if they raised multiple points related to the same theme.

We observed varying degrees of alignment between participants' perceptions and the AI's guesses. Over half of the participants (n=30) reported lower than expected. Many (n=18) emphasised the importance of emotional attachment and personal sensory experiences in understanding scents - areas where AI struggled. This was particularly evident as the AI failed to capture nuances in scent categories and intensities, relying mostly on concrete terms to describe physical attributes. Consequently, AI guesses often deviated from participants' descriptions (n=21). The AI also showed a pattern of repetitive and alternating guesses, shifting between nearly correct scents and unrelated ones (n=6). Additionally, nearly half of the participants (n = 19) reported difficulty verbalising their olfactory experiences due to limited vocabulary, lack of prior experience describing scents, and the transient nature of the experience. Despite these challenges, 14 participants found the interaction novel and engaging. Participants also anticipated potential future applications for AI-driven scent technologies in entertainment (n=34), healthcare (n=8), and safety regulation (n=11), although the AI's current limitations in processing subjective and culturally contextual scent descriptions were clear.
5.6 Replacing Human Participants with GenAI Models
Given the limited alignment observed in Task 1, we conducted further exploratory experiments. Specifically, we designed a simple experiment using language generative models to perform Task 1 (Scent Description) instead of human
(a) Human (dots) and LLMs (crosses) scent description t-SNE.
(b) Human smell experience description across 20 scent
t-SNE.
participants - describing its interpretation of each selected scent. We consider the current most powerful LLM families f LLM : OpenAI GPT [56], Google Gemini [57] and Anthropic Glaude [62], and compare the following models via their official APIs:
We prompt these LLMs to describe how a scent smells without mentioning the name, aiming to reflect the description of an average person, each result in f LLM ( x i ) . Equation (1) would then generate a set of vectors: V LLM = { f encoder ( f LLM ( x 1 )) , ..., f encoder ( f LLM ( x 20 ))} where each value in V LLM represents an embedding vector generated by encoding the response returned by an LLM. The prompts and experimental settings are detailed in supplementary material.
To compare the descriptors generated by LLMs V LLM with humans V human . We employ the t-distributed Stochastic Neighbor Embedding (t-SNE) algorithm, a commonly used explainable AI tool, to visually compare their embeddings in 2D. It is an unsupervised dimensionality reduction that transforms high-dimensional Euclidean distances between data points into conditional probabilities that reflect their similarities in low dimension, in our case, 2D.
We first calculate the centroid of the vectors for human descriptions of each scent. These centroids capture the average semantic space of human perceptions for each scent. We then compared and visualized these centroids with V LLM in the same embedding space using t-SNE as illustrated in Figure 9. This allows us to observe the clustering and dispersion patterns, comparing how closely AI-generated descriptions resemble human perceptions. Most GenAI-generated points are far from human points, suggesting that the AI's scent representations are not well aligned with human sensory experiences. However, for the scent "lemon," where the AI achieved a 100% success rate, the points for GenAI and human descriptions are close, becoming the only point with a high alignment.
Additionally, we explore the linguistic differences between GenAI and human descriptions by analyzing term frequency. We first excluded non-substantive words (e.g. "it", "this") and study-specific terms (e.g. "feel", "smells") using Python NLTK. We then used WordCloud for visualisation. This approach emphasizes on the most significant content words used in descriptions, providing insight into the focus and variability of language used as depicted in Figure 10. The word cloud for humans prominently features words like "sweet", "fresh", "strong", "woody" and "reminds". AI highlights terms such as "undertone", "aroma", "slightly", "reminiscent" and "hint".
(a) Wordcloud for human
(b) Wordcloud for AI
6 Discussion
The primary objective of this work was to assess how well LLMs align with human in smell experiences. We conducted a user study where participants sniffed and described scents, and an LLM-based embedding model integrated into an AI system guessed the scents in real-time. Task 1 focused on accessing LLM encoder's ability to match a specific scent based on human-provided description during the study, in other words, its general semantic understanding of scent. In Task 2, we evaluate the LLM encoder's ability to understand and represent the relationships between different scents. The AI system demonstrated moderate success in identifying scents, with an overall success rate of 27 . 50% in Task 1 and rate of 37.50% in Task 2. Our results indicate that LLMs can, to some extent, represent scent semantics within their embedding spaces, though this alignment is limited and biased toward certain scents. There is some alignment and promising potential, but significant challenges remain for LLM to fully understand human smell experiences. Below, we discuss reasons for this limited alignment and how the future would improve performance.
6.1 AI's Performance in Understanding Scents
A basic measurement for AI's understanding of scent is through our Scent Description Task (Task 1) as discussed in Section 3.3. Our quantitative findings presented in Section 5.1 show that there is limited alignment in the LLM encoder model to comprehend smell experiences, with only a 27.50% success rate in Scent Description (Task 1). Also, most scents received only one correct guess (8 instances) or none at all (7 instances); suggesting human-AI alignment varies depending on the specific scent. Notably, the encoder frequently confused eucalyptus (ID2) with peppermint (ID5) and gardenia (ID7) with rose (ID8), misclassifying each pair five times; we further discuss this divergence in Section 6.3.
This limited performance could be due to limitations in the LLM's ability, and also the challenges humans face in describing scents. Scents lack a standardized language and are often linked to personal experiences and memories (e.g., events, times, and people) as also suggested from our interview Table 3 [47, 63, 64]. People tend to describe scents by referencing their sources, using tangible objects to help others understand the smell [65, 66]. In our study, participants were instructed not to name the scent but to describe the things that came to their mind, focusing on scent characteristics. Half of the participants (n=19) mentioned their difficulty verbalising the smell experiences as shown in Table 3. They employ direct, sensory-focused terminology like "fresh", "strong", and "woody" and often relate to "remind" of their personal experience (see Figure 10a). These descriptions are straightforward and resonate with everyday experiences. This may challenge LLMs to match the scent, as these personal experiences may be their unseen scenarios. For example, the AI did not associate "sandalwood" with descriptions such as "an expensive candle or home incense" (P7), but it did successfully link "lavender" to " a teddy bear sleep product" (P38) as shown in Figure 1. Additionally, human descriptors focus on the immediate sensory impact: for example, describing the rose as "a little bit sweet and maybe purple" (P23, AI misrecognise it as geranium (ID7)) or noting that "it smells like some kind of flowers. The smell is quite light, not heavy at all, and it makes me feel very relaxed and comfortable" (P29, AI misrecognise it as gardenia (ID10).
To further investigate the abovementioned phenomenon, we have AI tackled Task 1 similarly to a human participant, as detailed in Section 5.6. As shown in Figure 10, LLMs tend to utilize more abstract, expert terminology such as "undertone," which may feel detached from common everyday usage and occasionally lack intuitive sense. We found that LLMs provide a richer narrative that is sometimes distant from typical human descriptions. For instance, GPT-4 describes a rose as "sweet and floral, like a blooming garden full of delicate petals, carrying a soft, romantic fragrance with a hint of a nutty undertone" ( gpt-4o ). Similarly, Claude-3 proposes "a delightful floral aroma that is both sweet and subtly spicy, evoking the essence of a lush garden in full bloom, with a warm, nutty undertone that adds depth" ( claude-3-opus ). This difference is important to consider, as AI-generated descriptions might not always align with how people naturally talk about scents, and there exists a challenge to map personal experiences with scents (also reflected from our interviews Table 3), all these factors may affect the effectiveness in everyday

Sniff AI
human-AI communication. This phenomenon is likely caused by the fact that these large models are pre-trained on more professional, formal, or technical datasets. We will further discuss this in the limitation section.
Additionally, the t-SNE visualization (Figure 9) clearly demonstrates a separation between AI-generated and humangiven scent descriptions. Interestingly, lemon, identified 100% in the study, is an exception where human descriptions cluster closely to those of the AI, suggesting alignment in human-AI descriptions for this scent. Despite this anomaly, the separation highlights a significant gap in how the AI and humans perceive and describe scents. This observation aligns with findings from Zhong et al. [23], who explored AI's abilities in describing tactile sensations. Moreover, our results indicate that AI models show more variance in tasks related to scent than those involving tactile descriptions. This could be due to the models being trained on different data sources. Also, descriptions of scents vary more than tactile descriptions, leading to less consistent performance across various models.
6.2 Limited Alignment in Smell Experiences
The Interactive Scent Comparison Task (Task 2) evaluated the degree of alignment between the LLM's scent representations and human descriptions, focusing on the embedding model's dimensional structure, through comparative descriptions. Although the success rate of 37.50% marks an improvement from Task 1, it still remains moderate. We observe that LLM's performance is improved across all scent families (Fresh, Floral, Oriental, and Woody), with the most notable gains in the Floral category. This improvement could be due to participants gaining experience in communicating about scents from Task 1. Additionally, the presence of a reference point in Task 2 seemed to aid participants in describing scent differences, as suggested by our interview data. Despite these insights, the specific reasons for the marked improvement in the Floral family are not fully understood and need further investigation.
To address the challenges of articulating scent experiences as discussed in Section 6.1, we employed subjective assessments such as validity and similarity scores during the study. For instance, when describing the comparison between geranium and pine, a participant (P30) noted, "The target scent [geranium] is less spicy, more flowery, though not pleasantly so, and more akin to nature." Based on this description, the AI predicted the scent as cedarwood and, accordingly, diffused cedarwood. Although the AI's guess was incorrect, participants rated this prediction as significantly valid (8) but significantly dissimilar (2). In this case, participants found the prediction is valid in terms of interpret their description - the cedarwood about "less spicy, more flowery, not pleasantly, akin to nature", but not perceptual similar to the target. We still view this as an aligned case as AI gave a scent that largely matched human description.
Essentially, the validity score reflects the AI's success in interpreting human descriptions, while the similarity score indicates how well the AI matches the scent's perceptual profile. Overall, participant ratings for the AI's interpretation of validity and similarity were slightly above neutral, leaning "marginally valid/similar". Although the AI occasionally aligned with participants' descriptions, its overall capability to match human scent perceptions remained limited as validity scores hovered around marginally invalid during unsuccessful rounds. We then focused on identifying the trend for perceptual alignment concerning similarity. Our system and task design enables the AI to refine its understanding through cumulative descriptions. We tracked changes in similarity scores over time, as shown in Figure 8. Initially, there was a significant increase in similarity, followed by a decrease between the first and second guesses. The scores showed minimal iterative improvement but were not significantly different. This pattern could be attributed to our study's design, which balanced reference and target pairs through intra-group sequencing. Each participant encounters three pairs from different groups and one from the same group. Although the AI's guesses progressed to be more perceptually similar to the target, the increased guesses needed for a correct prediction (Section 5.1.1) suggests that accuracy is most likely on the first attempt. Subsequent attempts may not necessarily enhance prediction accuracy. This pattern raises questions about the cumulative effectiveness and the LLM's ability to understand context, especially considering the lack of significant changes in similarity after the initial guess. If the AI's first output is inaccurately directed, it becomes challenging to provide progressive and cumulative contrastive explanations that guide the AI toward the correct answer.
More importantly, olfaction alignment is relatively low compared to other modalities such as vision [22] or sound [20]. Also, Marjieh et al. [2] showed that GPT-4 could effectively interpret human sensory judgments (e.g., colour, sound, and taste) based on textual inputs. This divergent performance in olfaction may stem from the underrepresentation of smell experiences in AI training data and the focus of AI on linking olfactory descriptors to chemical structures in the domain [50, 49]. As a result, there has been a surge in calls for datasets and labelling efforts aimed at empowering AI to "sniff" with olfactory descriptions, like the DREAM challenge [50] and the Odeuropa project [52].
6.3 AI Exhibits a Bias toward Specific Scents and Particular Scent Characteristics
The success rates of both tasks varied across the four scent families and individual scents, indicating the AI's biases in these tasks. The LLM encoder excels with Fresh family scents but struggles with Floral family scents. The performance
Sniff AI
also varies with Woody family scents. When considering individual scents, the AI shows high accuracy with distinctive and common scents such as lemon and peppermint. This variability in performance not only suggests a potential bias but also points to gaps in the AI's ability to consistently process different scent categories.
We first explore whether "bias" arises from AI's limited capability or originates from participants themselves. Previous studies suggest people tend to describe things more accurately when they are familiar with them [61]. However, we found no significant correlation between success rate and familiarity. In some cases, the results were even divergent; for example, Task 1 with the highest familiarity, Lavender (ID 6), had only a 25% accuracy rate, while the task with the lowest familiarity, Black Pepper, achieved a 50% accuracy rate (see Table 2).
We then investigate the confusion matrix in Figure 7 at where AI had difficulty distinguishing between scents. Both confusion matrices in Figure 7 present a bias in the perception alignment across scents, where we observed a scattering of bright spots rather than a concentrated diagonal line in those square matrices. Ideally, in an instance of perfect alignment between the AI and human judgments, the confusion matrix should exhibit its highest values only on the diagonal line running from the top left to the bottom right. Our confusion matrices suggest that while the AI is somewhat aligned with human judgments for some scent families, there is significant variability, especially in distinguishing certain scent families such as Floral and Woody. For example, the AI frequently confused peppermint (ID 5) with both rosemary (ID 1) and eucalyptus (ID 2). This confusion could stem from their similar cool scent profiles. Similarly, gardenia (ID 7) was often mistaken for rose (ID 8), likely due to their perceptual similarity within floral characteristics. It was also observed that the AI repeatedly guessed peppermint (ID 5) in both tasks, one participant described the AI as "obsessed with minty and non-minty" (P08). This may be attributed to peppermint's distinctive minty/fresh characteristics and its prevalence in training data. As a result, the AI often mislabels other scents as peppermint, indicating a bias toward more familiar or common descriptions.
Interview feedback from participants also highlighted an observed bias in the AI towards particular scents, especially those that are more uniquely identifiable, such as mint and lemon. Then participants included descriptors related to "minty" or "citric" ; the AI often defaulted to predicting peppermint or lemon, even if these descriptors were used to indicate an absence of these qualities. For instance, participants noted that, "identify citrusy more accurately than other scents' (P2), "whenever I mention citric it will give me the smell of some kind of lemon" (P17) and "AI is quite obsessed with minty and non-minty flavour" (P08). This also confirms our finding in LLM's limited contextual understanding of scent profiles and relationships toward literal interpretations. Conversely, the AI struggled significantly with scents like rosemary, which had a 0% success rate in both identification tasks. This suggests that certain scents may lack the distinctive characteristics that the AI can readily identify or are underrepresented in the training data. Such discrepancies highlight the AI's difficulty in recognizing less distinctive or less commonly trained scents.
7 Limitations and future work
This study provides insights into the AI's capabilities and current limitations in scent recognition. However, it is important to acknowledge several challenges that might affect the broader applicability of these findings. First, we have a limited scent sample size that is based on the Fragrance Wheel [16]. Future work is needed to extend the sample selection and explore a larger diversity of scents. This could be done through combining in-person studies, as in our work, with online surveys, where more descriptions on peoples' smell experiences can be collected [41].
Second, this study primarily recruited non-experts to reflect the AI's intended use by the general public, focusing on human-like rather than "textbook" language. This only represents the general population and not necessarily professionals, limiting the generalizability of the conclusions. Future work could compare the same settings with both experts and non-experts to better understand how expertise influences AI's performance in scent recognition.
Third, to enhance AI's capabilities in processing scent-related language, it is crucial to enrich training data with a diverse array of descriptive terms and emotional expressions associated with scents. This aims to reduce biases and limitations evident in LLMs. Current research in computer science typically focuses on chemical descriptors, which are not readily applicable to everyday uses by the general public. Future studies could improve by integrating multimodal data inputs, and combining chemical scent analysis with human descriptions to create a more comprehensive AI-based scent prediction framework. Furthermore, employing Human-in-the-Loop strategies like Reinforcement Learning from Human Feedback (RLHF) [9] can enable AI systems to adapt and improve their predictions continuously based on user feedback, gradually increasing their effectiveness and relevance. Last but not least, there is growing interest in olfactory experiences within the HCI community [29, 32, 35].
8 Conclusion
In this work, we explored human-AI perceptual alignment in smell experiences. We developed an AI system, which leverages an LLM embedding model, named Sniff AI. Sniff AI recognizes human voice input, predicts, and delivers scents in real-time. To investigate how well the LLM encoder aligns with human perception, we conducted a user study involving 40 participants who interacted with the Sniff AI system. During the study, participants sniffed various scents and described them, whilst the AI attempted to identify them based on their descriptions. Our findings indicate that while the LLM's embedding space captures scent-related semantics, it exhibits limited accuracy and a bias toward certain scents. Additionally, participants responded positively to their interactions with Sniff AI, describing the experience as "very fun" and "interesting" , and appreciating the novel ways of exploring scents. Feedback highlighted the potential uses of scent-focused AI across entertainment, healthcare, and security sectors. For instance, AI might assist in choosing perfumes, aid those with olfactory impairments, or even detect substances in security settings. These findings and insights highlight both the potential and challenges of AI in aligning with human sensory experiences related to smell. Existing AI models demonstrate capability in recognizing distinct scents from descriptions; however, they encounter significant limitations in processing nuanced or subjective scent descriptions. This reveals critical avenues for future research to enhance AI's understanding of and interaction with human olfactory experiences.
References


Sniff AI

![Sniff AI
taste, touch, and smell [1, 2, 3]. It enables AI to better understand the physical world …](https://d2z384uprhdr6y.cloudfront.net/Mtg4ZCkeoXBGlXhKy2XDWLT4zJCihdbwzEvh3UNl5Zw/rt:fill/q:100/w:1280/h:0/gravity:sm/czM6Ly9qYXVudC1wcm9kdWN0aW9uLXVwbG9hZHMvMjAyNC8xMS8yMi8yN2FjNzQ0OS1jNWQxLTQ5MWMtYmQ4MC0yNGVhZjE3MzRkYzkvc2xpZGVfMi1sLndlYnA=.webp)
![Sniff AI
done by Zhong et al. [23], who investigated the agreement between humans and six vision-b…](https://d2z384uprhdr6y.cloudfront.net/MvLNiwrGuuPPYbn6W3lA9tIfvu5zS0YignEhjql7P-g/rt:fill/q:100/w:1280/h:0/gravity:sm/czM6Ly9qYXVudC1wcm9kdWN0aW9uLXVwbG9hZHMvMjAyNC8xMS8yMi8yN2FjNzQ0OS1jNWQxLTQ5MWMtYmQ4MC0yNGVhZjE3MzRkYzkvc2xpZGVfMy1sLndlYnA=.webp)



![Sniff AI
Table 1: 20 Selection Scent using the Fragrance Wheel [16] falling into Four Family and S…](https://d2z384uprhdr6y.cloudfront.net/ZRn5pqY3SwJhXAPRIeiLduqP9U4eXLy-A9UnLp1cVsc/rt:fill/q:100/w:1280/h:0/gravity:sm/czM6Ly9qYXVudC1wcm9kdWN0aW9uLXVwbG9hZHMvMjAyNC8xMS8yMi8yN2FjNzQ0OS1jNWQxLTQ5MWMtYmQ4MC0yNGVhZjE3MzRkYzkvc2xpZGVfNy1sLndlYnA=.webp)












![Sniff AI
[14] Geoffrey E Hinton and Sam Roweis. Stochastic neighbor embedding. Advances in neural …](https://d2z384uprhdr6y.cloudfront.net/GTYB4JUgiD9FwLaHftayWnAu6J9SuP7gKSiSnK5yi2w/rt:fill/q:100/w:1280/h:0/gravity:sm/czM6Ly9qYXVudC1wcm9kdWN0aW9uLXVwbG9hZHMvMjAyNC8xMS8yMi8yN2FjNzQ0OS1jNWQxLTQ5MWMtYmQ4MC0yNGVhZjE3MzRkYzkvc2xpZGVfMjAtbC53ZWJw.webp)
![Sniff AI
[35] Emanuela Maggioni, Robert Cobden, Dmitrijs Dmitrenko, Kasper Hornbæk, and Marianna O…](https://d2z384uprhdr6y.cloudfront.net/UtfEweq8xgDIMiUhf5EgB6NZFF3tgb4RuV97XCRcjN8/rt:fill/q:100/w:1280/h:0/gravity:sm/czM6Ly9qYXVudC1wcm9kdWN0aW9uLXVwbG9hZHMvMjAyNC8xMS8yMi8yN2FjNzQ0OS1jNWQxLTQ5MWMtYmQ4MC0yNGVhZjE3MzRkYzkvc2xpZGVfMjEtbC53ZWJw.webp)
![Sniff AI
[56] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leon…](https://d2z384uprhdr6y.cloudfront.net/8ogj7ge2sqB4vWSXEcoGORlELR33-OuwG5mfq3MdDYY/rt:fill/q:100/w:1280/h:0/gravity:sm/czM6Ly9qYXVudC1wcm9kdWN0aW9uLXVwbG9hZHMvMjAyNC8xMS8yMi8yN2FjNzQ0OS1jNWQxLTQ5MWMtYmQ4MC0yNGVhZjE3MzRkYzkvc2xpZGVfMjItbC53ZWJw.webp)
Related Jaunts

AI's Increasing Energy Consumption and demand for electricity
@Blockchainboss

Cryptocurrencies and Artificial Intelligence: Challenges and Opportunities
@Blockchainboss
More from author
