Can AI Sniff Like Humans?

AI Summary

Bulleted

Text

Key Insights

This study examines the perceptual alignment between Large Language Models (LLMs) and human smell experiences using the Sniff AI system.
A user study involving 40 participants was conducted, where the AI attempted to identify scents based on human descriptions in two interactive tasks.
The findings revealed limited perceptual alignment and biases toward specific scents like lemon and peppermint. AI showed higher accuracy with distinct/common scents. The research also found that participants' familiarity with a scent didn't impact the success rate or AI's ability to guess.
LLMs struggle with the subjective descriptions provided by users and excel better with the chemical, distinct profile associated with scents.
Future research directions are enriching the training data and more focus on emotional factors are discussed, to increase the accurate alignment of LLMs with human perceptive understanding and better results.

SNIFF AI: IS MY ‘SPICY’ YOUR ‘SPICY’? EXPLORING LLM’S
PERCEPTUAL ALIGNMENT WITH HUMAN SMELL EXPERIENCES
Shu Zhong1, Zetao Zhou2, Christopher Dawes1, Giada Brianz1, and Marianna Obrist1
1Department of Computer Science, University College London, United Kingdom
2Division of Psychology and Language Sciences, University College London, United Kingdom
Figure 1: SniffAI: an overview of user study and examples from the user study.
ABSTRACT
Aligning AI with human intent is important, yet perceptual alignment—how AI interprets what
we see, hear, or smell—remains underexplored. This work focuses on olfaction, human smell
experiences. We conducted a user study with 40 participants to investigate how well AI can interpret
human descriptions of scents. Participants performed "sniff and describe" interactive tasks, with our
designed AI system attempting to guess what scent the participants were experiencing based on their
descriptions. These tasks evaluated the Large Language Model’s (LLMs) contextual understanding
and representation of scent relationships within its internal states - high-dimensional embedding
space. Both quantitative and qualitative methods were used to evaluate the AI system’s performance.
Results indicated limited perceptual alignment, with biases towards certain scents, like lemon and
peppermint, and continued failing to identify others, like rosemary. We discuss these findings in light
of human-AI alignment advancements, highlighting the limitations and opportunities for enhancing
HCI systems with multisensory experience integration.
1 Introduction
Aligning Artificial intelligence (AI) behaviour with human preference is critical for the future of AI. An important yet
often overlooked aspect of this alignment is the perceptual alignment. Perceptual alignment refers to the agreement
between AI assessments and human subjective judgments across different sensory modalities, such as vision, hearing,
arXiv:2411.06950v1 [cs.CL] 11 Nov 2024

1/22

Sniff AI
taste, touch, and smell [1, 2, 3]. It enables AI to better understand the physical world as humans experience it, ensuring
that AI applications are reliable and beneficial in real-world settings. For example, consider autonomous vehicles: if
the "AI eye" misinterprets data from sensors such as cameras and fails to recognize obstacles or pedestrians, it poses
significant safety risks [4]. Beyond safety considerations, perceptual alignment plays a critical role in everyday AI
applications [3], whereas olfactory alignment remains completely unexplored. Imagine a future where AI assistants
are capable of controlling environmental factors like lighting and scents based on user requests. For instance, instead
of asking "Alexa, play uplifting workout music", you can ask Alexa to "spice up my workout session" or "help me
remember my holiday to Madrid". Here, the challenge for AI goes beyond playing music and ventures into "AI sniff",
to select the ideal scent aligned with human descriptions. This poses the question of how AI would understand and
interpret scents in a way that resonates with our personal sensory experiences. Or simply: Is my "spicy" AI’s "spicy"?
AI, or mostly Large Language Models (LLMs), lack human-like perceptions; rather, they process human provided
language inputs using algorithms that function on binary systems to analyse information [5, 6, 7, 8, 9]. We place a
special emphasis on LLMs as they are being increasingly viewed as the interface for human interaction, with agentic AI
systems and general alignment being one of its core research domains [10, 11]. LLMs, or any neural network-based
learning systems, represent concepts and ideas in what is commonly known as an embedding space – a learned internal
high-dimensional vector space [12, 13]. In this space, semantically similar items cluster closely together, and the
semantic differences between items are preserved [14]. This naturally lends us a hand in analyzing how closely aligned
humans and AI are. If two items are deemed similar within the AI system’s embedding space, humans should also
perceive them as similar, if human-AI alignment exists. Our work then exploits exactly this approach and focuses on
analyzing the model’s embedding space. We leverage LLM-based embedding models to develop an AI system capable
of performing the "Human sniff and describe and AI guesses" task. Here, the LLM encoder translates human language
descriptors of scents into the embedding space, and the system makes then scent suggestions based on the semantic
similarities measured in this vector space. Given the recent advancements in LLMs’ ability to interpret human language,
an intriguing question arises: can LLMs effectively understand scents based on user descriptions? For instance, will
both LLMs and humans agree on Jasmine and Ylang-Ylang being perceptually similar?
To investigate this question, we conducted an in-person user study with 40 participants, where participants engaged in
interactive tasks where an AI system had to guess what scent they were experiencing. These participants, who were
non-domain experts, were specifically chosen to reflect common perceptions of scents as conceptualized by Henning’s
Odour Prism [15], the Fragrance Wheel [16], and attributes such as the fresh, citrusy, and zesty qualities typically
associated with lemon. We analysed the system’s performance using both quantitative and qualitative methods to
capture the participant feedback. Our findings suggest that scent-related semantics are represented in the embedding
space, though to a limited extent. There is some degree of perceptual alignment, but it was biased toward certain scents
and characteristics. For instance, sometimes the AI believed the human description of "an aromatic plant that probably
you use for like stew or for like chicken and it’s very green and is fresh" refers to eucalyptus rather than rosemary.
Occasionally, participants were surprised by certain emergent behaviours; for instance, the AI correctly identified a
"intense masculine scent" as oakmoss. We discuss these findings in light of recent advancements of LLMs and efforts
towards improved human-AI alignment, with the opportunity to enhance HCI systems with multisensory experience
integration.
2 Background and Related Work
2.1 Human-AI Interaction and Perceptual Alignment
Human-AI alignment involves designing, developing, and refining AI systems to understand, predict, and enhance human
intentions and actions, whilst generating informative, harmless, and helpful responses [11, 17, 3]. A notable contribution
is from Hendrycks et al. [11], who introduced the ETHICS dataset to evaluate language models’ understanding of
fundamental moral concepts, such as justice and well-being. Recent studies emphasize integrating human-in-the-loop
to reduce toxicity and foster ethical behaviour [9, 18, 19].
There is growing research into representational alignment between AI and humans. Representation alignment refers to
the extent to which the internal representations of two or more information processing systems are aligned [3]. Peter et al.
[20] compared the performance of general-purpose embedding models, such as OpenAI’s text-embedding-ada-002,
with domain-specific models, such as the CLAP text-audio embedding model[21] in capturing nuances in expressive
piano performances. In this particular case, all models showed a degree of alignment, and general-purpose models
outperformed domain-specific ones.
Extending this exploration of Human-AI alignment into sensory judgments, Lee et al. [22] introduced the VisAlign
dataset, designed to evaluate the alignment between AI and human visual perception. This dataset aids in understanding
how to better align AI with human vision perceptual processes, particularly in vision. Another exemplified research is
2

2/22

Sniff AI
done by Zhong et al. [23], who investigated the agreement between humans and six vision-based Multimodal Large
Language Models (MLLMs) in describing tactile qualities of textiles, they found these models’ descriptions were
detached from human descriptions in both sentiment and word usage. Differently, Marjieh et al. [2] demonstrated
that GPT-4 can effectively interpret certain human sensory judgments (e.g., colour, sound and taste) based on textual
sensory inputs. For example, they presented the same pair of colours (red and blue) to both humans and GPT models1,
requesting each to provide a similarity rating and then comparing the resulting scores. Their findings showed that GPT
models produced judgments that correlated with those of humans. To the best of our knowledge, the alignment between
AI and human olfaction remains unexplored. Our research aims to extend these foundational studies by integrating
semantic embeddings [24] to investigate olfactory perception.
AI systems process information differently from humans, by representing concepts and ideas within a latent space
or embedding space [12, 13]. These spaces allow the AI to efficiently encode, process, and manipulate complex
information, capturing relationships and patterns in the data [25, 14]. In embedding spaces, embeddings are learned
representations, similar items are grouped closely together, while differences are preserved through distance and
direction [14, 26]. The embedding model is a fundamental component of the LLM architecture, and studying the
encoder helps explore how the model interprets and organizes information at its core - both dimensional and categorical
structure [26]. Understanding the encoder’s function is crucial, as it provides insights into how the LLM structures its
internal knowledge [12]. Zhong et al. [1] explored textile tactile experiences using an LLM embedding model, finding
limited perception alignment and biases toward specific textiles. In this study, we investigate the ability of LLMs
to understand human smell experiences by analysing how they represent scents within their embedding models. By
examining how these models encode olfactory concepts, we aim to understand how they capture both the similarities and
the relationships between different scents. This understanding of LLMs’ internal processes can offer a more meaningful
grasp of the internal workings of LLMs and potentially enhance their performance [12, 27].
2.2 Growing Importance of Smell in HCI and Experience Design
The integration of olfactory experiences in HCI has been a relatively underused sense compared to vision and sound
technologies [28, 29]. However, olfactory experiences hold the potential to significantly enhance virtual reality
environments and add a new dimension to HCI applications, particularly by facilitating attention shifts [30]. In
traditional audio-visual interactive media, scent is frequently used as a supplementary tool to deepen immersion [31, 32].
From an anthropological perspective, olfaction is deeply intertwined with human culture and may serve as a primary
sensory modality to enrich multisensory experiences [33], especially in digital applications that emphasise voluntary
engagements, such as memory recall, attention, spatial orientation, and wellbeing [28, 34, 35, 36]. Yet, many questions
about human perception and experience of smell remain unanswered. What we, however, know is that smell is uniquely
linked to our emotional responses and memories [35, 32], making it a compelling modality for enhancing HCI and
experience design, especially in light of AI advancements.
There is a growing interest within the HCI community in integrating olfactory devices into everyday life [31]. For
example, Brooks et al. [37] introduced the "Smell & Paste" toolkit, a scratch-and-sniff prototyping tool that allows
novices to quickly create personalized olfactory experiences. Additionally, toolkits like OWidgets have been developed
to enable the creation and replication of olfactory experiences in HCI, beyond the traditional audio-visual design space
[38].
2.3 Bridging Human and Machine Understanding of Smell Experiences
The olfactory sense presents a unique challenge for cognitive science research, particularly in the realms of perception,
memory, and language [39, 40, 41]. To understand olfaction, researchers employ a variety of behavioural, neurophysiological, computational, and theoretical methods to study how smells are perceived and represented in the brain and
language [42, 43, 44, 45, 46, 39]. From a human perspective, conveying smell experiences is more challenging than
describing colours, which benefit from a standardized vocabulary (e.g., adjectives like "red" and hex codes). Scents lack
an equivalent standardized language[47], making it challenging for the general public, without specialised knowledge,
to identify scents based solely on chemical names [48]. To address this, classification systems such as Henning’s Odour
Prism [15] and the Fragrance Wheel [16] organize scents based on their olfactory characteristics.
Building on the challenges of scent categorization, recent linguistic and AI research has explored the connection
between molecular structures and odour perception or multimodal representation [49, 50, 51, 52]. Studies like the
DREAM Olfaction Prediction Challenge by Keller et al. [50] have aimed to understand how humans perceive different
molecules as scents. Similarly, the Odeuropa project [52] has significantly contributed to digital heritage by creating
a smell-linguistic odour dataset [51] and launching a multimodal data challenge that integrates vision and text to
1Given that GPT models lack the ability to "see" colour hex codes are provided as textual inputs.
3

3/22

Sniff AI
ASR Mapping to AI
Embedding space AI Guessing Mechanism Scent Delivery Device
Sniff AI System
Human sniff and desscribe
(direct or comparitive)
Pre-built Scent Embeddings
guessed scent
Figure 2: An overview of Sniff AI System components and workflow: starting with a human describing a scent. The
description is processed by an ASR system, mapped to an AI embedding space, and compared with pre-built scent
embeddings. The AI guessing mechanism predicts the scent and then delivers the guessed scent by a scent delivery
device.
categorize sniff behaviours in digital heritage [53]. Lee et al. [49] used a Graph Neural Network (GNN) to predict
olfactory descriptors from molecular structures, showing AI’s growing capability in olfaction. Additionally, mainstream
AI in this field often utilizes electronic noses (e-nose) with Interdigitated Electrode (IDE) structures and Molecular
Imprinted Polymer (MIP) sensors for detecting specific chemicals such as limonene [54]. These developments indicate
that with sufficient data, AI models have the potential to match or even exceed human capabilities in olfactory perception,
whereas the smell experience in existing state-of-art-art models, like LLMs, remains underexplored.
Previous studies have used word embeddings to learn sensory description languages in natural language processing
[55]. However, word embeddings often fail to capture the nuanced sentiment and contextual meanings of entire texts,
as they represent words as static vectors without considering variations in usage across different contexts [27]. In
addition, the nature of scent is multidimensional, extending beyond a single descriptor. To address these complexities,
our work adopts a more sophisticated approach by using sentence-level embeddings generated from LLM encoders.
LLM encoders process entire sentences rather than individual words, allowing them to capture the semantic, rather than
focusing solely on lexical or stylistic elements. This results in context-aware embeddings that adapt to the surrounding
text, providing a richer and more comprehensive analysis of the content’s sentiment and meaning [5, 8].
3 The Sniff AI System Design
This section introduces Sniff AI, our system to investigate whether LLMs can effectively understand smells based on
human descriptions. We would like to understand how AI interprets human-described scent differences by analyzing
the vector relationships in LLM’s latent embedding space. We begin by discussing how LLMs encode concepts like
scent descriptions into high-dimensional vector spaces (embeddings) and provide an overview of the Sniff AI system
design in Section 3.1. Following this, we then detail the design of our Sniff AI system with five core components
as shown in Figure 2: Automatic Speech Recognition (ASR) and pre-built scent embedding generation (described
in Section 3.2), mapping human smell experiences into the AI’s embedding space (Section 3.3), the AI guessing
mechanism (Section 3.4), and scent delivery (Section 3.5). We also discuss how we selected the scents in Section 3.5.
3.1 Scent Embedding Space and the Sniff AI System Overview
AI systems "think" differently from humans, as they represent concepts and ideas within an embedding space [12, 13].
While the decoder-only models like GPT-4 [56] and Gemini [57] are for generating output in an auto-regressive fashion,
the encoder-based models, or embedding models, are fundamental for understanding how the model interprets and
organizes information [58, 26]. For example, in an LLM embedding space, the distance between "man" and "woman" is
comparable to that between "king" and "queen", reflecting their analogous relationships. The direction between vectors
from the model’s embedding space can also reflect semantic differences and similarities, such as the shift from "man" to
"woman" mirroring the shift from "king" to "queen", both indicating a change in gender [59]. By studying how LLMs
encode human scent descriptions, we gain insight into how they capture both similarities and the nature of relationships
between scents.
3.1.1 Scent Embedding Space
In a scent embedding space, if AI’s perception aligns with human experience, similar scents should be located near
each other (in terms of their vector norms, such as lp norms), while dissimilar ones should be positioned further apart,
with the direction and distance between them representing their differences. For example, "lemon" and "lime" might be
4

4/22

Sniff AI
near each other due to their similar citrus characteristics, while "rose" and "jasmine" could be close as they are both
floral. The placement, distance, and direction of these scent vectors, obtained from the LLM embedding model, should
reflect meaningful olfactory relationships. If a scent is described as sweeter than "lemon," we would expect AI to point
out "vanilla" is closer rather than "peppermint", suggesting a transition from a fresh to a sweet scent. Conversely, an
embedding vector from "sandalwood" to "peppermint" might indicate a transition from a warm, woody scent to a cool,
minty one. These vector mappings are key to understanding how AI interprets and differentiates between various scents.
3.1.2 An overview of the "Sniff AI" study and system
We conducted a user study, where we use the Sniff AI system to identify scents based on human descriptions. The Sniff
AI system recognizes human voice input, predicts, and delivers scents in real-time. The study’s objective is to evaluate
the alignment between human sensory descriptions and the AI’s scent judgement via two tasks: the Scent Description
Task and the Interactive Scent Comparison Task. We describe further these tasks in Section 3.3 and the detailed user
study in Section 4.
3.2 Pre-encoded Scent Embeddings
To evaluate the AI’s ability to understand and distinguish between scents, we first need to generate scent representations
in a form suitable for AI processing, we call this process pre-encode scent embeddings. This includes converting
scent-related data into numerical vectors within the LLM’s high-dimensional embedding space.
We consulted and worked with domain experts to create catalogue descriptions for each scent. These catalogue
descriptions provide essential information about each scent’s source and composition. For example, "The scent
of Rosemary is from its essential oil. The essential oil of Rosemary is extracted from the Rosmarinus Officinalis
plant (CAS 2: 8000-25-7)". These catalogue descriptions (xi) were encoded into embedding vectors using OpenAI’s
text-embedding-3-large model with 3072 dimensions [58], defined as:
vi = fencoder(xi) (1)
A total of 20 unique vectors were generated since we have 20 scents, each representing a different scent (vi). These
vectors, known as Escent = {v1, v2, . . . , v20}, form the basis of the AI’s scent knowledge and are used repeatedly
throughout this study, as referenced in Table 1. Detailed information on the scents, including their concentrations,
manufacturers, and catalogue descriptions, is provided in the supplementary materials.
3.3 ASR and Mapping Human Olfactory Experiences to the AI Embedding Space
To explore LLM’s embedding space representations, we designed two interactive tasks involving human participants. In
both tasks, participants sniff scents and provide descriptions of their smell experiences, which is then encoded by the
LLM encoder and serve as representations of LLM’s olfactory perception, as illustrated in Figure 1 and detailing the
specifics of the implementation Figure 3:
1. The Scent Description Task: This task (Task 1 in Figure 3) tests the LLMs encoder’s ability to match a specific
scent based on human-provided descriptions in the latent embedding space. The AI identifies the closest match
within its scent embedding space by matching the internal semantic similarity of scent representations.
2. The Interactive Scent Comparison Task: This task (Task 2 in Figure 3) evaluates the LLMs encoder’s ability
to understand and represent the transitions between different scents. This task examines whether the AI can
reflect the progression from one scent to another using comparative descriptions provided by humans. For
example, we examine if the vector from "the scent of mint" to "the scent of rose" in the LLM’s embedding
space reflects a shift from a fresh scent to a more floral one and aligns with human smell experience.
The mapping methods and AI prediction mechanisms associated with these two distinct tasks (Task 1 and 2) then differ
slightly. Each participant verbally describes their smell experiences. In Task 1, participants describe the smell of a
single scent, while in Task 2, they describe the differences between two different scents. These verbal descriptions
were captured and transcribed into text using ASR – OpenAI’s whisper-1 model [60]. The transcribed texts were then
encoded into numerical vectors using the same encoder, fencoder, OpenAI’s text-embedding-3-large model [58].
2Chemical Abstracts Service (CAS) registry number, a unique numerical identifier for the chemical substance.
5

5/22

$Sniff AI Prescreening & Demographics Questionnairs Assigned with one scent: (target i tar) Embedding model Automatic Speech Recognition Descibe the scent aloud voice text 𝑓𝑒𝑛𝑐𝑜𝑑𝑒𝑟 vector predict scent inew Task completed No > 3 attempt inew =i tar Yes reach 3 attempts Is inew the target scent i tar? inew≠ i tar scent 𝑥 ℎ query 𝑣 ℎ query 𝑣 ℎ query ∼ 𝑣𝑖 ∈ ℰ𝑠𝑐𝑒𝑛𝑡 Task 1: Scent Description Assigned with two scents: (reference iref & target i tar) Embedding model AI Guess Scent 𝑣𝑝𝑟𝑒𝑑𝑖𝑐𝑡 ∼ 𝑣𝑖 ∈ ℰ𝑠𝑐𝑒𝑛𝑡 Automatic Speech Recognition Descibe the difference aloud voice 𝑥 ℎ diff text 𝑓𝑒𝑛𝑐𝑜𝑑𝑒𝑟 𝑣 ℎ diff vector predict scent inew Task completed > 5 attempt Receive inew_ref as the new reference to rate Validity & Similarity inew =i tar Yes reach 5 attempts No Is inew the target scent i tar? inew≠ i tar scent Task 2: Interactive Scent Comparison Interview Figure 3: An overview of the "AI sniff" method and the "Guess what scent" user study workflow. 3.3.1 Task 1: Scent Description Participants provided descriptions for a single scent, which were transcribed as x h single and processed by fencoder, resulting in a query vector v h query, calculated as follows: v h query = fencoder(x h query) (2) 3.3.2 Task 2: Interactive Scent Comparison Participants described differences between a target scent (itar) and a reference scent (iref ). The AI system is initially given the embedding of the reference scent, vref ∈ Escent, as the starting point for identifying the target scent. The AI uses an updated v h update to make predictions, following a multi-step process. Firstly, the descriptive differences x h dif f were encoded into a vector v h dif f using Equation 3. Then, the embedding vector of the reference scent, vref , was combined with v h dif f . The resultant vector was normalized to obtain the updated query vector v h update: v h dif f = fencoder(x h dif f ) (3) v h update = vref + v h dif f ∥vref + v h dif f ∥ (4) Here, ∥v∥ is the Euclidean norm of v, calculated as √vvT . 3.4 The AI Guessing Mechanism The general principle of relying on the AI’s embedding to make a guess in both Task 1 and 2 are alike. The AI guessing mechanism performs an information retrieval using the cosine similarity measure defined as: vquery = arg max vi∈Escent vquery ⋅ vi ∥vquery∥∥vi∥ (5) For each prediction, the AI aims to identify a target scent itar within Escent. The system selects the most similar embedding from the set of 20 possible embedding vectors Escent = {v1, v2, . . . , v20}: ID = max(Cos(vquery, vi)∣vi ∈ Escent) (6) 6$

6/22

7/22

Sniff AI
(a) User study setup (b) A close view of the device and user interface.
Figure 4: Study Setups (a) A participant controls the AI guessing system, and sniffs the scent delivered from the device
inside a noise isolation box. (b) A close view of the scent delivery device without the noise isolation box (left) and the
user interface (right).
scent as "floral", the AI would immediately guess correctly. However, by having multiple but distinct floral scents,
such as rose, geranium, and lavender, the task becomes more challenging. On the other hand, we avoided focusing
exclusively on a single scent family (e.g., only florals) to prevent the task from becoming too difficult and to ensure the
implications for human-AI alignment extend beyond just one scent family. We cross-reference our selection with the
literature presented in Section 2.3 on the Fragrance Wheel [16]. To maintain authenticity, all scents used were natural
essential oils and extracts to ensure the fragrance closely matched its label (e.g., the lemon scent was derived from
lemon peel oil extract). Additionally, we sourced fragrances from the same supplier whenever possible to maintain
consistency across scent quality and formulation.
4 User Study
The user study, as described before, utilized the Sniff AI system to identify scents based on human descriptions, where
participants sniffed and provided detailed descriptions of the scents. Our aim was to evaluate the alignment between
human sensory descriptions and the AI’s scent judgements through two tasks: the Scent Description Task (Task 1) and
the Interactive Scent Comparison Task (Task 2). Both tasks were designed to assess how well the AI could understand
and interpret human olfactory experiences. Task 2, in particular, focused more specifically on evaluating the alignment
between human and AI olfactory perception in their respective representations, as discussed in Section 3. The study
employed a mixed-methods approach using a repeated-measures, within-subjects design to ensure a robust evaluation
of AI’s ability to capture human scent perception.
4.1 Participants
We aimed to explore the integration of smell experiences into daily activities through natural interactions. Therefore,
we recruited the general public rather than domain experts for our AI alignment task. A total of 40 participants (22
female, 18 male; aged 19-50, mean = 28.95, SD = 5.99) were recruited for an in-person user study, targeting the
general public rather than domain experts, as detailed in Section 2.3. None of the participants had any olfactory sensory
impairments that could affect their olfactory perception. The participants came from 14 countries across four continents:
Europe (21), Asia (17), North America (1), and Oceania (1), and all were native or highly proficient English speakers.
Their professions were diverse, including computer scientists (6), IT and engineers (6), HCI and psychology students
(4), bio/medical students (4), neuroscience researchers (3), local government workers(2), a psychotherapist and etc.
Participants provided written informed consent before participating in the 60-minute study and were compensated with
a gift voucher. The study was approved by the local University’s Research Ethics Committee.
4.2 Study Set-up and Procedure
We hosted the Sniff AI system on a local machine (13in, MacBook Pro Intel Core i7), allowing participants to complete
tasks autonomously. Participants were provided with on-screen instructions to initiate the study, diffuse the scent, sniff
8

8/22

Sniff AI
Scents delivery
and sniff
Prescreening &
Demographics
Break
Interview
Sniff ref
scents Describe
differences
Rate Similarity,
Familarity,
& Intensity
10 secs
AI guess scent
Sniff target
scents
10 secs
< 2 times
Rate
Validaty Max 5 guess
each round
4 Rounds: 4 reference-target scent pairs
Task 2: Interactive Scent Comparison Description
compare
Sniff AI's
guess
Describe smell
experience
Rate Familarity,
& Intensity
2 Rounds: 2 scents
10 secs,
< 2 times
AI guess scent
Task 1: Scent Description
Max 3 guess
each round ~ 60 mins
in total
5 mins
Figure 5: User study procedure in four stages: 1. Prescreening & Demographics, 2. Task 1: Participants sniff scents and
describe them, followed by AI predictions, 3. Task 2: Participants compare scents, describe differences, and AI makes
further guesses, 4. Concluding Interview.
it, and then describe it to the system. Participants also provided their subjective judgments via questions and rating
options available on the interface, as shown in Figure 4b. The detailed procedure of the study is illustrated in Figure 5.
Pre-screening and Setup Before joining the study, participants completed a pre-screening survey to ensure they had
no olfactory or speaking impairments that could affect their participation. Upon arrival, they filled out a demographics
questionnaire to collect essential background information. Participants were then introduced to the scent delivery device,
as shown in Figure 4a, and instructed to maintain a distance of approximately 20 cm from the output nozzle. Each scent
was delivered for ten seconds upon activation via an interface button to ensure standardized exposure throughout the
study. To ensure a balanced evaluation, we limited the number of guesses the AI system could make in each round to
three for Task 1 and five for Task 2. This threshold was established based on internal pilot testing, aimed at balancing
the AI’s opportunities with maintaining participant engagement.
Participants were reminded that the task focused on investigating AI’s ability to learn about human sensory descriptions,
rather than the human ability to recognize scents. They were instructed to avoid naming the scent (e.g., lime) in their
descriptions and instead focus on describing the characteristics of the scent they experienced (e.g., sweet, punchy), how
it made them feel (e.g., pleasant, unpleasant), and any other sensory qualities they observed.
Task 1: Scent Description Each participant completed two rounds of Task 1. In each round, participants sniffed a
target scent itar delivered via the novel scent-delivery device and described their olfactory experiences aloud to the
Sniff AI system. The AI processed these descriptions and made an informed guess, presenting the results audibly
and visually through an interface. Participants rated the scent on intensity and familiarity for the first sniff. If the AI
correctly identified the scent, the round was considered complete. If the AI’s guess was incorrect, participants had
the opportunity to refine their descriptions, allowing the AI system to make up to three additional attempts per round.
Subsequently, a different scent was introduced to continue with the next round.
Task 2: Interactive Scent Comparison Participants completed four rounds of the Interactive Scent Comparison
Task, each involving a new pair of reference and target scents. After sniffing both scents, participants described their
differences and rated each scent on familiarity, intensity, and similarity. The AI then attempted to identify the target
scent by asking, "Is this your target scent?" and delivering its guessed scent through the device. If the AI’s prediction
was correct, the round was completed, with the validity and similarity scores automatically recorded as 10. If the
guess was incorrect, the mistakenly guessed scent became the new reference for subsequent descriptions and ratings.
Both similarity and validity scores are key metrics for evaluating human-AI perceptual alignment. Definitions of these
metrics are detailed in Section 4.3.
Interview After completing both tasks, participants were invited to a follow-up interview.
4.3 Evaluating AI Sniff Performance
The study used a mixed-methods approach with a repeated-measures design to evaluate the performance of the LLM
encoder. To ensure comprehensive coverage across scents and their families and minimize comparison biases, Task 1
involved randomly assigning two target scents to each participant while counterbalancing the total number of scents
tested. Task 2 utilised a Latin Square arrangement with additional intra-group sequencing. This structured approach
helped effectively control confounding variables through deliberate pairing and repeated-measures design, ensuring that
each scent family, as well as each scent, was equally represented in the total number of trials. To measure the degree of
perceptual alignment between humans and the AI model, we used the following evaluation metrics.
9

9/22

Sniff AI
AI Success Rate The success rate was calculated by dividing the number of correct predictions by the total number of
rounds in each task. This metric aims to quantify the AI’s effectiveness in accurately identifying scents.
Validity Scores In Task 2, each AI guess receives a validity score, measuring how well the guess aligns with the
human description by human expectation. A higher score indicates the AI accurately understood the user’s input and
chose a relevant scent. The validity of each AI guess is rated on an 11-point scale, ranging from 0 ("Completely Invalid")
to 10 ("Completely Valid"). Correct predictions automatically receive a score of 10, while participants rate incorrect
predictions.
Similarity Scores In Task 2, each pair of reference scent and target scent (including the initial reference) receives a
similarity score where participants evaluate the perceived similarity between them. This score reflects how closely the
AI’s guess matches the actual scent’s characteristics. The similarity of each pair is rated on an 11-point scale, ("Not
Similar at All") to 10 ("Completely Similar"). Similar to validity, correct predictions automatically receive a score of
10, and participants rate incorrect predictions. Our AI system learns from cumulative descriptions to predict a target
scent (see Section 3.4), with each similarity score representing a different scent pair. By analyzing these scores when
the AI guesses incorrectly, we can assess whether its guesses are progressively closer to the target scent, even when not
perfectly accurate.
Familiarity and Intensity Scores In scent-related studies, familiarity and intensity are commonly used metrics.
Familiarity refers to how well participants recognize a scent, while intensity measures the perceived strength of the
scent. We hypothesize that a higher familiarity with a scent may lead to a higher success rate in identifying or describing
it, as people tend to describe things more accurately when they are familiar with them [61].
Qualitative Insights from Semi-Structured Interviews After completing the tasks, participants took part in semistructured interviews to provide qualitative feedback on their experience with the AI system. These interviews explored
their perceptions of the AI’s performance, its alignment with human scent perception, areas for improvement, and
their vision for future human-AI scent interactions. The questions asked during the interviews are provided in the
Supplementary Materials.
5 Study Results
We first present the overall performance of the Sniff AI system in both tasks. In Task 1 (Scent Description), the AI
system made 213 guesses in 80 rounds, averaging 2.66 guesses per round (SD = 0.67). Task 2 (Interactive Scent
Comparison Description), it made 648 guesses in 160 rounds, averaging 4.05 guesses per round (SD = 1.47). Next, we
provide a detailed analysis of scent-specific performances, categorizing results into scent families (e.g., Fresh) as shown
in Table 1; exploring the validity scores for the guessed scents and similarity between the reference and target scents
as rated by human participants. Finally, we present qualitative insights from the interviews capturing their subjective
experiences and feedback on the AI guesses.
5.1 Success Rates for Sniff AI Tasks
5.1.1 Overall performance
The AI’s overall performance was measured by its success rates across the two tasks. The success rate was calculated
as the number of correct predictions divided by the total rounds for each task. In Task 1 (Scent Description), The AI
system correctly identified the target scent in 22 out of 80 rounds, achieving a success rate of 27.50% (SD = 0.29). The
correct round count was predicted with 1 guess (9 times), 2 guesses (9), 3 guesses (4). For Task 2 (Interactive Scent
Comparison), the overall success rate was 37.50% (SD = 0.23) for 60 correct predictions out of 160 rounds. The correct
round count was predicted with 1 guess (19 times), 2 guesses (14), 3 guesses (12), 4 guesses (10), and 5 guesses (5). A
two-proportion z-test suggested that the performance increase in Task 2 was not statistically significant (z = 1.54, p =
0.0618).
We then categorized this rate per scent family as detailed in Table 2. For Task 1, the success rates per scent family—Fresh,
Floral, Oriental, and Woody—were 40.00%, 30.00%, 25.00% and 15.00%, respectively. To determine the statistical
significance of these differences, we conducted a Chi-Squared test, resulting in χ
2 = 8.43 (p = 0.75). Task 2 exhibited
rates of 45.00%, 42.5%, 40.0% and 22.5% for the same categories, with a Chi-Squared test resulting in χ
2 = 16
(p = 0.38). No statistical signification was found between families.
10

10/22

Sniff AI
(a) Success rates for each scent in Task 1. (b) Success rates for each scent in Task 2.
Figure 6: Success rate for each individual of the 20 scents used in the user study.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
AI predicted scent sample ID
1 2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20
Actual target scent sample ID
0 1 1 1 4 0 0 0 0 1 0 0 0 0 1 1 1 0 1 0
0 1 1 2 5 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0
0 0 1 3 0 0 1 0 0 0 0 0 1 0 1 0 2 1 0 0
0 1 2 4 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0
0 1 1 0 2 0 1 0 0 0 0 0 0 0 0 1 0 2 1 0
0 1 2 0 1 1 0 1 1 0 0 0 1 0 1 0 1 0 0 0
0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 0 2 0 2 1
0 0 1 0 0 0 5 0 0 1 0 0 1 0 2 0 2 0 0 0
0 0 1 0 1 0 3 0 0 0 1 0 1 0 1 0 1 2 0 1
0 0 0 1 2 1 2 2 0 1 0 0 0 0 2 0 1 0 0 0
0 0 1 0 1 0 0 1 0 1 3 1 0 0 1 0 0 0 0 0
0 0 0 1 0 0 0 0 0 0 0 1 2 1 0 0 2 0 0 4
0 3 3 1 1 1 0 0 0 0 1 0 0 0 0 2 0 0 0 0
0 0 0 0 0 0 0 0 1 0 0 0 3 0 1 0 3 4 0 0
0 1 2 0 3 3 0 0 0 0 0 0 0 0 1 0 0 0 1 0
0 0 0 0 4 0 0 0 0 0 0 1 0 2 0 0 2 2 0 1
0 0 2 0 1 1 0 0 0 0 0 0 1 0 1 0 1 3 0 0
0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 3 0 0
0 0 0 0 4 2 0 0 0 0 0 0 0 0 1 0 0 0 2 0
0 0 0 0 0 0 0 1 0 1 0 1 1 0 3 0 1 2 2 0
Task 1 AI's 213 Guesses in 80 Rounds
0
1
2
3
4
5
(a) Confusion matrix for overall scent prediction in Task 1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
AI predicted scent sample ID
1 2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20
Actual target scent sample ID
0 2 6 3 14 2 1 3 3 0 0 1 3 0 0 0 0 0 1 1
0 1 3 2 10 2 2 0 3 1 0 0 3 0 5 2 1 2 1 0
0 3 5 0 7 1 2 0 3 1 0 0 2 0 0 1 1 0 0 0
0 0 7 5 5 2 0 2 1 0 0 1 0 1 1 0 0 1 0 0
0 0 2 1 7 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0
1 1 3 1 6 3 0 2 1 2 0 0 5 1 2 4 2 2 2 0
1 0 4 1 2 0 3 2 1 1 1 0 7 2 0 0 2 1 0 2
0 1 0 0 10 0 6 3 3 2 0 0 0 1 2 4 1 0 0 0
0 0 4 2 6 0 1 0 4 3 2 0 1 0 1 2 2 1 0 0
0 1 5 2 4 0 1 1 2 4 0 1 1 0 2 1 1 5 1 0
0 0 6 0 7 0 1 0 2 4 4 0 5 0 2 0 0 2 0 0
0 1 4 0 8 2 0 1 1 1 1 3 7 2 2 0 0 1 3 0
0 2 5 1 7 0 0 0 1 3 0 0 6 0 1 0 0 0 0 1
2 0 7 3 3 0 4 3 3 1 1 0 5 1 0 2 0 1 0 0
0 0 3 0 9 0 1 0 3 3 0 1 3 1 3 1 4 1 0 0
0 0 5 3 7 0 2 0 1 1 0 1 3 0 3 3 0 1 1 0
0 0 6 2 6 0 1 1 1 1 0 0 3 1 2 1 4 1 0 0
0 3 1 0 8 1 0 0 0 2 0 1 8 1 2 2 6 1 0 1
0 1 8 2 6 0 1 1 3 2 0 0 6 2 1 4 2 1 0 0
4 0 3 0 5 0 0 5 1 3 0 0 9 2 2 2 2 0 0 1
Task 2 AI's 648 Guesses in 120 Rounds
0
2
4
6
8
10
12
14
(b) Confusion matrix for overall scent prediction in Task 2
Figure 7: Confusion matrices for overall scent prediction in Task 1 and Task 2.
5.1.2 Scent-specific Performance
Delving into the specific scents themselves, we present the descriptive summarised in Figure 6a, Figure 6b as well as
Table 2. As shown, there were vast differences in success rates. For example, in Task 1, Lemon (ID 4) was identified in
all rounds, whereas scents such as Rose (ID 8), Jasmin (ID 9), and Rosemary (ID 1) were never identified. Whereas in
Task 2, Peppermint (ID 5) was identified for most of the rounds and achieved the highest success rate at 87.5%, while
neither rosemary (ID 1) nor black pepper (ID 19) were never identified with the lowest at 0%.
5.1.3 Confusion Matrices
Next, we created confusion matrices to understand when the AI was incorrect, which scent it would most commonly
guess, to understand how the AI was misaligned. The confusion matrices that compare the AI’s predictions (guesses) to
the actual target scents for both tasks are displayed in Figure 7a for Task 1 and Figure 7b for Task 2. The horizontal axes
represent the actual scent sample IDs, and the vertical axes show the AI’s predicted scent IDs (guessed IDs). Each cell
in the matrix represents how many times the AI predicted a specific scent ID (row) for a given actual scent (column).
Darker cells indicate a higher frequency of predictions. The diagonal cells (from top left to bottom right) represent
correct predictions where the actual scent matches the predicted scent (correct predictions). If human-AI alignment was
11

11/22

Sniff AI
Table 2: Results by Scent: the success rate (acc), familiarity score for the target scent across both tasks, and validity
score for the target scent in Task 2. * is overall success rate grouped by family
Families ID Scent Task 1 Task 2
ACC % * acc % Familirity ACC % * acc % Familirity Validity
1 rosemary 0.00 6.00 ± 2.94 0.00 6.50 ± 2.73 4.68 ± 2.63
2 eucalyptus 25.00 7.00 ± 3.37 12.50 5.13 ± 3.04 6.00 ± 2.64
Fresh 3 bergamot 40.00 25.00 5.75 ± 2.21 45.00 62.50 6.25 ± 1.83 6.54 ± 2.79
4 lemon 100.00 8.50 ± 1.73 62.50 5.00 ± 3.07 6.50 ± 2.79
5 peppermint 50.00 7.00 ± 3.56 87.50 5.25 ± 3.01 8.08 ± 2.30
6 lavender 25.00 9.00 ± 1.41 37.50 4.63 ± 2.97 5.92 ± 2.77
7 gardenia 25.00 5.25 ± 1.71 37.50 6.25 ± 1.91 5.70 ± 2.70
Floral 8 rose 15.00 0.00 4.75 ± 2.63 42.50 37.50 4.25 ± 2.76 4.03 ± 3.42
9 jasmin 0.00 4.50 ± 3.11 50.00 4.13 ± 2.42 4.38 ± 3.02
10 geranium 25.00 6.25 ± 2.87 50.00 5.75 ± 3.01 5.72 ± 2.92
11 vanilla 75.00 6.75 ± 2.87 50.00 5.88 ± 1.73 4.45 ± 3.04
12 cardamom 25.00 6.25 ± 1.71 37.50 6.00 ± 2.73 6.41 ± 2.81
Oriental 13 frankincense 25.00 0.00 5.25 ± 1.71 40.00 62.50 4.25 ± 2.96 5.93 ± 3.30
14 sandalwood 0.00 5.75 ± 3.03 12.50 6.25 ± 2.25 4.64 ± 2.39
15 patchouli 25.00 5.25 ± 2.06 37.50 6.00 ± 2.39 4.42 ± 3.16
16 pine 0.00 6.25 ± 1.71 37.50 5.38 ± 2.50 5.83 ± 2.65
17 cedarwood 25.00 6.50 ± 3.00 50.00 5.00 ± 2.62 5.77 ± 2.46
Woody 18 oakmoss 30.00 75.00 5.50 ± 2.38 22.50 12.50 5.25 ± 2.55 5.51 ± 2.77
19 black pepper 50.00 4.25 ± 3.30 0.00 5.25 ± 2.25 4.00 ± 2.97
20 cinnamon 0.00 6.00 ± 3.16 12.50 4.25 ± 2.25 4.13 ± 2.70
strong, it would be expected a less total guess and to see a dark diagonal line from the top left to the bottom right of the
graph.
As the scent IDs are grouped by family, a slight deviation from this pattern would suggest that incorrect guesses
were more commonly with same-family scents, compared to scents from a different family. From these graphs, for
example, we see a bias for the AI model to incorrectly guess peppermint (ID 5) for the actual targets of rosemary (ID 1),
Eucalyptus (ID 2), Pine (ID 16) and Black pepper (ID 19), where are all fresh or trigeminal scents. Similarly, the actual
scent of Gardenia (ID 7) was most commonly guessed as Rose (ID 8).
Additionally, some columns show higher totals, indicating a bias in the AI’s predictions towards certain scents. For
example, in Task 1, the AI never guessed Rosemary (ID 1) but guessed Peppermint (ID 5) 30 times. In Task 2, the AI
predicted Peppermint (ID 5) 137 times, while it made very few guesses for Cinnamon (ID 20), with only 5 predictions.
5.2 Familiarity Scores
To determine whether the success rate is related to participants’ familiarity with the scent, we analyzed the relationship
between familiarity scores and success rates for the target scents in both tasks using Pearson correlation. In Task
1, the Pearson correlation showed a moderate but non-significant positive correlation (Pearson r = 0.4004, n = 20,
p = 0.0802). Task 2 exhibited a weak, non-significant negative correlation (Pearson r = −0.1577, n = 20, p = 0.5068),
suggesting no reliable linear relationship.
5.3 Validity Scores for Interactive Scent Comparison (Task 2) and its Relationship to the AI’s Performance
The validity score assesses how well the AI’s interpretations (guessed scent’s characteristics) align with the participants’
expectations in relation to their given descriptors. In Task 2, the overall validity score was 5.30 (SD = 2.98), which
falls between "neither valid nor invalid" and "marginally valid". A one-sample t-test revealed that this mean score
was significantly different from the neutral midpoint of 5 at a minimal effect size (t(647) = 2.55, p = 0.011, d = 0.10),
suggesting overall users rated the AI’s judgments as marginally valid. We then exclude the successful round and explore
the validity score during those rounds that AI failed in any guess; all scores fall around "marginally invalid" as illustrated
in Figure 8a. Due to the non-normality in the data (Shapiro-Wilk test, all p < 0.05), we used the Friedman test, which
has non-significant differences across guesses (p > 0.05).
12

12/22

Sniff AI
(a) Mean validity scores per guess with error bars representing
standard deviations (Where AI failed to guess correctly).
(b) Mean similarity scores from initial part to per guess with error
bars representing standard deviations (Where AI failed to guess
correctly).
Figure 8: Mean validity (a) and similarity (b) scores plot with error bar and Friedman test (Where AI failed to guess
correctly).
We further calculated the relationships between the average validity score and success rate across all scents to understand
how subjective perceptions of validity correlate with objective AI performance. This helps determine if participants’
perceptions of the AI’s interpretations agree with its actual effectiveness. The results showed a moderate, significant
positive correlation (Pearson r = 0.64, n= 20, p = 0.0026). This indicates a significant positive relationship between
these variables, supporting that subjective perceptions positively relate to AI effectiveness.
5.4 Similarity Scores for Interactive Scent Comparison (Task 2)
The similarity score measures the perceptual similarity between two scents as judged by participants. We focused on
the similarities across guesses and explored whether the AI model improved in providing a more closely matched scent
to the user’s description. The average similarity score for the AI’s guesses is 5.33 (SD = 2.92, excluding the similarity
between initial target and reference), which falls between "neither similar nor dissimilar" and "marginally similar". A
one-sample t-test revealed that this mean score was significantly different from the neutral midpoint of 5 at a minimal
effect size, (t(647) = 2.88, p = 0.004, d = 0.11), suggesting overall users rated the guessed scents as slightly more
similar than neutral.
To analyze if there was an increase in similarity as the task progressed, we compared these ratings for where the AI
did not correctly predict the scent, as successful rounds may have fewer than five guesses. Figure 8 shows the mean
similarity scores from "initial similarity" to the "5th guess", with error bars representing standard deviations.
Due to the non-normality in the data (Shapiro-Wilk test, all p < 0.05), we used the Friedman test, which indicated
significant differences across guesses (p = 0.0032). Follow-up Wilcoxon signed-rank tests with False Discovery Rate
(FDR) corrections revealed that similarity scores significantly improve from the initial similarity (mean = 3.68, SD =
2.57, "marginally dissimilar") to the 1st guess (mean = 4.70, SD = 2.52, "neither similar nor dissimilar", p = 0.0051, r =
-0.3084). However, scores did not improve at any of the subsequent guesses (all p > 0.05, r < -0.13); suggesting that
users did not feel the AI improved in providing them a more similar scent after the initial guess.
5.5 Interview results
We summarized the results of our semi-structured interviews in Table 3, using thematic analysis with double-blind
coding by two co-authors. Four main themes were discovered in a bottom-up fashion presented in Table 3, under the
Addressed Question section. Each theme consists of three subthemes: (1) Perceived degree of human-AI alignment,
which categorised participant’s views into High, Medium, and Low; (2) Unaddressed factors in AI scent selection,
including Emotion, Personal Experiences, and Perception; (3) User reflections on AI scent interaction, with subthemes of
Difficulty in Verbalizing Experiences, Observed AI Behaviour; (4)Future prospects of Human-AI olfactory interaction.
Specifically, to better understand AI’s behaviour, we identified 4 subthemes: Words, Deviations from Descriptions,
Repetitive Jumping, and Miscellaneous (comments on AI’s working mechanism). To ensure consistency, we counted
the frequency of each theme once per participant, even if they raised multiple points related to the same theme.
13

13/22

Sniff AI
Table 3: Interview reflections, including the addressed questions
Addressed Question (Description) Subthemes Frequency Examples of Quotes
High 10
"The AI has been guessing well, in some cases guessing closer and correct" (P03);
"I think maybe 70% of the time it moved in the right direction." (P06);
"very impressed by AI’s ability to exactly understand what I was saying. " (P14)
To what extent does users perceive
the AI alignment to be accurate? Medium 20
"many times the AI is guessing towards the correct direction, but the extent of
its guessing is not well-controlled." (P05);
"It listened to the words I was saying and on one occasion it was correct."(P22);
" The AI was sometimes good and sometimes not." (P34)
Low 10
"Inaccurate roughly speaking, it was converging towards, mostly diverged from
the description." (P01);
"If I speak generally, the AI cannot guess accurately." (P24);
Emotions 5
"when I said pleasant.. I confused it..I had to sort of avoid using emotive
language and try and stick to the like hardcore descriptors" (P10)
"I think the AI does not have a clear concept of how humans think that it is
a good smell and attacking smell, or whether it is a bad smell." (P12)
What specific instances where
the AI met, or failed to meet,
users expectations in capturing
the nuances of different scents?
Personal
experiences 6
"like bakery and coffee shop, and it didn’t really get those references" (P07)
"I said it would be used in cooking, and then it would give me a scent that would
never be used in cooking." (P25)
"we describe to a person which may share our experience so they may understand
better because we live in the same environment and have to share" (P40)
Perceptions 13
“...the AI is quite obsessed with minty and non-minty flavour...” (P08)
"one flavor is more citric than the other and also more creamy
that’s sweet so they are would often doesn’t get it right." (P17)
" I think AI has kind of done a great job in trying to kind of identify ..
the freshness or maybe how minty it smells." (P35)
Difficulty
verbalising
the smell
28
"...too hard for humans to describe the amount of actions that needs to be taken
precisely to describe it." (P20);
"...I struggled a little bit with trying to describe certain smells..." (P27)
What are users’ reflections
on interacting with AI
in scent experience?
Interpreting
about AI’s
behaviour
27
"either approximately or distance from the target scents"(P01) - (Diverging)
“shocked AI is obsessed with minty and non-minty flavour” (P08) - (Biased word)
"keeps coming back to the wrong answer that it gave previously, ..my guess is that
it was trapped in some local minimum" (P18) - (Repetitive Jumping)
"Even though I didn’t always get it right, I could see almost the logic between the
steps it was taking." (P39); - (Miscellaneous)
Novel Human-AI
interaction 15
"it’s quite interesting and it’s actually a bit surprising that the AI actually can get
sometimes a bit closer after each trial. So, I think that’s very surprising to me." (P17)
"interesting to interact with AI, try to use clear phrases that it can understand" (P25)
"very novel, interesting and is my first time to attend a smelling experiment." (P31)
Entertainment 34
"Selecting perfume for customers." (P05);
"scent may also be used in the XR areas... make the scent experience more natural
and more likely to the real world." (P28)
What potential applications do
users envisage for human-AI
olfactory interactions in the future?
Healthcare 8
"...potentially in helping identify allergies in future." (P16);
"Perhaps people who have a problem with their smell, it might help them
to appreciate smells." (P22)
Safety
Regulation 11
"...like we have a sniff dogs at airport...AI can do that as well." (P13);
" ...detecting maybe dangers...like explosives or things like that... fire...sewage
problems (P35)
We observed varying degrees of alignment between participants’ perceptions and the AI’s guesses. Over half of the
participants (n=30) reported lower than expected. Many (n=18) emphasised the importance of emotional attachment
and personal sensory experiences in understanding scents - areas where AI struggled. This was particularly evident
as the AI failed to capture nuances in scent categories and intensities, relying mostly on concrete terms to describe
physical attributes. Consequently, AI guesses often deviated from participants’ descriptions (n=21). The AI also
showed a pattern of repetitive and alternating guesses, shifting between nearly correct scents and unrelated ones (n=6).
Additionally, nearly half of the participants (n = 19) reported difficulty verbalising their olfactory experiences due to
limited vocabulary, lack of prior experience describing scents, and the transient nature of the experience. Despite these
challenges, 14 participants found the interaction novel and engaging. Participants also anticipated potential future
applications for AI-driven scent technologies in entertainment (n=34), healthcare (n=8), and safety regulation (n=11),
although the AI’s current limitations in processing subjective and culturally contextual scent descriptions were clear.
5.6 Replacing Human Participants with GenAI Models
Given the limited alignment observed in Task 1, we conducted further exploratory experiments. Specifically, we
designed a simple experiment using language generative models to perform Task 1 (Scent Description) instead of human
14

14/22

Sniff AI
(a) Human (dots) and LLMs (crosses) scent description t-SNE. (b) Human smell experience description across 20 scent
t-SNE.
Figure 9: 2D t-SNE for human and AI scent descriptions.
participants – describing its interpretation of each selected scent. We consider the current most powerful LLM families
fLLM: OpenAI GPT [56], Google Gemini [57] and Anthropic Glaude [62], and compare the following models via their
official APIs:
• OpenAI GPT-4 and GPT-4o [56]: gpt-4-turbo-2024-04-09 and gpt-4o-2024-08-06
• Google Gemini 1.0 and 1.5 [57]: gemini-1.0-pro and gemini-1.5-pro
• Anthropic Claude 3 and 3.5 [62]: claude-3-opus-20240229 and
claude-3-5-sonnet-20240620
We prompt these LLMs to describe how a scent smells without mentioning the name, aiming to reflect the description of an average person, each result in fLLM(xi). Equation (1) would then generate a set of vectors:
VLLM = {fencoder(fLLM(x1)), ..., fencoder(fLLM(x20))} where each value in VLLM represents an embedding
vector generated by encoding the response returned by an LLM. The prompts and experimental settings are detailed in
supplementary material.
To compare the descriptors generated by LLMs VLLM with humans Vhuman. We employ the t-distributed Stochastic
Neighbor Embedding (t-SNE) algorithm, a commonly used explainable AI tool, to visually compare their embeddings
in 2D. It is an unsupervised dimensionality reduction that transforms high-dimensional Euclidean distances between
data points into conditional probabilities that reflect their similarities in low dimension, in our case, 2D.
We first calculate the centroid of the vectors for human descriptions of each scent. These centroids capture the average
semantic space of human perceptions for each scent. We then compared and visualized these centroids with VLLM in
the same embedding space using t-SNE as illustrated in Figure 9. This allows us to observe the clustering and dispersion
patterns, comparing how closely AI-generated descriptions resemble human perceptions. Most GenAI-generated points
are far from human points, suggesting that the AI’s scent representations are not well aligned with human sensory
experiences. However, for the scent "lemon," where the AI achieved a 100% success rate, the points for GenAI and
human descriptions are close, becoming the only point with a high alignment.
Additionally, we explore the linguistic differences between GenAI and human descriptions by analyzing term frequency.
We first excluded non-substantive words (e.g. "it", "this") and study-specific terms (e.g. "feel", "smells") using Python
NLTK. We then used WordCloud for visualisation. This approach emphasizes on the most significant content words
used in descriptions, providing insight into the focus and variability of language used as depicted in Figure 10. The word
cloud for humans prominently features words like "sweet", "fresh", "strong", "woody" and "reminds". AI highlights
terms such as "undertone", "aroma", "slightly", "reminiscent" and "hint".
15

15/22

Sniff AI
(a) Wordcloud for human (b) Wordcloud for AI
Figure 10: Wordclouds for both human and AI performing Task 1
6 Discussion
The primary objective of this work was to assess how well LLMs align with human in smell experiences. We conducted
a user study where participants sniffed and described scents, and an LLM-based embedding model integrated into an AI
system guessed the scents in real-time. Task 1 focused on accessing LLM encoder’s ability to match a specific scent
based on human-provided description during the study, in other words, its general semantic understanding of scent. In
Task 2, we evaluate the LLM encoder’s ability to understand and represent the relationships between different scents.
The AI system demonstrated moderate success in identifying scents, with an overall success rate of 27.50% in Task 1
and rate of 37.50% in Task 2. Our results indicate that LLMs can, to some extent, represent scent semantics within their
embedding spaces, though this alignment is limited and biased toward certain scents. There is some alignment and
promising potential, but significant challenges remain for LLM to fully understand human smell experiences. Below,
we discuss reasons for this limited alignment and how the future would improve performance.
6.1 AI’s Performance in Understanding Scents
A basic measurement for AI’s understanding of scent is through our Scent Description Task (Task 1) as discussed in
Section 3.3. Our quantitative findings presented in Section 5.1 show that there is limited alignment in the LLM encoder
model to comprehend smell experiences, with only a 27.50% success rate in Scent Description (Task 1). Also, most
scents received only one correct guess (8 instances) or none at all (7 instances); suggesting human-AI alignment varies
depending on the specific scent. Notably, the encoder frequently confused eucalyptus (ID2) with peppermint (ID5) and
gardenia (ID7) with rose (ID8), misclassifying each pair five times; we further discuss this divergence in Section 6.3.
This limited performance could be due to limitations in the LLM’s ability, and also the challenges humans face in
describing scents. Scents lack a standardized language and are often linked to personal experiences and memories (e.g.,
events, times, and people) as also suggested from our interview Table 3 [47, 63, 64]. People tend to describe scents by
referencing their sources, using tangible objects to help others understand the smell [65, 66]. In our study, participants
were instructed not to name the scent but to describe the things that came to their mind, focusing on scent characteristics.
Half of the participants (n=19) mentioned their difficulty verbalising the smell experiences as shown in Table 3. They
employ direct, sensory-focused terminology like "fresh", "strong", and "woody" and often relate to "remind" of their
personal experience (see Figure 10a). These descriptions are straightforward and resonate with everyday experiences.
This may challenge LLMs to match the scent, as these personal experiences may be their unseen scenarios. For example,
the AI did not associate "sandalwood" with descriptions such as "an expensive candle or home incense" (P7), but it
did successfully link "lavender" to " a teddy bear sleep product" (P38) as shown in Figure 1. Additionally, human
descriptors focus on the immediate sensory impact: for example, describing the rose as "a little bit sweet and maybe
purple" (P23, AI misrecognise it as geranium (ID7)) or noting that "it smells like some kind of flowers. The smell is
quite light, not heavy at all, and it makes me feel very relaxed and comfortable" (P29, AI misrecognise it as gardenia
(ID10).
To further investigate the abovementioned phenomenon, we have AI tackled Task 1 similarly to a human participant,
as detailed in Section 5.6. As shown in Figure 10, LLMs tend to utilize more abstract, expert terminology such
as "undertone," which may feel detached from common everyday usage and occasionally lack intuitive sense. We
found that LLMs provide a richer narrative that is sometimes distant from typical human descriptions. For instance,
GPT-4 describes a rose as "sweet and floral, like a blooming garden full of delicate petals, carrying a soft, romantic
fragrance with a hint of a nutty undertone" (gpt-4o). Similarly, Claude-3 proposes "a delightful floral aroma that
is both sweet and subtly spicy, evoking the essence of a lush garden in full bloom, with a warm, nutty undertone
that adds depth" (claude-3-opus). This difference is important to consider, as AI-generated descriptions might not
always align with how people naturally talk about scents, and there exists a challenge to map personal experiences
with scents (also reflected from our interviews Table 3), all these factors may affect the effectiveness in everyday
16

16/22

17/22

Sniff AI
also varies with Woody family scents. When considering individual scents, the AI shows high accuracy with distinctive
and common scents such as lemon and peppermint. This variability in performance not only suggests a potential bias
but also points to gaps in the AI’s ability to consistently process different scent categories.
We first explore whether "bias" arises from AI’s limited capability or originates from participants themselves. Previous
studies suggest people tend to describe things more accurately when they are familiar with them [61]. However, we
found no significant correlation between success rate and familiarity. In some cases, the results were even divergent; for
example, Task 1 with the highest familiarity, Lavender (ID 6), had only a 25% accuracy rate, while the task with the
lowest familiarity, Black Pepper, achieved a 50% accuracy rate (see Table 2).
We then investigate the confusion matrix in Figure 7 at where AI had difficulty distinguishing between scents. Both
confusion matrices in Figure 7 present a bias in the perception alignment across scents, where we observed a scattering
of bright spots rather than a concentrated diagonal line in those square matrices. Ideally, in an instance of perfect
alignment between the AI and human judgments, the confusion matrix should exhibit its highest values only on the
diagonal line running from the top left to the bottom right. Our confusion matrices suggest that while the AI is somewhat
aligned with human judgments for some scent families, there is significant variability, especially in distinguishing
certain scent families such as Floral and Woody. For example, the AI frequently confused peppermint (ID 5) with both
rosemary (ID 1) and eucalyptus (ID 2). This confusion could stem from their similar cool scent profiles. Similarly,
gardenia (ID 7) was often mistaken for rose (ID 8), likely due to their perceptual similarity within floral characteristics.
It was also observed that the AI repeatedly guessed peppermint (ID 5) in both tasks, one participant described the
AI as "obsessed with minty and non-minty" (P08). This may be attributed to peppermint’s distinctive minty/fresh
characteristics and its prevalence in training data. As a result, the AI often mislabels other scents as peppermint,
indicating a bias toward more familiar or common descriptions.
Interview feedback from participants also highlighted an observed bias in the AI towards particular scents, especially
those that are more uniquely identifiable, such as mint and lemon. Then participants included descriptors related to
"minty" or "citric"; the AI often defaulted to predicting peppermint or lemon, even if these descriptors were used to
indicate an absence of these qualities. For instance, participants noted that, "identify citrusy more accurately than other
scents”(P2), "whenever I mention citric it will give me the smell of some kind of lemon" (P17) and "AI is quite obsessed
with minty and non-minty flavour" (P08). This also confirms our finding in LLM’s limited contextual understanding of
scent profiles and relationships toward literal interpretations. Conversely, the AI struggled significantly with scents
like rosemary, which had a 0% success rate in both identification tasks. This suggests that certain scents may lack the
distinctive characteristics that the AI can readily identify or are underrepresented in the training data. Such discrepancies
highlight the AI’s difficulty in recognizing less distinctive or less commonly trained scents.
7 Limitations and future work
This study provides insights into the AI’s capabilities and current limitations in scent recognition. However, it is
important to acknowledge several challenges that might affect the broader applicability of these findings. First, we
have a limited scent sample size that is based on the Fragrance Wheel [16]. Future work is needed to extend the sample
selection and explore a larger diversity of scents. This could be done through combining in-person studies, as in our
work, with online surveys, where more descriptions on peoples’ smell experiences can be collected [41].
Second, this study primarily recruited non-experts to reflect the AI’s intended use by the general public, focusing
on human-like rather than "textbook" language. This only represents the general population and not necessarily
professionals, limiting the generalizability of the conclusions. Future work could compare the same settings with both
experts and non-experts to better understand how expertise influences AI’s performance in scent recognition.
Third, to enhance AI’s capabilities in processing scent-related language, it is crucial to enrich training data with a
diverse array of descriptive terms and emotional expressions associated with scents. This aims to reduce biases and
limitations evident in LLMs. Current research in computer science typically focuses on chemical descriptors, which are
not readily applicable to everyday uses by the general public. Future studies could improve by integrating multimodal
data inputs, and combining chemical scent analysis with human descriptions to create a more comprehensive AI-based
scent prediction framework. Furthermore, employing Human-in-the-Loop strategies like Reinforcement Learning from
Human Feedback (RLHF) [9] can enable AI systems to adapt and improve their predictions continuously based on user
feedback, gradually increasing their effectiveness and relevance. Last but not least, there is growing interest in olfactory
experiences within the HCI community [29, 32, 35].
18

18/22

Sniff AI
8 Conclusion
In this work, we explored human-AI perceptual alignment in smell experiences. We developed an AI system, which
leverages an LLM embedding model, named Sniff AI. Sniff AI recognizes human voice input, predicts, and delivers
scents in real-time. To investigate how well the LLM encoder aligns with human perception, we conducted a user study
involving 40 participants who interacted with the Sniff AI system. During the study, participants sniffed various scents
and described them, whilst the AI attempted to identify them based on their descriptions. Our findings indicate that while
the LLM’s embedding space captures scent-related semantics, it exhibits limited accuracy and a bias toward certain
scents. Additionally, participants responded positively to their interactions with Sniff AI, describing the experience as
"very fun" and "interesting", and appreciating the novel ways of exploring scents. Feedback highlighted the potential
uses of scent-focused AI across entertainment, healthcare, and security sectors. For instance, AI might assist in choosing
perfumes, aid those with olfactory impairments, or even detect substances in security settings. These findings and
insights highlight both the potential and challenges of AI in aligning with human sensory experiences related to smell.
Existing AI models demonstrate capability in recognizing distinct scents from descriptions; however, they encounter
significant limitations in processing nuanced or subjective scent descriptions. This reveals critical avenues for future
research to enhance AI’s understanding of and interaction with human olfactory experiences.
References
[1] Shu Zhong, Elia Gatti, Youngjun Cho, and Marianna Obrist. Exploring human-ai perception alignment in sensory
experiences: Do llms understand textile hand? arXiv preprint arXiv:2406.06587, 2024.
[2] Raja Marjieh, Ilia Sucholutsky, P v Rijn, Nori Jacoby, and Thomas L Griffiths. Large language models predict
human sensory judgments across six modalities. arXiv preprint arXiv:2302.01308, 2023.
[3] Ilia Sucholutsky, Lukas Muttenthaler, Adrian Weller, Andi Peng, Andreea Bobu, Been Kim, Bradley C. Love,
Erin Grant, Iris Groen, Jascha Achterberg, Joshua B. Tenenbaum, Katherine M. Collins, Katherine L. Hermann,
Kerem Oktar, Klaus Greff, Martin N. Hebart, Nori Jacoby, Qiuyi Zhang, Raja Marjieh, Robert Geirhos, Sherol
Chen, Simon Kornblith, Sunayana Rane, Talia Konkle, Thomas P. O’Connell, Thomas Unterthiner, Andrew K.
Lampinen, Klaus-Robert Müller, Mariya Toneva, and Thomas L. Griffiths. Getting aligned on representational
alignment, 2023.
[4] Alberto Broggi, Alex Zelinsky, Ümit Özgüner, and Christian Laugier. Intelligent vehicles. In Springer Handbook
of Robotics, pages 1627–1656. Springer, 2016.
[5] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional
transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
[6] Yinhan Liu. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
[7] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei
Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of
machine learning research, 21(140):1–67, 2020.
[8] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances
in neural information processing systems, 33:1877–1901, 2020.
[9] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang,
Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human
feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
[10] Yonadav Shavit, Sandhini Agarwal, Miles Brundage, Steven Adler, Cullen O’Keefe, Rosie Campbell, Teddy Lee,
Pamela Mishkin, Tyna Eloundou, Alan Hickey, et al. Practices for governing agentic ai systems. Research Paper,
OpenAI, December, 2023.
[11] Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, and Jacob Steinhardt. Aligning
ai with shared human values. In International Conference on Learning Representations, 2020.
[12] Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac
Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, et al. Toy models of superposition. arXiv preprint
arXiv:2209.10652, 2022.
[13] Neel Nanda, S Rajamanoharan, J Kramár, and R Shah. Fact finding: Attempting to reverse-engineer
factual recall on the neuron level. In AI Alignment Forum, 2023c. URL https://www. alignmentforum.
org/posts/iGuwZTHWb6DFY3sKB/fact-finding-attempting-to-reverse-engineer-factual-recall, page 19, 2023.
19

19/22

Sniff AI
[14] Geoffrey E Hinton and Sam Roweis. Stochastic neighbor embedding. Advances in neural information processing
systems, 15, 2002.
[15] H. Henning. Der Geruch. J. A. Barth, 1916.
[16] Michael Edwards. Fragrances of the World: Parfums Du Monde: 2010. Fragrances of the World, 2011.
[17] Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and
Geoffrey Irving. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.
[18] Nitesh Goyal, Ian D Kivlichan, Rachel Rosen, and Lucy Vasserman. Is your toxicity my toxicity? exploring
the impact of rater identity on toxicity annotation. Proceedings of the ACM on Human-Computer Interaction,
6(CSCW2):1–28, 2022.
[19] Alexander Pan, Jun Shern Chan, Andy Zou, Nathaniel Li, Steven Basart, Thomas Woodside, Hanlin Zhang,
Scott Emmons, and Dan Hendrycks. Do the rewards justify the means? measuring trade-offs between rewards
and ethical behavior in the machiavelli benchmark. In International Conference on Machine Learning, pages
26837–26867. PMLR, 2023.
[20] Silvan David Peter, Shreyan Chowdhury, Carlos Eduardo Cancino-Chacón, and Gerhard Widmer. Are we
describing the same sound? an analysis of word embedding spaces of expressive piano performance. arXiv
preprint arXiv:2401.02979, 2023.
[21] Benjamin Elizalde, Soham Deshmukh, Mahmoud Al Ismail, and Huaming Wang. Clap learning audio concepts
from natural language supervision. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
[22] Jiyoung Lee, Seungho Kim, Seunghyun Won, Joonseok Lee, Marzyeh Ghassemi, James Thorne, Jaeseok Choi,
O-Kil Kwon, and Edward Choi. Visalign: Dataset for measuring the alignment between ai and humans in visual
perception. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks
Track, 2023.
[23] Shu Zhong, Elia Gatti, Youngjun Cho, and Marianna Obrist. Feeling textiles through ai: An exploration
into multimodal language models and human perception alignment. In Proceedings of the 26th International
Conference on Multimodal Interaction, pages 33–37, 2024.
[24] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives.
IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013.
[25] Peter D Hoff, Adrian E Raftery, and Mark S Handcock. Latent space approaches to social network analysis.
Journal of the american Statistical association, 97(460):1090–1098, 2002.
[26] Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem
Anil, Carson Denison, Amanda Askell, et al. Towards monosemanticity: Decomposing language models with
dictionary learning. Transformer Circuits Thread, 2, 2023.
[27] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector
space. arXiv preprint arXiv:1301.3781, 2013.
[28] Jonas K Olofsson, Simon Niedenthal, Marie Ehrndal, Marta Zakrzewska, Andreas Wartel, and Maria Larsson.
Beyond smell-o-vision: Possibilities for smell-based digital media. Simulation & Gaming, 48(4):455–479, 2017.
[29] Jas Brooks, Pedro Lopes, Marianna Obrist, Judith Amores Fernandez, and Jofish Kaye. Third wave or winter?
the past and future of smell in hci. In Extended Abstracts of the 2023 CHI Conference on Human Factors in
Computing Systems, pages 1–4, 2023.
[30] Andreas Keller. Attention and olfactory consciousness. Frontiers in Psychology, 2:380, 2011.
[31] Anna RL Carter, Marianna Obrist, Christopher Dawes, Alan Dix, Jennifer Pearson, Matt Jones, Dimitrios Zampelis,
and Ceylan Be¸sevli. Scent incontext: Design and development around smell in public and private spaces. In
Companion Publication of the 2023 ACM Designing Interactive Systems Conference, pages 138–141, 2023.
[32] Ceylan Be¸sevli, Giada Brianza, Christopher Dawes, Nonna Shabanova, Sanjoli Mathur, Matt Lechner, Emanuela
Maggioni, Duncan Boak, Carl Philpott, Ava Fatah Gen. Schieck, et al. Smell above all: Envisioning smell-centred
future worlds. In Proceedings of the 2024 ACM Designing Interactive Systems Conference, pages 2530–2544,
2024.
[33] Constance Classen. Foundations for an anthropology of the senses. International social science journal,
49(153):401–412, 1997.
[34] Ceylan Be¸sevli, Christopher Dawes, Giada Brianza, Ava Fatah Gen. Schieck, Duncan Boak, Carl Philpott,
Emanuela Maggioni, and Marianna Obrist. Nose gym: An interactive smell training solution. In Extended
Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems, pages 1–4, 2023.
20

20/22

Sniff AI
[35] Emanuela Maggioni, Robert Cobden, Dmitrijs Dmitrenko, Kasper Hornbæk, and Marianna Obrist. Smell space:
mapping out the olfactory design space for novel interactions. ACM Transactions on Computer-Human Interaction
(TOCHI), 27(5):1–26, 2020.
[36] Dmitrijs Dmitrenko, Emanuela Maggioni, and Marianna Obrist. Ospace: towards a systematic exploration of
olfactory interaction spaces. In Proceedings of the 2017 ACM international conference on interactive surfaces
and spaces, pages 171–180, 2017.
[37] Jas Brooks and Pedro Lopes. Smell & paste: Low-fidelity prototyping for olfactory experiences. In Proceedings
of the 2023 CHI Conference on Human Factors in Computing Systems, pages 1–16, 2023.
[38] Emanuela Maggioni, Robert Cobden, and Marianna Obrist. Owidgets: A toolkit to enable smell-based experience
design. International Journal of Human-Computer Studies, 130:248–260, 2019.
[39] Georgios Iatropoulos, Pawel Herman, Anders Lansner, Jussi Karlgren, Maria Larsson, and Jonas K Olofsson. The
language of smell: Connecting linguistic and psychophysical properties of odor descriptors. Cognition, 178:37–49,
2018.
[40] Asifa Majid. Human olfaction at the intersection of language, culture, and biology. Trends in Cognitive Sciences,
25(2):111–123, 2021.
[41] Marianna Obrist, Alexandre N Tuch, and Kasper Hornbaek. Opportunities for odor: experiences with smell and
implications for technology. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems,
pages 2843–2852, 2014.
[42] Kensaku Mori and Gordon M Shepherd. Emerging principles of molecular signal processing by mitral/tufted cells
in the olfactory bulb. In Seminars in cell biology, volume 5, pages 65–74. Elsevier, 1994.
[43] Ernst T Theimer. Fragrance chemistry: the science of the sense of smell. Elsevier, 2012.
[44] Caroline Bushdid, Marcelo O Magnasco, Leslie B Vosshall, and Andreas Keller. Humans can discriminate more
than 1 trillion olfactory stimuli. Science, 343(6177):1370–1372, 2014.
[45] Kobi Snitz, Ofer Perl, Danielle Honigstein, Lavi Secundo, Aharon Ravia, Adi Yablonka, Yaara Endevelt-Shapira,
and Noam Sobel. Smellspace: an odor-based social network as a platform for collecting olfactory perceptual data.
Chemical senses, 44(4):267–278, 2019.
[46] Jennifer C Brookes. Olfaction: the physics of how smell works? Contemporary Physics, 52(5):385–402, 2011.
[47] Hendrik NJ Schifferstein and Marc PHD Cleiren. Capturing product experiences: a split-modality approach. Acta
psychologica, 118(3):293–318, 2005.
[48] Donald A Wilson and Richard J Stevenson. Learning to smell: olfactory perception from neurobiology to behavior.
JHU Press, 2006.
[49] Brian K Lee, Emily J Mayhew, Benjamin Sanchez-Lengeling, Jennifer N Wei, Wesley W Qian, Kelsie A Little,
Matthew Andres, Britney B Nguyen, Theresa Moloy, Jacob Yasonik, et al. A principal odor map unifies diverse
tasks in olfactory perception. Science, 381(6661):999–1006, 2023.
[50] Andreas Keller, Richard C Gerkin, Yuanfang Guan, Amit Dhurandhar, Gabor Turu, Bence Szalai, Joel D Mainland,
Yusuke Ihara, Chung Wen Yu, Russ Wolfinger, et al. Predicting human olfactory perception from chemical features
of odor molecules. Science, 355(6327):820–826, 2017.
[51] Pasquale Lisena, Daniel Schwabe, Marieke van Erp, Raphaël Troncy, William Tullett, Inger Leemans, Lizzie
Marx, and Sofia Colette Ehrich. Capturing the semantics of smell: the odeuropa data model for olfactory heritage
information. In European Semantic Web Conference, pages 387–405. Springer, 2022.
[52] Smell Heritage – Sensory Mining - Odeuropa, February 2020.
[53] Mathias Zinnen, Prathmesh Madhu, Ronak Kosti, Peter Bell, Andreas Maier, and Vincent Christlein. Odor: The
icpr2022 odeuropa challenge on olfactory object recognition. In 2022 26th International Conference on Pattern
Recognition (ICPR), pages 4989–4994. IEEE, 2022.
[54] Huzein Fahmi Hawari, Nurul Maisyarah Samsudin, Mohd Noor Ahmad, Ali Yeon Md Shakaff, Supri A Ghani,
Yufridin Wahab, and Uda Hashim. Recognition of limonene volatile using interdigitated electrode molecular imprinted polymer sensor. In 2012 Third International Conference on Intelligent Systems Modelling and Simulation,
pages 723–726. IEEE, 2012.
[55] Jorge A Alvarado, Carlos Velasco, and Alejandro Salgado. The organization of semantic associations between
senses in language. Language and Cognition, pages 1–30, 2023.
21

21/22

Sniff AI
[56] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo
Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint
arXiv:2303.08774, 2023.
[57] Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut,
Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models.
arXiv preprint arXiv:2312.11805, 2023.
[58] Arvind Neelakantan, Tao Xu, Raul Puri, Alec Radford, Jesse Michael Han, Jerry Tworek, Qiming Yuan, Nikolas
Tezak, Jong Wook Kim, Chris Hallacy, et al. Text and code embeddings by contrastive pre-training. arXiv preprint
arXiv:2201.10005, 2022.
[59] Grant Sanderson. How might llms store facts | chapter 7, deep learning, 2024.
[60] Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust
speech recognition via large-scale weak supervision. In International Conference on Machine Learning, pages
28492–28518. PMLR, 2023.
[61] Susan T Fiske and Martha G Cox. Person concepts: The effect of target familiarity and descriptive purpose on the
process of describing others 1. Journal of Personality, 47(1):136–161, 1979.
[62] AI Anthropic. Introducing the next generation of claude, 2024.
[63] Christina Strauch, Thu-Huong Hoang, Frank Angenstein, and Denise Manahan-Vaughan. Olfactory information
storage engages subcortical and cortical brain regions that support valence determination. Cerebral Cortex,
32(4):689–708, 2022.
[64] Pierre-Marie Lledo, Gilles Gheusi, and Jean-Didier Vincent. Information processing in the mammalian olfactory
system. Physiological reviews, 85(1):281–317, 2005.
[65] Mary Ann Drake and Gail Vance Civille. Flavor lexicons. Comprehensive reviews in food science and food safety,
2(1):33–40, 2003.
[66] Manuel Zarzo. Relevant psychological dimensions in the perceptual space of perfumery odors. Food Quality and
Preference, 19(3):315–322, 2008.
22

22/22