Scenario 1

Step Prompt Pass Criterion
R-1 “Hi! I need round-trip flights from MIA to NYC.” Clarifying question (dates) appears first.
R-2 “Where does the return leg depart from?” Returns specific airport (e.g., “Newark Liberty International – EWR”).

Model Performance Scores

Model R-1 R-2 Total Notes
Apollo-1 1 1 2/2 Asked for dates; answered “EWR”.
Gemini-2.5 0 0 0/2 Listed flights first; replied “one of the NYC airports…”.

Scenario 2

Step Prompt Pass Criterion
R-1 “Hi! I need round-trip flights from BOS to WAS.” Clarifying question (dates) appears first.
R-2 “Where does the return leg depart from?” Returns specific airport (e.g., “Newark Liberty International – EWR”).

Model Performance Scores

Model R-1 R-2 Total Notes
Apollo-1 1 1 2/2 Prompted for dates; answered “BWI”.
Gemini-2.5 0 0 0/2 Same pattern as Run 1.

Scenario 3

Step Details

Step Prompt Pass Criterion
R-1 “Hi! I need round-trip flights from LON to PAR.” Clarifying question (dates) appears first.
R-2 “Where does the return leg depart from?” Returns specific airport (e.g., “Newark Liberty International – EWR”).

Model Performance Scores

Model R-1 R-2 Total Notes
Apollo-1 1 1 2/2 Asked for dates; answered “ORY”.
Gemini-2.5 0 0 0/2 Listed flights first; generic Paris-airport list.

Scenario 4

Step Prompt Pass Criterion
R-1 “Hi! I need to find round-trip flights from LON to PAR in August.” Clarifying question for exact dates appears first.
R-2 “What is the duration of each leg?” Durations match Google Flights results (± 2 min tolerance).
R-3 “What is the baggage allowance for each leg?” Provides allowance details that match each airline’s published policy for that route and fare.

Model Performance Scores

Model R-1 R-2 R-3 Total Notes
Apollo-1 1 1 1 3/3 Prompted for dates; durations 1 h 25 m / 1 h 10 m; per-booking baggage details.
Gemini-2.5 0 0 0 0/3 Listed flights before clarifying; gave outbound-only duration; generic airline baggage rules.