Can We Trust AI Chatbots to Teach University Physics? A Performance Comparison of AI Chatbots

abstract

Generative artificial intelligence (AI) is transforming the learning and teaching landscape, and large language models, such as ChatGPT, Gemini, or Copilot, are demonstrating their outstanding ability to hold conversations and generate relevant content. However, given their probabilistic nature, it is currently not possible to trust completely in their academic performance. A detailed comparison of the accuracy of two AI chatbots, ChatGPT-4o and Gemini 1.5 Pro is presented by measuring their ability to correctly answer multiple-choice questions of standard undergraduate physics questionnaires. A total sample of 790 questions was analyzed, covering 13 mechanics' themes, from units and measurements to gravitation. Each chatbot was given the same prompt and was asked to solve each question. Their answers were compared to the correct answers, and an accuracy indicator was derived for each chatbot. In partial agreement with similar studies, it was found that ChatGPT achieved an overall accuracy of only 58 (±7) points (on a 0-100 scale), while Gemini achieved 79 (±6), which represents a 36% improvement. An analysis of variances proved this difference to be significant. The stability of ChatGPT was also tested through different runs, yielding basically the same performance. Its average accuracy may vary up to 11 points for a given theme. On the other hand, ChatGPT running with the Wolfram Alpha plugin did not achieve a systematic improvement, and its average accuracy may vary up to 17 points for a given theme. A PhysicsGPT by pulsr.co.uk from the OpenAI gallery was also tested, and it only achieved an overall accuracy of 57. When using these AI chatbots to solve multiple-choice questions in undergraduate physics topics requiring mathematical procedures, teachers and students should take these results as a warning about their current lack of precision and potential misleadingness. © 2025 IEEE.

authors

status

published

publication date

January 1, 2025

published in

IEEE Global Engineering Education Conference EDUCON Journal
IEEE Global Engineering Education Conference EDUCON Journal

Can We Trust AI Chatbots to Teach University Physics? A Performance Comparison of AI Chatbots Academic Article in Scopus

Overview

abstract

authors

status

publication date

published in

Identity

Digital Object Identifier (DOI)

Additional document info

has global citation frequency