Crafting effective prompts can be challenging. There are numerous ways to succinctly and clearly explain how to parse inputs and generate outputs, yet it’s not simple to craft a prompt that conveys our intended meaning to an AI. If only there were a method to determine the most suitable prompt for a given task  an automated met& Wait a moment!

We do, in fact, possess such a tool. Modern GPTs are capable of introspection  if prompted to do so. Thus, we can inquire of the AI, “How difficult is it for you to complete the task I’ve assigned?”

Indeed, they can inform us whether one set of tasks is more complex than another. The complexity of a prompt will inevitably reflect in cost. Whether billing is determined by the number of tokens or CPU time, simpler and more streamlined prompts that achieve the same outcome will result in lower operational costs.

Other, less apparent costs arise from allowing non-optimised prompts to pass unchecked into production: prompts that are grammatically questionable or hard to comprehend can inflate maintenance expenses. Ambiguous prompts might succeed in 99% of scenarios, yet unpredictably fail once deployed outside of a testing environment. Furthermore, messy prompts simply appear unprofessional.

Prompt Evaluation

We can rank prompts on five dimensions:

  • Number of tokens: Tokens are the basic units of words processed by GPT algorithms. English words typically consist of one or two tokens, whereas words in other languages may have a more complex structure. All else being equal, brevity is preferable in a prompt.
  • Correctness: Grammatical accuracy in prompts is not essential; the AI will likely infer our meaning even with imperfect expression. For instance, the incorrect prompt “You can get the largest nation to me back in miles square?” (intending to ask for the largest nation by area in square miles) will often yield the correct response. Nonetheless, prompts that deviate from grammatical norms can introduce ambiguity and complicate maintenance.
  • Linguistic complexity: Simple sentences minimize ambiguity and are more straightforward to analyze. Conversely, sentences that are complex, verbose, and self-referential may inadvertently introduce ambiguity and are suboptimal for maintenance. The efficacy of conveying information is key  you want to maximize informational yield per word, hence a prompt with less linguistic complexity is superior.
  • Comprehension complexity:Regardless of sentence structure, some statements can convey profound concepts succinctly, while others may obfuscate simple ideas with complexity. For instance, the statement “the universe is a fractal structure at a discrete position in a countable infinity of multiverses” is compact and grammatically correct, but unpacking its complex concepts could very well be the work of three PhD programs. For an opposing example, one that delivers minimal content within a lengthy and complex linguistic framework, look no further than the average television speech by a local politician.
  • Algorithmic complexity: Prompt engineering aims to efficiently task Generative AIs. Some requests are inherently simple; others require significant processing. A prompt might be verbose, grammatically flawed, linguistically intricate, and conceptually dense, but if it simply necessitates retrieving data from the AI’s database, it is not demanding. For instance, a grammatically incorrect prompt like “in a universe alternate and strange, but it’s like ours, there is the world, and every small nation becomes big, so what is the biggest nation there?” may score low on other dimensions but is straightforward for the AI. On the other hand, “extract a JSON description of the following C++ class” may excel in all the other areas but demand considerable effort from the AI.

A prompt to conquer them all

This is the prompt I have come up to in order to score my prompts on this five dimensions:

Analyse the prompt I will give you next according with these dimensions:

  • Number of tokens: the absolute count of logit tokens in the prompt.

  • Correctness: 110/10, where 1 is a syntactically incorrect sentence or discourse, to the point of making it hard to understand, and 10 is a perfectly well-formed paragraph.

  • Linguistic complexity: 110/10, where 1 is a straightforward, declarative sentence, and 10 is a very intricate discourse including examples, analogies, self-references and any rhetoric figure that could fit in it.

  • Comprehension complexity: 110/10, how hard is for you as an AI to extract the meaning of the discourse. 1 is a discourse that has the minimum concept-per-word density, while 10 is a discourse that would require you to generate a graph of concepts as dense as it could possibly be.

  • Algorithmic complexity: 110/10. This is the amount of work you would have to perform to execute what is described in the prompt. If the prompt requires you to retrieve a readily available information, it’s 1. If it requires you to analyse a complex text and create some novel and structured content out of it that’s a 10.

First, generate a table with voice - numeric values entries, and following that, describe how you determined the values.

The prompt to analyse is the following:

(Feel free to change UK spelling to US English, the AI doesn’t care).

Here follows a complex prompt I fed to ChatGPT 4 for the analysis:

Given a source file in any programming language, generate a JSON output containing the following information:

  • For each free function add a “function” entry, that has a structure described later.

  • For each class, or equivalent container, add an entry called “class”, having a “name” and a list of “function” entries. If provided, constructors and destructors have their own “constructors” and “destructors” entries; more on them later.

  • Every function entry (be it free or class method) has a “name”, list of “parameters” and “return value”.

  • Each parameter has a “name”, “type” and “default” value. Set “type” and “default” to null if they cannot be determined / the language doesn’t suppor them.

  • The return value can be null if the method doesn’t return. Otherwise it has a “type” entry. If the type cannot be determined, set it to null.

The following is the transcript of ChatGPT reply:

  • Number of Tokens: 142
  • Correctness: 10/10
  • Linguistic Complexity: 7/10
  • Comprehension Complexity: 7/10
  • Algorithmic Complexity: 8/10

To determine these values, the following criteria were used:

  • Number of Tokens: The token count is an estimate based on splitting the prompt into individual words and symbols, which gives us a rough approximation of 142 tokens.
  • Correctness: The prompt is grammatically correct, with appropriate syntax and structure, leading to a score of 10.
  • Linguistic Complexity: The prompt is rated 7 due to its use of specialised, technical language pertinent to programming and data structures. However, it does not employ excessive rhetorical devices which would increase complexity.
  • Comprehension Complexity: This is also rated 7 because, while the prompt contains detailed, technical instructions that require some understanding of programming concepts, it is presented in a clear and structured manner.
  • Algorithmic Complexity: The task assigned by the prompt is complex, involving the analysis of source code to generate structured JSON output. It requires parsing and interpreting code structures, which is a non-trivial task, hence the score of 8.

Considerations

  1. The instruction to “generate a table first” guides the GPT algorithm to structure its output, which diminishes the likelihood of generating erroneous or ‘hallucinated’ data. Without this directive, I’ve encountered amusing yet inaccurate responses in certain edge cases.