Prompt engineering is the practice of creating the proper prompt to generate the output that you want. With the proper prompting, an LLM can do an amazing number of tasks, including generating an email or product description, summarizing provided text, classifying text into standard or customized categories, responding to a customer query with an appropriate answer, and much more.
Every model behaves differently given the same prompt, and so you’ll probably spend a fair amount of time adjusting your prompt for your specific use case or adapting it for different models. The best practices given here are not absolute rules, but guidelines.
A prompt, in our context, consists of the following information. Other than the instruction, all other elements depend on the task and improving or customizing the response.
Your first prompt is rarely good enough, particularly when designing a prompt to use for a commercial system. You’ll spend a lot of time refining your prompt and assessing the results.
Here is a typical workflow for designing and refining a prompt:
Be concise
Say everything you want to say with as few words as possible. Don’t state the obvious.
You are a customer support representative for ACME corp, your name is Wile E. Coyote. you need to answer user questions regarding support issues. Be polite, engaging and to the point.
Do not curse.
Do not mention competitors.
You are Wile E., ACME corp support chat representative.
Be polite, engaging and to the point.
Ensure that the prompt is clear
When crafting prompts for Jamba models, follow this fundamental principle: Write your prompts as you would want them to be written for you. If you can’t understand the prompt well, neither will an LLM.
To elaborate:
List all the following animals, objects and places in the story.
{story}:
Your lists:
List all the following animals, objects and places in the story.
Your output should be in the following JSON format:
{
‘animals’: a list of all animals in following the story,
‘Objects’: a list of all objects in following the story,
‘places’: a list of all objects in following the story
}
Story:
{story}
–End of story –
Clean and well-structured prompts minimize errors and help you debug and optimize your instructions. If you encounter difficulties or errors that you can’t seem to fix, try simplifying your prompt.
Describe the DOs, not that DON’Ts
Focus on telling the model what you want to do. Minimize “do nots.” Of course, you can use negatives occasionally, but excessive use of “don’t do” and “avoid x” is a sign that your prompt may be going in the wrong direction and you may want to rewrite your prompt.
Write a product description for a high-end cell phone (i.e. not a landline). The description should not be for regular folks; it should only be for important executives. Do not make it overly sales-ish; instead have it be grounded in the specs of the phone.
Write a product description for a high-end cell phone. The description should be tailored for a high powered executive and focus on the specs of the phone. Focus on how the phone can enable more efficient work.
Allow the model to say “I don’t know”
Specifically state to the model you permit it to not return an answer. This reduces hallucinations.
Why did revenue increase, according to the following quarterly report?
{Quarterly Report}.
Why did revenue increase, according to the following quarterly report? If the answer is not in the provided report, reply only with “I don’t know”
{Quarterly Report}X
System prompt - Use it!
Describe the role that the LLM should assume when answering the question. This is frequently referred to as a system prompt, and has been found to produce better completions for many families of LLMs. This affects not only the tone and language used, but also the amount of detail and level of expertise used.
System prompts should also be used to guide the perspective the model takes when answering the question; for example, by thinking about the problem as a research assistant, or a customer, or a novice.
When accessing the model in code, the system prompt is specified by an initial role:system
message. In the playground, provide the information in the System instructions section. Alternatively, you can put the system prompt directly into the prompt itself, although this might be less effective.
I want you to assume the role of a meticulous research assistant.
Your task is to Evaluate if the following text extract is relevant to the case at hand.
You are a meticulous, critical research assistant.
The IDH Template (Instruction—Data—Hint)
For complex prompts, include instructions, then data, then a hint. This can be used for simple prompts as well.
The hint should be a paraphrased version of the instructions.
Rewrite the following patient record, so that it is easily understandable for an average person with a high school degree.
{PATIENT RECORD}
Rewrite the following patient record, so that it is easily understandable for an average person with a high school degree.
{PATIENT RECORD}
high school level rewrite:
For prompts with a lot of data it is better to clearly state to the model where every section starts and ends.
Your task is to fix the product description to be compliant with the product guidance.
{Product Description}
{Product Guidance}
Your product description:
Your task is to fix the product description to be compliant with the product guidance.
Original product description:
{Product Description}
– End of Description –
Product Guidance:
{Product Guidance}
– End of Description –
Your product description:
For some prompts with complex instructions, it is useful to include instruction both in the beginning and in the end of the prompt.
Your task is to fix the product description to be compliant with the product guidance.
Original product description:
{Product Description}
– End of Description –
Product Guidance:
{VERY COMPLEX Product Guidance}
– End of Description –
Your product description:
Your task is to fix the product description to be compliant with the product guidance.
Original product description:
{Product Description}
– End of Description –
Product Guidance:
{VERY COMPLEX Product Guidance}
– End of Description –
The rewrite product description in accordance with the product guidance:
Use structured output when needed
If your output is meant to be read by another system (e.g., for integration into a pipeline), request a JSON-formatted response.
Use the response_format=json API parameter and specify the expected structure in the prompt itself.
Extract the user’s name, location, and request from the input text.
Extract the user’s name, location, and request from the input text.
Return the output in the following JSON format:
{
“name”: "",
“location”: "",
“request”: ""
}
Use the appropriate tool
For straightforward math calculations or other actions that can be done in simple code, use a more appropriate tool for the job (a calculator, a macro, a short code snippet). Those tools are designed specifically for the job, and provide much more controllable and consistent output than an LLM.
Ask the LLM if it understands the prompt
To speed up development, consider first asking the LLM if it understands the instructions and other key terms in the prompt. This can help ensure that the LLM understands the core idea of what you are trying to do.
Don’t do math
LLMs are famously bad at math. They’re getting better, but it’s still not a good idea, generally to ask an LLM to solve a word problem for you. LLMs can count and do sums, but word problems or logic are tricky and produce inconsistent results. For simple or straightforward math, use a calculator or other more appropriate tool.
If you must evaluate a word problem or logic, you can increase the accuracy by asking the LLM to explain each step it takes in the process (called chain-of-thought prompting.
When performing classification or other scoring, it is much better to use categories that are meaningful rather than arbitrary numbers.
Analyze whether the two sentences provided are consistent or inconsistent. You can provide a score of between 1-3.
Sentence A:
“When talking on the phone, the defendant confessed to felony murder.”
Sentence B:
“The defendant admitted nothing when talking on the phone.”
Your score:
Analyze whether the two sentences provided are consistent or inconsistent. Classify into one of the following classes:
[Consistent, partially consistent, inconsistent]
Sentence A:
“When talking on the phone, the defendant confessed to felony murder.”
Sentence B:
“The defendant admitted nothing when talking on the phone.”
Your classification:
Output variability is influenced by two parameters: temperature and top P:
If you need completely consistent answers, such as for classification or math (but don’t do math), set the temperature to 0.
When you want some variation, start with a low temperature (0.2-0.3) and increase by tenths until you get the variability that you want. Typically you won’t need a temperature higher than 0.7. Note that setting temperatures higher than 0.7 can cause the LLM to wander, and setting it higher than 1.0 can cause extremely long and sometimes nonsensical output (definitely specify a max_tokens
limit for high temperatures).
If the temperature is very high (greater than 0.7) reduce the top P a few tenths of a point to remove the very unlikely results unless you want some very high creativity, in which case you can keep the topP high (try a topP of 0.99 to omit extremes).
Limiting output length
If you have a length goal or limit for your output, specify it in the prompt as the desired or maximum number of lines, words, sentences, or paragraphs. Don’t expect the model to hit the mark exactly, as it can only see one word ahead, and it might need a bit more or less than you specify to provide a good answer. Examples: “Write two or three sentences about…,” or “Limit your answer to 10 words.”
The API supports a max_tokens
parameter, but you should use it only as a failsafe to prevent the edge case of the model going off in a completely unexpected direction and far exceeding your length limits (higher temperature can increase output length). This value is absolutely respected, so if the output hits this limit, the result might stop in the middle of a word. Set this value a fair bit higher than any prompt-suggested limit.
Requesting citations
LLMs have been known to invent citations, so asking Jamba Instruct for citations for its information is not a guarantee of accuracy. If you need absolutely reliable citations, use the RAG Engine.
Use labels, not numbers, to rate output
It is common for people to want a numerical scoring of “good” or “bad” (“on a scale of 1 to 10…”). Assigning exact numbers to subjective categories is hard for people, and harder for LLMs, and also gives a false sense of accuracy. Simple labels like “Bad,” “Okay,” and “Best” are easier for the language model to provide than precise numerical ratings like 4.7 or 6.8.
Note that although you can use numbers to represent categories it is generally preferable to use category labels that are inherently meaningful, such as “None”,“Some”, and “Most”. For example:
Providing examples of what you want to see (also called “Few Shot prompts”) can be few useful to Jamba. Examples can be very helpful when
Before you provide examples, try out the results using just instruction to see if you get the results you need. If it turns out that the results look better with an example or two, or if describing something is harder than showing an example, then go ahead and use examples.
Some recommendations about using examples in your prompt:
In the following example, we use two types of delimiters: a delimiter ### between examples and a newline between ads and answers. We’ve put the examples first and the instructions at the end. Content between user[” ”] marks are just placeholders in the example where you would put actual ads, answers, or criteria.
Cover all cases in your examples
When providing examples, do your best to cover all relevant example types. For example, if the answer to a question posed to the LLM can be “yes” or “no,” provide examples where the answer is yes and examples where the answer is no. Ideally, the distribution should match real-life use cases as well.
In the following example, we provide prompt examples that show how to respond to both vague and specific feedback from the user.
Be aware of bias with matching examples
If you provide examples, the model can be biased toward input that closely resembles one of these examples. While this usually guides the model appropriately, it can also reduce its flexibility.
During development, and also after release (if you’re using your prompt in a production system) you’ll need to evaluate your results. Which methods you use depend on your usage scale and where you are in the development process.
For grading outputs with an absolute answer (such as classification or sentiment analysis), judge on the following criteria:
Create an ideal answer for each test input and grade the generated result against the golden answer on a scale of 1-10. This is especially useful for classification exercises, which you will mark as correct or incorrect.
Human evaluation
If you have the resources, human evaluation is often the best solution. Typically in the early stages of adjusting your prompts you’ll be evaluating the answers yourself. Create a list of criteria to consider when evaluating each answer (accuracy, clarity, usefulness). You can either rate each criteria individually or give a general overall score (1-5, good/OK/bad).
Use an LLM to evaluate your results
You can try using another LLM to rank your answers. If you do, use a different LLM than the one that generated the answer (LLMs, like people, tend to be biased toward their own answers). For example, if you have a large block of information and your prompt asks the model to answer a specific question based on that information, you might write a prompt like this to evaluate the result generated from your first prompt:
Log your prompts and responses in production. Periodically check the output quality for a random selection of your generated answers.
Provide a feedback button that sends you the question and generated response to help you improve your prompts.
If you have a complex task that requires several steps, you might want to break it into multiple prompts, and port the output of each step into the prompt for the next step. That way you can fine tune the results (and temperature) for each step.
For example, to provide a list of doctors for a patient with a specific symptom, you might break it into these steps:
Prompt engineering is the practice of creating the proper prompt to generate the output that you want. With the proper prompting, an LLM can do an amazing number of tasks, including generating an email or product description, summarizing provided text, classifying text into standard or customized categories, responding to a customer query with an appropriate answer, and much more.
Every model behaves differently given the same prompt, and so you’ll probably spend a fair amount of time adjusting your prompt for your specific use case or adapting it for different models. The best practices given here are not absolute rules, but guidelines.
A prompt, in our context, consists of the following information. Other than the instruction, all other elements depend on the task and improving or customizing the response.
Your first prompt is rarely good enough, particularly when designing a prompt to use for a commercial system. You’ll spend a lot of time refining your prompt and assessing the results.
Here is a typical workflow for designing and refining a prompt:
Be concise
Say everything you want to say with as few words as possible. Don’t state the obvious.
You are a customer support representative for ACME corp, your name is Wile E. Coyote. you need to answer user questions regarding support issues. Be polite, engaging and to the point.
Do not curse.
Do not mention competitors.
You are Wile E., ACME corp support chat representative.
Be polite, engaging and to the point.
Ensure that the prompt is clear
When crafting prompts for Jamba models, follow this fundamental principle: Write your prompts as you would want them to be written for you. If you can’t understand the prompt well, neither will an LLM.
To elaborate:
List all the following animals, objects and places in the story.
{story}:
Your lists:
List all the following animals, objects and places in the story.
Your output should be in the following JSON format:
{
‘animals’: a list of all animals in following the story,
‘Objects’: a list of all objects in following the story,
‘places’: a list of all objects in following the story
}
Story:
{story}
–End of story –
Clean and well-structured prompts minimize errors and help you debug and optimize your instructions. If you encounter difficulties or errors that you can’t seem to fix, try simplifying your prompt.
Describe the DOs, not that DON’Ts
Focus on telling the model what you want to do. Minimize “do nots.” Of course, you can use negatives occasionally, but excessive use of “don’t do” and “avoid x” is a sign that your prompt may be going in the wrong direction and you may want to rewrite your prompt.
Write a product description for a high-end cell phone (i.e. not a landline). The description should not be for regular folks; it should only be for important executives. Do not make it overly sales-ish; instead have it be grounded in the specs of the phone.
Write a product description for a high-end cell phone. The description should be tailored for a high powered executive and focus on the specs of the phone. Focus on how the phone can enable more efficient work.
Allow the model to say “I don’t know”
Specifically state to the model you permit it to not return an answer. This reduces hallucinations.
Why did revenue increase, according to the following quarterly report?
{Quarterly Report}.
Why did revenue increase, according to the following quarterly report? If the answer is not in the provided report, reply only with “I don’t know”
{Quarterly Report}X
System prompt - Use it!
Describe the role that the LLM should assume when answering the question. This is frequently referred to as a system prompt, and has been found to produce better completions for many families of LLMs. This affects not only the tone and language used, but also the amount of detail and level of expertise used.
System prompts should also be used to guide the perspective the model takes when answering the question; for example, by thinking about the problem as a research assistant, or a customer, or a novice.
When accessing the model in code, the system prompt is specified by an initial role:system
message. In the playground, provide the information in the System instructions section. Alternatively, you can put the system prompt directly into the prompt itself, although this might be less effective.
I want you to assume the role of a meticulous research assistant.
Your task is to Evaluate if the following text extract is relevant to the case at hand.
You are a meticulous, critical research assistant.
The IDH Template (Instruction—Data—Hint)
For complex prompts, include instructions, then data, then a hint. This can be used for simple prompts as well.
The hint should be a paraphrased version of the instructions.
Rewrite the following patient record, so that it is easily understandable for an average person with a high school degree.
{PATIENT RECORD}
Rewrite the following patient record, so that it is easily understandable for an average person with a high school degree.
{PATIENT RECORD}
high school level rewrite:
For prompts with a lot of data it is better to clearly state to the model where every section starts and ends.
Your task is to fix the product description to be compliant with the product guidance.
{Product Description}
{Product Guidance}
Your product description:
Your task is to fix the product description to be compliant with the product guidance.
Original product description:
{Product Description}
– End of Description –
Product Guidance:
{Product Guidance}
– End of Description –
Your product description:
For some prompts with complex instructions, it is useful to include instruction both in the beginning and in the end of the prompt.
Your task is to fix the product description to be compliant with the product guidance.
Original product description:
{Product Description}
– End of Description –
Product Guidance:
{VERY COMPLEX Product Guidance}
– End of Description –
Your product description:
Your task is to fix the product description to be compliant with the product guidance.
Original product description:
{Product Description}
– End of Description –
Product Guidance:
{VERY COMPLEX Product Guidance}
– End of Description –
The rewrite product description in accordance with the product guidance:
Use structured output when needed
If your output is meant to be read by another system (e.g., for integration into a pipeline), request a JSON-formatted response.
Use the response_format=json API parameter and specify the expected structure in the prompt itself.
Extract the user’s name, location, and request from the input text.
Extract the user’s name, location, and request from the input text.
Return the output in the following JSON format:
{
“name”: "",
“location”: "",
“request”: ""
}
Use the appropriate tool
For straightforward math calculations or other actions that can be done in simple code, use a more appropriate tool for the job (a calculator, a macro, a short code snippet). Those tools are designed specifically for the job, and provide much more controllable and consistent output than an LLM.
Ask the LLM if it understands the prompt
To speed up development, consider first asking the LLM if it understands the instructions and other key terms in the prompt. This can help ensure that the LLM understands the core idea of what you are trying to do.
Don’t do math
LLMs are famously bad at math. They’re getting better, but it’s still not a good idea, generally to ask an LLM to solve a word problem for you. LLMs can count and do sums, but word problems or logic are tricky and produce inconsistent results. For simple or straightforward math, use a calculator or other more appropriate tool.
If you must evaluate a word problem or logic, you can increase the accuracy by asking the LLM to explain each step it takes in the process (called chain-of-thought prompting.
When performing classification or other scoring, it is much better to use categories that are meaningful rather than arbitrary numbers.
Analyze whether the two sentences provided are consistent or inconsistent. You can provide a score of between 1-3.
Sentence A:
“When talking on the phone, the defendant confessed to felony murder.”
Sentence B:
“The defendant admitted nothing when talking on the phone.”
Your score:
Analyze whether the two sentences provided are consistent or inconsistent. Classify into one of the following classes:
[Consistent, partially consistent, inconsistent]
Sentence A:
“When talking on the phone, the defendant confessed to felony murder.”
Sentence B:
“The defendant admitted nothing when talking on the phone.”
Your classification:
Output variability is influenced by two parameters: temperature and top P:
If you need completely consistent answers, such as for classification or math (but don’t do math), set the temperature to 0.
When you want some variation, start with a low temperature (0.2-0.3) and increase by tenths until you get the variability that you want. Typically you won’t need a temperature higher than 0.7. Note that setting temperatures higher than 0.7 can cause the LLM to wander, and setting it higher than 1.0 can cause extremely long and sometimes nonsensical output (definitely specify a max_tokens
limit for high temperatures).
If the temperature is very high (greater than 0.7) reduce the top P a few tenths of a point to remove the very unlikely results unless you want some very high creativity, in which case you can keep the topP high (try a topP of 0.99 to omit extremes).
Limiting output length
If you have a length goal or limit for your output, specify it in the prompt as the desired or maximum number of lines, words, sentences, or paragraphs. Don’t expect the model to hit the mark exactly, as it can only see one word ahead, and it might need a bit more or less than you specify to provide a good answer. Examples: “Write two or three sentences about…,” or “Limit your answer to 10 words.”
The API supports a max_tokens
parameter, but you should use it only as a failsafe to prevent the edge case of the model going off in a completely unexpected direction and far exceeding your length limits (higher temperature can increase output length). This value is absolutely respected, so if the output hits this limit, the result might stop in the middle of a word. Set this value a fair bit higher than any prompt-suggested limit.
Requesting citations
LLMs have been known to invent citations, so asking Jamba Instruct for citations for its information is not a guarantee of accuracy. If you need absolutely reliable citations, use the RAG Engine.
Use labels, not numbers, to rate output
It is common for people to want a numerical scoring of “good” or “bad” (“on a scale of 1 to 10…”). Assigning exact numbers to subjective categories is hard for people, and harder for LLMs, and also gives a false sense of accuracy. Simple labels like “Bad,” “Okay,” and “Best” are easier for the language model to provide than precise numerical ratings like 4.7 or 6.8.
Note that although you can use numbers to represent categories it is generally preferable to use category labels that are inherently meaningful, such as “None”,“Some”, and “Most”. For example:
Providing examples of what you want to see (also called “Few Shot prompts”) can be few useful to Jamba. Examples can be very helpful when
Before you provide examples, try out the results using just instruction to see if you get the results you need. If it turns out that the results look better with an example or two, or if describing something is harder than showing an example, then go ahead and use examples.
Some recommendations about using examples in your prompt:
In the following example, we use two types of delimiters: a delimiter ### between examples and a newline between ads and answers. We’ve put the examples first and the instructions at the end. Content between user[” ”] marks are just placeholders in the example where you would put actual ads, answers, or criteria.
Cover all cases in your examples
When providing examples, do your best to cover all relevant example types. For example, if the answer to a question posed to the LLM can be “yes” or “no,” provide examples where the answer is yes and examples where the answer is no. Ideally, the distribution should match real-life use cases as well.
In the following example, we provide prompt examples that show how to respond to both vague and specific feedback from the user.
Be aware of bias with matching examples
If you provide examples, the model can be biased toward input that closely resembles one of these examples. While this usually guides the model appropriately, it can also reduce its flexibility.
During development, and also after release (if you’re using your prompt in a production system) you’ll need to evaluate your results. Which methods you use depend on your usage scale and where you are in the development process.
For grading outputs with an absolute answer (such as classification or sentiment analysis), judge on the following criteria:
Create an ideal answer for each test input and grade the generated result against the golden answer on a scale of 1-10. This is especially useful for classification exercises, which you will mark as correct or incorrect.
Human evaluation
If you have the resources, human evaluation is often the best solution. Typically in the early stages of adjusting your prompts you’ll be evaluating the answers yourself. Create a list of criteria to consider when evaluating each answer (accuracy, clarity, usefulness). You can either rate each criteria individually or give a general overall score (1-5, good/OK/bad).
Use an LLM to evaluate your results
You can try using another LLM to rank your answers. If you do, use a different LLM than the one that generated the answer (LLMs, like people, tend to be biased toward their own answers). For example, if you have a large block of information and your prompt asks the model to answer a specific question based on that information, you might write a prompt like this to evaluate the result generated from your first prompt:
Log your prompts and responses in production. Periodically check the output quality for a random selection of your generated answers.
Provide a feedback button that sends you the question and generated response to help you improve your prompts.
If you have a complex task that requires several steps, you might want to break it into multiple prompts, and port the output of each step into the prompt for the next step. That way you can fine tune the results (and temperature) for each step.
For example, to provide a list of doctors for a patient with a specific symptom, you might break it into these steps: