Optimize Code Development with LLMs on the Intel® Developer Cloud

author-image

By

Whether you are a seasoned developer or learning to code in a new language, large language models (LLMs) can be an invaluable tool for many programming tasks. Code Llama is a generative text model developed by Meta* researchers that is designed for general code synthesis and understanding. The Code Llama family of large language models comes in a range of sizes and specializations: a base model, a Python* specific model, and an instruction-following model. Each model is available in sizes of 7 billion, 13 billion, and 34 billion parameters.

The models used in this guide are powered by Intel® Data Center GPU Max Series. This general-purpose, discrete GPU packs over 100 billion transistors into one package and contains up to 128 Xe-core, Intel’s foundational GPU compute building block. The GPU provides further acceleration for end-to-end AI and data analytics pipelines. You can get the most performance using libraries optimized for Intel architectures and configurations tuned for HPC and AI workloads, high-capacity storage, and high-bandwidth memory.

You can try out the Intel-optimized code generation chatbot and access the full code in this tutorial under the Generative AI Essentials training section on the Intel Developer Cloud console. The Intel Developer Cloud offers early technology access to the Intel Data Center GPU Max Series as well as additional Intel hardware platforms, like Intel® Gaudi®2 AI accelerator deep learning servers and 4th generation Intel® Xeon® Scalable processors.

Before beginning this tutorial, go to the Intel Developer Cloud and create an account. Following are the steps to get started:

  1. Register for an Intel Developer Cloud account as a Standard user.
  2. Sign into your account and go to the Training and Workshops section.
  3. Under Gen AI Essentials, go to Optimize Code Generation with LLMs, and then select Launch as shown in Figure 1.

Figure 1: GenAI Essentials Notebooks on the Intel Developer Cloud.

In a new terminal, verify that you are running on an Intel Data Center GPU Max Series 1100 with the following command:

clinfo -l

The output from this command should display the following four Intel GPUs:

Platform #0: Intel(R) OpenCL Graphics
+-- Device #0: Intel(R) Data Center GPU Max 1100
+-- Device #1: Intel(R) Data Center GPU Max 1100
+-- Device #2: Intel(R) Data Center GPU Max 1100
`-- Device #3: Intel(R) Data Center GPU Max 1100

To display GPU usage statistics for Device #0 at one-second intervals, you can use the following command:

xpu-smi dump -d 0 -m 0,5,18

Your output should look similar to:

getpwuid error: Success
Timestamp, DeviceId, GPU Utilization (%), GPU Memory Utilization (%), GPU Memory Used (MiB)
13:34:51.000, 0, 0.02, 0.05, 28.75
13:34:52.000, 0, 0.00, 0.05, 28.75
13:34:53.000, 0, 0.00, 0.05, 28.75

Once the Code Llama models have been loaded into the notebook, Intel® Extension for PyTorch* accelerates model inferencing and code generation. This extension enhances the performance of PyTorch models on Intel® architectures with the newest features and optimizations that have not yet been released in the open source framework. This extension efficiently uses Intel hardware capabilities like the Intel® Xᵉ Matrix Extensions (Intel® XMX) on Intel discrete GPUs. Two of the key functions of the Intel Extension for PyTorch are:

import intel_extension_for_pytorch as ipex
ipex.optimize_transformers(model, dtype = torch_dtype)

and:

ipex.optimize(model, dtype = torch_dtype)

where model is the pretrained Code Llama model and torch_dtype is the data precision type. Set the data precision type to torch.bfloat16 to boost performance on Intel discrete GPUs.

Choose from three Code Llama models in the notebook, depending on your needs:

  1. Code Llama: The foundational code model for general-purpose programming tasks and code completion.
  2. Code Llama – Python: This model is specialized for Python code generation.
  3. Code Llama – Instruct: This model is specialized to understand and follow natural language instructions.

Figure 2 shows the user interface of the Code Llama chatbot within the Jupyter* Notebook. To get started, choose one of the three Code Llama models from the dropdown menu that best suits your programming task, select whether you would like to interact with or without context, and use the sliding handle to adjust the following parameters:

  • Temperature: The temperature for controlling randomness in a Boltzmann distribution. Higher values increase randomness, while lower values make the generation more deterministic.
  • Top P: The cumulative distribution function threshold for nucleus sampling. This helps in controlling the trade-off between randomness and diversity.
  • Top K: The number of highest probability vocabulary tokens to keep for top-k filtering.
  • Num Beams: The number of beams for beam search. This controls the breadth of the search.
  • Rep Penalty: The repetition penalty applied for repeating tokens.

User interface of the Code Llama chatbot

Figure 2: User interface of the Code Llama chatbot.

The following figure shows the Code Llama chatbot for Python in action, generating code using the default model parameters. You can adjust these parameters as needed to optimize the model’s responses.

Code generation by the Code Llama chatbot for Python

Figure 3: Code generated by the Code Llama chatbot for Python.

Try out this model and the other Code Llama LLMs, like the foundational code model and the Instruct model, on the Intel Developer Cloud today. Thank you for reading and happy coding!

Notices and Disclaimers

More information on Code Llama can be found in the paper Code Llama: Open Foundation Models for Code or its arXiv page. Before using these models, review the Llama Responsible Use Guide: Your Resource for Building Responsibly and the license agreement at Llama Access Request Form - Meta* AI.

Next Steps

Come chat with us on the DevHub Discord* server to keep interacting with fellow developers.