Running Llama3 model inference  on Intel CPU and Intel GPU (1)   Leave a comment

Introduction

On April 18, 2024, Meta released Llama 3, the latest and most capable Open source large language model (LLM)  model, which is a  major leap over the previous Llama 2 model. This latest LLM model features pretained and instruction-fine-tuned language models with 8B and 70B parameters. According to Meta, these Llama 3 8B and 70B models were the beginning of Llama 3, and there’s a lot more to come, for example the  largest models with over 400B parameters which are still training.

Meta has made the LIama 3 models available for download at  Llama 3 website  and provided the  Getting Started Guide for the latest list of all available platforms. After learning the information,  I started to try out these models. One of the great news is that on the same day( 4/18/2024), Intel announced that its CPU and GPU have been validated to supported Llama 3 8B and 70B models (refer to Llama 3 with Intel AI Solutions). As my  initial experiments of these latest LLM models, I had opportunities to run the Meta Llama 3 8B inference on Intel CPUs and Intel® Arc™ built-in GPU using IPEX-LLM. IPEX-LLM is a PyTorch library for running LLM on Intel CPU and GPU with very low latency 

Run the Llama3 8B inference on Intel CPUs

We can use IPEX-LLM optimize model API to accelerate Llama3 models on CPU. Here are the steps:

  1. Install IPEX-LLM and set environment variables on Linux with the help of IPEX-LLM

$ git clone https://github.com/intel-analytics/ipex-llm.git

$ pip install ipex-llm

$ source ipex-llm-init2

2. Create the conda to manage the Python environment:

. Install the latest 64-bit version of the installer and then clean up after themselves.

$mkdir -p ~/miniconda3

$wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh

  $bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3

  $rm -rf ~/miniconda3/miniconda.sh

     . Initialize the newly-installed Minicond.  

    Run the following commands to initialize for bash and zsh shells:

  $~/miniconda3/bin/conda init bash
  $~/miniconda3/bin/conda init zsh

  • Create a Python environment for IPEX-LLM
   $conda create -n llm python=3.11 # recommend to use Python 3.11
   $conda activate llm
  Install the latest ipex-llm nightly build with 'all' option
   $pip install --pre --upgrade ipex-llm[all] 
   Install transformers 4.37.0 (>=4.33.0 is required for Llama3 with IPEX-LLM optimizations) 
   $pip install transformers==4.37.0
  • Run the inference example for a Llama3 model to predict the next N tokens using generate() API, with IPEX-LLM INT4 optimizations:               

   For example we give a prompt ‘What is AI’ to run the inference using this LIama 3 model.

$ python ./generate.py –prompt ‘What is AI?’

With the default setting of 32 tokens of output, it produced the output like this as shown in the following screen shot:

We can also give a different prompt to run the inference using this LIama 3 model.

$python ./generate.py –prompt ‘What is llama3?’

With the default setting of 32 tokens of output, the output is like this:

As a part of the output of the program, it gave the inference time for 32 tokens (default value). For example, the inference time in the example above is about 2.944019079208374 second.   

Run the Llama3 8B inference on Intel ARC A770 GPU

I ran the Llama3 8B inference on a system with Intel® Arc™ A770 Graphics (16GB)  of  16 GB memory and 32 Xe-cores

Here are the steps:

  1.  create conda virtual environment:

$conda create -n llama3-test python=3.11
$conda activate llama3-test

$cd llama3

  • Install  IPEX-LLM  (intel_extension_for_pytorch==2.1.10+xpu as default)

$git clone https://github.com/intel-analytics/ipex-llm.git

Below command will install intel_extension_for_pytorch==2.1.10+xpu as default
$pip install –pre –upgrade ipex-llm[xpu] –extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/

  •  Install transformers 4.37.0

# transformers>=4.33.0 is required for Llama3 with IPEX-LLM optimizations 

$pip install transformers==4.37.0

  • Set the environment variables

$source /opt/intel/oneapi/setvars.sh

$export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
$export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
$export SYCL_CACHE_PERSISTENT=1
$export ENABLE_SDP_FUSION=1

  • Run the inference example for a Llama3 model to predict the next N tokens using generate() API, with IPEX-LLM INT4 optimizations:

            $python ./generate.py –repo-id-or-model-path /mnt/disk1/models/llama3- 8b-instruction-hf –prompt ‘What is AI?’

As a part of the output of the program, it gave the inference time. For example, the

 inference time for the 32 tokens (default value) in the example above  is about 0.474730 second.   

Posted April 28, 2024 by kyuoracleblog in Uncategorized

Tagged with , , , ,

Leave a comment