Scikit-LLM: Bridging the Gap Between LLM and Scikit-learn

A magnifier with a list of text anlaysis topics covered in the post.

Introduction

Scikit-LLM is a Python package that integrates large language models (LLMs) like OpenAI’s GPT-3 into the scikit-learn framework for text analysis tasks.

Scikit-LLM is designed to work within the scikit-learn framework. Hence, if you’re familiar with scikit-learn, you’ll feel right at home with scikit-llm. The library offers a range of features, out of which we will cover the following [1]:

Zero-shot text classification
Multi-label zero-shot text classification
Text vectorization
Text translation
Text summarization

Installation

You can install the library via pip:

pip install scikit-llm

Configuration

Before you start using Scikit-LLM, you need to pass your OpenAI API key to Scikit-LLM. You can check out this post to set up your OpenAI API key.

from skllm.config import SKLLMConfig
import os

from dotenv import load_dotenv
load_dotenv()

OPENAI_SECRET_KEY = os.environ["OPENAI_SECRET_KEY"]
OPENAI_ORG_ID = os.environ["OPENAI_ORG_ID"]

SKLLMConfig.set_openai_key(OPENAI_SECRET_KEY)
SKLLMConfig.set_openai_org(OPENAI_ORG_ID)

Caution

Please note that Scikit-LLM provides a convenient interface to access OpenAI’s GPT-3 models. Use of these models is not free and requires an API key. While the API cost is relatively cheap, depending on the volume of your data and frequency of calls, these costs can add up. Therefore, it’s important to plan and manage your usage carefully to control costs. Always remember to review OpenAI’s pricing details and terms of use before getting started with Scikit-LLM.

To give you a rough idea, I ran this notebook at least five times to make this tutorial, and the total cost was US $0.02. I have to say, I thought this would be higher!

Zero-Shot Text Classification

One of the features of Scikit-LLM is the ability to perform zero-shot text classification. Scikit-LLM provides two classes for this purpose:

ZeroShotGPTClassifier: used for single label classification (e.g. sentiment analysis),
MultiLabelZeroShotGPTClassifier: used for a multi-label classification task.

Single label ZeroShotGPTClassifier

Let’s do a sentiment analysis of a few movie reviews. For training purposes, we define the sentiment for each review (defined by a variable movie_review_labels). We train the model with these reviews and labels, so that we can predict new movie reviews using the trained model.

The sample dataset for the movie reviews is given below:

movie_reviews = [
    "This movie was absolutely wonderful. The storyline was compelling and the characters were very realistic.",
    "I really loved the film! The plot had a few unexpected twists which kept me engaged till the end.",
    "The movie was alright. Not great, but not bad either. A decent one-time watch.",
    "I didn't enjoy the film that much. The plot was quite predictable and the characters lacked depth.",
    "This movie was not to my taste. It felt too slow and the storyline wasn't engaging enough.",
    "The film was okay. It was neither impressive nor disappointing. It was just fine.",
    "I was blown away by the movie! The cinematography was excellent and the performances were top-notch.",
    "I didn't like the movie at all. The story was uninteresting and the acting was mediocre at best.",
    "The movie was decent. It had its moments but was not consistently engaging."
]

movie_review_labels = [
    "positive", 
    "positive", 
    "neutral", 
    "negative", 
    "negative", 
    "neutral", 
    "positive", 
    "negative", 
    "neutral"
]

new_movie_reviews = [
    # A positive review
    "The movie was fantastic! I was captivated by the storyline from beginning to end.",

    # A negative review
    "I found the film to be quite boring. The plot moved too slowly and the acting was subpar.",

    # A neutral review
    "The movie was okay. Not the best I've seen, but certainly not the worst."
]

Let’s train the model and then check what the model predicts for each new review.

from skllm import ZeroShotGPTClassifier

# Initialize the classifier with the OpenAI model
clf = ZeroShotGPTClassifier(openai_model="gpt-3.5-turbo")

# Train the model 
clf.fit(X=movie_reviews, y=movie_review_labels)  

# Use the trained classifier to predict the sentiment of the new reviews
predicted_movie_review_labels = clf.predict(X=new_movie_reviews)  

for review, sentiment in zip(new_movie_reviews, predicted_movie_review_labels):
    print(f"Review: {review}\nPredicted Sentiment: {sentiment}\n\n")

  0%|          | 0/3 [00:00<?, ?it/s] 33%|███▎      | 1/3 [00:01<00:02,  1.21s/it] 67%|██████▋   | 2/3 [00:02<00:01,  1.28s/it]100%|██████████| 3/3 [00:03<00:00,  1.23s/it]100%|██████████| 3/3 [00:03<00:00,  1.24s/it]

Review: The movie was fantastic! I was captivated by the storyline from beginning to end.
Predicted Sentiment: positive


Review: I found the film to be quite boring. The plot moved too slowly and the acting was subpar.
Predicted Sentiment: negative


Review: The movie was okay. Not the best I've seen, but certainly not the worst.
Predicted Sentiment: neutral

As can be seen above, the model predicted the sentiment of each movie review correctly.

Multi-Labels ZeroShotGPTClassifier

In the previous section, we had a single-label classifier ([“positive”, “negative”, “neutral”]). Here, we are going to use the MultiLabelZeroShotGPTClassifier estimator to assign multiple labels to a list of restaurant reviews.

restaurant_reviews = [
    "The food was delicious and the service was excellent. A wonderful dining experience!",
    "The restaurant was in a great location, but the food was just average.",
    "The service was very slow and the food was cold when it arrived. Not a good experience.",
    "The restaurant has a beautiful ambiance, and the food was superb.",
    "The food was great, but I found it to be a bit overpriced.",
    "The restaurant was conveniently located, but the service was poor.",
    "The food was not as expected, but the restaurant ambiance was really nice.",
    "Great food and quick service. The location was also very convenient.",
    "The prices were a bit high, but the food quality and the service were excellent.",
    "The restaurant offered a wide variety of dishes. The service was also very quick."
]

restaurant_review_labels = [
    ["Food", "Service"],
    ["Location", "Food"],
    ["Service", "Food"],
    ["Atmosphere", "Food"],
    ["Food", "Price"],
    ["Location", "Service"],
    ["Food", "Atmosphere"],
    ["Food", "Service", "Location"],
    ["Price", "Food", "Service"],
    ["Food Variety", "Service"]
]

new_restaurant_reviews = [
    "The food was excellent and the restaurant was located in the heart of the city.",
    "The service was slow and the food was not worth the price.",
    "The restaurant had a wonderful ambiance, but the variety of dishes was limited."
]

Let’s train the model and then predict the labels for new reviews.

from skllm import MultiLabelZeroShotGPTClassifier

# Initialize the classifier with the OpenAI model
clf = MultiLabelZeroShotGPTClassifier(max_labels=3)

# Train the model 
clf.fit(X=restaurant_reviews, y=restaurant_review_labels)

# Use the trained classifier to predict the labels of the new reviews
predicted_restaurant_review_labels = clf.predict(X=new_restaurant_reviews)

for review, labels in zip(new_restaurant_reviews, predicted_restaurant_review_labels):
    print(f"Review: {review}\nPredicted Labels: {labels}\n\n")

  0%|          | 0/3 [00:00<?, ?it/s] 33%|███▎      | 1/3 [00:01<00:02,  1.44s/it] 67%|██████▋   | 2/3 [00:02<00:01,  1.44s/it]100%|██████████| 3/3 [00:05<00:00,  1.87s/it]100%|██████████| 3/3 [00:05<00:00,  1.76s/it]

Review: The food was excellent and the restaurant was located in the heart of the city.
Predicted Labels: ['Food', 'Location']


Review: The service was slow and the food was not worth the price.
Predicted Labels: ['Service', 'Price']


Review: The restaurant had a wonderful ambiance, but the variety of dishes was limited.
Predicted Labels: ['Atmosphere', 'Food Variety']

The predicted labels for each review are spot-on.

Text Vectorization

Scikit-LLM provides the GPTVectorizer class to convert the input text into a fixed-dimensional vector representation. Each resulting vector is an array of floating numbers, which is a representation of the corresponding sentence.

Let’s get a vectorized representation of the following sentences.

X = [
    "AI can revolutionize industries.",
    "Robotics creates automated solutions.",
    "IoT connects devices for data exchange."
]

from skllm.preprocessing import GPTVectorizer

vectorizer = GPTVectorizer()

vectors = vectorizer.fit_transform(X)

print(vectors)

  0%|          | 0/3 [00:00<?, ?it/s] 33%|███▎      | 1/3 [00:00<00:00,  4.04it/s] 67%|██████▋   | 2/3 [00:00<00:00,  5.81it/s]100%|██████████| 3/3 [00:00<00:00,  6.90it/s]100%|██████████| 3/3 [00:00<00:00,  6.24it/s]

[[-0.00818074 -0.02555227 -0.00994665 ... -0.00266894 -0.02135153
   0.00325925]
 [-0.00944166 -0.00884305 -0.01260475 ... -0.00351341 -0.01211498
  -0.00738735]
 [-0.01084771 -0.00133671  0.01582962 ...  0.01247486 -0.00829649
  -0.01012453]]

In practice, these vectors are inputs to other machine learning models for tasks like classification, clustering, or regression, rather than examining the vectors directly.

Text Translation

GPT models can be used to translate by making accurate readings from one language to another. We can translate a text into a language of interest using the GPTTranslator module.

from skllm.preprocessing import GPTTranslator
from skllm.datasets import get_translation_dataset

translator = GPTTranslator(openai_model="gpt-3.5-turbo", output_language="English")

text_to_translate = ["Je suis content que vous lisiez ce post."]
# "I am happy that you are reading this post."

translated_text = translator.fit_transform(text_to_translate)

print(
    f"Text in French: \n{text_to_translate[0]}\n\nTranslated text in English: {translated_text[0]}"
)

  0%|          | 0/1 [00:00<?, ?it/s]100%|██████████| 1/1 [00:01<00:00,  1.42s/it]100%|██████████| 1/1 [00:01<00:00,  1.43s/it]

Text in French: 
Je suis content que vous lisiez ce post.

Translated text in English: I am glad that you are reading this post.

Text Summarization

GPT models are very useful for summarizing texts. The Scikit-LLM library provides the GPTSummarizer estimator for text summarization. Let’s see that in action by summarizing the long reviews given below.

reviews = [
    "I dined at The Gourmet Kitchen last night and had a wonderful experience. The service was impeccable, the food was exquisite, and the ambiance was delightful. I had the seafood pasta, which was cooked to perfection. The wine list was also quite impressive. I would highly recommend this restaurant to anyone looking for a fine dining experience.",
    "I visited The Burger Spot for lunch today and was pleasantly surprised. Despite being a fast food joint, the quality of the food was excellent. I ordered the classic cheeseburger and it was juicy and flavorful. The fries were crispy and well-seasoned. The service was quick and the staff was friendly. It's a great place for a quick and satisfying meal.",
    "The Coffee Corner is my favorite spot to work and enjoy a good cup of coffee. The atmosphere is relaxed and the coffee is always top-notch. They also offer a variety of pastries and sandwiches. The staff is always welcoming and the service is fast. I enjoy their latte and the blueberry muffin is a must-try."
]

from skllm.preprocessing import GPTSummarizer

# Initialize the GPT summarizer model
gpt_summarizer = GPTSummarizer(openai_model = "gpt-3.5-turbo", max_words = 15)

summaries = gpt_summarizer.fit_transform(reviews)

print(summaries)

  0%|          | 0/3 [00:00<?, ?it/s] 33%|███▎      | 1/3 [00:02<00:05,  2.61s/it] 67%|██████▋   | 2/3 [00:03<00:01,  1.85s/it]100%|██████████| 3/3 [00:05<00:00,  1.65s/it]100%|██████████| 3/3 [00:05<00:00,  1.78s/it]

['The Gourmet Kitchen offers impeccable service, exquisite food, delightful ambiance, and impressive wine list. Highly recommended.'
 'The Burger Spot offers excellent quality fast food with friendly service.'
 'The Coffee Corner is a great place to work with good coffee and food.']

A short summary of each review is generated. The max_words parameter sets a rough upper bound on the summary’s length; in practice, it may be a slightly longer.

Jupyter Notebook

📓 You can find the Jupyter notebook for this tutorial on GitHub.

Conclusion

Scikit-LLM is a powerful tool that adds the power of advanced language models like GPT-3 to the well-known scikit-learn framework. In this tutorial, we looked at some of Scikit-LLM’s most important features, such as zero-shot text classification, multi-label zero-shot text classification, text vectorization, text summary, and language translation.

Scikit-LLM is a promising tool that opens up new possibilities in the realm of text analysis using large language models. This library might be a useful addition to your toolbox; therefore, I suggest giving it a try.

References

[1]

Iryna Kondrashchenko, “Scikit-LLM: Sklearn Meets Large Language Models,” 2023. https://github.com/iryna-kondr/scikit-llm

Citation

BibTeX citation:

@online{alizadeh2023,
  author = {Alizadeh, Esmaeil},
  title = {Scikit-LLM: {Bridging} the {Gap} {Between} {LLM} and
    {Scikit-learn}},
  date = {2023-06-06},
  url = {https://ealizadeh.com/blog/tutorial-scikit-llm/},
  langid = {en}
}

For attribution, please cite this work as:

E. Alizadeh, “Scikit-LLM: Bridging the Gap Between LLM and Scikit-learn,” Jun. 06, 2023. https://ealizadeh.com/blog/tutorial-scikit-llm/

Subscribe

Introduction

Installation

Configuration

Zero-Shot Text Classification

Single label ZeroShotGPTClassifier

Multi-Labels ZeroShotGPTClassifier

Text Vectorization

Text Translation

Text Summarization

Conclusion

References

Citation

Subscribe