from skllm.config import SKLLMConfig
import os
from dotenv import load_dotenv
load_dotenv()
= os.environ["OPENAI_SECRET_KEY"]
OPENAI_SECRET_KEY = os.environ["OPENAI_ORG_ID"]
OPENAI_ORG_ID
SKLLMConfig.set_openai_key(OPENAI_SECRET_KEY) SKLLMConfig.set_openai_org(OPENAI_ORG_ID)
Introduction
Scikit-LLM is a Python package that integrates large language models (LLMs) like OpenAIβs GPT-3 into the scikit-learn framework for text analysis tasks.
Scikit-LLM is designed to work within the scikit-learn framework. Hence, if youβre familiar with scikit-learn, youβll feel right at home with scikit-llm. The library offers a range of features, out of which we will cover the following [1]:
- Zero-shot text classification
- Multi-label zero-shot text classification
- Text vectorization
- Text translation
- Text summarization
Installation
You can install the library via pip:
pip install scikit-llm
Configuration
Before you start using Scikit-LLM, you need to pass your OpenAI API key to Scikit-LLM. You can check out this post to set up your OpenAI API key.
Please note that Scikit-LLM provides a convenient interface to access OpenAIβs GPT-3 models. Use of these models is not free and requires an API key. While the API cost is relatively cheap, depending on the volume of your data and frequency of calls, these costs can add up. Therefore, itβs important to plan and manage your usage carefully to control costs. Always remember to review OpenAIβs pricing details and terms of use before getting started with Scikit-LLM.
To give you a rough idea, I ran this notebook at least five times to make this tutorial, and the total cost was US $0.02. I have to say, I thought this would be higher!
Zero-Shot Text Classification
One of the features of Scikit-LLM is the ability to perform zero-shot text classification. Scikit-LLM provides two classes for this purpose:
ZeroShotGPTClassifier
: used for single label classification (e.g. sentiment analysis),MultiLabelZeroShotGPTClassifier
: used for a multi-label classification task.
Single label ZeroShotGPTClassifier
Letβs do a sentiment analysis of a few movie reviews. For training purposes, we define the sentiment for each review (defined by a variable movie_review_labels
). We train the model with these reviews and labels, so that we can predict new movie reviews using the trained model.
The sample dataset for the movie reviews is given below:
= [
movie_reviews "This movie was absolutely wonderful. The storyline was compelling and the characters were very realistic.",
"I really loved the film! The plot had a few unexpected twists which kept me engaged till the end.",
"The movie was alright. Not great, but not bad either. A decent one-time watch.",
"I didn't enjoy the film that much. The plot was quite predictable and the characters lacked depth.",
"This movie was not to my taste. It felt too slow and the storyline wasn't engaging enough.",
"The film was okay. It was neither impressive nor disappointing. It was just fine.",
"I was blown away by the movie! The cinematography was excellent and the performances were top-notch.",
"I didn't like the movie at all. The story was uninteresting and the acting was mediocre at best.",
"The movie was decent. It had its moments but was not consistently engaging."
]
= [
movie_review_labels "positive",
"positive",
"neutral",
"negative",
"negative",
"neutral",
"positive",
"negative",
"neutral"
]
= [
new_movie_reviews # A positive review
"The movie was fantastic! I was captivated by the storyline from beginning to end.",
# A negative review
"I found the film to be quite boring. The plot moved too slowly and the acting was subpar.",
# A neutral review
"The movie was okay. Not the best I've seen, but certainly not the worst."
]
Letβs train the model and then check what the model predicts for each new review.
from skllm import ZeroShotGPTClassifier
# Initialize the classifier with the OpenAI model
= ZeroShotGPTClassifier(openai_model="gpt-3.5-turbo")
clf
# Train the model
=movie_reviews, y=movie_review_labels)
clf.fit(X
# Use the trained classifier to predict the sentiment of the new reviews
= clf.predict(X=new_movie_reviews)
predicted_movie_review_labels
for review, sentiment in zip(new_movie_reviews, predicted_movie_review_labels):
print(f"Review: {review}\nPredicted Sentiment: {sentiment}\n\n")
0%| | 0/3 [00:00<?, ?it/s] 33%|ββββ | 1/3 [00:01<00:02, 1.21s/it] 67%|βββββββ | 2/3 [00:02<00:01, 1.28s/it]100%|ββββββββββ| 3/3 [00:03<00:00, 1.23s/it]100%|ββββββββββ| 3/3 [00:03<00:00, 1.24s/it]
Review: The movie was fantastic! I was captivated by the storyline from beginning to end.
Predicted Sentiment: positive
Review: I found the film to be quite boring. The plot moved too slowly and the acting was subpar.
Predicted Sentiment: negative
Review: The movie was okay. Not the best I've seen, but certainly not the worst.
Predicted Sentiment: neutral
As can be seen above, the model predicted the sentiment of each movie review correctly.
Multi-Labels ZeroShotGPTClassifier
In the previous section, we had a single-label classifier ([βpositiveβ, βnegativeβ, βneutralβ]). Here, we are going to use the MultiLabelZeroShotGPTClassifier
estimator to assign multiple labels to a list of restaurant reviews.
= [
restaurant_reviews "The food was delicious and the service was excellent. A wonderful dining experience!",
"The restaurant was in a great location, but the food was just average.",
"The service was very slow and the food was cold when it arrived. Not a good experience.",
"The restaurant has a beautiful ambiance, and the food was superb.",
"The food was great, but I found it to be a bit overpriced.",
"The restaurant was conveniently located, but the service was poor.",
"The food was not as expected, but the restaurant ambiance was really nice.",
"Great food and quick service. The location was also very convenient.",
"The prices were a bit high, but the food quality and the service were excellent.",
"The restaurant offered a wide variety of dishes. The service was also very quick."
]
= [
restaurant_review_labels "Food", "Service"],
["Location", "Food"],
["Service", "Food"],
["Atmosphere", "Food"],
["Food", "Price"],
["Location", "Service"],
["Food", "Atmosphere"],
["Food", "Service", "Location"],
["Price", "Food", "Service"],
["Food Variety", "Service"]
[
]
= [
new_restaurant_reviews "The food was excellent and the restaurant was located in the heart of the city.",
"The service was slow and the food was not worth the price.",
"The restaurant had a wonderful ambiance, but the variety of dishes was limited."
]
Letβs train the model and then predict the labels for new reviews.
from skllm import MultiLabelZeroShotGPTClassifier
# Initialize the classifier with the OpenAI model
= MultiLabelZeroShotGPTClassifier(max_labels=3)
clf
# Train the model
=restaurant_reviews, y=restaurant_review_labels)
clf.fit(X
# Use the trained classifier to predict the labels of the new reviews
= clf.predict(X=new_restaurant_reviews)
predicted_restaurant_review_labels
for review, labels in zip(new_restaurant_reviews, predicted_restaurant_review_labels):
print(f"Review: {review}\nPredicted Labels: {labels}\n\n")
0%| | 0/3 [00:00<?, ?it/s] 33%|ββββ | 1/3 [00:01<00:02, 1.44s/it] 67%|βββββββ | 2/3 [00:02<00:01, 1.44s/it]100%|ββββββββββ| 3/3 [00:05<00:00, 1.87s/it]100%|ββββββββββ| 3/3 [00:05<00:00, 1.76s/it]
Review: The food was excellent and the restaurant was located in the heart of the city.
Predicted Labels: ['Food', 'Location']
Review: The service was slow and the food was not worth the price.
Predicted Labels: ['Service', 'Price']
Review: The restaurant had a wonderful ambiance, but the variety of dishes was limited.
Predicted Labels: ['Atmosphere', 'Food Variety']
The predicted labels for each review are spot-on.
Text Vectorization
Scikit-LLM provides the GPTVectorizer
class to convert the input text into a fixed-dimensional vector representation. Each resulting vector is an array of floating numbers, which is a representation of the corresponding sentence.
Letβs get a vectorized representation of the following sentences.
= [
X "AI can revolutionize industries.",
"Robotics creates automated solutions.",
"IoT connects devices for data exchange."
]
from skllm.preprocessing import GPTVectorizer
= GPTVectorizer()
vectorizer
= vectorizer.fit_transform(X)
vectors
print(vectors)
0%| | 0/3 [00:00<?, ?it/s] 33%|ββββ | 1/3 [00:00<00:00, 4.04it/s] 67%|βββββββ | 2/3 [00:00<00:00, 5.81it/s]100%|ββββββββββ| 3/3 [00:00<00:00, 6.90it/s]100%|ββββββββββ| 3/3 [00:00<00:00, 6.24it/s]
[[-0.00818074 -0.02555227 -0.00994665 ... -0.00266894 -0.02135153
0.00325925]
[-0.00944166 -0.00884305 -0.01260475 ... -0.00351341 -0.01211498
-0.00738735]
[-0.01084771 -0.00133671 0.01582962 ... 0.01247486 -0.00829649
-0.01012453]]
In practice, these vectors are inputs to other machine learning models for tasks like classification, clustering, or regression, rather than examining the vectors directly.
Text Translation
GPT models can be used to translate by making accurate readings from one language to another. We can translate a text into a language of interest using the GPTTranslator
module.
from skllm.preprocessing import GPTTranslator
from skllm.datasets import get_translation_dataset
= GPTTranslator(openai_model="gpt-3.5-turbo", output_language="English")
translator
= ["Je suis content que vous lisiez ce post."]
text_to_translate # "I am happy that you are reading this post."
= translator.fit_transform(text_to_translate)
translated_text
print(
f"Text in French: \n{text_to_translate[0]}\n\nTranslated text in English: {translated_text[0]}"
)
0%| | 0/1 [00:00<?, ?it/s]100%|ββββββββββ| 1/1 [00:01<00:00, 1.42s/it]100%|ββββββββββ| 1/1 [00:01<00:00, 1.43s/it]
Text in French:
Je suis content que vous lisiez ce post.
Translated text in English: I am glad that you are reading this post.
Text Summarization
GPT models are very useful for summarizing texts. The Scikit-LLM library provides the GPTSummarizer
estimator for text summarization. Letβs see that in action by summarizing the long reviews given below.
= [
reviews "I dined at The Gourmet Kitchen last night and had a wonderful experience. The service was impeccable, the food was exquisite, and the ambiance was delightful. I had the seafood pasta, which was cooked to perfection. The wine list was also quite impressive. I would highly recommend this restaurant to anyone looking for a fine dining experience.",
"I visited The Burger Spot for lunch today and was pleasantly surprised. Despite being a fast food joint, the quality of the food was excellent. I ordered the classic cheeseburger and it was juicy and flavorful. The fries were crispy and well-seasoned. The service was quick and the staff was friendly. It's a great place for a quick and satisfying meal.",
"The Coffee Corner is my favorite spot to work and enjoy a good cup of coffee. The atmosphere is relaxed and the coffee is always top-notch. They also offer a variety of pastries and sandwiches. The staff is always welcoming and the service is fast. I enjoy their latte and the blueberry muffin is a must-try."
]
from skllm.preprocessing import GPTSummarizer
# Initialize the GPT summarizer model
= GPTSummarizer(openai_model = "gpt-3.5-turbo", max_words = 15)
gpt_summarizer
= gpt_summarizer.fit_transform(reviews)
summaries
print(summaries)
0%| | 0/3 [00:00<?, ?it/s] 33%|ββββ | 1/3 [00:02<00:05, 2.61s/it] 67%|βββββββ | 2/3 [00:03<00:01, 1.85s/it]100%|ββββββββββ| 3/3 [00:05<00:00, 1.65s/it]100%|ββββββββββ| 3/3 [00:05<00:00, 1.78s/it]
['The Gourmet Kitchen offers impeccable service, exquisite food, delightful ambiance, and impressive wine list. Highly recommended.'
'The Burger Spot offers excellent quality fast food with friendly service.'
'The Coffee Corner is a great place to work with good coffee and food.']
A short summary of each review is generated. The max_words
parameter sets a rough upper bound on the summaryβs length; in practice, it may be a slightly longer.
π You can find the Jupyter notebook for this tutorial on GitHub.
Conclusion
Scikit-LLM is a powerful tool that adds the power of advanced language models like GPT-3 to the well-known scikit-learn framework. In this tutorial, we looked at some of Scikit-LLMβs most important features, such as zero-shot text classification, multi-label zero-shot text classification, text vectorization, text summary, and language translation.
Scikit-LLM is a promising tool that opens up new possibilities in the realm of text analysis using large language models. This library might be a useful addition to your toolbox; therefore, I suggest giving it a try.
References
Citation
@online{alizadeh2023,
author = {Alizadeh, Esmaeil},
title = {Scikit-LLM: {Bridging} the {Gap} {Between} {LLM} and
{Scikit-learn}},
date = {2023-06-06},
url = {https://ealizadeh.com/blog/tutorial-scikit-llm/},
langid = {en}
}