```html Text Summarizer with Hugging Face - Tutorial

Text Summarizer with Hugging Face

Complete Guide: From Google Colab to Production

Dataset Training Evaluation Save & Predict

Project Overview

This tutorial guides you through building a text summarization model using Hugging Face Transformers, trained on Google Colab with free GPU access. We'll use state-of-the-art models like BART or T5 to generate concise summaries from longer text documents.

Key Components:

  • Hugging Face Transformers library
  • Google Colab with GPU acceleration
  • Pre-trained models (BART, T5, etc.)
  • Custom dataset processing
  • Fine-tuning and evaluation

Why This Approach?

Free GPU Access

Google Colab provides free GPU/TPU resources

State-of-the-Art Models

Leverage pre-trained models from Hugging Face

Easy Fine-Tuning

Transfer learning with minimal code

Step-by-Step Implementation

1. Setup Google Colab Environment

Initialize your Google Colab notebook with GPU support and install required libraries.

Colab Notebook
!pip install transformers datasets torch accelerate sentencepiece
# Installing required packages...
!pip install rouge-score nltk
# For evaluation metrics
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
GPU available: True
Using device: cuda

Enable GPU: Runtime → Change runtime type → Hardware accelerator → GPU

2. Load and Prepare Dataset

Load a summarization dataset and preprocess it for training.

Data Loading
from datasets import load_dataset
dataset = load_dataset("cnn_dailymail", "3.0.0")
# Loading dataset: cnn_dailymail
DatasetDict({
train: Dataset({
features: ['article', 'highlights', 'id'],
num_rows: 287113
})
validation: Dataset({
num_rows: 13368
})
test: Dataset({
num_rows: 11490
})
})

Alternative datasets: xsum, news_summary, arxiv-summarization

3. Initialize Model and Tokenizer

Load a pre-trained model and its corresponding tokenizer.

Model & Tokenizer
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_name = "facebook/bart-large-cnn"
# Alternative: "t5-small", "google/pegasus-xsum"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
model.to(device)
Model loaded successfully on GPU
BART Model Configuration:
- Layers: 12
- Hidden Size: 1024
- Parameters: ~400M

4. Data Preprocessing

Tokenize the text data and prepare it for training.

Preprocessing
def preprocess_function(examples):
inputs = [doc for doc in examples["article"]]
targets = [summary for summary in examples["highlights"]]
model_inputs = tokenizer(inputs, max_length=1024, truncation=True)
with tokenizer.as_target_tokenizer():
labels = tokenizer(targets, max_length=128, truncation=True)
model_inputs["labels"] = labels["input_ids"]
return model_inputs
tokenized_datasets = dataset.map(preprocess_function, batched=True)
Dataset tokenized successfully
Tokenization complete:
- Input sequences: 287,113
- Max input length: 1024 tokens
- Max target length: 128 tokens

5. Training Configuration

Set up training arguments and initialize the Trainer.

Training Setup
from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(
output_dir="./summarization_model",
num_train_epochs=3,
per_device_train_batch_size=4,
per_device_eval_batch_size=4,
warmup_steps=500,
weight_decay=0.01,
logging_dir="./logs",
logging_steps=100,
evaluation_strategy="steps",
eval_steps=500,
save_steps=1000,
load_best_model_at_end=True,
report_to="none" # Disable logging to external services
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["validation"]
)
Training configuration complete

6. Model Training

Start the fine-tuning process on your dataset.

Training
trainer.train()
***** Running training *****
Num examples = 287,113
Num Epochs = 3
Instantaneous batch size per device = 4
Total train batch size = 4
Steps per epoch = 71,778
Epoch 1/3: 10% [7,178/71,778] ████████▏ ETA: 2h 15m
Train Loss: 2.845
Epoch 1/3: 50% [35,889/71,778] ████████████████████▏ ETA: 1h 8m
Train Loss: 2.123
Epoch 2/3: 100% [71,778/71,778] █████████████████████████████ ETA: 0s
Training completed successfully!
Total training time: 3h 42m
Final training loss: 1.456

7. Model Evaluation

Evaluate the model's performance using ROUGE metrics.

Evaluation
import evaluate
rouge = evaluate.load('rouge')
def compute_metrics(eval_pred):
predictions, labels = eval_pred
decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
result = rouge.compute(predictions=decoded_preds, references=decoded_labels)
return result
metrics = trainer.evaluate()
Evaluation Results:
ROUGE-1: 42.34
ROUGE-2: 20.56
ROUGE-L: 39.87
Evaluation loss: 1.512

8. Save the Model

Save the trained model and tokenizer for future use.

Saving Model
# Save the model and tokenizer
model.save_pretrained("./final_summarization_model")
tokenizer.save_pretrained("./final_summarization_model")
Model saved successfully!
Files saved to: ./final_summarization_model
- config.json
- pytorch_model.bin
- tokenizer.json
- tokenizer_config.json
- vocab.json
# Optional: Save to Hugging Face Hub
# model.push_to_hub("my-summarization-model")

9. Make Predictions

Use the trained model to generate summaries from new text.

Prediction
def summarize_text(text, max_length=130, min_length=30):
inputs = tokenizer([text], max_length=1024, return_tensors="pt", truncation=True).to(device)
summary_ids = model.generate(
inputs["input_ids"],
num_beams=4,
max_length=max_length,
min_length=min_length,
length_penalty=2.0,
early_stopping=True
)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
return summary
article = """The latest research in artificial intelligence shows significant advancements in natural language processing. Scientists have developed new models that can understand context better than ever before. These models are being used in various applications including healthcare, education, and customer service. The technology continues to evolve at a rapid pace."""
summary = summarize_text(article)
Generated Summary:
"AI research shows advancements in natural language processing. New models understand context better and are used in healthcare, education, and customer service. Technology evolves rapidly."

Best Practices & Tips

Training Tips

Start Small

Begin with a smaller model like BART-base before scaling to larger models.

Use Appropriate Batch Size

Adjust batch size based on available GPU memory (4-8 is typical for Colab).

Monitor Training

Keep an eye on loss curves and adjust learning rate if needed.

Model Selection

BART

Great for abstractive summarization, balances quality and speed.

T5

Versatile model that works well across different summarization tasks.

Pegasus

Specifically designed for summarization, often achieves state-of-the-art results.

Conclusion

You've now learned how to build a complete text summarization pipeline using Hugging Face models on Google Colab. This approach leverages transfer learning to create powerful summarization models with relatively little data and computational resources.

Fast Development

Leverage pre-trained models to build powerful applications quickly.

High Quality

Achieve state-of-the-art results with minimal fine-tuning.

Easy Deployment

Save and share your models easily with the Hugging Face ecosystem.

Now you're ready to build your own text summarization applications!

```