Text Summarizer with Hugging Face

1. Setup Google Colab Environment

Initialize your Google Colab notebook with GPU support and install required libraries.

Colab Notebook

!pip install transformers datasets torch accelerate sentencepiece

# Installing required packages...

!pip install rouge-score nltk

# For evaluation metrics

import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

GPU available: True

Using device: cuda

Enable GPU: Runtime → Change runtime type → Hardware accelerator → GPU

2. Load and Prepare Dataset

Load a summarization dataset and preprocess it for training.

Data Loading

from datasets import load_dataset

dataset = load_dataset("cnn_dailymail", "3.0.0")

# Loading dataset: cnn_dailymail

DatasetDict({

train: Dataset({

features: ['article', 'highlights', 'id'],

num_rows: 287113

})

validation: Dataset({

num_rows: 13368

})

test: Dataset({

num_rows: 11490

})

Alternative datasets: xsum, news_summary, arxiv-summarization

3. Initialize Model and Tokenizer

Load a pre-trained model and its corresponding tokenizer.

Model & Tokenizer

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "facebook/bart-large-cnn"

# Alternative: "t5-small", "google/pegasus-xsum"

tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

model.to(device)

Model loaded successfully on GPU

BART Model Configuration:

- Layers: 12

- Hidden Size: 1024

- Parameters: ~400M

4. Data Preprocessing

Tokenize the text data and prepare it for training.

Preprocessing

def preprocess_function(examples):

inputs = [doc for doc in examples["article"]]

targets = [summary for summary in examples["highlights"]]

model_inputs = tokenizer(inputs, max_length=1024, truncation=True)

with tokenizer.as_target_tokenizer():

labels = tokenizer(targets, max_length=128, truncation=True)

model_inputs["labels"] = labels["input_ids"]

return model_inputs

tokenized_datasets = dataset.map(preprocess_function, batched=True)

Dataset tokenized successfully

Tokenization complete:

- Input sequences: 287,113

- Max input length: 1024 tokens

- Max target length: 128 tokens

5. Training Configuration

Set up training arguments and initialize the Trainer.

Training Setup

from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(

output_dir="./summarization_model",

num_train_epochs=3,

per_device_train_batch_size=4,

per_device_eval_batch_size=4,

warmup_steps=500,

weight_decay=0.01,

logging_dir="./logs",

logging_steps=100,

evaluation_strategy="steps",

eval_steps=500,

save_steps=1000,

load_best_model_at_end=True,

report_to="none" # Disable logging to external services

)

trainer = Trainer(

model=model,

args=training_args,

train_dataset=tokenized_datasets["train"],

eval_dataset=tokenized_datasets["validation"]

)

Training configuration complete

6. Model Training

Start the fine-tuning process on your dataset.

Training

trainer.train()

***** Running training *****

Num examples = 287,113

Num Epochs = 3

Instantaneous batch size per device = 4

Total train batch size = 4

Steps per epoch = 71,778

Epoch 1/3: 10% [7,178/71,778] ████████▏ ETA: 2h 15m

Train Loss: 2.845

Epoch 1/3: 50% [35,889/71,778] ████████████████████▏ ETA: 1h 8m

Train Loss: 2.123

Epoch 2/3: 100% [71,778/71,778] █████████████████████████████ ETA: 0s

Training completed successfully!

Total training time: 3h 42m

Final training loss: 1.456

7. Model Evaluation

Evaluate the model's performance using ROUGE metrics.

Evaluation

import evaluate

rouge = evaluate.load('rouge')

def compute_metrics(eval_pred):

predictions, labels = eval_pred

decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)

labels = np.where(labels != -100, labels, tokenizer.pad_token_id)

decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

result = rouge.compute(predictions=decoded_preds, references=decoded_labels)

return result

metrics = trainer.evaluate()

Evaluation Results:

ROUGE-1: 42.34

ROUGE-2: 20.56

ROUGE-L: 39.87

Evaluation loss: 1.512

8. Save the Model

Save the trained model and tokenizer for future use.

Saving Model

# Save the model and tokenizer

model.save_pretrained("./final_summarization_model")

tokenizer.save_pretrained("./final_summarization_model")

Model saved successfully!

Files saved to: ./final_summarization_model

- config.json

- pytorch_model.bin

- tokenizer.json

- tokenizer_config.json

- vocab.json

# Optional: Save to Hugging Face Hub

# model.push_to_hub("my-summarization-model")

9. Make Predictions

Use the trained model to generate summaries from new text.

Prediction

def summarize_text(text, max_length=130, min_length=30):

inputs = tokenizer([text], max_length=1024, return_tensors="pt", truncation=True).to(device)

summary_ids = model.generate(

inputs["input_ids"],

num_beams=4,

max_length=max_length,

min_length=min_length,

length_penalty=2.0,

early_stopping=True

)

summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

return summary

article = """The latest research in artificial intelligence shows significant advancements in natural language processing. Scientists have developed new models that can understand context better than ever before. These models are being used in various applications including healthcare, education, and customer service. The technology continues to evolve at a rapid pace."""

summary = summarize_text(article)

Generated Summary:

"AI research shows advancements in natural language processing. New models understand context better and are used in healthcare, education, and customer service. Technology evolves rapidly."

Text Summarizer with Hugging Face

Project Overview

Key Components:

Why This Approach?

Free GPU Access

State-of-the-Art Models

Easy Fine-Tuning

Step-by-Step Implementation

1. Setup Google Colab Environment

2. Load and Prepare Dataset

3. Initialize Model and Tokenizer

4. Data Preprocessing

5. Training Configuration

6. Model Training

7. Model Evaluation

8. Save the Model

9. Make Predictions

Best Practices & Tips

Training Tips

Start Small

Use Appropriate Batch Size

Monitor Training

Model Selection

BART

T5

Pegasus

Conclusion

Fast Development

High Quality

Easy Deployment