Use Graph Neural Network to predict the movie rating

Hanati Tuoken
4 min readNov 14, 2021

--

Recommendation is quite important, since we are recommended everyday with books, movies, things that we maybe interested in. The usual way of recommending system is to use matrix factorization method (https://en.wikipedia.org/wiki/Matrix_factorization_(recommender_systems)#:~:text=Matrix%20factorization%20is%20a%20class,two%20lower%20dimensionality%20rectangular%20matrices.)

Here I would like to predict the movie rating using the Graph Neural Network (GNN). The idea is to first set up a heterogenous graph, with three different node type: User, Movie, and Genre, and the edge between User and Movie have the rating property. And I will use the GNN to predict the edge property between the User and Movie.

So let’s start with the MovieLens data

Let’s first look at the data, the data is provided by a research lab (https://grouplens.org/datasets/movielens/). And I am using the small dataset, which contains 100,000 ratings applied to 9,000 movies by 600 users. Last updated 9/2018.

Let’s check out the movie and rating table.

import pandas as pd
movie_path = './ml-latest-small/movies.csv'
rating_path = './ml-latest-small/ratings.csv'

movie=pd.read_csv(movie_path)
rating=pd.read_csv(rating_path)
print (movie.shape,rating.shape)

The shape of moviedataframe is (9742,3) , and it contains title and genre information.

Movie

First I introduce a function to parse the genres column, and create an onehot vector for each genre

## get genre index
## input the list, return the dict {"genre":idx}
def genres2index(genres):
genre_2_idx={}
idx=0
for x in genres:
for xi in x.split("|"):
if xi not in genre_2_idx.keys():
genre_2_idx[xi]=idx
idx+=1
return genre_2_idx
genre_index=genres2index(movie.genres.tolist())
## use mid instead of movieid, since the movieid is not continous
movie["mid"]=movie.index
movie_2_genre=[]
for mid,genres in movie[["mid","genres"]].values:
for gx in genres.split("|"):
movie_2_genre.append([mid,genre_index[gx]])

## prepare genre_x, onehot encode 20x20
genre_x=[]
for k,v in genre_index.items():
x=[ 0 for i in range(len(genre_index))]
x[v]=1
genre_x.append(x)

For the title column, I used pretrained Glove model to convert the title to a 300 dimension vector.

import torchtext
import pandas as pd

def title2vector(x):
x=x.split("(")[0]
x2v=glove840b.get_vecs_by_tokens([ xi.lower() for xi in x.split(" ")])
if len(x2v.size())==2:
x2v=x2v.mean(dim=0)
return x2v.view(1,300)

glove840b=torchtext.vocab.GloVe("840B")
import torch
titles=[]
for title in movie.title:
titles.append(title2vector(title))
titles_tensor=torch.cat(titles)

Now we have prepared the features for Movie and Genre.

And the shape of the ratingdataframe is (100836,4) . The userId and the movieId with the corresponding rating is available.

rating_m=rating.merge(movie,left_on="movieId",right_on="movieId")
n_users=len(rating.userId.unique())
user_rates_movie=torch.from_numpy(rating_m[["userId","mid"]].transpose().values-1)
user_rates_movie_attr=torch.from_numpy(rating_m["rating"].values).float().view(len(rating_m),1)user_x=[]
for i in range(610):
v=[ 0 for j in range(610)]
v[i]=1
user_x.append(v)

The above script convert the rating table to edges user_rates_movie and edge attributes user_rates_movie_attr .

So we are roughly done with the preparation, now we can look into the GNN libarary.

Graph Neural Network library

Here I am using PyG libraray. PyG (PyTorch Geometric) is a library built upon PyTorch to easily write and train Graph Neural Networks (GNNs) for a wide range of applications related to structured data.

Following script put the data into the data structure.

from torch_geometric.data import HeteroData

data = HeteroData()

#data['user'].num_nodes = n_users # Users do not have any features.
data['user'].x=user_x
data['movie'].x = movie_x
data['genre'].x = genre_x


data['user', 'rates', 'movie'].edge_index = user_rates_movie
data['user', 'rates', 'movie'].train_mask=train_flag
data['user', 'rates', 'movie'].test_mask=test_flag

data['user', 'rates', 'movie'].edge_label = user_rates_movie_attr
data['movie', 'belongto', 'genre'].edge_index = movie_2_genre

print(data)

The output looks as following

HeteroData(
user={ x=[610, 610] },
movie={ x=[9742, 300] },
genre={ x=[20, 20] },
(user, rates, movie)={
edge_index=[2, 100836],
train_mask=[100836],
test_mask=[100836],
edge_label=[100836, 1]
},
(movie, belongto, genre)={ edge_index=[2, 22084] }
)

we can see from the output, the features for each node type User, Movie, Genre, and as well the edges betwen the User and the Movie. The train and test mask are just simple boolean vector to distinguish which edges are used for computing loss.

Now Let’s look at the GNN model, I just create a simple GNN model and convert it to a heterogenous GNN model.

import torch_geometric.transforms as T
from torch_geometric.nn import SAGEConv, to_hetero

class GNN(torch.nn.Module):
def __init__(self, hidden_channels, out_channels):
super().__init__()
self.conv1 = SAGEConv((-1, -1), hidden_channels)
self.conv2 = SAGEConv((-1, -1), out_channels)

def forward(self, x, edge_index):
x = self.conv1(x, edge_index).relu()
x = self.conv2(x, edge_index)
return x


model = GNN(hidden_channels=128, out_channels=64)
model = to_hetero(model, data.metadata(), aggr='sum')

The model will embed the node features to a size of 64 vectors, and these vectors will be feeded into the following linkp model to predict the rating.

import torch.nn.functional as F

class LinkP(torch.nn.Module):
def __init__(self, in_channels, hidden_channels, out_channels):
super(LinkP, self).__init__()

self.lins = torch.nn.ModuleList()
self.lins.append(torch.nn.Linear(in_channels, hidden_channels))
self.lins.append(torch.nn.Linear(hidden_channels, out_channels))

def reset_parameters(self):
for lin in self.lins:
lin.reset_parameters()

def forward(self, x_i, x_j):
x = x_i * x_j
for lin in self.lins[:-1]:
x = lin(x)
x = F.relu(x)
x = self.lins[-1](x)
return x
linkp=LinkP(64,64,1)

Turn on the GPU for training

Now we start the training,

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
data=data.to(device)
model=model.to(device)
linkp=linkp.to(device)
with torch.no_grad(): # Initialize lazy modules.
out = model(data.x_dict, data.edge_index_dict)
optimizer = torch.optim.Adam(
list(model.parameters()) + list(linkp.parameters()),
lr=0.0002)
for epoch in range(1000):
model.train()
linkp.train()
optimizer.zero_grad()
out=model(data.x_dict, data.edge_index_dict)
p1=linkp(out["user"][uids,:],out["movie"][mids,:])

loss=lossfunc(p1,rates)
loss.backward()

torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
torch.nn.utils.clip_grad_norm_(linkp.parameters(), 1.0)

## eval
linkp.eval()
p1_te=linkp(out["user"][uids_te,:],out["movie"][mids_te,:])
loss_te=lossfunc(p1_te,rates_te)

if epoch % 50 == 0:
print (epoch,loss.item(), loss_te.item())
optimizer.step()

Now we can wait and see how the loss curve looks

0 13.54012393951416 13.520957946777344
50 7.229877948760986 7.23165225982666
100 1.2005460262298584 1.2366523742675781
...
950 0.5994845032691956 0.760612428188324

And the prediction looks also good, the xlabel is the rating for testing, and ylabel is the prediction.

Some words

The PyG python libraray comes very naturally to data scientists who are familiar with PyTorch, and it is very simple and easy to implement. Please feel free to leave your comments, and enjoy.

--

--

No responses yet