Skip to content

ezamyatin/spark-logfac

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

spark-logfac

Logistic Matrix Factorizations over Spark

The package contains highly scalable matrix factorization algorithms with a logistic loss function:

  • LMF [1]
  • Item2Vec [2]

Setup

resolvers += "Spark Packages Repo" at "https://repos.spark-packages.org/"

libraryDependencies += "ezamyatin" % "spark-logfac" % "0.1.0"

Overview

Both algorithms implement the general gradient descent method for matrix factorizations. The key principle of the algorithm is as follows: each training epoch consists of n subiterations, where n is the number of partitions into which the embeddings and data are partitioned. The value of n should be chosen so that 1/n part of the embeddings fits into the memory of a single executor. Data and embeddings are partitioned using two hash functions, one for “rows” and another for “columns”. Thus one table of embeddings (user-side) is partitioned using the first hash function and the second (item-side) using the second hash function. The data is a set of user-item pairs. On each subiteration, only those pairs whose embeddings of both elements are present on the excecutor are processed. To process all pairs, each of the n subiterations corresponds to a k-th cyclic shift of item-side partitioning.

Within the subiteration, local SGD is performed. The key difficulty here is negative sampling for implicit approaches. To solve this problem we used the method described in the paper "Distributed negative sampling for word embeddings" [3], the essence of which is that to use embeddings available at the executor.

References

[1] Logistic Matrix Factorization

[2] Item2Vec

[3] Distributed negative sampling for word embeddings

About

Logistic Matrix Factorization over Spark

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors