Skip to content

Latest commit

 

History

History

README.md

SANSA ML Readme

The SANSA ML stack is currently under major refactoring. It steers to a support of Scala 2.12 and Spark 3. The functionalities are covered by Scala unit tests and are documented within Scaladoc. The Readme provides

Current Modules

The current stack provides:

SmartFeatureExtractor

This feature extractor creates out of a Apache Spark Dataset of Apache Jena Triple a Dataframe which contains entity feature information for further machine learning approaches. With an initial entity selecting method, for all the specified URIs all features are collected by the pivot function of Spark. The resulting feature columns are collapsed s.t. for each entity we have later one row in the dataframe. If the extracted feature is of type Literal, the column is casted to the assigned literal data type: e.g. StringType, DoubleType, Timestamp, ...

val dataset: Dataset[graph.Triple] = [...]

/** Smart Feature Extractor */
val sfeNoFilter = new SmartFeatureExtractor()
  // .setEntityColumnName("s")
  // .setObjectFilter("http://data.linkedmdb.org/movie/film")
  // .setSparqlFilter("SELECT ?s WHERE { ?s ?p <http://data.linkedmdb.org/movie/film> }")

/** Feature Extracted DataFrame */
val feDf = sfeNoFilter
  .transform(dataset)
feDf
  .show(false)

here is an example with an initial filter by object

val sfeObjectFilter = new SmartFeatureExtractor()
  // .setEntityColumnName("s")
  .setObjectFilter("http://data.linkedmdb.org/movie/film")

val feDf1 = sfeObjectFilter
  .transform(dataset)
feDf1
  .show(false)

and finally an example with an initial sparql filter

val sfeSparqlFilter = new SmartFeatureExtractor()
  // .setEntityColumnName("s")
  .setSparqlFilter("SELECT ?s WHERE { ?s ?p <http://data.linkedmdb.org/movie/film> }")

val feDf2 = sfeSparqlFilter
  .transform(dataset)
feDf2
  .show()

The results looks like this:

+--------------------+--------------------+--------------------+-------+--------------------+---------------------+--------------------+--------------------+
|                   s|               actor|initial_release_date|runtime|              writer|22-rdf-syntax-ns#type|    rdf-schema#label|               genre|
+--------------------+--------------------+--------------------+-------+--------------------+---------------------+--------------------+--------------------+
|https://sansa.sam...|[https://sansa.sa...| 2002-01-01 00:00:00|  141.0|http://data.linke...| http://data.linke...| Catch Me If You Can|[https://sansa.sa...|
|https://sansa.sam...|[https://sansa.sa...| 1994-01-01 00:00:00|  142.0|http://data.linke...| http://data.linke...|The Shawshank Red...|[https://sansa.sa...|
|https://sansa.sam...|[https://sansa.sa...| 1999-01-01 00:00:00|  189.0|http://data.linke...| http://data.linke...|          Green Mile|[https://sansa.sa...|
+--------------------+--------------------+--------------------+-------+--------------------+---------------------+--------------------+--------------------+

Literal2Feature AutoSparql Generation for Feature Extraction

AutoSparql Generation for Feature Extraction: This module (scaladocs) creates a SPARQL query traversing the tree to gain literals which can be used as features for common feature based Machine Learning Approaches. The user needs only to specify the WHERE clause, how to reach the entities, which should be considered as seeds/roots for graph traversal. This traversal will then provide a SPARQL Query to fetch connected features from Literals. As sample usage would be:

val inputFilePath: String = this.getClass.getClassLoader.getResource("utils/test.ttl").getPath
val seedVarName = "?seed"
val whereClauseForSeed = "?seed a <http://dig.isi.edu/Person>"
val maxUp: Int = 5
val maxDown: Int = 5
val seedNumber: Int = 0
val seedNumberAsRatio: Double = 1.0

// setup spark session
val spark = SparkSession.builder
  .appName(s"tryout sparql query transformer")
  .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
  .config("spark.sql.crossJoin.enabled", true)
  .getOrCreate()
spark.sparkContext.setLogLevel("ERROR")

implicit val nodeTupleEncoder = Encoders.kryo(classOf[(Node, Node, Node)])

// first mini file:
    val dataset: Dataset[org.apache.jena.graph.Triple] = spark.rdf(Lang.TURTLE)(inputFilePath).toDS().cache()


val (totalSparqlQuery: String, var_names: List[String]) = FeatureExtractingSparqlGenerator.createSparql(
  ds = dataset,
  seedVarName = seedVarName,
  seedWhereClause = whereClauseForSeed,
  maxUp = maxUp,
  maxDown = maxDown,
  numberSeeds = seedNumber,
  ratioNumberSeeds = seedNumberAsRatio
)
println(totalSparqlQuery)

This sample is taken from scala unit test A sample query creaed and adjsuted by the Literal2Feature module created the following feature extracting SPARQL query for thr Linked Movie Data base dataset:

SELECT
?movie
?movie__down_genre__down_film_genre_name
?movie__down_title
(<http://www.w3.org/2001/XMLSchema#int>(?movie__down_runtime) as ?movie__down_runtime_asInt)
?movie__down_runtime
?movie__down_actor__down_actor_name
WHERE {
    ?movie <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://data.linkedmdb.org/movie/film> .
    ?movie <http://data.linkedmdb.org/movie/genre> ?movie__down_genre . ?movie__down_genre <http://data.linkedmdb.org/movie/film_genre_name> ?movie__down_desiredGenre__down_film_genre_name .
    
    OPTIONAL { ?movie <http://purl.org/dc/terms/title> ?movie__down_title . }
    OPTIONAL { ?movie <http://data.linkedmdb.org/movie/runtime> ?movie__down_runtime . }
    OPTIONAL { ?movie <http://data.linkedmdb.org/movie/actor> ?movie__down_actor . ?movie__down_actor <http://data.linkedmdb.org/movie/actor_name> ?movie__down_actor__down_actor_name . }
    OPTIONAL { ?movie <http://data.linkedmdb.org/movie/genre> ?movie__down_genre . ?movie__down_genre <http://data.linkedmdb.org/movie/film_genre_name> ?movie__down_genre__down_film_genre_name . }
    
    FILTER (?movie__down_desiredGenre__down_film_genre_name = 'Superhero' || ?movie__down_desiredGenre__down_film_genre_name = 'Fantasy' )
}

SparqlFrame Feature Extractor

With SparqlFrame we provide a Transformer which takes a String representing a sparql query. You can also use our Literal2Feature - AutoSparql Generation for Feature Extraction. It uses SPARQLIFY from query layer to gain query results. THe values are casted to String if not all elements in a repective feature column are of a respective type like Integer.

/**
* transformer that collect the features from the Dataset[Triple] to a common spark Dataframe
* collapsed
*/
val sparqlFrame = new SparqlFrame()
    .setSparqlQuery(sparqlString)
    .setCollapsByKey(true) // optional, default is false
    .setCollapsColumnName("seed") // optional

/**
* dataframe with resulting features
* in this collapsed by the movie column
*/
val extractedFeaturesDf = sparqlFrame
.transform(dataset)
.cache()

this creates a dataframe e.g. of such a shape

+--------------------------+--------------+---------------+
|seed                      |seed__down_age|seed__down_name|
+--------------------------+--------------+---------------+
|http://dig.isi.edu/Mary   |25            |Mary           |
|http://dig.isi.edu/John   |28            |John           |
|http://dig.isi.edu/John_jr|2             |John Jr.       |
+--------------------------+--------------+---------------+

this dataframe can then be manipulated by native apache spark mllib transformers for desired scenario or the Smart Vector assembler.

Smart Vector Assembler

This Transformer creates a needed Dataframe for common ML approaches in Spark MLlib. The resulting Dataframe consists of a column features which is a numeric vector for each entity The other columns are the id/identifier column like the node id And optional column for label

/*
 val smartVectorAssembler = new SmartVectorAssembler()
    .setEntityColumn("seed")
    .setLabelColumn("age")
    .setNullReplacement("string", "") // optional
    .setNullReplacement("digit", -1) // optional
    .setWord2VecSize(5) // optional
    .setWord2VecMinCount(1) // optional

val assembledDf: DataFrame = smartVectorAssembler
    .transform(postprocessedFeaturesDf)

this creates a dataframe e.g. of such a shape

+--------------------------+-----+------------------------+
|id                        |label|features                |
+--------------------------+-----+------------------------+
|http://dig.isi.edu/Mary   |25   |[28.0,-1.0,2.0,0.0,-1.0]|
|http://dig.isi.edu/John   |28   |[25.0,-1.0,1.0,1.0,-1.0]|
|http://dig.isi.edu/John_jr|2    |[-1.0,25.0,0.0,-1.0,1.0]|
|http://dig.isi.edu/John_jr|2    |[-1.0,28.0,0.0,-1.0,0.0]|
+--------------------------+-----+------------------------+

ML2Graph

The module ML2Graph can be used to transform the tabular represenation of ML results into RDF KNowledge Graphs t enrich the initial input KG.

// Make predictions.
val predictions = model.transform(testData)

// Select example rows to display.
predictions.select("entityID", "prediction", "label", "features").show(10)
predictions.show()

val ml2Graph = new ML2Graph()
  .setEntityColumn("entityID")
  .setValueColumn("prediction")

val metagraph: RDD[Triple] = ml2Graph.transform(predictions)
metagraph.take(10).foreach(println(_))

SimE4KG Transformer

This modules provides a sementic similarity estimation Transformer. It needs a dataset of triple as input. Then you specify which seeds you want to consider. Either you do this over a sparwl or over the object filter. next we reuse the DistSim similarity estimation to gather candidate pairs. Based on this candidate pairs, the multi modal simialrity estimation is calculated. The features of this are extracted by the SmartFeatureExtractor.

val lang = Lang.TURTLE
val originalDataRDD = spark.rdf(lang)("/Users/.../Datasets/sampleMovieDB.nt").persist()

val dataset: Dataset[Triple] = originalDataRDD
  .toDS()
  .cache()

val dse = new DaSimEstimator()
  // .setSparqlFilter("SELECT ?o WHERE { ?s <https://sansa.sample-stack.net/genre> ?o }")
  .setObjectFilter("http://data.linkedmdb.org/movie/film")
  .setDistSimFeatureExtractionMethod("os")
  .setSimilarityValueStreching(false)
  .setImportance(Map("initial_release_date_sim" -> 0.2, "rdf-schema#label_sim" -> 0.0, "runtime_sim" -> 0.2, "writer_sim" -> 0.1, "22-rdf-syntax-ns#type_sim" -> 0.0, "actor_sim" -> 0.3, "genre_sim" -> 0.2))

val resultSimDf = dse
  .transform(dataset)

resultSimDf.show(false)

val metagraph: RDD[Triple] = dse
   .semantification(resultSimDf)

Apart of the code snippets here, we also provide a sample Dataricks Notebook

Sparql Transformer

Sparql Transformer: The SPARQL Transformer is implemented as a Spark MLlib Transformer. It reads RDF data as a DataSet and produces a DataFrame of type Apache Jena Node. Currently supported are up to 5 projection variables. A sample usage could be:

val spark = SparkSession.builder()
    .appName(sc.appName)
    .master(sc.master)
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
    .config("spark.kryo.registrator", "net.sansa_stack.rdf.spark.io.JenaKryoRegistrator")
    .config("spark.sql.crossJoin.enabled", "true")
    .getOrCreate()

private val dataPath = this.getClass.getClassLoader.getResource("utils/test_data.nt").getPath
val data = spark.read.rdf(Lang.NTRIPLES)(dataPath).toDS()

val sparqlQueryString =
    "PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> " +
    "PREFIX owl: <http://www.w3.org/2002/07/owl#> " +
    "SELECT ?s " +
    "WHERE {" +
    "  ?s rdf:type owl:ObjectProperty " +
    "}"
val sparqlQuery: SPARQLQuery = SPARQLQuery(sparqlQueryString)

val res: DataFrame = sparqlQuery.transform(data)
val resultNodes: Array[Node] = res.as[Node].collect()

this sample is taken from a Scala unit test

DistAD Distributed Anomaly Detection

DistAD: This module is a generic, scalable, and distributed framework for anomaly detection on large RDF knowledge graphs. DistAD provides a great granularity for the end-users to select from a vast number of different algorithms, methods, and (hyper-)parameters to detect outliers. The framework performs anomaly detection by extracting semantic features from entities for calculating similarity, applying clustering on the entities, and running multiple anomaly detection algorithms to detect the outliers on the different levels and granularity. The output of DistAD will be the list of anomalous RDF triples. DistAD needs a config file which looks like as follows:

verbose=true
writeResultToFile=false

inputData="PATH/TO/dbpediaInfoBox.nt"
resultFilePath="PATH/TO/output.nt"

anomalyDetectionType="Predicate" //Possible values: NumericLiteral, Predicate, MultiFeature
clusteringMethod="BisectingKmeans" //Possible values: BisectingKmeans, MinHashLSH
clusteringType="Full" //Possible values: Full, Partial
anomalyDetectionAlgorithm="ZSCORE" //Possible values: IQR, MAD, ZSCORE
featureExtractor="PIVOT" //Possible values:PIVOT, LITERAL2FEATURE

anomalyListSize = 10
numberOfClusters = 2
silhouetteMethod = false
silhouetteMethodSamplingRate = 0.1

//CONOD
pairWiseDistanceThreshold = 0.43
  
//IsolationForest
maxSampleForIF = 256

//Literal2Feature
depth=1
seedNumber=20

After providing a config file, use can easily start the anomaly detection process. The process will read the config file nad based on the values will initiate the corresponding classes.

val configFilePath = "PATH/TO/config.conf"
AnomalyDetectionDispatcher.main(Array(configFilePath))

Moreover, DistAD offers options to select between different algorithms, and hyperparameters and prepare a customized pipeline for anomaly detection. The following table shows all the possibilities:

For more information about DistAD visit here.

Feature Based Semantic Similarity Estimations

DistSim - Feature Based Semantic Similarity Estimations (code): DistSim is the scalable distributed in-memory Semantic Similarity Estimation for RDF Knowledge Graph Frameworks which has been integrated into the SANSA stack in the SANSA Machine Learning package. The Scaladoc is available here, the respective similarity estimation models are in this Github directory and further needed utils can be found here

ScalaDocs:

Usage of Modules

Feature Extraction How to use Semantic Similarity Pipeline Modules:

val featureExtractorModel = new FeatureExtractorModel()
       .setMode("an")
val extractedFeaturesDataFrame = featureExtractorModel
       .transform(triplesDf)
       .filter(t => t.getAs[String]("uri").startsWith("m"))
extractedFeaturesDataFrame.show()

Transform features to indexed feature representation:

val cvModel: CountVectorizerModel = new CountVectorizer()
        .setInputCol("extractedFeatures")
        .setOutputCol("vectorizedFeatures")
        .fit(filteredFeaturesDataFrame)
val tmpCvDf: DataFrame = cvModel.transform(filteredFeaturesDataFrame)

(optional but recommended) filter out feature vectors which does not contain any feature

val isNoneZeroVector = udf({ v: Vector => v.numNonzeros > 0 }, DataTypes.BooleanType)
       val countVectorizedFeaturesDataFrame: DataFrame = tmpCvDf.filter(isNoneZeroVector(col("vectorizedFeatures"))).select("uri", "vectorizedFeatures")
       countVectorizedFeaturesDataFrame.show()

Semantic Similarity Estimations

now the data is prepared to run Semantic Similarity Estimations.

We have always two options.

  • Option 1:
    • nearestNeighbors provides for one feature vector and a Dataframe the k nearest neighbors in the DataFrame to the key feature vector. a feature vector as key could be: val sample_key: Vector = countVectorizedFeaturesDataFrame.take(1)(0).getAs[Vector]("vectorizedFeatures")
  • Option 2:
    • similarityJoin calculates for two DataFrames of feature vectors all pairs of similarity. This DataFrame the is limited by a minimal threshold.

Currently, we provide these similarity estimation models:

  • Batet
  • Braun-Blanquet
  • Dice
  • Jaccard
  • MinHash(probabilistic Jaccard)
  • Ochiai
  • Simpson
  • Tversky

Usage of MinHash

val minHashModel: MinHashLSHModel = new MinHashLSH()
      .setInputCol("vectorizedFeatures")
      .setOutputCol("hashedFeatures")
      .fit(countVectorizedFeaturesDataFrame)
minHashModel.approxNearestNeighbors(countVectorizedFeaturesDataFrame, sample_key, 10, "minHashDistance").show()
minHashModel.approxSimilarityJoin(countVectorizedFeaturesDataFrame, countVectorizedFeaturesDataFrame, 0.8, "distance").show()

Usage of Jaccard

val jaccardModel: JaccardModel = new JaccardModel()
      .setInputCol("vectorizedFeatures")
     jaccardModel.nearestNeighbors(countVectorizedFeaturesDataFrame, sample_key, 10).show()
     jaccardModel.similarityJoin(countVectorizedFeaturesDataFrame, countVectorizedFeaturesDataFrame, threshold = 0.5).show()

Usage of Tversky

val tverskyModel: TverskyModel = new TverskyModel()
       .setInputCol("vectorizedFeatures")
       .setAlpha(1.0)
       .setBeta(1.0)
tverskyModel.nearestNeighbors(countVectorizedFeaturesDataFrame, sample_key, 10).show()
tverskyModel.similarityJoin(countVectorizedFeaturesDataFrame, countVectorizedFeaturesDataFrame, threshold = 0.5).show()

Module Roadmap

  • Domain Aware Semantic Similarity Estimation
  • Anomaly Detection
  • KGE

Several further algorithms are in development. Please create a pull request and/or contact Jens Lehmann if you are interested in contributing algorithms to SANSA-ML.

Research and Experimental Projects

In recent research projects further experimental approaches have been implemented. Due to the ongoing refactoring and re-design of Data Analytics Functionality, these methods are available in the Release 0.7.1 Machine Learning Layer. They are currently not maintained but can be used as inspiration for further developments. The developed approaches cover:

  • Classification (Spark) (Flink)
  • Clustering
    • Paper Clustering Pipelines of large RDF POI Data by Rajjat Dadwal1, Damien Graux, Gezim Sejdiu, Hajira Jabeen, and Jens Lehmann
    • Masterthesis, Distributed RDF Clustering Framework, Tina Boroukhian
  • Kernel
  • Kge/Linkprediction
  • Mining/AmieSpark
    • Bachelorthesis, Association Rule Mining of Linked Data Using Apache Spark by Theresa Nathan
  • Outliers/Anomaly Detection
    • Paper, Divided we stand out! Forging Cohorts fOr Numeric Outlier Detection in large scale knowledge graphs (CONOD) by Hajira Jabeen 1, Rajjat Dadwal, Gezim Sejdiu, and Jens Lehmann
    • Masterthesis, Scalable Numerical Outlier Detection in Knowledge Graphs, Rajjat Dadwal
  • WordNetDistance

Some further Usage examples of these modules are available in the archived SANSA_Examples Repository

How to Contribute

We always welcome new contributors to the project! Please see our contribution guide for more details on how to get started contributing to SANSA.