Loading RDF
Scala
We suggest to import the net.sansa_stack.rdf.spark.io
package which adds the function rdf()
to a Spark session. You can either explicitely specify the type of RDF serialization or let the API guess the format based on the file extension.
For example, the following Scala code shows how to read an RDF file in N-Triples syntax (be it a local file or a file residing in HDFS) into a Spark RDD:
Load as RDD
Scala
import net.sansa_stack.rdf.spark.io._
import org.apache.jena.riot.Lang
val spark: SparkSession = ...
val lang = Lang.NTRIPLES
val triples = spark.rdf(lang)(path)
triples.take(5).foreach(println(_))
Load as DataFrame
Scala
import net.sansa_stack.rdf.spark.io._
import org.apache.jena.riot.Lang
val spark: SparkSession = ...
val lang = Lang.NTRIPLES
val triples = spark.read.rdf(lang)(path)
triples.take(5).foreach(println(_))
Java
The main class for loading RDDs with Java is net.sansa_stack.spark.io.rdf.input.impl.RdfSourceFactoryImpl
:
Java
import net.sansa_stack.spark.io.rdf.input.api.RdfSourceFactory;
SparkSession sparkSession = SparkSession.builder().config(sparkConf).getOrCreate();
RdfSourceFactory rdfSourceFactory = RdfSourceFactoryImpl.from(sparkSession);
RdfSource rdfSource = rdfSourceFactory.get("path");
RDD<Triple> rddOfTrple = rdfSource.asTriples();
RDD<Quad> rddOfQuad = rdfSource.asQuads();
RDD<Model> rddOfModel = rdfSource.asModels();
RDD<Dataset> rddOfDataset = rdfSource.asDatasets();
Note, that a JavaSparkContext
is not necessary for basic RDF loading.
One can easily obtain one using:
JavaSparkContext javaSparkContext = JavaSparkContext.fromSparkContext(sparkSession.sparkContext());