{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "aw_lp5wfXRSl"
},
"source": [
"# JSON-LD: A Simple Introduction Using a Person\n",
"\n",
"**JSON-LD** (JavaScript Object Notation for Linked Data) is a lightweight syntax to express linked data using JSON. It allows you to add semantic meaning to data by referencing concepts from ontologies or controlled vocabularies like [schema.org](https://schema.org/).\n",
"\n",
"In this notebook, we use the example of a person to demonstrate how to:\n",
"- Enrich regular JSON with semantic context\n",
"- Link data to external definitions using URIs\n",
"- Enable data sharing in a machine-readable, interoperable format\n",
"\n",
"You can think of JSON-LD as \"JSON + semantics\".\n",
"\n",
"## From JSON to JSON-LD\n",
"\n",
"Here’s what JSON-LD adds to regular JSON:\n",
"- `@context`: A mapping between your terms (e.g. `\"firstName\"`) and standardized URIs that define their meaning (e.g. `\"schema:givenName\"`).\n",
"- `@id`: A globally unique identifier (IRI) for the entity being described.\n",
"- `@type`: A type indicator, often from a vocabulary like `schema:Person`.\n",
"\n",
"These additions allow machines—not just humans—to understand what your data is about.\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## A Person in JSON-LD\n",
"\n",
"Below is an example of a person described using JSON-LD. This is not just a person named \"Simon Clark\" — it is a semantically described entity with a globally unique identifier (`@id`), and relationships (such as their employer) that are also fully structured as linked data.\n",
"\n",
"The `@context` block maps the terms used in the JSON document to well-defined concepts from external vocabularies.\n",
"\n",
"For example:\n",
"\n",
"```json\n",
"\"@context\": \"https://schema.org/\"\n",
"```\n",
"\n",
"This indicates that the terms used (like givenName, birthDate, or affiliation) come from the [schema.org](https://schema.org) vocabulary. This mapping enables software agents and data systems to interpret the data consistently, beyond just reading key-value pairs.\n",
"\n",
"Using JSON-LD in this way makes the data both human-readable and machine-interpretable, opening the door to powerful integration, validation, and reasoning across systems."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"id": "tNjkjGNEXJXP"
},
"outputs": [],
"source": [
"import jsonschema\n",
"from jsonschema import validate\n",
"import json\n",
"import rdflib\n",
"\n",
"# Regular JSON representation of a person\n",
"person_data = {\n",
" \"@context\": \"https://schema.org/\",\n",
" \"@id\": \"https://orcid.org/0000-0002-8758-6109\",\n",
" \"@type\": \"Person\",\n",
" \"givenName\": \"Simon\",\n",
" \"familyName\": \"Clark\",\n",
" \"gender\": {\"@type\": \"Male\"},\n",
" \"birthDate\": \"1987-04-23\",\n",
" \"affiliation\": {\n",
" \"@id\": \"https://ror.org/01f677e56\",\n",
" \"name\": \"SINTEF\",\n",
" \"@type\": \"ResearchOrganization\"\n",
" }\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Validating JSON-LD Structure with a JSON Schema\n",
"\n",
"While JSON-LD enriches data with semantic meaning, it is still fundamentally JSON — which means we can use **JSON Schema** to validate its structure.\n",
"\n",
"In the code below, we define a **JSON Schema** to validate the structure of a person object. This schema enforces that:\n",
"- `givenName` and `familyName` are required strings,\n",
"- `birthDate` must follow the `YYYY-MM-DD` format (validated with both a format and a regex),\n",
"- `affiliation` and `gender` must be valid objects.\n",
"\n",
"The `validate_json()` function uses the `jsonschema` Python package to validate the `person_data` object against this schema. If the data is valid, it confirms success; otherwise, it prints a validation error.\n",
"\n",
"This is especially useful when:\n",
"- Receiving data from users or external systems\n",
"- Validating linked data before publishing or storage\n",
"- Integrating structured data into APIs or semantic pipelines"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "HDHJuV62bTby",
"outputId": "d5537f6e-4325-40ef-dfa4-c7d55159623f"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"JSON data is valid according to the schema.\n"
]
}
],
"source": [
"person_schema = {\n",
" \"type\": \"object\",\n",
" \"properties\": {\n",
" \"@context\": {\n",
" \"type\": [\"string\", \"object\"] # object form if using inline mappings\n",
" },\n",
" \"@type\": {\n",
" \"type\": \"string\",\n",
" },\n",
" \"@id\": {\n",
" \"type\": \"string\",\n",
" \"format\": \"uri\"\n",
" },\n",
" \"givenName\": {\n",
" \"type\": \"string\"\n",
" },\n",
" \"familyName\": {\n",
" \"type\": \"string\",\n",
" \"minLength\": 1\n",
" },\n",
" \"birthDate\": {\n",
" \"type\": \"string\",\n",
" \"format\": \"date\",\n",
" \"pattern\": \"^[0-9]{4}-[0-1][0-9]-[0-3][0-9]$\"\n",
" },\n",
" \"gender\": {\n",
" \"type\": \"object\"\n",
" },\n",
" \"affiliation\": {\n",
" \"type\": \"object\"\n",
" }\n",
" },\n",
" \"required\": [\"@context\", \"@type\", \"@id\", \"givenName\", \"familyName\", \"birthDate\", \"affiliation\"]\n",
"}\n",
"\n",
"# Function to validate JSON data against the schema\n",
"def validate_json(data, schema):\n",
" try:\n",
" validate(instance=data, schema=schema)\n",
" return True, \"JSON data is valid according to the schema.\"\n",
" except jsonschema.exceptions.ValidationError as ve:\n",
" return False, ve.message\n",
"\n",
"# Validate the sample JSON data\n",
"is_valid, message = validate_json(person_data, person_schema)\n",
"print(message)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Querying JSON-LD Data with SPARQL and RDFLib\n",
"\n",
"In this section, we demonstrate how to use `rdflib` to work with JSON-LD data and execute SPARQL queries against it.\n",
"\n",
"### Step 1: Create an RDF Graph\n",
"\n",
"We start by creating an RDF graph using `rdflib.Graph()`, which serves as a container for all the triples (subject-predicate-object statements) derived from our data.\n",
"\n",
"### Step 2: Load Schema.org Vocabulary\n",
"\n",
"We load the full [Schema.org](https://schema.org) vocabulary into the graph from its latest official JSON-LD release. This gives us access to the class hierarchy and definitions used in our person data, including terms like `schema:Person` and `schema:Organization`.\n",
"\n",
"### Step 3: Load JSON-LD Person Data\n",
"\n",
"We convert the `person_data` dictionary into a JSON string and parse it into the RDF graph. This integrates our structured data with the schema definitions, allowing us to query both vocabulary and instance data together.\n",
"\n",
"### Step 4: Run a SPARQL Query\n",
"\n",
"We execute a SPARQL query to retrieve all subclasses (direct or indirect) of `schema:Organization` using the `rdfs:subClassOf*` path operator. This is useful when you want to identify all organization-related types defined in Schema.org.\n",
"\n",
"### Output\n",
"\n",
"The result is a list of IRIs for types that are (transitively) subclasses of `schema:Organization`. This could include entities like `schema:EducationalOrganization`, `schema:Corporation`, or `schema:ResearchOrganization`.\n",
"\n",
"This approach demonstrates how JSON-LD + Schema.org + SPARQL can provide a powerful way to:\n",
"- Enrich data with formal semantics\n",
"- Query both vocabulary and data in a unified RDF graph\n",
"- Integrate data across schemas and domains"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "R-mO4FGtbr-L",
"outputId": "c0ae5e87-6ade-4ef8-a639-02bf287071a9"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(rdflib.term.URIRef('http://schema.org/Organization'),)\n",
"(rdflib.term.URIRef('http://schema.org/GovernmentOrganization'),)\n",
"(rdflib.term.URIRef('http://schema.org/Consortium'),)\n",
"(rdflib.term.URIRef('http://schema.org/PerformingGroup'),)\n",
"(rdflib.term.URIRef('http://schema.org/TheaterGroup'),)\n",
"(rdflib.term.URIRef('http://schema.org/MusicGroup'),)\n",
"(rdflib.term.URIRef('http://schema.org/DanceGroup'),)\n",
"(rdflib.term.URIRef('http://schema.org/OnlineBusiness'),)\n",
"(rdflib.term.URIRef('http://schema.org/OnlineStore'),)\n",
"(rdflib.term.URIRef('http://schema.org/LibrarySystem'),)\n",
"(rdflib.term.URIRef('http://schema.org/SearchRescueOrganization'),)\n",
"(rdflib.term.URIRef('http://schema.org/PoliticalParty'),)\n",
"(rdflib.term.URIRef('http://schema.org/Corporation'),)\n",
"(rdflib.term.URIRef('http://schema.org/Project'),)\n",
"(rdflib.term.URIRef('http://schema.org/FundingAgency'),)\n",
"(rdflib.term.URIRef('http://schema.org/ResearchProject'),)\n",
"(rdflib.term.URIRef('http://schema.org/NewsMediaOrganization'),)\n",
"(rdflib.term.URIRef('http://schema.org/MedicalOrganization'),)\n",
"(rdflib.term.URIRef('http://schema.org/Dentist'),)\n",
"(rdflib.term.URIRef('http://schema.org/MedicalClinic'),)\n"
]
}
],
"source": [
"# Step 1: Create an RDF Graph\n",
"g = rdflib.Graph()\n",
"\n",
"# Step 2: Load Schema.org Vocabulary\n",
"g.parse(\"https://schema.org/version/latest/schemaorg-current-http.jsonld\", format=\"json-ld\")\n",
"\n",
"# Step 3: Load JSON-LD Person Data\n",
"person_data_str = json.dumps(person_data)\n",
"g.parse(data=person_data_str, format=\"json-ld\")\n",
"\n",
"# Step 4: Run a SPARQL Query\n",
"sparql_query = \"\"\"\n",
"PREFIX schema: \n",
"SELECT DISTINCT ?type WHERE {\n",
" ?type rdfs:subClassOf* schema:Organization .\n",
"}\n",
"LIMIT 20\n",
"\"\"\"\n",
"\n",
"# Execute the SPARQL query\n",
"results = g.query(sparql_query)\n",
"\n",
"# Print the results\n",
"for row in results:\n",
" print(row)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Querying Instances of `schema:Organization`\n",
"\n",
"In this example, we go one step further by querying for actual **instances** of `schema:Organization` (or any of its subclasses) present in the RDF graph.\n",
"\n",
"### What This SPARQL Query Does\n",
"\n",
"This SPARQL query performs two key operations:\n",
"\n",
"1. It uses:\n",
" ```sparql\n",
" ?subclass rdfs:subClassOf* schema:Organization .\n",
" ```\n",
" to find all types that are subclasses of `schema:Organization`. The `*` means it includes both direct and indirect subclasses.\n",
"\n",
"2. It then finds:\n",
" ```sparql\n",
" ?instance rdf:type ?subclass .\n",
" ```\n",
" all **instances** in the graph whose `rdf:type` is one of these subclasses — meaning they are some kind of organization.\n",
"\n",
"### Why This Matters\n",
"\n",
"This allows us to extract not just definitions (as in the previous example), but **real data entries** that correspond to organizations — such as companies, research institutes, or educational organizations — described in your JSON-LD.\n",
"\n",
"Since our `person_data` includes an `affiliation` field that references a `schema:ResearchOrganization`, this query will match that and return it.\n",
"\n",
"### Output\n",
"\n",
"The output is a list of IRIs identifying each organization instance in the graph. This provides a powerful way to:\n",
"- Discover all known organizations in your data\n",
"- Use these IRIs for follow-up queries (e.g., get their name, address, or related persons)\n",
"- Analyze structured relationships between people and institutions\n",
"\n",
"This pattern is central to working with linked data: describing entities with types, and then querying them using semantic relationships."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "h4vvw5hNhih2",
"outputId": "77828fc9-bafe-414b-f6f6-65fb117aabc5"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"https://ror.org/01f677e56\n"
]
}
],
"source": [
"# Define and execute a SPARQL query for all instances of Organization\n",
"sparql_query = \"\"\"\n",
"PREFIX rdf: \n",
"PREFIX rdfs: \n",
"PREFIX schema: \n",
"\n",
"SELECT ?instance WHERE {\n",
" ?subclass rdfs:subClassOf* schema:Organization .\n",
" ?instance rdf:type ?subclass .\n",
"}\n",
"LIMIT 10\n",
"\"\"\"\n",
"\n",
"# Execute the SPARQL query\n",
"results = g.query(sparql_query)\n",
"\n",
"# Print the results\n",
"for row in results:\n",
" print(row[0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Querying Birth Dates of Persons\n",
"\n",
"In this example, we execute a SPARQL query to retrieve the birth dates of individuals in the graph who are typed as `schema:Person`.\n",
"\n",
"### What the Query Does\n",
"\n",
"This SPARQL query looks for:\n",
"\n",
"1. Individuals explicitly typed as `schema:Person`:\n",
" ```sparql\n",
" ?subject rdf:type schema:Person .\n",
" ```\n",
"\n",
"2. The associated birth date of each person using the `schema:birthDate` property:\n",
" ```sparql\n",
" ?subject schema:birthDate ?bday .\n",
" ```\n",
"\n",
"3. It selects and returns only the `?bday` values, which represent literal dates.\n",
"\n",
"4. The query includes:\n",
" ```sparql\n",
" LIMIT 10\n",
" ```\n",
" to restrict the results to the first 10 entries (useful for inspection or previewing large datasets).\n",
"\n",
"### Why This Matters\n",
"\n",
"This kind of query is useful when you want to extract **attribute values** from structured data. In this case, we’re retrieving **dates of birth** for people in the graph. These values can then be used for analytics, filtering, or even plotting demographics.\n",
"\n",
"### Assumptions\n",
"\n",
"- It assumes that `schema:birthDate` is used directly with a literal (e.g., `\"1987-04-23\"`).\n",
"- If the birth date is represented as a nested object or typed node, additional handling would be required in the query.\n",
"\n",
"### Result\n",
"\n",
"The query prints a list of birth dates (as literals) for up to 10 individuals defined in your RDF graph.\n"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "HJfpCj4-sww3",
"outputId": "c4ba6b30-7d7e-45d9-c3e4-b0d311e0fdd0"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1987-04-23\n"
]
}
],
"source": [
"# Define and execute a SPARQL query for all instances of Organization\n",
"sparql_query = \"\"\"\n",
"PREFIX rdf: \n",
"PREFIX rdfs: \n",
"PREFIX schema: \n",
"\n",
"SELECT ?bday WHERE {\n",
" ?subject rdf:type schema:Person .\n",
" ?subject schema:birthDate ?bday .\n",
"}\n",
"LIMIT 10\n",
"\"\"\"\n",
"\n",
"# Execute the SPARQL query\n",
"results = g.query(sparql_query)\n",
"\n",
"# Print the results\n",
"for row in results:\n",
" print(row[0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Summary\n",
"\n",
"In this notebook, we explored how **JSON-LD** can transform regular JSON into semantically enriched, machine-readable data using well-defined vocabularies like [schema.org](https://schema.org).\n",
"\n",
"### Key Concepts Covered\n",
"\n",
"- **JSON-LD Basics**: We structured a `Person` object with fields like `@context`, `@type`, and `@id`, connecting each field to a formal semantic definition.\n",
"- **JSON Schema Validation**: We used `jsonschema` to ensure that our JSON-LD documents are syntactically valid before graph conversion.\n",
"- **RDF Graph Construction**: Using `rdflib`, we converted JSON-LD data and schema.org into an RDF graph that supports reasoning and querying.\n",
"- **SPARQL Queries**: We demonstrated several SPARQL queries to:\n",
" - Retrieve all types derived from `schema:Organization`\n",
" - Find all instances of those types\n",
" - Count people with gender set to `schema:Male`\n",
" - List birth dates of individuals\n",
"\n",
"By combining JSON-LD, RDFLib, and SPARQL:\n",
"- You can enrich your data with standardized semantics\n",
"- Enable interoperability across systems and domains\n",
"- Perform structured, meaningful queries over data\n",
"- Integrate your metadata with larger knowledge graphs (e.g., Wikidata, Google Knowledge Graph)\n",
"\n",
"This notebook serves as a practical introduction to **semantic data modeling and querying** — a foundational component of linked data applications and the Semantic Web.\n"
]
}
],
"metadata": {
"colab": {
"provenance": []
},
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.10"
}
},
"nbformat": 4,
"nbformat_minor": 0
}