{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "aw_lp5wfXRSl" }, "source": [ "# JSON-LD: A Simple Introduction Using a Person\n", "\n", "**JSON-LD** (JavaScript Object Notation for Linked Data) is a lightweight syntax to express linked data using JSON. It allows you to add semantic meaning to data by referencing concepts from ontologies or controlled vocabularies like [schema.org](https://schema.org/).\n", "\n", "In this notebook, we use the example of a person to demonstrate how to:\n", "- Enrich regular JSON with semantic context\n", "- Link data to external definitions using URIs\n", "- Enable data sharing in a machine-readable, interoperable format\n", "\n", "You can think of JSON-LD as \"JSON + semantics\".\n", "\n", "## From JSON to JSON-LD\n", "\n", "Here’s what JSON-LD adds to regular JSON:\n", "- `@context`: A mapping between your terms (e.g. `\"firstName\"`) and standardized URIs that define their meaning (e.g. `\"schema:givenName\"`).\n", "- `@id`: A globally unique identifier (IRI) for the entity being described.\n", "- `@type`: A type indicator, often from a vocabulary like `schema:Person`.\n", "\n", "These additions allow machines—not just humans—to understand what your data is about.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## A Person in JSON-LD\n", "\n", "Below is an example of a person described using JSON-LD. This is not just a person named \"Simon Clark\" — it is a semantically described entity with a globally unique identifier (`@id`), and relationships (such as their employer) that are also fully structured as linked data.\n", "\n", "The `@context` block maps the terms used in the JSON document to well-defined concepts from external vocabularies.\n", "\n", "For example:\n", "\n", "```json\n", "\"@context\": \"https://schema.org/\"\n", "```\n", "\n", "This indicates that the terms used (like givenName, birthDate, or affiliation) come from the [schema.org](https://schema.org) vocabulary. This mapping enables software agents and data systems to interpret the data consistently, beyond just reading key-value pairs.\n", "\n", "Using JSON-LD in this way makes the data both human-readable and machine-interpretable, opening the door to powerful integration, validation, and reasoning across systems." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "id": "tNjkjGNEXJXP" }, "outputs": [], "source": [ "import jsonschema\n", "from jsonschema import validate\n", "import json\n", "import rdflib\n", "\n", "# Regular JSON representation of a person\n", "person_data = {\n", " \"@context\": \"https://schema.org/\",\n", " \"@id\": \"https://orcid.org/0000-0002-8758-6109\",\n", " \"@type\": \"Person\",\n", " \"givenName\": \"Simon\",\n", " \"familyName\": \"Clark\",\n", " \"gender\": {\"@type\": \"Male\"},\n", " \"birthDate\": \"1987-04-23\",\n", " \"affiliation\": {\n", " \"@id\": \"https://ror.org/01f677e56\",\n", " \"name\": \"SINTEF\",\n", " \"@type\": \"ResearchOrganization\"\n", " }\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Validating JSON-LD Structure with a JSON Schema\n", "\n", "While JSON-LD enriches data with semantic meaning, it is still fundamentally JSON — which means we can use **JSON Schema** to validate its structure.\n", "\n", "In the code below, we define a **JSON Schema** to validate the structure of a person object. This schema enforces that:\n", "- `givenName` and `familyName` are required strings,\n", "- `birthDate` must follow the `YYYY-MM-DD` format (validated with both a format and a regex),\n", "- `affiliation` and `gender` must be valid objects.\n", "\n", "The `validate_json()` function uses the `jsonschema` Python package to validate the `person_data` object against this schema. If the data is valid, it confirms success; otherwise, it prints a validation error.\n", "\n", "This is especially useful when:\n", "- Receiving data from users or external systems\n", "- Validating linked data before publishing or storage\n", "- Integrating structured data into APIs or semantic pipelines" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "HDHJuV62bTby", "outputId": "d5537f6e-4325-40ef-dfa4-c7d55159623f" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "JSON data is valid according to the schema.\n" ] } ], "source": [ "person_schema = {\n", " \"type\": \"object\",\n", " \"properties\": {\n", " \"@context\": {\n", " \"type\": [\"string\", \"object\"] # object form if using inline mappings\n", " },\n", " \"@type\": {\n", " \"type\": \"string\",\n", " },\n", " \"@id\": {\n", " \"type\": \"string\",\n", " \"format\": \"uri\"\n", " },\n", " \"givenName\": {\n", " \"type\": \"string\"\n", " },\n", " \"familyName\": {\n", " \"type\": \"string\",\n", " \"minLength\": 1\n", " },\n", " \"birthDate\": {\n", " \"type\": \"string\",\n", " \"format\": \"date\",\n", " \"pattern\": \"^[0-9]{4}-[0-1][0-9]-[0-3][0-9]$\"\n", " },\n", " \"gender\": {\n", " \"type\": \"object\"\n", " },\n", " \"affiliation\": {\n", " \"type\": \"object\"\n", " }\n", " },\n", " \"required\": [\"@context\", \"@type\", \"@id\", \"givenName\", \"familyName\", \"birthDate\", \"affiliation\"]\n", "}\n", "\n", "# Function to validate JSON data against the schema\n", "def validate_json(data, schema):\n", " try:\n", " validate(instance=data, schema=schema)\n", " return True, \"JSON data is valid according to the schema.\"\n", " except jsonschema.exceptions.ValidationError as ve:\n", " return False, ve.message\n", "\n", "# Validate the sample JSON data\n", "is_valid, message = validate_json(person_data, person_schema)\n", "print(message)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Querying JSON-LD Data with SPARQL and RDFLib\n", "\n", "In this section, we demonstrate how to use `rdflib` to work with JSON-LD data and execute SPARQL queries against it.\n", "\n", "### Step 1: Create an RDF Graph\n", "\n", "We start by creating an RDF graph using `rdflib.Graph()`, which serves as a container for all the triples (subject-predicate-object statements) derived from our data.\n", "\n", "### Step 2: Load Schema.org Vocabulary\n", "\n", "We load the full [Schema.org](https://schema.org) vocabulary into the graph from its latest official JSON-LD release. This gives us access to the class hierarchy and definitions used in our person data, including terms like `schema:Person` and `schema:Organization`.\n", "\n", "### Step 3: Load JSON-LD Person Data\n", "\n", "We convert the `person_data` dictionary into a JSON string and parse it into the RDF graph. This integrates our structured data with the schema definitions, allowing us to query both vocabulary and instance data together.\n", "\n", "### Step 4: Run a SPARQL Query\n", "\n", "We execute a SPARQL query to retrieve all subclasses (direct or indirect) of `schema:Organization` using the `rdfs:subClassOf*` path operator. This is useful when you want to identify all organization-related types defined in Schema.org.\n", "\n", "### Output\n", "\n", "The result is a list of IRIs for types that are (transitively) subclasses of `schema:Organization`. This could include entities like `schema:EducationalOrganization`, `schema:Corporation`, or `schema:ResearchOrganization`.\n", "\n", "This approach demonstrates how JSON-LD + Schema.org + SPARQL can provide a powerful way to:\n", "- Enrich data with formal semantics\n", "- Query both vocabulary and data in a unified RDF graph\n", "- Integrate data across schemas and domains" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "R-mO4FGtbr-L", "outputId": "c0ae5e87-6ade-4ef8-a639-02bf287071a9" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(rdflib.term.URIRef('http://schema.org/Organization'),)\n", "(rdflib.term.URIRef('http://schema.org/GovernmentOrganization'),)\n", "(rdflib.term.URIRef('http://schema.org/Consortium'),)\n", "(rdflib.term.URIRef('http://schema.org/PerformingGroup'),)\n", "(rdflib.term.URIRef('http://schema.org/TheaterGroup'),)\n", "(rdflib.term.URIRef('http://schema.org/MusicGroup'),)\n", "(rdflib.term.URIRef('http://schema.org/DanceGroup'),)\n", "(rdflib.term.URIRef('http://schema.org/OnlineBusiness'),)\n", "(rdflib.term.URIRef('http://schema.org/OnlineStore'),)\n", "(rdflib.term.URIRef('http://schema.org/LibrarySystem'),)\n", "(rdflib.term.URIRef('http://schema.org/SearchRescueOrganization'),)\n", "(rdflib.term.URIRef('http://schema.org/PoliticalParty'),)\n", "(rdflib.term.URIRef('http://schema.org/Corporation'),)\n", "(rdflib.term.URIRef('http://schema.org/Project'),)\n", "(rdflib.term.URIRef('http://schema.org/FundingAgency'),)\n", "(rdflib.term.URIRef('http://schema.org/ResearchProject'),)\n", "(rdflib.term.URIRef('http://schema.org/NewsMediaOrganization'),)\n", "(rdflib.term.URIRef('http://schema.org/MedicalOrganization'),)\n", "(rdflib.term.URIRef('http://schema.org/Dentist'),)\n", "(rdflib.term.URIRef('http://schema.org/MedicalClinic'),)\n" ] } ], "source": [ "# Step 1: Create an RDF Graph\n", "g = rdflib.Graph()\n", "\n", "# Step 2: Load Schema.org Vocabulary\n", "g.parse(\"https://schema.org/version/latest/schemaorg-current-http.jsonld\", format=\"json-ld\")\n", "\n", "# Step 3: Load JSON-LD Person Data\n", "person_data_str = json.dumps(person_data)\n", "g.parse(data=person_data_str, format=\"json-ld\")\n", "\n", "# Step 4: Run a SPARQL Query\n", "sparql_query = \"\"\"\n", "PREFIX schema: \n", "SELECT DISTINCT ?type WHERE {\n", " ?type rdfs:subClassOf* schema:Organization .\n", "}\n", "LIMIT 20\n", "\"\"\"\n", "\n", "# Execute the SPARQL query\n", "results = g.query(sparql_query)\n", "\n", "# Print the results\n", "for row in results:\n", " print(row)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Querying Instances of `schema:Organization`\n", "\n", "In this example, we go one step further by querying for actual **instances** of `schema:Organization` (or any of its subclasses) present in the RDF graph.\n", "\n", "### What This SPARQL Query Does\n", "\n", "This SPARQL query performs two key operations:\n", "\n", "1. It uses:\n", " ```sparql\n", " ?subclass rdfs:subClassOf* schema:Organization .\n", " ```\n", " to find all types that are subclasses of `schema:Organization`. The `*` means it includes both direct and indirect subclasses.\n", "\n", "2. It then finds:\n", " ```sparql\n", " ?instance rdf:type ?subclass .\n", " ```\n", " all **instances** in the graph whose `rdf:type` is one of these subclasses — meaning they are some kind of organization.\n", "\n", "### Why This Matters\n", "\n", "This allows us to extract not just definitions (as in the previous example), but **real data entries** that correspond to organizations — such as companies, research institutes, or educational organizations — described in your JSON-LD.\n", "\n", "Since our `person_data` includes an `affiliation` field that references a `schema:ResearchOrganization`, this query will match that and return it.\n", "\n", "### Output\n", "\n", "The output is a list of IRIs identifying each organization instance in the graph. This provides a powerful way to:\n", "- Discover all known organizations in your data\n", "- Use these IRIs for follow-up queries (e.g., get their name, address, or related persons)\n", "- Analyze structured relationships between people and institutions\n", "\n", "This pattern is central to working with linked data: describing entities with types, and then querying them using semantic relationships." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "h4vvw5hNhih2", "outputId": "77828fc9-bafe-414b-f6f6-65fb117aabc5" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "https://ror.org/01f677e56\n" ] } ], "source": [ "# Define and execute a SPARQL query for all instances of Organization\n", "sparql_query = \"\"\"\n", "PREFIX rdf: \n", "PREFIX rdfs: \n", "PREFIX schema: \n", "\n", "SELECT ?instance WHERE {\n", " ?subclass rdfs:subClassOf* schema:Organization .\n", " ?instance rdf:type ?subclass .\n", "}\n", "LIMIT 10\n", "\"\"\"\n", "\n", "# Execute the SPARQL query\n", "results = g.query(sparql_query)\n", "\n", "# Print the results\n", "for row in results:\n", " print(row[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Querying Birth Dates of Persons\n", "\n", "In this example, we execute a SPARQL query to retrieve the birth dates of individuals in the graph who are typed as `schema:Person`.\n", "\n", "### What the Query Does\n", "\n", "This SPARQL query looks for:\n", "\n", "1. Individuals explicitly typed as `schema:Person`:\n", " ```sparql\n", " ?subject rdf:type schema:Person .\n", " ```\n", "\n", "2. The associated birth date of each person using the `schema:birthDate` property:\n", " ```sparql\n", " ?subject schema:birthDate ?bday .\n", " ```\n", "\n", "3. It selects and returns only the `?bday` values, which represent literal dates.\n", "\n", "4. The query includes:\n", " ```sparql\n", " LIMIT 10\n", " ```\n", " to restrict the results to the first 10 entries (useful for inspection or previewing large datasets).\n", "\n", "### Why This Matters\n", "\n", "This kind of query is useful when you want to extract **attribute values** from structured data. In this case, we’re retrieving **dates of birth** for people in the graph. These values can then be used for analytics, filtering, or even plotting demographics.\n", "\n", "### Assumptions\n", "\n", "- It assumes that `schema:birthDate` is used directly with a literal (e.g., `\"1987-04-23\"`).\n", "- If the birth date is represented as a nested object or typed node, additional handling would be required in the query.\n", "\n", "### Result\n", "\n", "The query prints a list of birth dates (as literals) for up to 10 individuals defined in your RDF graph.\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "HJfpCj4-sww3", "outputId": "c4ba6b30-7d7e-45d9-c3e4-b0d311e0fdd0" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1987-04-23\n" ] } ], "source": [ "# Define and execute a SPARQL query for all instances of Organization\n", "sparql_query = \"\"\"\n", "PREFIX rdf: \n", "PREFIX rdfs: \n", "PREFIX schema: \n", "\n", "SELECT ?bday WHERE {\n", " ?subject rdf:type schema:Person .\n", " ?subject schema:birthDate ?bday .\n", "}\n", "LIMIT 10\n", "\"\"\"\n", "\n", "# Execute the SPARQL query\n", "results = g.query(sparql_query)\n", "\n", "# Print the results\n", "for row in results:\n", " print(row[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summary\n", "\n", "In this notebook, we explored how **JSON-LD** can transform regular JSON into semantically enriched, machine-readable data using well-defined vocabularies like [schema.org](https://schema.org).\n", "\n", "### Key Concepts Covered\n", "\n", "- **JSON-LD Basics**: We structured a `Person` object with fields like `@context`, `@type`, and `@id`, connecting each field to a formal semantic definition.\n", "- **JSON Schema Validation**: We used `jsonschema` to ensure that our JSON-LD documents are syntactically valid before graph conversion.\n", "- **RDF Graph Construction**: Using `rdflib`, we converted JSON-LD data and schema.org into an RDF graph that supports reasoning and querying.\n", "- **SPARQL Queries**: We demonstrated several SPARQL queries to:\n", " - Retrieve all types derived from `schema:Organization`\n", " - Find all instances of those types\n", " - Count people with gender set to `schema:Male`\n", " - List birth dates of individuals\n", "\n", "By combining JSON-LD, RDFLib, and SPARQL:\n", "- You can enrich your data with standardized semantics\n", "- Enable interoperability across systems and domains\n", "- Perform structured, meaningful queries over data\n", "- Integrate your metadata with larger knowledge graphs (e.g., Wikidata, Google Knowledge Graph)\n", "\n", "This notebook serves as a practical introduction to **semantic data modeling and querying** — a foundational component of linked data applications and the Semantic Web.\n" ] } ], "metadata": { "colab": { "provenance": [] }, "kernelspec": { "display_name": ".venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.10" } }, "nbformat": 4, "nbformat_minor": 0 }