{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "aw_lp5wfXRSl"
      },
      "source": [
        "# JSON-LD: A Simple Introduction Using a Person\n",
        "\n",
        "**JSON-LD** (JavaScript Object Notation for Linked Data) is a lightweight syntax to express linked data using JSON. It allows you to add semantic meaning to data by referencing concepts from ontologies or controlled vocabularies like [schema.org](https://schema.org/).\n",
        "\n",
        "In this notebook, we use the example of a person to demonstrate how to:\n",
        "- Enrich regular JSON with semantic context\n",
        "- Link data to external definitions using URIs\n",
        "- Enable data sharing in a machine-readable, interoperable format\n",
        "\n",
        "You can think of JSON-LD as \"JSON + semantics\".\n",
        "\n",
        "## From JSON to JSON-LD\n",
        "\n",
        "Here’s what JSON-LD adds to regular JSON:\n",
        "- `@context`: A mapping between your terms (e.g. `\"firstName\"`) and standardized URIs that define their meaning (e.g. `\"schema:givenName\"`).\n",
        "- `@id`: A globally unique identifier (IRI) for the entity being described.\n",
        "- `@type`: A type indicator, often from a vocabulary like `schema:Person`.\n",
        "\n",
        "These additions allow machines—not just humans—to understand what your data is about.\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## A Person in JSON-LD\n",
        "\n",
        "Below is an example of a person described using JSON-LD. This is not just a person named \"Simon Clark\" — it is a semantically described entity with a globally unique identifier (`@id`), and relationships (such as their employer) that are also fully structured as linked data.\n",
        "\n",
        "The `@context` block maps the terms used in the JSON document to well-defined concepts from external vocabularies.\n",
        "\n",
        "For example:\n",
        "\n",
        "```json\n",
        "\"@context\": \"https://schema.org/\"\n",
        "```\n",
        "\n",
        "This indicates that the terms used (like givenName, birthDate, or affiliation) come from the [schema.org](https://schema.org) vocabulary. This mapping enables software agents and data systems to interpret the data consistently, beyond just reading key-value pairs.\n",
        "\n",
        "Using JSON-LD in this way makes the data both human-readable and machine-interpretable, opening the door to powerful integration, validation, and reasoning across systems."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 1,
      "metadata": {
        "id": "tNjkjGNEXJXP"
      },
      "outputs": [],
      "source": [
        "import jsonschema\n",
        "from jsonschema import validate\n",
        "import json\n",
        "import rdflib\n",
        "\n",
        "# Regular JSON representation of a person\n",
        "person_data = {\n",
        "    \"@context\": \"https://schema.org/\",\n",
        "    \"@id\": \"https://orcid.org/0000-0002-8758-6109\",\n",
        "    \"@type\": \"Person\",\n",
        "    \"givenName\": \"Simon\",\n",
        "    \"familyName\": \"Clark\",\n",
        "    \"gender\": {\"@type\": \"Male\"},\n",
        "    \"birthDate\": \"1987-04-23\",\n",
        "    \"affiliation\": {\n",
        "        \"@id\": \"https://ror.org/01f677e56\",\n",
        "        \"name\": \"SINTEF\",\n",
        "        \"@type\": \"ResearchOrganization\"\n",
        "    }\n",
        "}"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Validating JSON-LD Structure with a JSON Schema\n",
        "\n",
        "While JSON-LD enriches data with semantic meaning, it is still fundamentally JSON — which means we can use **JSON Schema** to validate its structure.\n",
        "\n",
        "In the code below, we define a **JSON Schema** to validate the structure of a person object. This schema enforces that:\n",
        "- `givenName` and `familyName` are required strings,\n",
        "- `birthDate` must follow the `YYYY-MM-DD` format (validated with both a format and a regex),\n",
        "- `affiliation` and `gender` must be valid objects.\n",
        "\n",
        "The `validate_json()` function uses the `jsonschema` Python package to validate the `person_data` object against this schema. If the data is valid, it confirms success; otherwise, it prints a validation error.\n",
        "\n",
        "This is especially useful when:\n",
        "- Receiving data from users or external systems\n",
        "- Validating linked data before publishing or storage\n",
        "- Integrating structured data into APIs or semantic pipelines"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 2,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "HDHJuV62bTby",
        "outputId": "d5537f6e-4325-40ef-dfa4-c7d55159623f"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "JSON data is valid according to the schema.\n"
          ]
        }
      ],
      "source": [
        "person_schema = {\n",
        "    \"type\": \"object\",\n",
        "    \"properties\": {\n",
        "        \"@context\": {\n",
        "            \"type\": [\"string\", \"object\"]  # object form if using inline mappings\n",
        "        },\n",
        "        \"@type\": {\n",
        "            \"type\": \"string\",\n",
        "        },\n",
        "        \"@id\": {\n",
        "            \"type\": \"string\",\n",
        "            \"format\": \"uri\"\n",
        "        },\n",
        "        \"givenName\": {\n",
        "            \"type\": \"string\"\n",
        "        },\n",
        "        \"familyName\": {\n",
        "            \"type\": \"string\",\n",
        "            \"minLength\": 1\n",
        "        },\n",
        "        \"birthDate\": {\n",
        "            \"type\": \"string\",\n",
        "            \"format\": \"date\",\n",
        "            \"pattern\": \"^[0-9]{4}-[0-1][0-9]-[0-3][0-9]$\"\n",
        "        },\n",
        "        \"gender\": {\n",
        "            \"type\": \"object\"\n",
        "        },\n",
        "        \"affiliation\": {\n",
        "            \"type\": \"object\"\n",
        "        }\n",
        "    },\n",
        "    \"required\": [\"@context\", \"@type\", \"@id\", \"givenName\", \"familyName\", \"birthDate\", \"affiliation\"]\n",
        "}\n",
        "\n",
        "# Function to validate JSON data against the schema\n",
        "def validate_json(data, schema):\n",
        "    try:\n",
        "        validate(instance=data, schema=schema)\n",
        "        return True, \"JSON data is valid according to the schema.\"\n",
        "    except jsonschema.exceptions.ValidationError as ve:\n",
        "        return False, ve.message\n",
        "\n",
        "# Validate the sample JSON data\n",
        "is_valid, message = validate_json(person_data, person_schema)\n",
        "print(message)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Querying JSON-LD Data with SPARQL and RDFLib\n",
        "\n",
        "In this section, we demonstrate how to use `rdflib` to work with JSON-LD data and execute SPARQL queries against it.\n",
        "\n",
        "### Step 1: Create an RDF Graph\n",
        "\n",
        "We start by creating an RDF graph using `rdflib.Graph()`, which serves as a container for all the triples (subject-predicate-object statements) derived from our data.\n",
        "\n",
        "### Step 2: Load Schema.org Vocabulary\n",
        "\n",
        "We load the full [Schema.org](https://schema.org) vocabulary into the graph from its latest official JSON-LD release. This gives us access to the class hierarchy and definitions used in our person data, including terms like `schema:Person` and `schema:Organization`.\n",
        "\n",
        "### Step 3: Load JSON-LD Person Data\n",
        "\n",
        "We convert the `person_data` dictionary into a JSON string and parse it into the RDF graph. This integrates our structured data with the schema definitions, allowing us to query both vocabulary and instance data together.\n",
        "\n",
        "### Step 4: Run a SPARQL Query\n",
        "\n",
        "We execute a SPARQL query to retrieve all subclasses (direct or indirect) of `schema:Organization` using the `rdfs:subClassOf*` path operator. This is useful when you want to identify all organization-related types defined in Schema.org.\n",
        "\n",
        "### Output\n",
        "\n",
        "The result is a list of IRIs for types that are (transitively) subclasses of `schema:Organization`. This could include entities like `schema:EducationalOrganization`, `schema:Corporation`, or `schema:ResearchOrganization`.\n",
        "\n",
        "This approach demonstrates how JSON-LD + Schema.org + SPARQL can provide a powerful way to:\n",
        "- Enrich data with formal semantics\n",
        "- Query both vocabulary and data in a unified RDF graph\n",
        "- Integrate data across schemas and domains"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 3,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "R-mO4FGtbr-L",
        "outputId": "c0ae5e87-6ade-4ef8-a639-02bf287071a9"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "(rdflib.term.URIRef('http://schema.org/Organization'),)\n",
            "(rdflib.term.URIRef('http://schema.org/GovernmentOrganization'),)\n",
            "(rdflib.term.URIRef('http://schema.org/Consortium'),)\n",
            "(rdflib.term.URIRef('http://schema.org/PerformingGroup'),)\n",
            "(rdflib.term.URIRef('http://schema.org/TheaterGroup'),)\n",
            "(rdflib.term.URIRef('http://schema.org/MusicGroup'),)\n",
            "(rdflib.term.URIRef('http://schema.org/DanceGroup'),)\n",
            "(rdflib.term.URIRef('http://schema.org/OnlineBusiness'),)\n",
            "(rdflib.term.URIRef('http://schema.org/OnlineStore'),)\n",
            "(rdflib.term.URIRef('http://schema.org/LibrarySystem'),)\n",
            "(rdflib.term.URIRef('http://schema.org/SearchRescueOrganization'),)\n",
            "(rdflib.term.URIRef('http://schema.org/PoliticalParty'),)\n",
            "(rdflib.term.URIRef('http://schema.org/Corporation'),)\n",
            "(rdflib.term.URIRef('http://schema.org/Project'),)\n",
            "(rdflib.term.URIRef('http://schema.org/FundingAgency'),)\n",
            "(rdflib.term.URIRef('http://schema.org/ResearchProject'),)\n",
            "(rdflib.term.URIRef('http://schema.org/NewsMediaOrganization'),)\n",
            "(rdflib.term.URIRef('http://schema.org/MedicalOrganization'),)\n",
            "(rdflib.term.URIRef('http://schema.org/Dentist'),)\n",
            "(rdflib.term.URIRef('http://schema.org/MedicalClinic'),)\n"
          ]
        }
      ],
      "source": [
        "# Step 1: Create an RDF Graph\n",
        "g = rdflib.Graph()\n",
        "\n",
        "# Step 2: Load Schema.org Vocabulary\n",
        "g.parse(\"https://schema.org/version/latest/schemaorg-current-http.jsonld\", format=\"json-ld\")\n",
        "\n",
        "# Step 3: Load JSON-LD Person Data\n",
        "person_data_str = json.dumps(person_data)\n",
        "g.parse(data=person_data_str, format=\"json-ld\")\n",
        "\n",
        "# Step 4: Run a SPARQL Query\n",
        "sparql_query = \"\"\"\n",
        "PREFIX schema: <http://schema.org/>\n",
        "SELECT DISTINCT ?type WHERE {\n",
        "  ?type rdfs:subClassOf* schema:Organization .\n",
        "}\n",
        "LIMIT 20\n",
        "\"\"\"\n",
        "\n",
        "# Execute the SPARQL query\n",
        "results = g.query(sparql_query)\n",
        "\n",
        "# Print the results\n",
        "for row in results:\n",
        "    print(row)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Querying Instances of `schema:Organization`\n",
        "\n",
        "In this example, we go one step further by querying for actual **instances** of `schema:Organization` (or any of its subclasses) present in the RDF graph.\n",
        "\n",
        "### What This SPARQL Query Does\n",
        "\n",
        "This SPARQL query performs two key operations:\n",
        "\n",
        "1. It uses:\n",
        "   ```sparql\n",
        "   ?subclass rdfs:subClassOf* schema:Organization .\n",
        "   ```\n",
        "   to find all types that are subclasses of `schema:Organization`. The `*` means it includes both direct and indirect subclasses.\n",
        "\n",
        "2. It then finds:\n",
        "   ```sparql\n",
        "   ?instance rdf:type ?subclass .\n",
        "   ```\n",
        "   all **instances** in the graph whose `rdf:type` is one of these subclasses — meaning they are some kind of organization.\n",
        "\n",
        "### Why This Matters\n",
        "\n",
        "This allows us to extract not just definitions (as in the previous example), but **real data entries** that correspond to organizations — such as companies, research institutes, or educational organizations — described in your JSON-LD.\n",
        "\n",
        "Since our `person_data` includes an `affiliation` field that references a `schema:ResearchOrganization`, this query will match that and return it.\n",
        "\n",
        "### Output\n",
        "\n",
        "The output is a list of IRIs identifying each organization instance in the graph. This provides a powerful way to:\n",
        "- Discover all known organizations in your data\n",
        "- Use these IRIs for follow-up queries (e.g., get their name, address, or related persons)\n",
        "- Analyze structured relationships between people and institutions\n",
        "\n",
        "This pattern is central to working with linked data: describing entities with types, and then querying them using semantic relationships."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 4,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "h4vvw5hNhih2",
        "outputId": "77828fc9-bafe-414b-f6f6-65fb117aabc5"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "https://ror.org/01f677e56\n"
          ]
        }
      ],
      "source": [
        "# Define and execute a SPARQL query for all instances of Organization\n",
        "sparql_query = \"\"\"\n",
        "PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>\n",
        "PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
        "PREFIX schema: <http://schema.org/>\n",
        "\n",
        "SELECT ?instance WHERE {\n",
        "    ?subclass rdfs:subClassOf* schema:Organization .\n",
        "    ?instance rdf:type ?subclass .\n",
        "}\n",
        "LIMIT 10\n",
        "\"\"\"\n",
        "\n",
        "# Execute the SPARQL query\n",
        "results = g.query(sparql_query)\n",
        "\n",
        "# Print the results\n",
        "for row in results:\n",
        "    print(row[0])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Querying Birth Dates of Persons\n",
        "\n",
        "In this example, we execute a SPARQL query to retrieve the birth dates of individuals in the graph who are typed as `schema:Person`.\n",
        "\n",
        "### What the Query Does\n",
        "\n",
        "This SPARQL query looks for:\n",
        "\n",
        "1. Individuals explicitly typed as `schema:Person`:\n",
        "   ```sparql\n",
        "   ?subject rdf:type schema:Person .\n",
        "   ```\n",
        "\n",
        "2. The associated birth date of each person using the `schema:birthDate` property:\n",
        "   ```sparql\n",
        "   ?subject schema:birthDate ?bday .\n",
        "   ```\n",
        "\n",
        "3. It selects and returns only the `?bday` values, which represent literal dates.\n",
        "\n",
        "4. The query includes:\n",
        "   ```sparql\n",
        "   LIMIT 10\n",
        "   ```\n",
        "   to restrict the results to the first 10 entries (useful for inspection or previewing large datasets).\n",
        "\n",
        "### Why This Matters\n",
        "\n",
        "This kind of query is useful when you want to extract **attribute values** from structured data. In this case, we’re retrieving **dates of birth** for people in the graph. These values can then be used for analytics, filtering, or even plotting demographics.\n",
        "\n",
        "### Assumptions\n",
        "\n",
        "- It assumes that `schema:birthDate` is used directly with a literal (e.g., `\"1987-04-23\"`).\n",
        "- If the birth date is represented as a nested object or typed node, additional handling would be required in the query.\n",
        "\n",
        "### Result\n",
        "\n",
        "The query prints a list of birth dates (as literals) for up to 10 individuals defined in your RDF graph.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 5,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "HJfpCj4-sww3",
        "outputId": "c4ba6b30-7d7e-45d9-c3e4-b0d311e0fdd0"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "1987-04-23\n"
          ]
        }
      ],
      "source": [
        "# Define and execute a SPARQL query for all instances of Organization\n",
        "sparql_query = \"\"\"\n",
        "PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>\n",
        "PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
        "PREFIX schema: <http://schema.org/>\n",
        "\n",
        "SELECT ?bday WHERE {\n",
        "    ?subject rdf:type schema:Person .\n",
        "    ?subject schema:birthDate ?bday .\n",
        "}\n",
        "LIMIT 10\n",
        "\"\"\"\n",
        "\n",
        "# Execute the SPARQL query\n",
        "results = g.query(sparql_query)\n",
        "\n",
        "# Print the results\n",
        "for row in results:\n",
        "    print(row[0])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Summary\n",
        "\n",
        "In this notebook, we explored how **JSON-LD** can transform regular JSON into semantically enriched, machine-readable data using well-defined vocabularies like [schema.org](https://schema.org).\n",
        "\n",
        "### Key Concepts Covered\n",
        "\n",
        "- **JSON-LD Basics**: We structured a `Person` object with fields like `@context`, `@type`, and `@id`, connecting each field to a formal semantic definition.\n",
        "- **JSON Schema Validation**: We used `jsonschema` to ensure that our JSON-LD documents are syntactically valid before graph conversion.\n",
        "- **RDF Graph Construction**: Using `rdflib`, we converted JSON-LD data and schema.org into an RDF graph that supports reasoning and querying.\n",
        "- **SPARQL Queries**: We demonstrated several SPARQL queries to:\n",
        "  - Retrieve all types derived from `schema:Organization`\n",
        "  - Find all instances of those types\n",
        "  - Count people with gender set to `schema:Male`\n",
        "  - List birth dates of individuals\n",
        "\n",
        "By combining JSON-LD, RDFLib, and SPARQL:\n",
        "- You can enrich your data with standardized semantics\n",
        "- Enable interoperability across systems and domains\n",
        "- Perform structured, meaningful queries over data\n",
        "- Integrate your metadata with larger knowledge graphs (e.g., Wikidata, Google Knowledge Graph)\n",
        "\n",
        "This notebook serves as a practical introduction to **semantic data modeling and querying** — a foundational component of linked data applications and the Semantic Web.\n"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "display_name": ".venv",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.10.10"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}