Wikifier documentation

<< Back to main page · About · Register

Contents:

Wikification

To call the JSI Wikifier, send a HTTP GET request to a URL of the following form:

http://www.wikifier.org/annotate-article?text=...&lang=...&...

Alternatively, you can send a POST request with a Content-Type of application/x-www-form-urlencoded and pass the parameters (everything after the ? character) in the request body.

The server is currently still in development so it is occasionally down.

The following parameters are supported:

Output format

The Wikifier returns a JSON reponse of the following form:

{
  "annotations": [ ... ],
  "spaces":["", " ", " ", "."],
  "words":["New", "York", "City"],
  "ranges": [ ... ]
}

The spaces and words arrays show how the input document has been split into words. It is always the case that spaces has exactly 1 more element than words and that concatenating spaces[0] + words[0] + spaces[1] + words[1] + ... + spaces[N-1] + words[N-1] + spaces[N] (where N is the length of words) is exactly equal to the input document (the one that was passed as the &text=... parameter).

annotations is an array of objects of the following form:

{
  "title":"New York City",
  "url":"http:\/\/en.wikipedia.org\/wiki\/New_York_City",
  "lang":"en",
  "pageRank":0.102831, "cosine":0.662925,
  "secLang": "en",
  "secTitle":"New York City",
  "secUrl":"http:\/\/en.wikipedia.org\/wiki\/New_York_City",
  "wikiDataClasses": [
    {"itemId":"Q515", "enLabel":"city"},
    {"itemId":"Q1549591", "enLabel":"big city"},
    ...
  ],
  "wikiDataClassIds": ["Q515", "Q1549591", ...],
  "dbPediaTypes":["City", "Settlement", "PopulatedPlace", ...],
  "dbPediaIri":"http:\/\/dbpedia.org\/resource\/New_York_City",
  "supportLen":2.000000,
  "support": [
    {"wFrom":0.000000, "wTo":1.000000, "pMentionGivenSurface":0.122591, "pageRank":0.018634},
    {"wFrom":0.000000, "wTo":2.000000, "pMentionGivenSurface":0.483354, "pageRank":0.073469}
  ]
}

ranges is an array of objects of the following form:

{
    "wFrom": 0, "wTo": 1, "pageRank":0.018634, "pMentionGivenSurface":0.122591,
    "candidates": [
        {"title":"New York", "url":"http:\/\/en.wikipedia.org\/wiki\/New_York", "cosine":0.578839, "linkCount":63626, "pageRank":0.049533},
        {"title":"New York City", "url":"http:\/\/en.wikipedia.org\/wiki\/New_York_City", "cosine":0.662925, "linkCount":11589, "pageRank":0.102831},
        {"title":"New York (magazine)", "url":"http:\/\/en.wikipedia.org\/wiki\/New_York_(magazine)", "cosine":0.431092, "linkCount":2159, "pageRank":0.030795},
		...
    ]
}

The first four members are the same as in support; in this particular example, we have wFrom = 0 and wTo = 1, so this object refers to the phrase "New York". The candidates array is a list of all the pages in the Wikipedia that are pointed to by links (from other pages in the Wikipedia) whose anchor text is the same as this phrase; for each such page, we have an object giving its title, Wikipedia URL, cosine similarity with the input document, number of links with this anchor text pointing to this particular page, and the pagerank score of this candidate annotation. For phrases that generate too many candidates, some of these candidates might not participate in the pagerank computation; in that case pageRank is shown as -1 instead.

Part-of-speech output

If the partsOfSpeech=true parameter is used to request part-of-speech information, the resulting JSON object will additionally contain four arrays called verbs, nouns, adjectives, and adverbs. Each of these arrays is of the following form:

"verbs": [
    {"iFrom":27, "iTo":32, "normForm":"offer", "synsetIds":["200706557", "200871623", ...]},
    {"iFrom":78, "iTo":80, "normForm":"have", "synsetIds":["200056930", "200065370", ...]},
    ...
]

For each entry, iFrom and iTo are the indices of the first and last character of that verb. The indices refer to the input text as a sequence of Unicode codepoints (i.e. not as a sequence of bytes that is the result of UTF-8 encoding). You can use these indices to recover the surface form of this verb as it appears in the input text. By contrast, normForm is the lemmatized form (e.g. have instead of has). synsetIds is a list of all the Wordnet synsets that contain this verb.

Te nouns, adjectives, and adverbs arrays are of the same form. If you use the verbs=true input parameter instead of partsOfSpeech=true, these arrays are not included in the response.

Part-of-speech processing is only supported on English documents.

Sample code in Python 3

Note: the following sample uses POST; if your input document is short, you can also use GET instead.

import urllib.parse, urllib.request, json

def CallWikifier(text, lang="en", threshold=0.8):
    # Prepare the URL.
    data = urllib.parse.urlencode([
        ("text", text), ("lang", lang),
        ("pageRankSqThreshold", "%g" % threshold),
        ("wikiDataClasses", "true"), ("wikiDataClassIds", "false"),
        ("support", "true"), ("ranges", "false"),
        ("includeCosines", "false"), ("maxMentionEntropy", "3")
        ])
    url = "http://www.wikifier.org/annotate-article"
    # Call the Wikifier and read the response.
    req = urllib.request.Request(url, data=data.encode("utf8"), method="POST")
    with urllib.request.urlopen(req, timeout = 60) as f:
        response = f.read()
        response = json.loads(response.decode("utf8"))
    # Output the annotations.
    for annotation in response["annotations"]:
        print("%s (%s)" % (annotation["title"], annotation["url"]))

CallWikifier("Syria's foreign minister has said Damascus is ready " +
    "to offer a prisoner exchange with rebels.")

Additional functions

The Wikifier also supports the following functions that are not directly related to Wikification but may be of interest anyway.

Cosine similarity

This function measures the similarity between the text of two Wikipedia pages. All markup etc. is ignored for the purposes of this comparison, and the pages are represented as feature vectors under the bag-of-words model (also known as the vector space model). To use this function, make a HTTP request of the following form:

http://www.wikifier.org/get-cosine-similarity?lang=...&title1=...&title2=...

As parameters, provide the language code of the Wikipedia that your pages are from, and the titles of both pages. The result will be a small JSON object of the following form:

{
  "nNonzeroComponents1":3769,
  "nNonzeroComponents2":5641,
  "cosBinVec":0.4847149433737823,
  "cosTfVec":0.9702968299849217,
  "cosTfIdfVec":0.8467604323717902
}

This gives the number of nonzero components in each of the two feature vectors (i.e. the number of distinct words on each page) and the cosine similarity measure between them. In fact three versions of the cosine measure are provided: one for binary feature vectors, one for TF vectors and one for TF-IDF vectors.

Sample code in Python 3

import urllib.parse, urllib.request, json

def CosineSimilarity(lang, title1, title2):
    # Prepare the URL.
    data = urllib.parse.urlencode([("lang", lang),
        ("title1", title1), ("title2", title2)])
    url = "http://www.wikifier.org/get-cosine-similarity?" + data
    # Call the Wikifier and read the response.
    with urllib.request.urlopen(url, timeout = 60) as f:
        response = f.read()
        response = json.loads(response.decode("utf8"))
    # Return the cosine similarity between the TF-IDF vectors.
    return response["cosTfIdfVec"]

print(CosineSimilarity("en", "New York", "New York City"))

Neighbourhood subgraph

For a given language, its Wikipedia can be thought of as a large graph with a vertex corresponding to each page, and with a directed edge (u, v) wherever a hyperlink exists from a page u to another page v. Where such a link exists, we say that u is a predecessor of v, and v is a successor of u.

The following function will return a subgraph of this graph:

http://www.wikifier.org/get-neigh-graph?lang=...&title=...&nPredLevels=...&nSuccLevels=...

The subgraph consists of the following vertices:

As is usual with subgraphs, it also contains all those edges of the original graph where both endpoints of the edge are vertices belonging to the subgraph.

The result will be a JSON object of the following form:

{
    "nVertices": 123, "nEdges": 1234,
    "titles": [...],
    "successors": [[...], [...], ...]
}

For the purposes of representing this subgraph, its vertices are numbered in an arbitrary order from 0 to nVertices − 1. Then titles[k] gives the title of the Wikipedia page that corresponds to vertex k, and successors[k] is an array containing the numbers of the successors of this vertex. This array may be empty if the vertex does not contain any links to other vertices in the subgraph.

To save space, lists of predecessors are not provided explicitly, but you can easily generate them from the lists of successors.

Sample code in Python 3

def NeighGraph(lang, title, nPredLevels, nSuccLevels):
    # Prepare the URL.
    data = urllib.parse.urlencode([("lang", lang), ("title", title),
        ("nPredLevels", nPredLevels), ("nSuccLevels", nSuccLevels)])
    url = "http://www.wikifier.org/get-neigh-graph?" + data
    # Call the Wikifier and read the response.
    with urllib.request.urlopen(url, timeout = 60) as f:
        response = f.read()
        response = json.loads(response.decode("utf8"))
    # Print the edges of the graph.
    nVertices = response["nVertices"]
    titles = response["titles"]
    nEdges = 0
    for u in range(nVertices):
        for v in response["successors"][u]:
            print("%s -> %s" % (titles[u], titles[v]))
            nEdges += 1
    assert nEdges == response["nEdges"]

import sys, io
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding = "utf8",
                              errors = "ignore", line_buffering = True)
NeighGraph("sl", "Ljubljana", 0, 2)