Wikifier documentation

<< Back to main page · About · Register

Contents

Wikification (document annotation)
- Sample code (in Python)
Part-of-speech tagging
Cosine similarity measure
Neighbourhood subgraph
Concept information

If you use our Wikifier in your work, please cite our paper:

Annotating Documents with Relevant Wikipedia Concepts

Proceedings of the Slovenian Conference on Data Mining and Data Warehouses

Wikification

To call the JSI Wikifier, send a HTTP GET request to a URL of the following form:

http://www.wikifier.org/annotate-article?text=...&lang=...&...

Alternatively, you can send a POST request with a Content-Type of application/x-www-form-urlencoded and pass the parameters (everything after the ? character) in the request body.

The server is currently still in development so it is occasionally down.

The following parameters are supported:

userKey: a 30-character string that uniquely identifies each user (register here to get one). This parameter is required.
text: the text of the document that you want to annotate. Use UTF-8 and %-encoding for non-ASCII characters (e.g. text=Beyonc%C3%A9).
lang: the ISO-639 code of the language of the document. Both 2- and 3-letter codes are supported (e.g. en or eng for English, sl or slv for Slovenian, etc.). See also: list of all the languages currently supported by the JSI Wikifier.

You can also use lang=auto to autodetect the language (using CLD2). In this case, the resulting JSON object will also contain a value named languageAutodetectDetails with more information about the results of the autodetection. Note that some of the languages supported by the Wikifier cannot be autodetected by CLD2 (and vice versa).
secondaryAnnotLanguage: for each annotation, the Wikifier can report, in addition to its name and Wikipedia link in the Wikipedia for the language of the input document, also the name and link of the corresponding page in the Wikipedia for a “secondary” language; the secondaryAnnotLanguage parameter specifies the code of this secondary language (default: en, i.e. English).
wikiDataClasses: should be true or false, determines whether to include, for each annotation, a list of wikidata (concept ID, concept name) pairs for all classes to which this concept belongs (directly or indirectly).
wikiDataClassIds: like wikiDataClasses, but generates a list of concept IDs only (which makes the resulting JSON output shorter).
support: should be true or false, determines whether to include, for each annotation, a list of subranges in the input document that support this particular annotation.
ranges: should be true or false, determines whether to include, for each subrange in the document that looks like a possible mention of a concept, a list of all candidate annotations for that subrange. This will significantly increase the size of the resulting JSON output, so it should only be used if there's a strong need for this data.
includeCosines: should be true or false, determines whether to include, for each annotation, the cosine similarity between the input document and the Wikipedia page corresponding to that annotation. Currently the cosine similarities are provided for informational purposes only and are not used in choosing the annotations. Thus, you should set this to false to conserve some CPU time if you don't need the cosines for your application.
maxMentionEntropy: set this to a real number x to cause all highly ambiguous mentions to be ignored (i.e. they will contribute no candidate annotations into the process). The heuristic used is to ignore mentions where H(link target | anchor text = this mention) > x. (Default value: −1, which disables this heuristic.)
maxTargetsPerMention: set this to an integer x to use only the most frequent x candidate annotations for each mention (default value: 20). Note that some mentions appear as the anchor text of links to many different pages in the Wikipedia, so disabling this heuristic (by setting x = −1) can increase the number of candidate annotations significantly and make the annotation process slower.
minLinkFrequency: if a link with a particular combination of anchor text and target occurs in very few Wikipedia pages (less than the value of minLinkFrequency), this link is completely ignored and the target page is not considered as a candidate annotation for the phrase that matches the anchor text of this link. (Default value: 1, which effectively disables this heuristic.)
pageRankSqThreshold: set this to a real number x to calculate a threshold for pruning the annotations on the basis of their pagerank score. The Wikifier will compute the sum of squares of all the annotations (e.g. S), sort the annotations by decreasing order of pagerank, and calculate a threshold such that keeping the annotations whose pagerank exceeds this threshold would bring the sum of their pagerank squares to S · x. Thus, a lower x results in a higher threshold and less annotations. (Default value: −1, which disables this mechanism.) The resulting threshold is reported in the minPageRank field of the JSON result object. If you want the Wikifier to actually discard the annotations whose pagerank is < minPageRank instead of including them in the JSON result object, set the applyPageRankSqThreshold parameter to true (its default value is false).
partsOfSpeech: should be true or false, determines whether to include information about parts of speech (nouns, verbs, adjectives, adverbs) and their corresponding WordNet synsets. This feature is only supported for English documents; the Brill tagger is used for POS tagging. (Default value: false).
verbs: like partsOfSpeech, but only reports verbs.
nTopDfValuesToIgnore: if a phrase consists entirely of very frequent words, it will be ignored and will not generate any candidate annotations. A word is considered frequent for this purpose if is one of the nTopDfValuesToIgnore most frequent words (in terms of document frequency) in the Wikipedia of the corresponding language. (Default value: 0, which disables this heuristic. If you want to use this heuristic, we recommend a value of 200 as a good starting point.)
nWordsToIgnoreFromList: works like nTopDfValuesToIgnore, except that instead of looking at the most frequent words from all the words, it only uses the most frequent words from a list provided to the Wikifier at startup. Currently we have lists for about half of the languages, obtained by taking the most frequent words and manually removing some that seemed actually useful and shouldn't be ignored, e.g. place names. If both this parameter and nTopDfValuesToIgnore are provided, the Wikifier will use the manual ignore list if it is available for the language of the input document, otherwise it will fall back on the behaviour specified by nTopDfValuesToIgnore. (Default value: -1, which disables this heuristic. If you want to use this heuristic, we recommend passing 200 for both this parameter and for nTopDfValuesToIgnore.)

Output format

The Wikifier returns a JSON reponse of the following form:

{
  "annotations": [ ... ],
  "spaces":["", " ", " ", "."],
  "words":["New", "York", "City"],
  "ranges": [ ... ]
}

The spaces and words arrays show how the input document has been split into words. It is always the case that spaces has exactly 1 more element than words and that concatenating spaces[0] + words[0] + spaces[1] + words[1] + ... + spaces[N-1] + words[N-1] + spaces[N] (where N is the length of words) is exactly equal to the input document (the one that was passed as the &text=... parameter).

annotations is an array of objects of the following form:

{
  "title":"New York City",
  "url":"http:\/\/en.wikipedia.org\/wiki\/New_York_City",
  "lang":"en",
  "pageRank":0.102831, "cosine":0.662925,
  "secLang": "en",
  "secTitle":"New York City",
  "secUrl":"http:\/\/en.wikipedia.org\/wiki\/New_York_City",
  "wikiDataClasses": [
    {"itemId":"Q515", "enLabel":"city"},
    {"itemId":"Q1549591", "enLabel":"big city"},
    ...
  ],
  "wikiDataClassIds": ["Q515", "Q1549591", ...],
  "dbPediaTypes":["City", "Settlement", "PopulatedPlace", ...],
  "dbPediaIri":"http:\/\/dbpedia.org\/resource\/New_York_City",
  "supportLen":2.000000,
  "support": [
    {"wFrom":0, "wTo":1, "chFrom": 0, "chTo": 7, "pMentionGivenSurface":0.122591, "pageRank":0.018634},
    {"wFrom":0, "wTo":2, "chFrom": 0, "chTo": 12, pMentionGivenSurface":0.483354, "pageRank":0.073469}
  ]
}

url is the URL of the Wikipedia page corresponding to this annotation, and title is its title;
lang is the language code of the Wikipedia from which this annotation is taken (currently this is always the language of the input document);
secUrl and secTitle refer to the equivalent page of the Wikipedia in the language secLang (which is the same language that was passed as the secondaryAnnotLanguage input parameter). (This is more useful when lang != secLang.)
wikiDataClasses and wikiDataClassIds are lists of the classes to which this concept belongs according to WikiData (using the instanceOf property, and then all their ancestors that can be reached with the subclassOf property;
dbPediaIri is (one of) the DBPedia IRIs corresponding to this annotation, and dbPediaTypes are types to which this DBPedia IRI is connected via the http://www.w3.org/1999/02/22-rdf-syntax-ns#type property;
support is an array of all the subranges in the document that support this particular annotation; for each such subrange, wFrom and wTo are the indices (into words) of the first and last word of the subrange; chFrom and chTo are the indices (into the input document) of the first and last character of the subrange; pageRank is the pagerank of this subrange (not necessarily a very useful value for the user), and pMentionGivenSurface is the probability that, when a link appears in the Wikipedia with this particular subrange as its anchor text, it points to the Wikipedia page corresponding to the current annotation.

ranges is an array of objects of the following form:

{
    "wFrom": 0, "wTo": 1, "pageRank":0.018634, "pMentionGivenSurface":0.122591,
    "candidates": [
        {"title":"New York", "url":"http:\/\/en.wikipedia.org\/wiki\/New_York", "cosine":0.578839, "linkCount":63626, "pageRank":0.049533},
        {"title":"New York City", "url":"http:\/\/en.wikipedia.org\/wiki\/New_York_City", "cosine":0.662925, "linkCount":11589, "pageRank":0.102831},
        {"title":"New York (magazine)", "url":"http:\/\/en.wikipedia.org\/wiki\/New_York_(magazine)", "cosine":0.431092, "linkCount":2159, "pageRank":0.030795},
		...
    ]
}

The first four members are the same as in support; in this particular example, we have wFrom = 0 and wTo = 1, so this object refers to the phrase "New York". The candidates array is a list of all the pages in the Wikipedia that are pointed to by links (from other pages in the Wikipedia) whose anchor text is the same as this phrase; for each such page, we have an object giving its title, Wikipedia URL, cosine similarity with the input document, number of links with this anchor text pointing to this particular page, and the pagerank score of this candidate annotation. For phrases that generate too many candidates, some of these candidates might not participate in the pagerank computation; in that case pageRank is shown as -1 instead.

Part-of-speech output

If the partsOfSpeech=true parameter is used to request part-of-speech information, the resulting JSON object will additionally contain four arrays called verbs, nouns, adjectives, and adverbs. Each of these arrays is of the following form:

"verbs": [
    {"iFrom":27, "iTo":32, "normForm":"offer", "synsetIds":["200706557", "200871623", ...]},
    {"iFrom":78, "iTo":80, "normForm":"have", "synsetIds":["200056930", "200065370", ...]},
    ...
]

For each entry, iFrom and iTo are the indices of the first and last character of that verb. The indices refer to the input text as a sequence of Unicode codepoints (i.e. not as a sequence of bytes that is the result of UTF-8 encoding). You can use these indices to recover the surface form of this verb as it appears in the input text. By contrast, normForm is the lemmatized form (e.g. have instead of has). synsetIds is a list of all the Wordnet synsets that contain this verb.

Te nouns, adjectives, and adverbs arrays are of the same form. If you use the verbs=true input parameter instead of partsOfSpeech=true, these arrays are not included in the response.

Part-of-speech processing is only supported on English documents.

Sample code in Python 3

Note: the following sample uses POST; if your input document is short, you can also use GET instead.

import urllib.parse, urllib.request, json

def CallWikifier(text, lang="en", threshold=0.8):
    # Prepare the URL.
    data = urllib.parse.urlencode([
        ("text", text), ("lang", lang),
        ("userKey", "insert your user key here"),
        ("pageRankSqThreshold", "%g" % threshold), ("applyPageRankSqThreshold", "true"),
        ("nTopDfValuesToIgnore", "200"), ("nWordsToIgnoreFromList", "200"),
        ("wikiDataClasses", "true"), ("wikiDataClassIds", "false"),
        ("support", "true"), ("ranges", "false"), ("minLinkFrequency", "2"),
        ("includeCosines", "false"), ("maxMentionEntropy", "3")
        ])
    url = "http://www.wikifier.org/annotate-article"
    # Call the Wikifier and read the response.
    req = urllib.request.Request(url, data=data.encode("utf8"), method="POST")
    with urllib.request.urlopen(req, timeout = 60) as f:
        response = f.read()
        response = json.loads(response.decode("utf8"))
    # Output the annotations.
    for annotation in response["annotations"]:
        print("%s (%s)" % (annotation["title"], annotation["url"]))

CallWikifier("Syria's foreign minister has said Damascus is ready " +
    "to offer a prisoner exchange with rebels.")

Additional functions

The Wikifier also supports the following functions that are not directly related to Wikification but may be of interest anyway.

Cosine similarity

This function measures the similarity between the text of two Wikipedia pages. All markup etc. is ignored for the purposes of this comparison, and the pages are represented as feature vectors under the bag-of-words model (also known as the vector space model). To use this function, make a HTTP request of the following form:

http://www.wikifier.org/get-cosine-similarity?lang=...&title1=...&title2=...

As parameters, provide the language code of the Wikipedia that your pages are from, and the titles of both pages. The result will be a small JSON object of the following form:

{
  "nNonzeroComponents1":3769,
  "nNonzeroComponents2":5641,
  "cosBinVec":0.4847149433737823,
  "cosTfVec":0.9702968299849217,
  "cosTfIdfVec":0.8467604323717902
}

This gives the number of nonzero components in each of the two feature vectors (i.e. the number of distinct words on each page) and the cosine similarity measure between them. In fact three versions of the cosine measure are provided: one for binary feature vectors, one for TF vectors and one for TF-IDF vectors.

Sample code in Python 3

import urllib.parse, urllib.request, json

def CosineSimilarity(lang, title1, title2):
    # Prepare the URL.
    data = urllib.parse.urlencode([("lang", lang),
        ("title1", title1), ("title2", title2)])
    url = "http://www.wikifier.org/get-cosine-similarity?" + data
    # Call the Wikifier and read the response.
    with urllib.request.urlopen(url, timeout = 60) as f:
        response = f.read()
        response = json.loads(response.decode("utf8"))
    # Return the cosine similarity between the TF-IDF vectors.
    return response["cosTfIdfVec"]

print(CosineSimilarity("en", "New York", "New York City"))

Neighbourhood subgraph

For a given language, its Wikipedia can be thought of as a large graph with a vertex corresponding to each page, and with a directed edge (u, v) wherever a hyperlink exists from a page u to another page v. Where such a link exists, we say that u is a predecessor of v, and v is a successor of u.

The following function will return a subgraph of this graph:

http://www.wikifier.org/get-neigh-graph?lang=...&title=...&nPredLevels=...&nSuccLevels=...

The subgraph consists of the following vertices:

The vertex t representing the page whose title is specified by the value of the title parameter.
Any vertex u such that a directed path, at most nPredLevels edges long, exists from u to t.
Any vertex v such that a directed path, at most nSuccLevels edges long, exists from t to v.

As is usual with subgraphs, it also contains all those edges of the original graph where both endpoints of the edge are vertices belonging to the subgraph.

The result will be a JSON object of the following form:

{
    "nVertices": 123, "nEdges": 1234,
    "titles": [...],
    "successors": [[...], [...], ...]
}

For the purposes of representing this subgraph, its vertices are numbered in an arbitrary order from 0 to nVertices − 1. Then titles[k] gives the title of the Wikipedia page that corresponds to vertex k, and successors[k] is an array containing the numbers of the successors of this vertex. This array may be empty if the vertex does not contain any links to other vertices in the subgraph.

To save space, lists of predecessors are not provided explicitly, but you can easily generate them from the lists of successors.

Sample code in Python 3

def NeighGraph(lang, title, nPredLevels, nSuccLevels):
    # Prepare the URL.
    data = urllib.parse.urlencode([("lang", lang), ("title", title),
        ("nPredLevels", nPredLevels), ("nSuccLevels", nSuccLevels)])
    url = "http://www.wikifier.org/get-neigh-graph?" + data
    # Call the Wikifier and read the response.
    with urllib.request.urlopen(url, timeout = 60) as f:
        response = f.read()
        response = json.loads(response.decode("utf8"))
    # Print the edges of the graph.
    nVertices = response["nVertices"]
    titles = response["titles"]
    nEdges = 0
    for u in range(nVertices):
        for v in response["successors"][u]:
            print("%s -> %s" % (titles[u], titles[v]))
            nEdges += 1
    assert nEdges == response["nEdges"]

import sys, io
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding = "utf8",
                              errors = "ignore", line_buffering = True)
NeighGraph("sl", "Ljubljana", 0, 2)

Concept information

This function returns a small JSON object with information about a given concept. This is a subset of information that appears in an annotation object when annotating a document. The request should be of the form:

http://www.wikifier.org/concept-info?lang=...&title=...&secLang=...

The result will be of the following form (the following example is for lang=en, title=Rome and secLang=it):



{
  "title":"Rome", "url":"http:\/\/en.wikipedia.org\/wiki\/Rome", "lang":"en",
  "secLang":"it", "secTitle":"Roma", "secUrl":"http:\/\/it.wikipedia.org\/wiki\/Roma",
  "wikiDataItemId": "Q220",
  "wikiDataClasses": [{"itemId":"Q5119", "enLabel":"capital"}, ....,  ],
  "wikiDataClassIds": ["Q5119", ...],
  "dbPediaTypes": ["City", "Settlement", "PopulatedPlace", ....],
  "dbPediaIri": "http:\/\/dbpedia.org\/resource\/Rome"}
}