How we built an automated glossary for Namibian legislation

02 September 2019, by Greg Kempe

Legislation often defines terms that have a specific meaning. For instance, Namibia’s Criminal Procedure Act (Act 25 of 2004) defines “charge” as “an indictment, charge sheet, summons or written notice”. These definitions are crucial for the correct interpretation of legislation. It is interesting to explore which Acts define which terms and how those definitions change over time.

Using the Namibian statutes in the Laws.Africa legislation commons, we’ve created a glossary of more than 3000 defined terms and definitions.

The glossary is updated automatically and doesn’t require any human editors, thanks to the machine-friendly legislation in the Laws.Africa commons. We use machine learning to group together similar definitions to make the glossary simpler to work with.

As of the date of this post, there are 3086 terms defined in 297 Acts. The bulk (79%) of the terms are defined in only one Act, 14% in two or three Acts, and the remaining 7% of term are defined in four or more Acts.

The curious case of “minister”

Some terms are defined in many different acts with widely varying definitions. This is the case with “minister”.

The term “minister” is defined in 220 Acts. This is a common term since many Acts legislate how the government minister responsible for a particular area (such as Finance or Health) should execute their duties and obligations. The definition in each Act will depend on the subject area being legislated.

For example, here are the definitions of “minister” that relate to Health and Social Services:

“Minister” means the Minister of Health and Social Services;

“Minister” means the Minister responsible for Social Services;

“Minister” means the Minister responsible for Health and Social Services;

The glossary has detected that these definitions are all related to Health and Social Services, and has grouped them together. It also groups together definitions that are identical.

Here are the definitions related to Agriculture:

“Minister” means the Minister responsible for agriculture;

“Minister” means the Minister of Agriculture;

In all, there are 79 groups of related definitions of “minister”.

It’s interesting to notice a trend in the definition of “minister”. Before the 1990s, Acts almost always use the wording “the Minister of X”. From the late 1990s onwards, however, most Acts use the new wording “the Minister responsible for X”.

How old are “youth”?

It’s also interesting to discover definitions that are only slightly different. One might assume that the definition of “youth” would be consistent across the legislation. However, these two Acts define “youth” slightly differently:

“youth” means a young person aged from 16 to 35 years old.

“youth” means an individual aged between 16 and 30 years.

Similarly, the definition of a “minor” for gambling-related purposes is someone under the age of 21, whereas for witness protection purposes it is someone under the age of 18.

“minor” means a person who has not attained the age of 21;

“minor” means a person who is below the age of 18 years;

Besides being a useful research tool, there are lots of other interesting oddities to be found when exploring legislation through the lens of defined terms.

How we built the Glossary

The glossary is built and maintained automatically. As we add and amend new Acts on Laws.Africa, the platform automatically identifies defined terms, extracts their definitions, groups similar definitions together, and updates the glossary. So how does it do this?

Identifying definitions

At Laws.Africa we markup legislation using Akoma Ntoso XML. Our platform searches for a definition by looking for a phrase such as ‘“X” means…’ and then marks that up using the Akoma Ntoso <def> and <term> tags.

<p refersTo="#term-day"><def refersTo="#term-day">day</def>” means the space of time between sunrise and sunset;
</p>

It’s straight-forward to then go through through all Acts and extract the <def> elements.

Grouping similar definitions

Once we have the terms and their definitions, we can cluster similar definitions together using some simple machine learning.

First, we extract the text from the definitions, strip punctuation and numbers, and normalise whitespace.

def defn_text(element):
    """ Extract plain text (without punctuation and numbers) from definition XML elements.
    """
    # join text elements with spaces, strip punctuation, and convert to lowercase
    text = remove_punctuation(' '.join(element.itertext()) or '').lower()
    # replace numbers with N
    text = num_re.sub(r'[0-9]+', 'N', text)
    return text


def remove_punctuation(text):
    # strip punctuation in unicode
    # https://stackoverflow.com/questions/11066400/remove-punctuation-from-unicode-formatted-strings
    punct_table = dict.fromkeys(i for i in range(sys.maxunicode) if unicodedata.category(chr(i)).startswith('P'))
    return text.translate(punct_table)

# map a list of definition elements to plain text
texts = [defn_text(e) for e in definitions]

Now, for each term, we need to determine which definition texts are similar. We do this by vectorising the text and calculating the cosine similarity between the vectors. This gives what is effectively the “distance” between every pair of definitions.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

vectorizer = TfidfVectorizer()
tfidf = vectorizer.fit_transform(texts)
distances = 1 - cosine_similarity(tfidf)

Finally, we use agglomerative clustering to group the terms based on these distances. This gives us a list of cluster labels, one for each definition.

from sklearn.cluster import AgglomerativeClustering

clustering = AgglomerativeClustering(
    n_clusters=None, compute_full_tree=True, distance_threshold=0.3,
    affinity='precomputed', linkage='complete').fit(distances)
labels = clustering.labels_

We then use these cluster labels to show related definitions in the glossary.

The power of Legislation as Data

Treating legislation as machine-friendly data makes it really simple to build and maintain this glossary automatically, something that would have taken weeks or months of research by humans. This is just a small example of what’s possible with the machine-friendly Laws.Africa legislation commons. It simplifies previously time-consuming tasks and opens up a range of possibilities.

You can explore an automated glossary for all countries and places we have in the Laws.Africa commons. Choose a country, click “Insights” and then click “Glossary”. For example, check out the glossary for Namibian national Acts or the glossary for the City of Cape Town’s by-laws.

What will you build with machine-friendly legislation?

Share: