MuLCAM

Multilingual Corpus for Online Safety and Moderation in Nepal

MuLCAM is Nepal's first open, context-aware linguistic corpus focused on Technology-Facilitated Gender-Based Violence (TFGBV). It documents how harm and abuse appear in Nepali digital spaces across languages, scripts, and cultural contexts, and makes this data openly available for research, moderation, and public-interest technology.

The Problem

Online gender-based violence is rising in Nepal, and complaints have increased exponentially in recent years, with women constituting a significant proportion of survivors. According to the Cyber Bureau of Nepal Police, in fiscal year (FY) 2023/24 alone 19,730 cybercrime complaints were filed, of which 8,745 cases were related to violence against women, with cases ranging from harassment and impersonation to blackmail and non-consensual sharing of intimate images. 382 complaints involved girls and 767 involved individuals from gender and sexual minority groups.

Why Existing Moderation Systems Fail

Nepali digital communication is multilingual, contextual, and often informal. Contents on the digital spaces are on romanized Nepali, slang, coded language, grawlix, and obfuscation. Existing moderation systems which have been largely trained on global or English-language datasets, fail to recognize these patterns.

We have categorized these causes into the following key problem areas:

Invisible Harm

Many TFGBV incidents go unnoticed and unreported across languages and communities. Without context-aware, multilingual moderation systems, instances of online harm remain undetected and unaddressed. This invisibility perpetuates harm and prevents accountability.

Multilingual Moderation Gaps

No multilingual Nepal-based dataset of slurs and harassment language exists. Platforms cite multilingual gaps as a barrier to effective moderation. Most systems are trained on English or global datasets and fail to recognize patterns in Nepali digital communication, including Romanized Nepali, slang, dialects, and coded language.

Tech Developers Gap

Tech developers often lack gendered and intersectional understanding when building moderation systems. Without diverse perspectives and lived experiences informing technical design, tools fail to recognize harm in its actual context and may perpetuate bias.

Evidence Unavailability

Evidence unavailability is a major reason for inaction and slow investigation and justice. Without comprehensive, context-aware datasets documenting how harm appears in Nepali digital spaces, it becomes difficult to hold perpetrators accountable, demonstrate patterns to platform moderators, and build evidence for policy advocacy.

MuLCAM as a Solution

MuLCAM addresses this gap by building a locally grounded, multilingual corpus that reflects how harm actually appears in Nepali digital spaces.

Our Approach

We approach these problems through corpus development, tools development, and research.

Corpus

Open, Living Multilingual Corpus

An evolving, feedback-driven dataset of slurs, derogatory terms, and contextual harassment across languages.

Grounded in lived experiences and feminist review practices
Forms the foundation for detection, monitoring, and research on TFGBV
Continuously updated with community contributions and feedback
Supports multiple languages and scripts including Devanagari and Romanized Nepali

Detection & Moderation Tools

Ethical Detection & Moderation Tools (Ongoing)

Tools designed to detect, monitor, and flag gender-based harmful language—without stripping users of agency.

Built as plug-ins, APIs, and security-layer integrations
Enables user-controlled reporting, evidence recording, and handling
Context-aware detection that understands meaning, not just keywords
Respects user privacy and agency while providing safety mechanisms

Overall Approach Architecture

Online Content

→

Annotation & Lexicon

→

MuLCAM Corpus

→

Research

Moderation

Reporting

Policy

Aligned with the Feminist Principles of the Internet

MuLCAM translates feminist values into technical and governance choices.

Safety across languages and contexts

Access

Ensuring all communities can access and benefit from safe digital spaces across languages and cultural contexts.

Governance, accountability, and justice by design

Rights

Promoting governance, accountability, and justice by design in safety systems and moderation tools.

Open knowledge over platform extraction

Economy & Openness

Promoting open knowledge and public-interest technology over proprietary solutions and platform extraction.

Lived experiences at the center

Expression

Centering lived experiences in the design and development of safety tools and moderation systems.

Intersectional, consent-first systems

Embodiment

Building intersectional, consent-first systems that respect user agency and diverse identities.

Design Principles of MuLCAM

Context-aware, not keyword-only: Understands meaning and context, not just individual words
Multilingual by design: Supports Devanagari, Romanized Nepali, and other mother tongues
Open and public-interest oriented: Freely available for research and public good
Designed for low-resource language contexts: Built with limited-resource languages in mind
Built with feminist and rights-based principles: Grounded in lived experiences and feminist practices

Corpus Structure

Languages: Nepali (Devanagari), Romanized Nepali, Other Nepali mother tongues
Categories: TFGBV-related harm categories (approximately 12 categories)
Annotated Data: Contextual examples of harmful and non-harmful usage
Lexicon: Curated canonical terms maintained separately

Who This Project Is For

Researchers and students
Feminist and civil society organizations
Platform trust & safety teams
Journalists and policymakers
Developers building moderation or reporting tools

Get Involved

Partnerships and Collaborations

MuLCAM welcomes collaboration with feminist and human rights organizations, academic institutions, digital safety initiatives, and open data and civic tech communities.

Partner with Us

Support the Project

Support includes using and citing the dataset, sharing the project, providing financial or institutional support, and collaborating on research and tool partnerships.

Learn More

Contribute to the Corpus

Contributions are accepted through structured submission and review processes. All contributions are quality-checked before inclusion.

Start Contributing