MuLCAM

Multilingual Corpus for Online Safety and Moderation in Nepal

MuLCAM is Nepal's first open, context-aware linguistic corpus focused on Technology-Facilitated Gender-Based Violence (TFGBV). It documents how harm and abuse appear in Nepali digital spaces across languages, scripts, and cultural contexts, and makes this data openly available for research, moderation, and public-interest technology.

The Problem

Online gender-based violence is rising in Nepal, and complaints have increased exponentially in recent years, with women constituting a significant proportion of survivors. According to the Cyber Bureau of Nepal Police, in fiscal year (FY) 2023/24 alone 19,730 cybercrime complaints were filed, of which 8,745 cases were related to violence against women, with cases ranging from harassment and impersonation to blackmail and non-consensual sharing of intimate images. 382 complaints involved girls and 767 involved individuals from gender and sexual minority groups.

Why Existing Moderation Systems Fail

Nepali digital communication is multilingual, contextual, and often informal. Contents on the digital spaces are on romanized Nepali, slang, coded language, grawlix, and obfuscation. Existing moderation systems which have been largely trained on global or English-language datasets, fail to recognize these patterns.

We have categorized these causes into the following key problem areas:

Invisible Harm

Invisible Harm

Many TFGBV incidents go unnoticed and unreported across languages and communities. Without context-aware, multilingual moderation systems, instances of online harm remain undetected and unaddressed. This invisibility perpetuates harm and prevents accountability.

Multilingual Moderation Gaps

Multilingual Moderation Gaps

No multilingual Nepal-based dataset of slurs and harassment language exists. Platforms cite multilingual gaps as a barrier to effective moderation. Most systems are trained on English or global datasets and fail to recognize patterns in Nepali digital communication, including Romanized Nepali, slang, dialects, and coded language.

Tech Developers Gap

Tech Developers Gap

Tech developers often lack gendered and intersectional understanding when building moderation systems. Without diverse perspectives and lived experiences informing technical design, tools fail to recognize harm in its actual context and may perpetuate bias.

Evidence Unavailability

Evidence Unavailability

Evidence unavailability is a major reason for inaction and slow investigation and justice. Without comprehensive, context-aware datasets documenting how harm appears in Nepali digital spaces, it becomes difficult to hold perpetrators accountable, demonstrate patterns to platform moderators, and build evidence for policy advocacy.

MuLCAM as a Solution

MuLCAM addresses this gap by building a locally grounded, multilingual corpus that reflects how harm actually appears in Nepali digital spaces.

Our Approach

We approach these problems through corpus development, tools development, and research.

Corpus

Corpus

Open, Living Multilingual Corpus

An evolving, feedback-driven dataset of slurs, derogatory terms, and contextual harassment across languages.

  • Grounded in lived experiences and feminist review practices
  • Forms the foundation for detection, monitoring, and research on TFGBV
  • Continuously updated with community contributions and feedback
  • Supports multiple languages and scripts including Devanagari and Romanized Nepali
Detection & Moderation Tools

Detection & Moderation Tools

Ethical Detection & Moderation Tools (Ongoing)

Tools designed to detect, monitor, and flag gender-based harmful language—without stripping users of agency.

  • Built as plug-ins, APIs, and security-layer integrations
  • Enables user-controlled reporting, evidence recording, and handling
  • Context-aware detection that understands meaning, not just keywords
  • Respects user privacy and agency while providing safety mechanisms

Overall Approach Architecture

MuLCAM Overall Approach Architecture
Online Content
Annotation & Lexicon
MuLCAM Corpus
Research
Moderation
Reporting
Policy

Aligned with the Feminist Principles of the Internet

MuLCAM translates feminist values into technical and governance choices.

Access

Safety across languages and contexts

Access

Ensuring all communities can access and benefit from safe digital spaces across languages and cultural contexts.

Rights

Governance, accountability, and justice by design

Rights

Promoting governance, accountability, and justice by design in safety systems and moderation tools.

Economy & Openness

Open knowledge over platform extraction

Economy & Openness

Promoting open knowledge and public-interest technology over proprietary solutions and platform extraction.

Expression

Lived experiences at the center

Expression

Centering lived experiences in the design and development of safety tools and moderation systems.

Embodiment

Intersectional, consent-first systems

Embodiment

Building intersectional, consent-first systems that respect user agency and diverse identities.

Design Principles of MuLCAM

  • Context-aware, not keyword-only: Understands meaning and context, not just individual words
  • Multilingual by design: Supports Devanagari, Romanized Nepali, and other mother tongues
  • Open and public-interest oriented: Freely available for research and public good
  • Designed for low-resource language contexts: Built with limited-resource languages in mind
  • Built with feminist and rights-based principles: Grounded in lived experiences and feminist practices

Corpus Structure

  • Languages: Nepali (Devanagari), Romanized Nepali, Other Nepali mother tongues
  • Categories: TFGBV-related harm categories (approximately 12 categories)
  • Annotated Data: Contextual examples of harmful and non-harmful usage
  • Lexicon: Curated canonical terms maintained separately

Who This Project Is For

  • Researchers and students
  • Feminist and civil society organizations
  • Platform trust & safety teams
  • Journalists and policymakers
  • Developers building moderation or reporting tools

Get Involved

Partnerships and Collaborations

MuLCAM welcomes collaboration with feminist and human rights organizations, academic institutions, digital safety initiatives, and open data and civic tech communities.

Partner with Us

Support the Project

Support includes using and citing the dataset, sharing the project, providing financial or institutional support, and collaborating on research and tool partnerships.

Learn More

Contribute to the Corpus

Contributions are accepted through structured submission and review processes. All contributions are quality-checked before inclusion.

Start Contributing