Skip to content

TheDataRideAlongs/ProjectDomino

Repository files navigation

Project Domino

Open source AI to scale the fight against misinformation

One of the most important steps in stopping the COVID-19 pandemic is influencing mass behavior change for citizens to take appropriate, swift action on mitigating infection and human-to-human contact. Government officials at all levels have advocated misinformed practices such as dining out or participating in outdoor gatherings that have contributed to amplifying the curve rather than flattening it. At time of writing, the result of poor crisis emergency risk communication has led to over 32.9M US citizens testing positive, 2-20X more are likely untested, and over 584K deaths. The need to influence appropriate behavior and mitigation actions are extreme: The US has shot up from untouched to become the 6th most infected nation.

Project Domino accelerates research on developing capabilities for information hygiene at the mass scale necessary for the current national disaster and for future ones. We develop and enable the use of 3 key data capabilities for modern social discourse:

  • Detecting misinformation campaigns
  • Identifying at-risk behavior groups and viable local behavior change influencers
  • Automating high-precision interventions

Scale with community and industry to flip the asymmetries. Key to our approach is focusing on scale:

  • Scaling technology through GPU computing and AI
  • Network effects through open source and providing foundations for others
  • Growth through public/private partnership that drive a sustainable team growth model Misinformation is traditionally difficult because of assymetry: attackers can continuously launch cheap & global-scale attacks, which overwhelems individual defenders. However, through scalable technology and network effects across defenders, we believe we can flip asymmetry to instead be an advantage for defenders.

Data

We are collecting, analyzing, and sharing data around:

  • Twitter: Who is saying what and when, including around URLs, especially around COVID and medical misinfo
  • Additional campaigns initiated by our community and partners, such as around voter supression & pollution
  • Feeds: Correlations against labeled data such as bots, fact checks, and indicators of digital crime
  • Scores: Bots, misinformation, crime, likely location, and more

While we cannot publish the raw data due due to compliance restrictions from our data porviders, we are happy to support individual projects, such as for analyzing and predicting real-world compliance of health policies, and identifying bad actors. Please jump into the Slack or contact a project leader on LinkedIn and we'll get you going.

The interventions

We are working with ethics groups to identify safe interventions along the following lines:

  • Targeting of specific underserved issues: Ex: Public health issues such as unsafe social behavior, unsafe medicine, unsafe science, dangerous government policy influence, and adjacent issues such as fake charities, phishing, malware, and hate group propaganda

  • Help top social platforms harden themselves: Trust and safety teams at top social networks need to be able to warn users about misinformation, de-trend it, and potentially take it down before it has served its purpose. The status quo is handling incidents months after the fact. We will provide real-time alert feeds and scoring APIs to help take action during the critical minutes before misinformation gains significant reach.

  • Enable top analysts to investigate coordinated activity: A minority of groups cause the bulk of the misinformation that gets shared. We are building a high-scale analyst environment featuring technology such as GPU-accelerated visual graph analytics and high-memory notebook servers.

  • Help leaders clean up their community: Identify and invite community leaders of at-risk groups to use our tools to detect trending misinformation and sift it out from their regular community content.

  • Alert individuals as they are being manipulated: For manipulated conversations where we have clear intelligence, we are exploring an alert bot that will post the misinformation report directly on the thread, or enable community participants or project partners to do so.

  • Enable other platforms: We expect a growing number of initiatives to benefit from our intelligence and automation capabilities.

The technologies

  • Twitter firehose monitor: 100K+ topical tweets/day
  • Data integration pipeline for sources of known scams, fraud, lies, bots, propaganda, extremism, and other misinformation sources
  • Misinformation knowledge graph connecting accounts, posts, reports, and models (1B+ nodes/edges)
  • Automated GPU / graph / machine learning pipeline: general classification (bot, community, ...) and targeted (clinical disinformation, ...): Nvidia RAPIDS, BERT, GNNs, ...
  • Automated alerting & reporting pipeline: Prefect, Streamlit
  • Interactive visual analytics environment for data scientists and analysts: GPU, Graphistry, graph, Jupyter, Streamlit, ML, ...

How to help

We are actively seeking several forms of support:

  • Volunteers: Most immediate priority is on data engineering, data science, and advisors on marketing/funding/public health

    • Data engineers: Orchestration (Airflow, Prefect.io, Nifi, ...), streaming (Kafka, ...), graph (Neo4j, cuGraph), GPU (RAPIDS), ML (NLP libs), and databases
    • Analysts: OSINT, threat intel, campaign tracking, ...
    • Data scientists: especially around graph, misinformation, neural networks, GNNs, NLP, with backgrounds such as security, fraud, misinformation, marketing
    • Developers & designers: intelligence integrations, website for search & reports, automations, intervention bots, API
    • Marketing: Strategy & implementation
    • Public health and communications: Especially around safe and effective intervention design
    • Legal: Risk & legal analysis for various interventions
  • APIs and Data:

    • Feeds & enriching APIs: Lists and intel on URLs, domains, keywords, emails, topics, blockchain accounts, social media accounts & content, clinical trials, esp. if tunable on topic
    • Scoring: Libraries and APIs around social networks (initially Twitter) structure & content, websites, and news: bot scores, fingerprinting, topic extraction & classification
    • Crawling tech: Social networks and web
  • Software Licenses:

    • SaaS, OSS, Docker preferred
    • Project management
    • Analytics
    • Automation
  • Hardware: Anything you can provide along the lines of:

    • 1 x Database server (CPU): 32+ cores, Ubuntu 18, 64GB+ RAM, ideally backups
    • 1 x Primary analytics server (GPU) - 32+ CPU cores, 128GB+ CPU RAM, 2-8 GPUs, 64GB+ disk, 2-10TB attached SSD
      • GPUs: Nvidia Pascal or later, 12GB minimum, with 32GB strongly preferred (Ex: 32GB Volta)
    • 2 x Secondary / developer servers (GPU) - 8+ CPU cores, 64GB+ CPU RAM, 2-4 GPUs, 64GB+ disk, 1TB attached SSD
      • GPUs: Nvidia Pascal or later, 12GB minimum
    • 3 x Analyst stations (GPU) - 8+ CPU cores, 64GB+ CPU RAM, 1 TB attached SSD, 1-2 GPUs
      • GPUs: Nvidia Pascal or later, 12GB minimum, with 32GB strongly preferred (Ex: 32GB Volta)
    • For each, we can work together to setup Ubuntu with remote SSH admin, Nvidia drivers, & (Nvidia-enabled) Docker
  • Sponsors: Near-term funding until the project finds a more sustainable path is welcome!

    • Federal and private grants
    • Mission-aligned contracts: We are happy to help organizations tackle challenges around social intelligence, such as disaster analytics & anti-misinformation
  • Subprojects: We are focusing near-term on core data pipeline and simple analyses while building up to the discourse-graph-level ones

    • Firehose pipeline:
      • Cleaning up Twitter -> Neo4j/Parquet data conversions
      • Prefect.io/Airflow Firehose tasks: usertimeline, topic search, ... => Parquet/Neo4j
      • Switching from Twarc to Twint for higher-volume ingest
      • Fast large graph (100M row) export of cypher->neo4j->parquet
    • Orchestration tasks:
      • Python entity extractors: tweet -> text, URLs, bitcoin address, topics, ...
      • Python enrichment tasks: external APIs (factcheck, crypto, ...) and lightweight NLP algs (sentiment) and feed back to neo4j
      • Python data feed collectors: scripts that download feeds (factcheck, phishing, ...) and feed back to neo4j
    • Intervention campaigns:
      • Priority: Untrialed drug misinformation - modeling, detection, analysis, report, & alert
      • Scams: Check URLs & blockchain addresses for known badness
      • Mapping bots & misinformers
    • UI tools & automation: Prototypes of
      • Alert feed + scoring API: Misinfo intel for trust & safety teams that is powered by Neo4j and/or Prefect.io/Airflow
      • Community leader tools: Personalized alerting for community leaders to detect & respond to misinfo in their close social networks

Contact

Please contact Leo Meyerovich, CEO @ Graphistry and Sean Griffin, CEO @ DisasterTech for support and information.

Community Slack channel: #COVID via an open invite link

Releases

No releases published

Packages

No packages published

Languages