ProMoi Logo
Core Technology

Visual AI Engine: Perceive Interface, Plan Actions

Dual Execution Layers · Three-Level Recognition · Decreasing Costs

Browser layer identifies elements via DOM/AX Tree. Cloud phone layer identifies via Android accessibility controls. Three-level perception: Fingerprint matching (zero cost) → LLM semantic recognition (low cost) → AI visual recognition (fallback). AI understands interfaces like a human and plans the next action.

Dual Execution LayersThree-Level PerceptionGets Cheaper Over Time

Dual Execution Layer Architecture

Browser + Cloud Phone, covering all automation scenarios

Browser Execution Layer

DOM / AX Tree Recognition

Obtains page structure via Accessibility Tree, precisely locating each element's position, type, and interactivity. Suitable for web-based social media, e-commerce backends, etc.

DOM Structure Parsing
AX Tree Compression (60-80% Token Reduction)
Fingerprint Browser Sandbox Isolation

Web: X/Twitter, LinkedIn, Facebook Web

重点 / Key Focus

Cloud Phone Execution Layer

Android Control Recognition (Key Focus)

Obtains control tree via Android Accessibility Service, identifying text, desc, resource-id, bounds, and other attributes. Executes in real device environment, compatible with all mobile detection.

Android Accessibility Control Tree
Real Device Kernel + Independent Hardware Fingerprint
Bézier Curves + Natural Behavioral Patterns

Mobile: TikTok, Instagram, WhatsApp, Xiaohongshu

💡 Cloud phone is our core advantage: Real device environment + hardware-level isolated sandbox + AI control recognition, perfectly compatible with platform detection.

Three-Level Intelligent Recognition

Mature workflows cost $0, only unknown pages consume tokens

Level 1

Level 1: Fingerprint Library Matching

Milliseconds$0 Cost

Generates fingerprints based on control text/desc/resource-id, directly matching previously learned pages.

Hit Rate: 60-80% (Mature Workflows)

Applicable: Previously Learned Pages

Level 2

Level 2: LLM Semantic Matching

SecondsLow Cost

Uses lightweight LLM to analyze control tree text, locating target elements through semantic understanding. Compression technology significantly reduces token consumption.

Hit Rate: 20-30%

Applicable: Similar but Changed Pages

Level 3

Level 3: AI Visual Recognition

SecondsHas Cost (Fallback)

Uses VLM (GPT-4V/Claude Vision/Qwen-VL) to analyze screenshots, understanding interfaces through vision.

Hit Rate: 99%+

Applicable: New Pages, Complex Layouts

Key Advantage: Mature workflow execution costs approach zero, only unknown pages consume tokens. Gets cheaper over time.

Perceive Interface, Plan Actions

AI doesn't just execute commands, it understands intent and plans paths

Perceive Current Interface

Get control tree/screenshot → Identify interactive elements → Understand page state

Browser: AX Tree + DOM Structure
Cloud Phone: Android Control Tree (text/desc/bounds)
Visual Fallback: Screenshot + VLM Analysis

Plan Next Action

Understand user intent → Analyze current state → Decide optimal action

Intent Understanding: What does the user want to achieve?
State Analysis: What can be done on the current interface?
Action Decision: What should be clicked/typed/swiped next?

Example: TikTok Account Warming Task

User Command: "Warm up TikTok account for 2 hours, browse videos, like, comment"

AI Understanding:

Platform Identified: TikTok

Task Type: Account Warming (Loop Mode)

Duration: 120 minutes

Behavior Weights: Browse 60%, Like 25%, Comment 15%

AI Autonomous Execution: Browse → Random Like → Smart Comment → Rest → Loop

Self-Healing: UI Updates Are No Longer a Problem

Traditional scripts crash on updates, AI adapts automatically

ScenarioTraditional RPAProMoi AI
TikTok UI Update❌ Script fails, needs manual fix✅ AI auto-identifies new layout
Instagram Button Position Change❌ Coordinates fail✅ Finds button via control/vision
LinkedIn New Popup❌ Flow interrupts✅ Auto-handles popup, continues
Captcha/Exception Page❌ Stuck✅ 4-level error recovery

Self-Healing Mechanism

When fingerprint matching fails, AI automatically triggers re-learning:

1

Detect: Page doesn't match fingerprint library

2

Analyze: Use LLM/VLM to understand new page structure

3

Update: Auto-update fingerprint library for next match

4

Continue: Seamlessly resume task execution

Natural Behavioral Patterns Behavior: Operate Like a Real Person

Not just random delays, but complete human behavior simulation

Bézier Curve Swiping

Cubic Bézier curves generate smooth trajectories, randomized control points, every path is different.

Random Offset Clicking

±3 pixel random offset, simulating human clicking imprecision.

Variable Speed Typing

Random typing speed and error rate, simulating real human input habits.

Visual Focus Pauses

Simulating human reading and thinking pauses, not mechanical continuous operation.

All behavior parameters are configurable, supporting different platform risk control strategies.

Cost Control: Gets Cheaper Over Time

Three-level recognition architecture makes costs decrease with usage

>85%

Fingerprint Match Rate

Mature Workflows

60-80%

Token Savings

AX Tree Compression

>70%

Zero Cost Operations

After Stable Running

Use Cases

Dual execution layers cover all automation needs

Social Media Matrix

TikTok, Instagram, WhatsApp team workspace multi-channel operations. Cloud phone execution layer ensures account safety.

Cloud Phone Execution

Cross-Border E-Commerce

Amazon, eBay backend operations, product listing, order processing. Browser execution layer is efficient and stable.

Browser Execution

LinkedIn Lead Generation

Auto search, filter, send connection requests. Browser sandbox isolation ensures account safety.

Browser Execution

Content Operations

Xiaohongshu, Douyin content publishing and engagement. Cloud phone real device environment compatible with detection.

Cloud Phone Execution

Frequently Asked Questions

Let AI Be Your Eyes and Brain

Perceive interfaces, plan actions, execute autonomously.

No credit card required · Private deployment available