Visual AI Engine: Perceive Interface, Plan Actions
Dual Execution Layers · Three-Level Recognition · Decreasing Costs
Browser layer identifies elements via DOM/AX Tree. Cloud phone layer identifies via Android accessibility controls. Three-level perception: Fingerprint matching (zero cost) → LLM semantic recognition (low cost) → AI visual recognition (fallback). AI understands interfaces like a human and plans the next action.
Dual Execution Layer Architecture
Browser + Cloud Phone, covering all automation scenarios
Browser Execution Layer
DOM / AX Tree Recognition
Obtains page structure via Accessibility Tree, precisely locating each element's position, type, and interactivity. Suitable for web-based social media, e-commerce backends, etc.
Web: X/Twitter, LinkedIn, Facebook Web
Cloud Phone Execution Layer
Android Control Recognition (Key Focus)
Obtains control tree via Android Accessibility Service, identifying text, desc, resource-id, bounds, and other attributes. Executes in real device environment, compatible with all mobile detection.
Mobile: TikTok, Instagram, WhatsApp, Xiaohongshu
💡 Cloud phone is our core advantage: Real device environment + hardware-level isolated sandbox + AI control recognition, perfectly compatible with platform detection.
Three-Level Intelligent Recognition
Mature workflows cost $0, only unknown pages consume tokens
Level 1: Fingerprint Library Matching
Generates fingerprints based on control text/desc/resource-id, directly matching previously learned pages.
Hit Rate: 60-80% (Mature Workflows)
Applicable: Previously Learned Pages
Level 2: LLM Semantic Matching
Uses lightweight LLM to analyze control tree text, locating target elements through semantic understanding. Compression technology significantly reduces token consumption.
Hit Rate: 20-30%
Applicable: Similar but Changed Pages
Level 3: AI Visual Recognition
Uses VLM (GPT-4V/Claude Vision/Qwen-VL) to analyze screenshots, understanding interfaces through vision.
Hit Rate: 99%+
Applicable: New Pages, Complex Layouts
✅ Key Advantage: Mature workflow execution costs approach zero, only unknown pages consume tokens. Gets cheaper over time.
Perceive Interface, Plan Actions
AI doesn't just execute commands, it understands intent and plans paths
Perceive Current Interface
Get control tree/screenshot → Identify interactive elements → Understand page state
Plan Next Action
Understand user intent → Analyze current state → Decide optimal action
Example: TikTok Account Warming Task
User Command: "Warm up TikTok account for 2 hours, browse videos, like, comment"
AI Understanding:
• Platform Identified: TikTok
• Task Type: Account Warming (Loop Mode)
• Duration: 120 minutes
• Behavior Weights: Browse 60%, Like 25%, Comment 15%
→ AI Autonomous Execution: Browse → Random Like → Smart Comment → Rest → Loop
Self-Healing: UI Updates Are No Longer a Problem
Traditional scripts crash on updates, AI adapts automatically
| Scenario | Traditional RPA | ProMoi AI |
|---|---|---|
| TikTok UI Update | ❌ Script fails, needs manual fix | ✅ AI auto-identifies new layout |
| Instagram Button Position Change | ❌ Coordinates fail | ✅ Finds button via control/vision |
| LinkedIn New Popup | ❌ Flow interrupts | ✅ Auto-handles popup, continues |
| Captcha/Exception Page | ❌ Stuck | ✅ 4-level error recovery |
Self-Healing Mechanism
When fingerprint matching fails, AI automatically triggers re-learning:
Detect: Page doesn't match fingerprint library
Analyze: Use LLM/VLM to understand new page structure
Update: Auto-update fingerprint library for next match
Continue: Seamlessly resume task execution
Natural Behavioral Patterns Behavior: Operate Like a Real Person
Not just random delays, but complete human behavior simulation
Bézier Curve Swiping
Cubic Bézier curves generate smooth trajectories, randomized control points, every path is different.
Random Offset Clicking
±3 pixel random offset, simulating human clicking imprecision.
Variable Speed Typing
Random typing speed and error rate, simulating real human input habits.
Visual Focus Pauses
Simulating human reading and thinking pauses, not mechanical continuous operation.
All behavior parameters are configurable, supporting different platform risk control strategies.
Cost Control: Gets Cheaper Over Time
Three-level recognition architecture makes costs decrease with usage
>85%
Fingerprint Match Rate
Mature Workflows
60-80%
Token Savings
AX Tree Compression
>70%
Zero Cost Operations
After Stable Running
Use Cases
Dual execution layers cover all automation needs
Social Media Matrix
TikTok, Instagram, WhatsApp team workspace multi-channel operations. Cloud phone execution layer ensures account safety.
Cloud Phone ExecutionCross-Border E-Commerce
Amazon, eBay backend operations, product listing, order processing. Browser execution layer is efficient and stable.
Browser ExecutionLinkedIn Lead Generation
Auto search, filter, send connection requests. Browser sandbox isolation ensures account safety.
Browser ExecutionContent Operations
Xiaohongshu, Douyin content publishing and engagement. Cloud phone real device environment compatible with detection.
Cloud Phone ExecutionFrequently Asked Questions
Let AI Be Your Eyes and Brain
Perceive interfaces, plan actions, execute autonomously.
No credit card required · Private deployment available

