kotlin+puppeteer写爬虫

kotlin + puppeteer写爬虫

环境搞的很郁闷,这个代码倒是简单解决了,多亏了一位日本老哥。

主要参考:バックグラウンドで使うpuppeteer with Kotlin

环境准备

build.gradle中添加:

dependencies {
    implementation 'org.jetbrains.kotlinx:kotlinx-coroutines-core-js:1.1.1'
}

让kotlin使用async/await

接口有变动,参考中代码部分失效。

import kotlin.coroutines.*
import kotlin.js.Promise

suspend fun <T> Promise<T>.await(): T = suspendCoroutine { cont ->
    then({ cont.resume(it) }, { cont.resumeWithException(it) })
}

fun <T> async(x: suspend () -> T): Promise<T> {
    return Promise { resolve, reject ->
        x.startCoroutine(object : Continuation<T> {
            override val context = EmptyCoroutineContext

            override fun resumeWith(result: Result<T>) {
                if (result.isSuccess)
                    resolve(result.getOrNull()!!)
                else
                    reject(result.exceptionOrNull()!!)
            }
        })
    }
}

封装puppeteer接口

import kotlin.js.Promise

@Suppress("FunctionName")
@JsModule("puppeteer")
external object Puppeteer {

    class Page {

        fun goto(url: String, options: dynamic): Promise<dynamic>

        fun waitFor(element: String, options: dynamic): Promise<dynamic>

        fun waitFor(num: Int): Promise<dynamic>

        fun content(): Promise<dynamic>

        fun click(selector: dynamic): Promise<dynamic>

        fun close(): Promise<dynamic>

        fun evaluate(pageFunction: Function<dynamic>): Promise<dynamic>

    }

    class Browser {

        fun newPage(): Promise<Page>

        fun close(): Promise<dynamic>

        fun wsEndpoint(): String

    }

    fun launch(options: dynamic): Promise<Browser>
}

爬虫代码

fun main() {
    async {
        val browser = Puppeteer.launch(object {}.also { it: dynamic ->
            it.devtools = true
            it.args = arrayOf("--no-sandbox", "--disable-setuid-sandbox")
            it.headless = true
        }).await()
        try {
            val page = browser.newPage().await()
            page.goto("http://www.baidu.com", object {}.also { it: dynamic -> it.timeout = 10 * 1000 }).await()
            page.waitFor(1000).await()
            val content = page.content().await()
            println(content.toString())
        } finally {
            browser.close().await()
        }
    }
}

注意it.headless = true为开启Chrome的Headless模式,需要显示界面调试置为false即可。

Author: SinLapis
Link: http://sinlapis.github.io/2019/12/11/kotlin-puppeteer写爬虫/
Copyright Notice: All articles in this blog are licensed under CC BY-NC-SA 4.0 unless stating additionally.