Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

curl_multi optimizations in waitForData #4

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

divinity76
Copy link

the old code would run the multi_handles, then wait until there was new data available on any of the sockets -or- until $timeout, and then return (and not fetch the new data),

the new code will run the multi handles, then wait until there is new data available on any of the sockets -or- until $timeout, and if there is new data available on the sockets before $timeout, it will fetch the new data as well,

this should make the function faster (waiting until there is data, and not fetching it, like the old function did, was kindof dumb)

also the new code only runs select() if there are unfinished handles, the old code would run select() even if all handles had finished (that might save some cpu)

the old code would run the multi_handles, then wait until there was new data available on any of the sockets -or- until $timeout, and then return (and not fetch the new data),

the new code will run the multi handles, then wait until there is new data available on any of the sockets -or- until $timeout, and if there is new data available on the sockets before $timeout, it will fetch the new data as well,

this should make the function faster (waiting until there is data, and not fetching it, like the old function did, was kindof dumb)

also the new code only runs select() if there are unfinished handles, the old code would run select() even if all handles had finished
@MartinMajor
Copy link
Owner

Hi, thanks for the PR! I've made this repo many years ago to avoid dealing with raw curl API ever again and it worked - I haven't seen it since then. The drawback is that I'm now unable to decide whether yours PR really helps because I would have to study that low-level API again. I did some quick tests and I wasn't able to simulate situation where your code would help, but that doesn't mean there is none maybe I just did wrong tests. Therefore I decided not to merge this PR nor to close it. This way, other users of this library can decide whether they want to use your optimization or not.

@divinity76
Copy link
Author

that makes perfect sense!

btw here is what i personally believe is very close to optimal use of the curl_multi api,
would you mind doing some quick tests and check if this one performs significantly different?

/**
 * fetch all urls in parallel,
 * warning: all urls must be unique..
 *
 * @param array $urls_unique
 *            urls to fetch
 * @param int $max_connections
 *            (optional, default unlimited) max simultaneous connections
 *            (some websites will auto-ban you for "ddosing" if you send too many requests simultaneously,
 *            and some wifi routers will get unstable on too many connectionis.. )
 * @param array $additional_curlopts
 *            (optional) set additional curl options here, each curl handle will get these options
 * @throws RuntimeException on curl_multi errors
 * @throws RuntimeException on curl_init() / curl_setopt() errors
 * @return array(url=>response,url2=>response2,...)
 */
function curl_fetch_multi_2(array $urls_unique, int $max_connections = PHP_INT_MAX, array $additional_curlopts = null)
{
    // $urls_unique = array_unique($urls_unique);
    $ret = array();
    $mh = curl_multi_init();
    // $workers format: [(int)$ch]=url
    $workers = array();
    $max_connections = min($max_connections, count($urls_unique));
    $unemployed_workers = array();
    for ($i = 0; $i < $max_connections; ++ $i) {
        $unemployed_worker = curl_init();
        if (! $unemployed_worker) {
            throw new \RuntimeException("failed creating unemployed worker #" . $i);
        }
        $unemployed_workers[] = $unemployed_worker;
    }
    unset($i, $unemployed_worker);

    $work = function () use (&$workers, &$unemployed_workers, &$mh, &$ret): void {
        assert(count($workers) > 0, "work() called with 0 workers!!");
        $still_running = null;
        for (;;) {
            do {
                $err = curl_multi_exec($mh, $still_running);
            } while ($err === CURLM_CALL_MULTI_PERFORM);
            if ($err !== CURLM_OK) {
                $errinfo = [
                    "multi_exec_return" => $err,
                    "curl_multi_errno" => curl_multi_errno($mh),
                    "curl_multi_strerror" => curl_multi_strerror($err)
                ];
                $errstr = "curl_multi_exec error: " . str_replace([
                    "\r",
                    "\n"
                ], "", var_export($errinfo, true));
                throw new \RuntimeException($errstr);
            }
            if ($still_running < count($workers)) {
                // some workers has finished downloading, process them
                // echo "processing!";
                break;
            } else {
                // no workers finished yet, sleep-wait for workers to finish downloading.
                // echo "select()ing!";
                curl_multi_select($mh, 1);
                // sleep(1);
            }
        }
        while (false !== ($info = curl_multi_info_read($mh))) {
            if ($info['msg'] !== CURLMSG_DONE) {
                // no idea what this is, it's not the message we're looking for though, ignore it.
                continue;
            }
            if ($info['result'] !== CURLM_OK) {
                $errinfo = [
                    "effective_url" => curl_getinfo($info['handle'], CURLINFO_EFFECTIVE_URL),
                    "curl_errno" => curl_errno($info['handle']),
                    "curl_error" => curl_error($info['handle']),
                    "curl_multi_errno" => curl_multi_errno($mh),
                    "curl_multi_strerror" => curl_multi_strerror(curl_multi_errno($mh))
                ];
                $errstr = "curl_multi worker error: " . str_replace([
                    "\r",
                    "\n"
                ], "", var_export($errinfo, true));
                throw new \RuntimeException($errstr);
            }
            $ch = $info['handle'];
            $ch_index = (int) $ch;
            $url = $workers[$ch_index];
            $ret[$url] = curl_multi_getcontent($ch);
            unset($workers[$ch_index]);
            curl_multi_remove_handle($mh, $ch);
            $unemployed_workers[] = $ch;
        }
    };
    $opts = array(
        CURLOPT_URL => '',
        CURLOPT_RETURNTRANSFER => 1,
        CURLOPT_ENCODING => ''
    );
    if (! empty($additional_curlopts)) {
        // i would have used array_merge(), but it does scary stuff with integer keys.. foreach() is easier to reason about
        foreach ($additional_curlopts as $key => $val) {
            $opts[$key] = $val;
        }
    }
    foreach ($urls_unique as $url) {
        while (empty($unemployed_workers)) {
            $work();
        }
        $new_worker = array_pop($unemployed_workers);
        $opts[CURLOPT_URL] = $url;
        if (! curl_setopt_array($new_worker, $opts)) {
            $errstr = "curl_setopt_array failed: " . curl_errno($new_worker) . ": " . curl_error($new_worker) . " " . var_export($opts, true);
            throw new RuntimeException($errstr);
        }
        $workers[(int) $new_worker] = $url;
        curl_multi_add_handle($mh, $new_worker);
    }
    while (count($workers) > 0) {
        $work();
    }
    foreach ($unemployed_workers as $unemployed_worker) {
        curl_close($unemployed_worker);
    }
    curl_multi_close($mh);
    return $ret;
}
  • one notable optimization in this one, lacking from async-request is that it re-use handles rather than creating/deleting/creating/deleting handles, here is the same thing, but lacking the re-use optimiaztions:
/**
 * fetch all urls in parallel,
 * warning: all urls must be unique..
 *
 * @param array $urls_unique
 *            urls to fetch
 * @param int $max_connections
 *            (optional, default unlimited) max simultaneous connections
 *            (some websites will auto-ban you for "ddosing" if you send too many requests simultaneously,
 *            and some wifi routers will get unstable on too many connectionis.. )
 * @param array $additional_curlopts
 *            (optional) set additional curl options here, each curl handle will get these options
 * @throws RuntimeException on curl_multi errors
 * @throws RuntimeException on curl_init() / curl_setopt() errors
 * @return array(url=>response,url2=>response2,...)
 */
function curl_fetch_multi_1(array $urls_unique, int $max_connections = PHP_INT_MAX, array $additional_curlopts = null)
{
    // $urls_unique = array_unique($urls_unique);
    $ret = array();
    $mh = curl_multi_init();
    // $workers format: [(int)$ch]=url
    $workers = array();
    $work = function () use (&$workers, &$mh, &$ret): void {
        assert(count($workers) > 0, "work() called with 0 workers!!");
        $still_running = null;
        for (;;) {
            do {
                $err = curl_multi_exec($mh, $still_running);
            } while ($err === CURLM_CALL_MULTI_PERFORM);
            if ($err !== CURLM_OK) {
                $errinfo = [
                    "multi_exec_return" => $err,
                    "curl_multi_errno" => curl_multi_errno($mh),
                    "curl_multi_strerror" => curl_multi_strerror($err)
                ];
                $errstr = "curl_multi_exec error: " . str_replace([
                    "\r",
                    "\n"
                ], "", var_export($errinfo, true));
                throw new \RuntimeException($errstr);
            }
            if ($still_running < count($workers)) {
                // some workers has finished downloading, process them
                // echo "processing!";
                break;
            } else {
                // no workers finished yet, sleep-wait for workers to finish downloading.
                // echo "select()ing!";
                curl_multi_select($mh, 1);
                // sleep(1);
            }
        }
        while (false !== ($info = curl_multi_info_read($mh))) {
            if ($info['msg'] !== CURLMSG_DONE) {
                // no idea what this is, it's not the message we're looking for though, ignore it.
                continue;
            }
            if ($info['result'] !== CURLM_OK) {
                $errinfo = [
                    "effective_url" => curl_getinfo($info['handle'], CURLINFO_EFFECTIVE_URL),
                    "curl_errno" => curl_errno($info['handle']),
                    "curl_error" => curl_error($info['handle']),
                    "curl_multi_errno" => curl_multi_errno($mh),
                    "curl_multi_strerror" => curl_multi_strerror(curl_multi_errno($mh))
                ];
                $errstr = "curl_multi worker error: " . str_replace([
                    "\r",
                    "\n"
                ], "", var_export($errinfo, true));
                throw new \RuntimeException($errstr);
            }
            $ch = $info['handle'];
            $ch_index = (int) $ch;
            $url = $workers[$ch_index];
            $ret[$url] = curl_multi_getcontent($ch);
            unset($workers[$ch_index]);
            curl_multi_remove_handle($mh, $ch);
            curl_close($ch);
        }
    };
    $opts = array(
        CURLOPT_URL => '',
        CURLOPT_RETURNTRANSFER => 1,
        CURLOPT_ENCODING => ''
    );
    if (! empty($additional_curlopts)) {
        // i would have used array_merge(), but it does scary stuff with integer keys.. foreach() is easier to reason about
        foreach ($additional_curlopts as $key => $val) {
            $opts[$key] = $val;
        }
    }
    foreach ($urls_unique as $url) {
        while (count($workers) >= $max_connections) {
            $work();
        }
        $new_worker = curl_init();
        $opts[CURLOPT_URL] = $url;
        if (! curl_setopt_array($new_worker, $opts)) {
            $errstr = "curl_setopt_array failed: " . curl_errno($new_worker) . ": " . curl_error($new_worker) . " " . var_export($opts, true);
            throw new RuntimeException($errstr);
        }
        $workers[(int) $new_worker] = $url;
        curl_multi_add_handle($mh, $new_worker);
    }
    while (count($workers) > 0) {
        $work();
    }
    curl_multi_close($mh);
    return $ret;
}

if you can run some quick tests against async-request, it'd be interesting to see if it makes any noticeable differences at all

@divinity76
Copy link
Author

divinity76 commented Oct 13, 2020

here is some quick benchmark code i hacked together, guess it should be fairly easy to switch out one of them with async-request

<?php
declare(strict_types = 1);
$urls = [];
$urls_to_create = 1000;
for ($i = 0; $i < $urls_to_create; ++ $i) {
    // $urls[] = "http://x2.ratma.net?{$i}";
    $urls[] = "http://127.0.0.1:81/?{$i}";
}
$method1_worst = PHP_INT_MIN;
$method1_best = PHP_INT_MAX;
$method2_worst = PHP_INT_MIN;
$method2_best = PHP_INT_MAX;

for ($i = 0; $i < 10; ++ $i) {
    $start = microtime(true);
    curl_fetch_multi_2($urls, 100);
    $end = microtime(true) - $start;
    if ($end > $method2_worst) {
        $method2_worst = $end;
    }
    if ($end < $method2_best) {
        $method2_best = $end;
    }
    // //////
    $start = microtime(true);
    curl_fetch_multi_1($urls, 100);
    $end = microtime(true) - $start;
    if ($end > $method1_worst) {
        $method1_worst = $end;
    }
    if ($end < $method1_best) {
        $method1_best = $end;
    }
}
$data["1_worst"] = $method1_worst;
$data["1_best"] = $method1_best;
$data["2_worst"] = $method2_worst;
$data["2_best"] = $method2_best;
$data["winner"] = ($method1_best === $method2_best ? "tie" : ($method1_best > $method2_best ? "method 2" : "method 1"));
$data["diff"] = number_format(abs($method1_best - $method2_best), 5);
var_export($data);

it generates outputs like

hans@xDevAd:~/projects/misc$ time php bech.php
array (
  '1_worst' => 2.691694974899292,
  '1_best' => 2.0963218212127686,
  '2_worst' => 2.234786033630371,
  '2_best' => 2.068150043487549,
  'winner' => 'method 2',
  'diff' => '0.02817',
)
real	0m43,616s
user	0m2,539s
sys	0m9,004s
hans@xDevAd:~/projects/misc$ time php bech.php
array (
  '1_worst' => 3.6277899742126465,
  '1_best' => 1.8193140029907227,
  '2_worst' => 2.7415170669555664,
  '2_best' => 1.7830519676208496,
  'winner' => 'method 2',
  'diff' => '0.03626',
)
real	0m45,064s
user	0m2,553s
sys	0m8,544s
hans@xDevAd:~/projects/misc$ time php bech.php
array (
  '1_worst' => 0.1131289005279541,
  '1_best' => 0.09237909317016602,
  '2_worst' => 0.12538981437683105,
  '2_best' => 0.09183907508850098,
  'winner' => 'method 2',
  'diff' => '0.00054',
)
real	0m2,083s
user	0m1,193s
sys	0m0,844s
hans@xDevAd:~/projects/misc$ time php bech.php
array (
  '1_worst' => 0.1123800277709961,
  '1_best' => 0.09178304672241211,
  '2_worst' => 0.14417290687561035,
  '2_best' => 0.0872950553894043,
  'winner' => 'method 2',
  'diff' => '0.00449',
)
real	0m2,088s
user	0m1,162s
sys	0m0,864s

edit: also appears that method2 (handle re-use optimization) beats method 1 (create/delete/create/delete) every time, so there's a little (28 milliseconds speed increase above! but only when having to use DNS, the first 2 requests were remote norway->canada, the 2 last were 127.0.0.1) to gain by doing it, but seemingly not much. i haven't tested any of these against async-request though.

also http://x2.ratma.net is a server in canada you're free to spam if you want something remote to test to. it's on a shared (and sometimes busy) 1Gbps connection

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants