-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
curl_multi optimizations in waitForData #4
base: master
Are you sure you want to change the base?
Conversation
the old code would run the multi_handles, then wait until there was new data available on any of the sockets -or- until $timeout, and then return (and not fetch the new data), the new code will run the multi handles, then wait until there is new data available on any of the sockets -or- until $timeout, and if there is new data available on the sockets before $timeout, it will fetch the new data as well, this should make the function faster (waiting until there is data, and not fetching it, like the old function did, was kindof dumb) also the new code only runs select() if there are unfinished handles, the old code would run select() even if all handles had finished
Hi, thanks for the PR! I've made this repo many years ago to avoid dealing with raw curl API ever again and it worked - I haven't seen it since then. The drawback is that I'm now unable to decide whether yours PR really helps because I would have to study that low-level API again. I did some quick tests and I wasn't able to simulate situation where your code would help, but that doesn't mean there is none maybe I just did wrong tests. Therefore I decided not to merge this PR nor to close it. This way, other users of this library can decide whether they want to use your optimization or not. |
that makes perfect sense! btw here is what i personally believe is very close to optimal use of the curl_multi api, /**
* fetch all urls in parallel,
* warning: all urls must be unique..
*
* @param array $urls_unique
* urls to fetch
* @param int $max_connections
* (optional, default unlimited) max simultaneous connections
* (some websites will auto-ban you for "ddosing" if you send too many requests simultaneously,
* and some wifi routers will get unstable on too many connectionis.. )
* @param array $additional_curlopts
* (optional) set additional curl options here, each curl handle will get these options
* @throws RuntimeException on curl_multi errors
* @throws RuntimeException on curl_init() / curl_setopt() errors
* @return array(url=>response,url2=>response2,...)
*/
function curl_fetch_multi_2(array $urls_unique, int $max_connections = PHP_INT_MAX, array $additional_curlopts = null)
{
// $urls_unique = array_unique($urls_unique);
$ret = array();
$mh = curl_multi_init();
// $workers format: [(int)$ch]=url
$workers = array();
$max_connections = min($max_connections, count($urls_unique));
$unemployed_workers = array();
for ($i = 0; $i < $max_connections; ++ $i) {
$unemployed_worker = curl_init();
if (! $unemployed_worker) {
throw new \RuntimeException("failed creating unemployed worker #" . $i);
}
$unemployed_workers[] = $unemployed_worker;
}
unset($i, $unemployed_worker);
$work = function () use (&$workers, &$unemployed_workers, &$mh, &$ret): void {
assert(count($workers) > 0, "work() called with 0 workers!!");
$still_running = null;
for (;;) {
do {
$err = curl_multi_exec($mh, $still_running);
} while ($err === CURLM_CALL_MULTI_PERFORM);
if ($err !== CURLM_OK) {
$errinfo = [
"multi_exec_return" => $err,
"curl_multi_errno" => curl_multi_errno($mh),
"curl_multi_strerror" => curl_multi_strerror($err)
];
$errstr = "curl_multi_exec error: " . str_replace([
"\r",
"\n"
], "", var_export($errinfo, true));
throw new \RuntimeException($errstr);
}
if ($still_running < count($workers)) {
// some workers has finished downloading, process them
// echo "processing!";
break;
} else {
// no workers finished yet, sleep-wait for workers to finish downloading.
// echo "select()ing!";
curl_multi_select($mh, 1);
// sleep(1);
}
}
while (false !== ($info = curl_multi_info_read($mh))) {
if ($info['msg'] !== CURLMSG_DONE) {
// no idea what this is, it's not the message we're looking for though, ignore it.
continue;
}
if ($info['result'] !== CURLM_OK) {
$errinfo = [
"effective_url" => curl_getinfo($info['handle'], CURLINFO_EFFECTIVE_URL),
"curl_errno" => curl_errno($info['handle']),
"curl_error" => curl_error($info['handle']),
"curl_multi_errno" => curl_multi_errno($mh),
"curl_multi_strerror" => curl_multi_strerror(curl_multi_errno($mh))
];
$errstr = "curl_multi worker error: " . str_replace([
"\r",
"\n"
], "", var_export($errinfo, true));
throw new \RuntimeException($errstr);
}
$ch = $info['handle'];
$ch_index = (int) $ch;
$url = $workers[$ch_index];
$ret[$url] = curl_multi_getcontent($ch);
unset($workers[$ch_index]);
curl_multi_remove_handle($mh, $ch);
$unemployed_workers[] = $ch;
}
};
$opts = array(
CURLOPT_URL => '',
CURLOPT_RETURNTRANSFER => 1,
CURLOPT_ENCODING => ''
);
if (! empty($additional_curlopts)) {
// i would have used array_merge(), but it does scary stuff with integer keys.. foreach() is easier to reason about
foreach ($additional_curlopts as $key => $val) {
$opts[$key] = $val;
}
}
foreach ($urls_unique as $url) {
while (empty($unemployed_workers)) {
$work();
}
$new_worker = array_pop($unemployed_workers);
$opts[CURLOPT_URL] = $url;
if (! curl_setopt_array($new_worker, $opts)) {
$errstr = "curl_setopt_array failed: " . curl_errno($new_worker) . ": " . curl_error($new_worker) . " " . var_export($opts, true);
throw new RuntimeException($errstr);
}
$workers[(int) $new_worker] = $url;
curl_multi_add_handle($mh, $new_worker);
}
while (count($workers) > 0) {
$work();
}
foreach ($unemployed_workers as $unemployed_worker) {
curl_close($unemployed_worker);
}
curl_multi_close($mh);
return $ret;
}
/**
* fetch all urls in parallel,
* warning: all urls must be unique..
*
* @param array $urls_unique
* urls to fetch
* @param int $max_connections
* (optional, default unlimited) max simultaneous connections
* (some websites will auto-ban you for "ddosing" if you send too many requests simultaneously,
* and some wifi routers will get unstable on too many connectionis.. )
* @param array $additional_curlopts
* (optional) set additional curl options here, each curl handle will get these options
* @throws RuntimeException on curl_multi errors
* @throws RuntimeException on curl_init() / curl_setopt() errors
* @return array(url=>response,url2=>response2,...)
*/
function curl_fetch_multi_1(array $urls_unique, int $max_connections = PHP_INT_MAX, array $additional_curlopts = null)
{
// $urls_unique = array_unique($urls_unique);
$ret = array();
$mh = curl_multi_init();
// $workers format: [(int)$ch]=url
$workers = array();
$work = function () use (&$workers, &$mh, &$ret): void {
assert(count($workers) > 0, "work() called with 0 workers!!");
$still_running = null;
for (;;) {
do {
$err = curl_multi_exec($mh, $still_running);
} while ($err === CURLM_CALL_MULTI_PERFORM);
if ($err !== CURLM_OK) {
$errinfo = [
"multi_exec_return" => $err,
"curl_multi_errno" => curl_multi_errno($mh),
"curl_multi_strerror" => curl_multi_strerror($err)
];
$errstr = "curl_multi_exec error: " . str_replace([
"\r",
"\n"
], "", var_export($errinfo, true));
throw new \RuntimeException($errstr);
}
if ($still_running < count($workers)) {
// some workers has finished downloading, process them
// echo "processing!";
break;
} else {
// no workers finished yet, sleep-wait for workers to finish downloading.
// echo "select()ing!";
curl_multi_select($mh, 1);
// sleep(1);
}
}
while (false !== ($info = curl_multi_info_read($mh))) {
if ($info['msg'] !== CURLMSG_DONE) {
// no idea what this is, it's not the message we're looking for though, ignore it.
continue;
}
if ($info['result'] !== CURLM_OK) {
$errinfo = [
"effective_url" => curl_getinfo($info['handle'], CURLINFO_EFFECTIVE_URL),
"curl_errno" => curl_errno($info['handle']),
"curl_error" => curl_error($info['handle']),
"curl_multi_errno" => curl_multi_errno($mh),
"curl_multi_strerror" => curl_multi_strerror(curl_multi_errno($mh))
];
$errstr = "curl_multi worker error: " . str_replace([
"\r",
"\n"
], "", var_export($errinfo, true));
throw new \RuntimeException($errstr);
}
$ch = $info['handle'];
$ch_index = (int) $ch;
$url = $workers[$ch_index];
$ret[$url] = curl_multi_getcontent($ch);
unset($workers[$ch_index]);
curl_multi_remove_handle($mh, $ch);
curl_close($ch);
}
};
$opts = array(
CURLOPT_URL => '',
CURLOPT_RETURNTRANSFER => 1,
CURLOPT_ENCODING => ''
);
if (! empty($additional_curlopts)) {
// i would have used array_merge(), but it does scary stuff with integer keys.. foreach() is easier to reason about
foreach ($additional_curlopts as $key => $val) {
$opts[$key] = $val;
}
}
foreach ($urls_unique as $url) {
while (count($workers) >= $max_connections) {
$work();
}
$new_worker = curl_init();
$opts[CURLOPT_URL] = $url;
if (! curl_setopt_array($new_worker, $opts)) {
$errstr = "curl_setopt_array failed: " . curl_errno($new_worker) . ": " . curl_error($new_worker) . " " . var_export($opts, true);
throw new RuntimeException($errstr);
}
$workers[(int) $new_worker] = $url;
curl_multi_add_handle($mh, $new_worker);
}
while (count($workers) > 0) {
$work();
}
curl_multi_close($mh);
return $ret;
} if you can run some quick tests against async-request, it'd be interesting to see if it makes any noticeable differences at all |
here is some quick benchmark code i hacked together, guess it should be fairly easy to switch out one of them with async-request <?php
declare(strict_types = 1);
$urls = [];
$urls_to_create = 1000;
for ($i = 0; $i < $urls_to_create; ++ $i) {
// $urls[] = "http://x2.ratma.net?{$i}";
$urls[] = "http://127.0.0.1:81/?{$i}";
}
$method1_worst = PHP_INT_MIN;
$method1_best = PHP_INT_MAX;
$method2_worst = PHP_INT_MIN;
$method2_best = PHP_INT_MAX;
for ($i = 0; $i < 10; ++ $i) {
$start = microtime(true);
curl_fetch_multi_2($urls, 100);
$end = microtime(true) - $start;
if ($end > $method2_worst) {
$method2_worst = $end;
}
if ($end < $method2_best) {
$method2_best = $end;
}
// //////
$start = microtime(true);
curl_fetch_multi_1($urls, 100);
$end = microtime(true) - $start;
if ($end > $method1_worst) {
$method1_worst = $end;
}
if ($end < $method1_best) {
$method1_best = $end;
}
}
$data["1_worst"] = $method1_worst;
$data["1_best"] = $method1_best;
$data["2_worst"] = $method2_worst;
$data["2_best"] = $method2_best;
$data["winner"] = ($method1_best === $method2_best ? "tie" : ($method1_best > $method2_best ? "method 2" : "method 1"));
$data["diff"] = number_format(abs($method1_best - $method2_best), 5);
var_export($data); it generates outputs like hans@xDevAd:~/projects/misc$ time php bech.php
array (
'1_worst' => 2.691694974899292,
'1_best' => 2.0963218212127686,
'2_worst' => 2.234786033630371,
'2_best' => 2.068150043487549,
'winner' => 'method 2',
'diff' => '0.02817',
)
real 0m43,616s
user 0m2,539s
sys 0m9,004s
hans@xDevAd:~/projects/misc$ time php bech.php
array (
'1_worst' => 3.6277899742126465,
'1_best' => 1.8193140029907227,
'2_worst' => 2.7415170669555664,
'2_best' => 1.7830519676208496,
'winner' => 'method 2',
'diff' => '0.03626',
)
real 0m45,064s
user 0m2,553s
sys 0m8,544s
hans@xDevAd:~/projects/misc$ time php bech.php
array (
'1_worst' => 0.1131289005279541,
'1_best' => 0.09237909317016602,
'2_worst' => 0.12538981437683105,
'2_best' => 0.09183907508850098,
'winner' => 'method 2',
'diff' => '0.00054',
)
real 0m2,083s
user 0m1,193s
sys 0m0,844s
hans@xDevAd:~/projects/misc$ time php bech.php
array (
'1_worst' => 0.1123800277709961,
'1_best' => 0.09178304672241211,
'2_worst' => 0.14417290687561035,
'2_best' => 0.0872950553894043,
'winner' => 'method 2',
'diff' => '0.00449',
)
real 0m2,088s
user 0m1,162s
sys 0m0,864s edit: also appears that method2 (handle re-use optimization) beats method 1 (create/delete/create/delete) every time, so there's a little (28 milliseconds speed increase above! but only when having to use DNS, the first 2 requests were remote norway->canada, the 2 last were 127.0.0.1) to gain by doing it, but seemingly not much. i haven't tested any of these against async-request though. also http://x2.ratma.net is a server in canada you're free to spam if you want something remote to test to. it's on a shared (and sometimes busy) 1Gbps connection |
the old code would run the multi_handles, then wait until there was new data available on any of the sockets -or- until $timeout, and then return (and not fetch the new data),
the new code will run the multi handles, then wait until there is new data available on any of the sockets -or- until $timeout, and if there is new data available on the sockets before $timeout, it will fetch the new data as well,
this should make the function faster (waiting until there is data, and not fetching it, like the old function did, was kindof dumb)
also the new code only runs select() if there are unfinished handles, the old code would run select() even if all handles had finished (that might save some cpu)