Skip to content
This repository has been archived by the owner on May 28, 2019. It is now read-only.

Commit

Permalink
Merge galaxy3
Browse files Browse the repository at this point in the history
  • Loading branch information
bluebore committed Sep 12, 2016
2 parents cbe47c9 + 1937ffc commit b903c05
Show file tree
Hide file tree
Showing 190 changed files with 34,915 additions and 28 deletions.
57 changes: 34 additions & 23 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -22,27 +22,38 @@ _testmain.go
*.exe
*.test
*.prof
*.sconsign*

*.pb.h
*.pb.cc

output/*

# bin files
master
scheduler
agent
initd
thirdsrc
thirdparty
sandbox/*.log
galaxy
initd_cli
initd_cli_server
sandbox/data/
sandbox/work_dir/
sandbox/gc_dir/
sandbox/galaxy.flag
sandbox/sample.json
test_agent
test_initd
# binary
#agent
#appmaster
#appworker
#resman

# dirs
thirdparty/
thirdsrc/
.*/

# pbs
*.pb.*

# ctags
tags
GRTAGS
GTAGS
GPATH

*~
*.swp
*.swo
tmp
galaxy.flag

*.orig

# logs
*.INFO*
*.WARNING*
*.ERROR*
galaxy.flag
5 changes: 1 addition & 4 deletions .travis.yml
Original file line number Diff line number Diff line change
@@ -1,10 +1,7 @@
language: cpp
compiler: gcc
env:
- PROTOBUF_VERSION=2.6.0
install:
- sudo apt-get update
- sudo apt-get install libreadline-dev
- sudo apt-get install scons
script:
- sh build.sh
- cd sandbox && ./quick_test.sh
2 changes: 1 addition & 1 deletion LICENSE
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
Copyright (c) 2015, The Galaxy Authors.
Copyright (c) 2016, Baidu.com, Inc.
All rights reserved.

Redistribution and use in source and binary forms, with or without
Expand Down
109 changes: 109 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
[![Build Status](https://travis-ci.org/baidu/galaxy.svg?branch=galaxy3)](https://travis-ci.org/baidu/galaxy)

galaxy 3.0

Galaxy 3.0设计
=============

# 背景

Galaxy3.0是对Galaxy2.0的重构,主要解决以下问题:

1. 容器管理和服务管理紧耦合:服务的升级和启停都伴随容器的销毁和调度;
2. 没有磁盘管理,只能管理home盘;
3. 不支持用户quota和记账;
4. 机器管理功能缺失;
5. Naming功能可用性低;
6. Trace功能不完善;

# 系统架构

Galaxy3.0架构上分为2层: 资源管理层和服务管理层,每层都是主从式架构
1. 资源管理层由ResMan(Resource Manager)和Agent构成
2. 服务管理层由AppMaster和AppWorker构成;


+-------------------+-----------------------------+
| | | |
| | MapReduce | Spark |
| | | |
| +-----------------------------+
| |
| Service Management | ---> {AppMaster + AppWorkers}
| |
+-------------------------------------------------+
| |
| Resource Management | ---> {ResMan + Agents}
| |
+-------------------------------------------------+

## 1. 资源管理层(Resource Management)
组件: ResMan + Agents
一个Galaxy集群只有一个处于工作状态的ResMan,负责容器的调度,为每个容器找到满足部署资源要求的机器;
ResMan通过和部署在各个机器上的Agent通信,来创建和销毁容器;
容器: 一个基于linux cgroup和namspace技术的资源隔离环境;
容器里默认会启动AppWorker进程,是容器内的第一个进程,也就是根进程;
ResMan不暴露给普通用户接口, 仅供内部组件以及集群管理员使用;

## 2. 服务管理层 (Service Management)
组件: AppMaster + AppWorkers
AppMaster是外界用户操作Galaxy的唯一入口;
一个Galaxy集群通常只有一个AppMaster,负责服务的部署、更新、启停和状态管理,把服务实例分发到各个机器上的容器内启动并跟踪状态;
AppMaster通过调用ResMan的RPC接口创建容器,容器内自动拉起AppWorker进程;
容器内的AppWorker进程通过和AppMaster进程通信,获得需要在容器内执行的命令,包括部署、启停、更新等等;
AppWorker会汇报服务的状态给AppMaster,例如托管的服务是否在正常运行,进程退出码等;

## 调度逻辑

用户提交的Job内容主要是两部分:资源需求 + 程序描述
资源需求: CPU核数、内存大小、磁盘容量、机器Lable、端口范围、mount路径
程序描述: 部署命令、启动命令、停止命令、更新命令、版本号

### 1. ResMan的调度逻辑
ResMan通过定时查询Agent,获得每个Agent上面可分配的资源
ResMan不断检查当前是否有处于Pending状态的容器, 寻找有资源的Agent创建容器;
创建失败的容器,又进入Pending状态,等待重新调度;
不符合预期的容器, ResMan命令Agent销毁, 重新进入Pending状态;
ResMan确保容器的个数始终符合用户的需求;

### 2. AppMaster的调度逻辑
AppMaster等待AppWorkers的定时汇报;
如果AppWorker汇报的服务状态不符合AppMaster的预期,则AppMaster返回一些命令让AppWorker执行;
> a) 部署: AppWorker汇报目前没有运行任何服务, AppMaster返回部署命令给AppWorker;
> b) 启动: AppWorker汇报部署成功了, AppMaster返回启动命令给AppWorker;
> c) 更新: AppWorker汇报当前服务的版本号, AppMaster发现不匹配, 返回更新命令给AppWorker;
> d) 失败处理: AppWorker汇报(部署失败 or 启动失败 or 更新失败), AppMaster记录此次异常,并根据策略决定是否让AppWorker继续重试;
## 容错

1. ResMan,AppMaster都有备份,通过Nexus抢锁来Standby;
2. Agent跟踪每个容器的状态汇报给ResMan,当容器个数不够或者不符合ResMan的要求时,就需要调度:创建或删除容器;
3. AppWorker负责跟踪用户程序的状态,当用户程序coredump、异常退出或者被cgroup kill后,反馈状态给AppMaster,AppMaster根据指定策略命令AppWorker是否再次拉起用户的服务;
4. 由于机器缺陷或者网络分割,可能导致ResMan认为容器个数足够,但是AppMaster发现服务实例数不够的情况:
> 例如: 磁盘坏了、端口被占用等, 导致用户服务始终无法拉起;
> 这种情况下, AppMaster可以调用ResMan的接口,增大容器个数(有上限);
## 服务发现
1. SDK通过Nexus发现AppMaster地址;
2. SDK请求AppMaster,发现每个Job实例的地址和当前的服务状态;
3. AppMaster会定时同步服务地址和状态到第三方Naming系统(如BNS,Nexus,ZK等);

## 服务更新
1. SDK通过Nexus发现指定的Job的AppMaster地址;
2. SDK请求AppMaster, AppMaster将服务更新命令传播给AppWorker, AppWorker将更新状态反馈给AppMaster;
3. AppWorker和AppMaster的通信方式是Pull的方式,因此AppMaster可以根据当前的情况来决定部署的暂停和步长控制;
4. 服务的更新都在容器内进行,不涉及到容器的销毁和创建

# 权限管理和quota管理模型
1. 集群(Cluster):共用同一ResMan的host/agent及服务
2. 机器池(Pool): 一个host/agent只能属于一个机器池,一个机器池通常有很多host/agent。一个集群中可能有多个机器池。机器池用于资源及环境的硬隔离, 也是权限分配的单位。
3. 用户(User):galaxy用户
4. 权限(Authority):某用户在某机器池上具有的某种操作权限,如对Job的增、删、改、查权限等。用户可以同时对多个机器池具有多项权限。
5. 配额(Quota):配额是对用户在集群中拥有资源量的描述, 包含cpu配额,内存配额, 磁盘空间配额,可提交任务数量配额等。用户的配额和具体的机器池没有关系。
6. 标签(label): 标签一般用来表征一批拥有某种特征的机器,标签和机器是多对多的关系。有权限的用户可以对机器池中的机器打标签,提交任务时可指定标签。

# 系统依赖
1. Nexus作为寻址和元信息保存
2. MDT作为用户日志的Trace系统
3. Sofa-PbRPC作为通信基础库

80 changes: 80 additions & 0 deletions SConstruct
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
protoc = Builder(action='thirdparty/bin/protoc --proto_path=./src/protocol/ --cpp_out=./src/protocol/ $SOURCE')
env_gen = Environment(BUILDERS={'Protoc':protoc})

env_gen.Protoc(['src/protocol/galaxy.pb.h','src/protocol/galaxy.pb.cc'], 'src/protocol/galaxy.proto')
env_gen.Protoc(['src/protocol/resman.pb.h','src/protocol/resman.pb.cc'], 'src/protocol/resman.proto')
env_gen.Protoc(['src/protocol/agent.pb.h','src/protocol/agent.pb.cc'], 'src/protocol/agent.proto')
env_gen.Protoc(['src/protocol/appmaster.pb.h','src/protocol/appmaster.pb.cc'], 'src/protocol/appmaster.proto')
env_gen.Protoc(['src/protocol/appworker.pb.h','src/protocol/appworker.pb.cc'], 'src/protocol/appworker.proto')


env = Environment(
CPPPATH = ['.', 'src', 'src/agent', 'thirdparty/boost_1_57_0/', './thirdparty/include', './thirdparty/rapidjson/include', 'src/utils'] ,
LIBS = ['sofa-pbrpc', 'protobuf', 'snappy', 'glog', 'gflags', 'tcmalloc', 'unwind', 'ins_sdk', 'pthread', 'z', 'rt', 'boost_filesystem', 'gtest', 'common', 'leveldb'],
LIBPATH = ['./thirdparty/lib', './thirdparty/boost_1_57_0/stage/lib'],
CCFLAGS = '-g2 -Wall -Werror -Wno-unused-but-set-variable',
LINKFLAGS = '-Wl,-rpath-link ./thirdparty/boost_1_57_0/stage/lib')

env.Program('resman', Glob('src/resman/*.cc') + Glob('src/utils/*.cc')
+ ['src/protocol/resman.pb.cc', 'src/protocol/galaxy.pb.cc', 'src/protocol/agent.pb.cc'])

env.Program('appmaster', Glob('src/appmaster/*.cc') + Glob('src/utils/*.cc')
+ ['src/protocol/appmaster.pb.cc', 'src/protocol/galaxy.pb.cc', 'src/protocol/resman.pb.cc', 'src/naming/private_sdk.cc'])

env.Program('appworker', Glob('src/appworker/*.cc') + Glob('src/utils/*.cc')
+ ['src/protocol/galaxy.pb.cc', 'src/protocol/appmaster.pb.cc', 'src/protocol/appworker.pb.cc'])

env.Program('agent', Glob('src/agent/*.cc') + Glob('src/utils/*.cc') + Glob('src/agent/*/*.cc')
+ ['src/protocol/agent.pb.cc', 'src/protocol/galaxy.pb.cc', 'src/protocol/resman.pb.cc'])

env.StaticLibrary('galaxy_sdk', ['src/protocol/appmaster.pb.cc', 'src/protocol/galaxy.pb.cc',
'src/sdk/galaxy_sdk_util.cc', 'src/sdk/galaxy_sdk_appmaster.cc'])
env.StaticLibrary('naming_sdk', ['src/protocol/appmaster.pb.cc', 'src/protocol/galaxy.pb.cc', 'src/naming/private_sdk.cc'])

env.Program('galaxy_res_client', Glob('src/client/galaxy_res_*.cc')
+ ['src/client/galaxy_util.cc', 'src/client/galaxy_parse.cc', 'src/sdk/galaxy_sdk_resman.cc',
'src/sdk/galaxy_sdk_util.cc',
'src/protocol/resman.pb.cc', 'src/protocol/galaxy.pb.cc'])

env.Program('galaxy_client', Glob('src/client/galaxy_job_*.cc') + Glob('src/sdk/*.cc')
+ ['src/client/galaxy_util.cc', 'src/client/galaxy_parse.cc',
'src/protocol/appmaster.pb.cc', 'src/protocol/galaxy.pb.cc', 'src/protocol/resman.pb.cc'])

#unittest
agent_unittest_src=Glob('src/test_agent/*.cc')+ Glob('src/agent/*/*.cc') + ['src/agent/agent_flags.cc', 'src/protocol/galaxy.pb.cc', 'src/protocol/agent.pb.cc']
env.Program('agent_unittest', agent_unittest_src)

cpu_tool_src = ['src/example/cpu_tool.cc']
env.Program('cpu_tool', cpu_tool_src)

jail_src = ['src/tools/gjail.cc', 'src/agent/util/input_stream_file.cc']
env.Program('gjail', jail_src)

container_meta_src = ['src/example/container_meta.cc','src/protocol/galaxy.pb.cc', 'src/agent/container/serializer.cc', 'src/agent/util/dict_file.cc']
env.Program('container_meta', container_meta_src)


#example
test_cpu_subsystem_src=['src/agent/cgroup/cpu_subsystem.cc', 'src/agent/cgroup/subsystem.cc', 'src/protocol/galaxy.pb.cc', 'src/agent/util/path_tree.cc', 'src/example/test_cpu_subsystem.cc', 'src/agent/agent_flags.cc', 'src/agent/util/util.cc']
env.Program('test_cpu_subsystem', test_cpu_subsystem_src)

test_cgroup_src=Glob('src/agent/cgroup/*.cc') + ['src/example/test_cgroup.cc', 'src/protocol/galaxy.pb.cc', 'src/agent/agent_flags.cc', 'src/agent/util/input_stream_file.cc', 'src/protocol/agent.pb.cc', 'src/agent/collector/collector_engine.cc', 'src/agent/util/util.cc']
env.Program('test_cgroup', test_cgroup_src)

test_process_src=['src/example/test_process.cc', 'src/agent/container/process.cc']
env.Program('test_process', test_process_src)

test_volum_src=['src/example/test_volum.cc', 'src/protocol/galaxy.pb.cc', 'src/agent/util/path_tree.cc', 'src/agent/agent_flags.cc', 'src/agent/util/user.cc'] + Glob('src/agent/volum/*.cc') + Glob('src/agent/collector/*.cc')
#env.Program('test_volum', test_volum_src);

test_container_src=['src/example/test_contianer.cc', 'src/protocol/galaxy.pb.cc', 'src/agent/agent_flags.cc', 'src/protocol/agent.pb.cc'] + Glob('src/agent/cgroup/*.cc') + Glob('src/agent/container/*.cc') + Glob('src/agent/volum/*.cc') + Glob('src/agent/util/*.cc') + Glob('src/agent/resource/*.cc') + Glob('src/agent/collector/*.cc')
env.Program('test_container', test_container_src);

test_galaxy_parse_src=['src/example/test_galaxy_parse.cc', 'src/client/galaxy_util.cc','src/client/galaxy_parse.cc']
env.Program('test_galaxy_parse', test_galaxy_parse_src);

env.Program('test_filesystem', ['src/example/test_boost_filesystem.cc'])
#env.Program('test_b', ['src/example/test_boost.cc', 'src/agent/util/util.cc'])
env.Program('test_appworker_utils', ['src/example/test_appworker_utils.cc', 'src/appworker/utils.cc'])

env.Program('test_volum_collector', ['src/example/test_volum_collector.cc', 'src/agent/volum/volum_collector.cc', 'src/agent/agent_flags.cc'])
10 changes: 10 additions & 0 deletions build.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
#!/bin/bash
echo "-->(1/2), start to install deps ..."
./build_deps.sh
if [ $? -ne 0 ]; then
echo "fail to install deps!!!"
exit 1
fi
echo "-->(2/2) call scons to build project"
scons -j 8

Loading

0 comments on commit b903c05

Please sign in to comment.