404
+ +Page not found
+ + +diff --git a/.nojekyll b/.nojekyll new file mode 100644 index 0000000..e69de29 diff --git a/404.html b/404.html new file mode 100644 index 0000000..0a23ebf --- /dev/null +++ b/404.html @@ -0,0 +1,195 @@ + + +
+ + + + +Page not found
+ + +Info
+This site provides best practices, sizing and troubleshooting guidelines to improve the performance of IBM Maximo Application Suite (MAS).
+Maximo 7.x Best Practices are also available on the site. Most DB configurations in the best practice are still applicable to MAS Manage app.
+ +There are many instance types available in AWS. Based on the benchmark, recommend M5, M6 instances (e.g.M5.4xlarge) as master or worker nodes and P3, P4 as GPU nodes.
+Note
+Each OCP cluster creates 1 class load balancer and 2 network load balancers in AWS. AWS classic load balancer has a default idle time 60 seconds. In some cases, this value is not enough for a long time transaction (e.g. asset health check notebook). Consider to adjust this value to what the application needs (e.g. 300 seconds).
+Also, monitoring classic load-balance performance is strongly recommend, particularly with IoT related app. (Note: Surge Queue Length's defaults to a hardcoded limit of 1024. When queue is fully, the tcp handshake will fail)
+DocumentDB is a fully managed MongoDB compatibility database. It can be used by both IBM Suite License Service (SLS) and MAS Core. However, there are functional differences between DocumentDB and MongoDB. Check this link for more details.
+Note: When using DocumentDB, it requires to set RetryWrite=false
in SLS and Suite CRs.
MAS supports MSK which is a fully managed apache Kafka service.
+Note
+EBS storages like gp2, gp3 are supported by OCP in AWS. Note: EBS storage is ReadWriteOnce. The volume can be mounted as read-write by a single node. io1 and io2 are SSD-based EBS that provides the higher performance. Check Amazon EBS volume types for extra info like throughput, tuning and cost.
+Below is a sample yaml to create io1 storageclass with 100 iopsPerGB.
+kind: StorageClass
+apiVersion: storage.k8s.io/v1
+metadata:
+ name: io1
+provisioner: kubernetes.io/aws-ebs
+parameters:
+ encrypted: 'true'
+ iopsPerGB: '100'
+ type: io1
+reclaimPolicy: Delete
+allowVolumeExpansion: true
+volumeBindingMode: Immediate
+
EFS Storage can be used as ReadWriteMany storageclass. EFS has different metered throughput modes.
+A self-managed OCP Cluster can be created by the installer cli tool that supports both IPI and UPI mode. It requires self maintenance and upgrades. Alternatively, ROSA is a managed Red Hat OpenShift Service. Each ROSA cluster comes with a fully managed control plane and compute nodes. Installation, management, maintenance, and upgrades are performed by Red Hat site reliability engineers (SRE) with joint Red Hat and Amazon support.
+ +For OCP cluster, it recommends Premium File Storage, because MAS and its components need RWX (Read/Write/Many permission) storage to support a certain level high availability as well as doclink, jms storage…
+For External DB VM, it recommends a high-performance storage like Premium SSD or v2 or Ultra Disk. More performance metric can be found in https://learn.microsoft.com/en-us/azure/virtual-machines/disks-types.
+ +The MAS core namespace contains several important services required for user login and authentication, application management, MAS adoption metrics, licensing, etc. To understand the insight of each service/pod functionality in MAS core, check MAS Pods Explained .
+The following are the key components/dependencies that require scaling as the number of concurrent MAS users grows.
+Caveat
+The scaling guidance described below is provided from lab benchmark testing and may vary based on the differences in workload, environment, or configuration settings.
+MongoDB is a crucial dependency for MAS core services, if not scaled properly MongoDB can quickly become bottleneck as the number of concurrent users increases. A common symptom of an undersized MongoDB cluster is liveness probe timeouts and pod restarts of the MAS core services which depend on MongoDB (e.g. coreidp).
+For useful MongoDB troubleshooting commands see MongoDB Troubleshooting
+The following MongoDB metrics are important to monitor
+Tip
+When using the ibm.mas_devops collection to install MAS you can optionally install Grafana with the cluster_monitoring ansible role. Once Grafana is installed via the cluster_monitoring ansible role you can then install MongoDB using the mongodb ansible role. The mongodb ansible role includes a Grafana dashboard for monitoring the MongoDB cluster.
+If your using a MongoDB cluster hosted by a cloud provider uses the monitoring dashboards provided by the cloud provider.
+The following databases and collections in MongoDB are accessed frequently during user login and authentication.
+The table below provides some general guidance on scaling MongoDB based on number of concurrent users and login rate. To scale MongoDB community edition you should specify the desired cpu/mem limits in the MongoDBCommunity CR.
+spec:
+ statefulset:
+ spec:
+ template:
+ spec:
+ containers:
+ - name: mongod
+ resources:
+ limits:
+ cpu: <cpu limit>
+ memory: <mem limit>
+
Login rate (logins/minute) | +MongoDB CPU limit | +MongoDB Memory limit (GB) | +
---|---|---|
75 | +2 | +4 | +
150 | +2 | +4 | +
300 | +4 | +8 | +
600 | +6 | +12 | +
1200 | +8 | +16 | +
The table below provides some general guidance on scaling the coreidp service based on number of concurrent users and login rate. To scale the coreidp service use the podTemplates workload customization feature in MAS.
+Login rate (logins/minute) | +coreidp replicas | +coreidp CPU limit | +coreidp Memory limit (GB) | +
---|---|---|---|
75 | +1 | +6 | +1 | +
150 | +1 | +6 | +1 | +
300 | +1 | +6 | +1 | +
600 | +2 | +6 | +2 | +
1200 | +4 | +6 | +3 | +
The table below provides some general guidance on scaling the licensing-mediator service based on number of concurrent users and login rate. The coreidp service calls the licensing-mediator service which in turn calls the api-licensing service in the SLS namespace for license checkin/checkout operations. To scale the licensing-mediator service use the podTemplates workload customization feature in MAS.
+Login rate (logins/minute) | +licensing-mediator replicas | +licensing-mediator CPU limit | +licensing-mediator Memory limit (GB) | +
---|---|---|---|
75 | +1 | +1 | +1 | +
150 | +1 | +1 | +1 | +
300 | +2 | +2 | +1 | +
600 | +4 | +3 | +1 | +
1200 | +6 | +3 | +1 | +
The table below provides some general guidance on scaling the api-licensing service based on number of concurrent users and login rate. To scale the api-licensing service use the podTemplates workload customization feature in MAS.
+Login rate (logins/minute) | +api-licensing replicas | +api-licensing CPU limit | +api-licensing Memory limit (GB) | +
---|---|---|---|
75 | +1 | +1 | +2 | +
150 | +1 | +2 | +2 | +
300 | +2 | +2 | +2 | +
600 | +2 | +2 | +2 | +
1200 | +2 | +2 | +2 | +
The table below provides some general guidance on scaling the coreapi service based on number of concurrent users and login rate. To scale the coreapi service use the podTemplates workload customization feature in MAS.
+Login rate (logins/minute) | +coreapi replicas | +coreapi CPU limit | +coreapi Memory limit (GB) | +
---|---|---|---|
75 | +3 | +1 | +2 | +
150 | +3 | +1 | +2 | +
300 | +3 | +1 | +2 | +
600 | +3 | +2 | +2 | +
1200 | +3 | +3 | +2 | +
IBM Cloud provides both block and file storages for OCP. Both storages support ReadWriteMany access. If the app requires a high-performance disks, consider to setup custom performance storageclass as blow:
+block storage sample yaml
+allowVolumeExpansion: true
+apiVersion: storage.k8s.io/v1
+kind: StorageClass
+metadata:
+ name: block100p
+parameters:
+ billingType: hourly
+ classVersion: "2"
+ fsType: ext4
+ sizeIOPSRange: |-
+ [20-1999]Gi:[100-100]
+ type: Performance
+provisioner: ibm.io/ibmc-block
+reclaimPolicy: Delete
+volumeBindingMode: WaitForFirstConsumer
+
file storage sample yaml
+allowVolumeExpansion: true
+apiVersion: storage.k8s.io/v1
+kind: StorageClass
+metadata:
+ name: file100p
+parameters:
+ billingType: hourly
+ classVersion: "2"
+ fsType: ext4
+ sizeIOPSRange: |-
+ [20-1999]Gi:[100-100]
+ type: Performance
+provisioner: ibm.io/ibmc-file
+reclaimPolicy: Delete
+volumeBindingMode: WaitForFirstConsumer
+
If the built-in ingress load balancer in OCP is unable to scale to handle with "large" workloads (100K+ concurrent device connections), consider to provision an instance of IBM cloud NLB2.0 (IPVS/KeepAlived) load balancer.
+IBM ROKS is a managed Red Hat OpenShift Service in IBM Cloud. Each ROKS cluster comes with a fully managed control plane and compute nodes. Installation, management, maintenance, and upgrades are performed by Red Hat site reliability engineers (SRE) with joint Red Hat and IBM Cloud support.
+ +The MQTT protocol is the preferred messaging protocol for data ingest in to the MAS IoT service. HTTP messaging support was added to MAS IoT for low volume scenarios and is not designed to be used for message rates greater than 1K msgs/sec.
+MQTT message ingest rates are 2-3 orders of magnitude faster than HTTP. The primary reason being that HTTP messaging requires a TLS handshake and authentication on every message published. The authentication requires a database lookup for the device authentication token. As such, HTTP messaging puts a strain on the authentication service and the IoT database.
+In order to achieve high data ingest rates with MAS IoT service, use the MQTT protocol and keep the device connection open while publishing messages.
+MQTT CONNECT
+MQTT PUBLISH (in loop until all messages are published)
+
MQTT CONNECT
+MQTT PUBLISH
+MQTT DISCONNECT
+MQTT CONNECT
+MQTT PUBLISH
+MQTT DISCONNECT
+...
+
The MQTT service in MAS IoT was designed to handle many device connections, each publishing at low rates. As such, when designing a data ingest application for MAS IoT it should distribute the load over many MQTT devices or applications in order to maximize message rates. Single device or application connections will be throttled based on the IoT Fair use policy (see below).
+IoT data ingest throttling limits are per device and are based on the device class (i.e. Device, Gateway, Application). These limits are in place to prevent DoS attacks from rogue (i.e. badly behaving) devices. The throttling limits do not scale with the MAS IoT deployment size. For more information on MAS IoT messaging quotas see https://www.ibm.com/docs/en/mapms/1_cloud?topic=features-quotas
+The messaging QoS specified when publishing an MQTT message also has a strong impact on messaging rates.
+QoS in order of fastest to slowest:
+QoS >0 performance considerations
+apiVersion: iot.ibm.com/v1
+kind: IoT
+metadata:
+ name: masinst1
+ namespace: mas-masinst1-iot
+spec:
+ bindings:
+ jdbc: system
+ kafka: system
+ mongo: system
+ settings:
+ deployment:
+ size: medium
+
/opt/ansible/roles/<ibm-iot-operator>/vars
folder, e.g. /opt/ansible/roles/ibm-iot-actions/vars/size_medium.yml
Openshift HAProxy supports 20k connection per pod. The total connection determinants how many end devices can connect to IoT MSProxy.
+oc patch -n openshift-ingress-operator ingresscontroller/default --patch '{"spec":{"replicas": 3}}' --type=merge
IoT uses Kafka to process the messages. Follow the Kafka Configuration Reference to configure best value for Kafka/Topics retention.ms, retention.bytes, partitions, replics
to support the workload.
Depending on the cloud providers, worker node instance has different network bandwidth. It determines how fast the end devices can send the request. Message rate is limited by the message size and the bandwidth of ethernet network. To achieve higher rates and/or larger messages it will require a 10GB ethernet. The network bandwidth also impacts the response latency. The higher bandwidth, the lower latency.
+Below deployment configurations are recommended as starting value with medium and large workload.
+Table Name | +Columns | +Comments | +
---|---|---|
plusgpermitwork | +"ptwclass" ASC,"siteid" ASC,"orgid" ASC,"permitworknum" ASC | ++ |
plusgpermitwork | +"ptwclass" ASC,"status" ASC,"plusgpertypeid" ASC,"permitworknum" ASC | ++ |
plusgpermitwork | +"ptwclass" ASC | ++ |
plusgpermitwork | +"status" ASC,"ptwclass" ASC,"description" ASC | ++ |
plusgpertype | +"pertypenum" ASC,"plusgpertypeid" ASC | ++ |
workorder | +"description" ASC | +Add it if search on description field, create as text index is better | +
workorder | +"status" ASC,"historyflag" ASC,"istask" ASC,"wonum" ASC | +Add it if search on status field | +
plusgoperaction | +"recordid" ASC,"class" ASC | ++ |
plusgshftlogentry | +"recordkey" ASC,"orgid" ASC,"siteid" ASC,"createdate" ASC | ++ |
plusgshiftlog | +"shiftnum" ASC,"isshiftlog" ASC,"startdate" ASC | ++ |
plusgrelatedrec | +"relatedreckey" ASC,"relatedrecclass" ASC,"recordkey" ASC | ++ |
plusgrelatedrec | +"recordkey" ASC,"class" ASC,"relatedrecclass" ASC | ++ |
plusgincperson | +"ticketid" ASC | ++ |
maxsession | +"issystem" ASC, "userid" ASC, "clienthost" ASC | ++ |
ticket | +"globalticketid" ASC,"globalticketclass" ASC | ++ |
report | +"reportname" ASC,"appname" ASC,"reportnum" ASC,"runtype" ASC,"userid" ASC | ++ |
reportrunqueue | +"running" ASC,"priority" ASC,"submittime" DESC | ++ |
Table Name | +Columns | +
---|---|
PLUSTWARRTRANS | +CONTRACTTYPE,CLAIMID,ASSETNUM,CONTRACTNUM,SITEID | +
PLUSTWARRTRANS | +CONTRACTTYPE,CLAIMID,TRANSDATE,PLUSTWARRTRANSID | +
maxuser | +status,userid | +
logintracking | +"userid" ASC,"attemptresult" ASC,"attemptdate" DESC | +
craftrate | +orgid | +
SYNONYMDOMAIN | +MAXVALUE,DOMAINID,VALUE | +
CONTLINEASSET | +"PLUSTNEWEXTENDEDREASON" ASC, "LOCATION" DESC | +
CONTRACT | +CONTRACTTYPE,STATUS,ORGID,CONTRACTNUM | +
LOCHIERARCHY | +LOCATION,SYSTEMID,SITEID,PARENT | +
MULTIASSETLOCCI | +ISPRIMARY,RECORDKEY,WORKSITEID,RECORDCLASS | +
PLUSTASSETALIAS | +ISACTIVE,ALIAS,PLUSTASSETALIASID,DESCRIPTION,ORGID,LANGCODE,ISDEFAULT,ISASSETNUM,HASLD,SITEID,ASSETNUM | +
INVOICELINE | +PLUSTCONTRACTNUM,SITEID,INVOICELINENUM,INVOICENUM | +
INVOICE | +INVOICENUM,SITEID,STATUS | +
INVOICELINE | +INVOICENUM,SITEID,INVOICELINENUM | +
INVOICECOST | +"INVOICENUM" ASC, "SITEID" ASC, "ASSETNUM" ASC, "INVOICELINENUM" ASC | +
ASSET | +"ASSETNUM" ASC, "SITEID" ASC, "ASSETID" ASC | +
inspectionform | +inspformnum,status,orgid | +
inspectionresult | +"siteid" ASC, "referenceobject" ASC, "referenceobjectid" ASC | +
PLUSTWARRTRANS | +"CONTRACTTYPE" ASC, "CLAIMID" ASC, "PLUSTWARRTRANSID" ASC | +
propertydefault | +contracttypeid,orgid | +
plustitemwarr | +itemnum,plustpos,orgid,assetid,matusetransid,plustitemwarrid | +
plustitemwarrmtr | +plustitemwarrid | +
plustassetalias | +assetnum,siteid,isdefault | +
PLUSTWARRTRANS | +"CLAIMID" ASC, "SITEID" ASC | +
countbookline | +itemnum,countbooknum,siteid | +
countbookline | +countbooknum,siteid,orgid | +
countbookline | +match,countbooknum,siteid,orgid | +
countbookline | +recon,countbooknum,siteid,orgid | +
countbookline | +physcnt,countbooknum,siteid,orgid | +
countbook | +storeroom,countbooknum,siteid | +
item | +itemsetid,itemnum | +
warrantyasset | +contractnum,revisionnum,orgid,assetid | +
contractline | +CONTRACTNUM,REVISIONNUM,ORGID,CONTRACTLINENUM,contracttype | +
invbalances | +"PHYSCNTDATE" DESC, "SITEID" ASC, "ITEMNUM" ASC | +
countbookline | +countbooknum,siteid,rotating,itemnum | +
invbalances | +location,nextphycntdate | +
inventory | +location,siteid,orgid,itemnum | +
countbooksel | +countbooknum,siteid | +
mafappdata | +ismobile,status | +
asset | +assetnum,siteid,status,plustisconsist,plustalias,orgid | +
plustassetalias | +assetnum,siteid,isactive | +
invoicecost | +assetnum,siteid | +
plustclaim | +orgid,contractnum,status | +
plustclaim | +assetnum,siteid | +
asset | +assetid,moved,plustisconsist,description | +
invoice | +siteid,invoicenum,status,orgid | +
wplabor | +wplaborid,orgid | +
joblabor | +orgid,siteid,jobplanid,jptask | +
workorder | +plustcmpnum,siteid,status | +
plustitemwarr | +matusetransid | +
contlineasset | +assetid,location,locationsite,warrantystartdate,warrantyenddate,contractnum | +
CONTRACT | +contractnum,revisionnum,orgid,contracttype,status | +
contlineasset | +assetid,orgid,plustfullcoverage,contractnum,revisionnum,contractlinenum | +
plustwpserv | +wpservid,orgid | +
plustwarrtrans | +matusetransid,covereditemnum | +
contractline | +itemnum,conditioncode,linestatus | +
warrantyline | +plustcoverservices,plustcovermaterials | +
pm | +siteid,status,pmnum,assetnum | +
workorder | +siteid,pmnum,status | +
plustwpserv | +wonum,siteid | +
plustwarrtrans | +servrectransid | +
plustwarrtrans | +parentwonum,refwonum,claimid,siteid,linecost | +
plustwpserv | +invoicenum,invoicesite | +
invoice | +status,invoicenum,siteid | +
maxvars | +varname,orgid,varvalue | +
inspectionresult | +inspformnum,revision,orgid,siteid,status,asset,location,resultnum | +
plustwpserv | +wonum,invoicenum,complete | +
A new MAS instance is required to run MAS Manage workloads at the point in which the DB server can no longer be scaled up. When the DB server can no longer be scaled up, the customer should plan to create a new MAS instance and move sites to the new MAS instance which will be using a new DB server.
+When describing Maximo transaction latency it is important to define the boundaries of what constitutes a standard or out-of-the-box Maximo transaction. The description below does just that.
+An out of the box Maximo transaction is expected to complete with a latency of 2 seconds or less, where a transaction is defined as the creation, update, or deletion of a single MBO, containing no more than one child object and with no attachments or binary data (blobs). Example include, but are not limited to:
+The following conditions are considered to be outside the scope of an out of the box Maximo transaction, and therefore do not fall under the 2 second latency characterization.
+MAS Manage has different bundle types e.g. All, UI, MEA, Report and CRON to configure app server. Adjust the resource settings like cpu, memory, replic to match the workload. The settings are in ManageWorkspaces CR. Below is the sample.
+apiVersion: apps.mas.ibm.com/v1
+kind: ManageWorkspace
+...
+spec:
+ settings:
+ deployment:
+ serverBundles:
+ - bundleType: mea
+ isDefault: false
+ isMobileTarget: false
+ isUserSyncTarget: true
+ name: mea
+ replica: 1
+ routeSubDomain: all
+ - bundleType: cron
+ isDefault: false
+ isMobileTarget: false
+ isUserSyncTarget: false
+ name: cron
+ replica: 1
+...
+
spec:
+ settings:
+ resources:
+ manageAdmin:
+ limits:
+ cpu: '2'
+ memory: 4Gi
+ requests:
+ cpu: '0.2'
+ memory: 500Mi
+ serverBundles:
+ limits:
+ cpu: '6'
+ memory: 10Gi
+ requests:
+ cpu: '0.2'
+ memory: 1Gi
+
Lab test shows roundrobin has more stable and better performance than leastconn policy which is the default. Follow this link to update load balancer policy.
+Follow this link to understand the manage pod functionality.
+Using IBM Maximo Application Suite (MAS), Manage users will receive an error message saying to reload the application after 2 hours, even while actively working. This 2-hour timeout default is when the LTPA token in Manage expires, and is redirecting the user back to the login page for MAS. Follow Updating LTPA timeout in Manage to increase the default value.
+Due to the architecture change, Maximo 8.x (MAS Manage app) is deployed on WebSphere Liberty Base 21.0.0.5 with OpenJ9. As of WebSphere Liberty 18.0.0.1, the thread growth algorithm is enhanced to grow thread capacity more aggressively to react more rapidly to peak loads. For many environments, this autonomic tuning provided by the Open Liberty thread pool works well with no configuration or tuning by the operator. If necessary, you can adjust coreThreads
and maxThreads
value by tuning liberty.
Follow this link to configure JVM options
+disk performance is critial for db performance. Recommend a storage or disk with
+To measure disk performance on Linux use the dd
command. The sample command below measures disk performance of the data volume inside a db2 pod running in OCP
CAUTION
+Make sure that ddtest
filename is appended to the end of the data path or the dd command will wipe the db2 data directory.
[db2inst1@c-db2wh-manage-db2u-0 - Db2U bludata0]$ dd if=/dev/zero of=path_of_db2_data_directory/ddtest bs=128K count=8192
+8192+0 records in
+8192+0 records out
+1073741824 bytes (1.1 GB, 1.0 GiB) copied, 2.84314 s, 378 MB/s
+[db2inst1@c-db2wh-manage-db2u-0 - Db2U bludata0]$
+
Reducing network latency is key to optimizing performance. Confirm latency is below 50ms by conducting a ping test. For production env, strongly recommend keeping the latency below 10ms and having app and db server in the same network segment. In cloud deployment scenarios, ensure both the database and OpenShift cluster are located within the same region, with the possibility of being in the same availability zone (AZ). Utilize the ping command to evaluate and pinpoint latency issues.
+When optimizing large tables in the Manage app, it is recommended to transfer these tables to a dedicated tablespace on high-throughput disks, coupled with a dedicated buffer cache for enhanced performance. The speed of the disks and the availability of memory play crucial roles in this optimization strategy. Additionally, ensure that index statistics are regularly updated, and address any problematic queries to further optimize the system.
+DB2 Tuning in Maximo 7.6.x Best practice is applicable.
+IMPORTANT
+The containerized DB2U and DB2WH deployments do NOT support text search (Regular DB2 has text search).
+As a result, some queries may perform poorly on containerized DB2 relative to Oracle DB and SQL Server, which both support text search.
Searching records by Description on the list page is a typical scenario whose performance can benefit from text search capability of the database, especially if no other indexed attributes are included in the query.
+Adding a non-unique index on Description can help if an exact search can be made (Maximo search type = EXACT or user types is "=" before the search string, eg =Text) or the search can be done based on the beginning of the string (user types '%' at end of the string, eg Text%). If possible, adding other fields to the query (either by user typing them or as part of the default, where those attributes are part of an index can also help. In addition, adding Description to the end of one of these indexes can also show improvement.
+Highlights:
+db2 update db cfg using DFT_TABLE_ORG ROW
Maximo 7.6.x Best practice is applicable
+additional settings for MSSQL Server 2019
+ ALTER DATABASE <DB NAME>
+ SET ALLOW_SNAPSHOT_ISOLATION ON
+
+ ALTER DATABASE <DB NAME>
+ SET READ_COMMITTED_SNAPSHOT ON
+
This was added to the MAS Manage application to assist users with problem code classification (PCC) for Work Orders. See the product documentation for more details.
+Inferencing is typically run more frequently than training, but is less resource intensive. By default, MAS Manage is configured with a single instance of the AIINFJOB cron task. This is recommended for most workloads.
+The predictor pod where inferencing/prediction occurs receives a batch of Work Orders to be inferenced from the MAS Manage cron pod running the AIINFJOB cron task. The batch size (or page size, defined on the MXAPIWODETAIL object structure query template) is the best way to control the rate at which Work Orders are inferenced. In the graph below you can see how the total time to inference 100K work orders is influenced by the batch size. With a batch size of 500 Work Orders/request and a 30 second interval for the AIINFJOB cron task 100K work orders were inferenced in approximately 1.6 hours. Compared to a batch size of 10 WO/request which took 83 hours.
+Important
+The recommended batch size is 500 Work Orders/request and the recommended interval for the AIINFJOB cron task is 30 seconds.
+The graphs below show the CPU and memory resource utilization of the predictor pod based on the batch size. As you can see, the CPU utilization of the predictor pod increases with the batch size, but the memory utilization remains fairly consistent (i.e. between 4GB - 5GB)
+For bulk inferencing of large numbers of Work Orders it is recommended to use the AIINFJOB cron task. However, UI users can also request problem code inferencing on a single work order. In this case the predictor pod will receive a single work order and as a result the overhead of processing a single work order is much higher. For example, to inference a batch of 10 or more work orders will result in an average inferencing time of 20 milliseconds per work order in the predictor pod, but the inference time for a single work order from the UI is about 120 milliseconds (in the predictor pod). The total time including the MAS Manage API request is about 750 milliseconds. It is therefore much more efficient to inference large numbers of work orders asynchronously using the AIINFJOB cron task and a page size of 500. In other words, don't use the API from a script.
+ +This was added to the MAS Manage application to assist users with problem code classification (PCC) for Work Orders. See the product documentation for more details.
+Model training is resource intensive. For this reason there is a limit of one active model training per MAS Manage instance.
+A single model training requires at least 8GB of memory. The pipeline pod, where model training occurs, will allocate a number of busy processes equal to the number of CPU on the worker node where the pod is scheduled. At the time of this writing there is no CPU limit set for the pipeline pod, so it will consume as much CPU resources as are available on the worker node where it is scheduled. In general, the more CPU is available to the pipeline pod the faster training time will go.
+The three data points on the graph below were taken on a 16 CPU worker node. In the tests below a cpu limit was placed on the pipeline pod (not the default, i.e. by default the pipeline pod does not have specified limits). As you can see the training time with an 8 CPU limit was a little more than twice as fast as the training time with a 4 CPU limit. However, when comparing the 16 CPU limit and 8 CPU limit training time, there is very little improvement. This can be attributed to the fact that there were other workloads running on the worker node where the pipeline pod was scheduled and as well as synchronization waits between the training processes/threads. In other words, to improve the training time for the 16 cpu limit test it would be necessary to schedule the pipeline pod on a worker node with more than 16 CPU and fewer competing workloads.
+Important
+Do not train with more than 10K labeled samples. 10K samples is the recommended limit for PCC training.
+The training times for a single epoch and different sample sizes are shown below. In general, the larger the size of the labeled sample data set, the longer the training time will be. You can see below there is an +exception to this rule. When comparing the single epoch training time between the 1K sample size and the 5K sample size, you can see that the single epoch training time for 5K sample size is only 82 minutes compared to 220 minutes for the 1K sample size. This is due to the fact that there were 30 problem codes in this test and with 1K sample size there were an insufficient number of samples per problem code. As a result, the model leveraged Watson X to generate synthetic samples and this process accounts for the additional training time for the 1K sample set.
+Info
+The results below show training time for a single epoch. For a real training, 12 epochs is used and therefore the single epoch training times below should be multiplied by 12 to get the real training time. Note, there is a default timeout of 14400 minutes (or 10 days) for training to complete.
+The lab results indicate a significant correlation between the transaction per second (TPS) and database disk IO utilization. This correlation suggests that the level of transactional activity directly impacts the IO workload on the database disk. Conversely, the IO workload acts as a limitation on the system's ability to handle a larger volume of transactions.
+
When IO is not the limiting factor, increasing the number of MEA Pods can positively impact the processing performance.
+Increasing the Message-Driven Bean (MDB) instances can potentially have a positive impact on system performance. It is recommended to adjust the number of records per message, the # of MDB and the batch size. By finding the right balance, you can target a resource usage of around 2 cores and 4-7GB of RAM that can help ensure efficient utilization without overburdening the MEA pods.
+Based on the lab results, it has been observed that a large number of internal error messages have a substantial impact on processing throughput.
+Under a certain circumstance, the configuration parameter mxe.int.splitdataonpost does not demonstrate a positive impact. To validate its effectiveness, it is recommended to perform a dry run in your specific environment for verification.
+To troubleshoot and optimize performance, follow this checklist:
+maxMessageDepth
to avoid message queue overflow. It is recommended to match SIBus's default value of at least 500,000.Establish a monitoring system to track essential performance metrics throughout the testing process.
+Begin with a dry run using a single MEA pod to establish a baseline benchmark for performance evaluation.
+Adjust the Message-Driven Bean (MDB) and BatchSize parameters to optimize resource utilization within an appropriate range for the MEA pod.
+Scale up the number of MEA pods as needed to meet performance requirements and accommodate increased workload.
+Continuously monitor and assess the performance of both the database and the application to identify any bottlenecks or areas for improvement.
+By following these test methodologies, you can effectively monitor and optimize the performance of your system, ensuring efficient resource utilization and maintaining satisfactory levels of performance.
+Component | +Configuration | +Adjustable or Scalable | +Observeration & Best Practice | +
---|---|---|---|
JMS / MIF | +maxMessageDepth | +Yes | +Make it large enough. If it is too small, when the queue is full, the process fails and may be hard to recover. Recommend 500,000 same as SIBus | +
+ | maxEndpoints | +Yes | +Limit the maxConcurrency | +
+ | MDB(maxConcurrency) | +Yes | +Alone with BatchSize will impact processing speed and MEA pods resouce utliitzation | +
+ | BatchSize(maxBatchSize) | +Yes | +Alone with MDB will impact processing speed and MEA pods resouce utliitzation | +
Maximo | +# of JMS Pod | +Yes | +1 JMS Server works well in benchmark test. It does not consume a significant resource | +
+ | # of MEA Pod | +Yes | +Able to linear scale | +
+ | MEA CPU / MEM Usage | +Yes | +Adjust JMS/MDB and BatchSize to control MEA pods resources in a reasonable range e.g. (2 - 3 core / 4 -7G) | +
+ | JMS CPU / MEM Usage | +Yes | +Default setting works well in the benchmark test | +
+ | DB CPU / MEM Usage | +Yes | +Ensure DB has sufficient resource | +
+ | DB Disk IO Util % | +Yes, but sometime it is hard to adjust | +Disk IO throughput is critial for the overall processing | +
+ | DB Lock Holds | +N/A | ++ |
+ | DB Tuning: Long Running Query, # of Appl, Memory.. | +Yes | +Follow the best practice to tune DB | +
+ | Maximo Sequence Cache | +Yes | +a reasonable # e.g. 20 or 50 can reduce the db cpu and processing time | +
+ | mxe.int.splitdataonpost | +Yes | ++ |
Message | +# of record per Message | +Yes | ++ |
+ | data structure (complexity of the record) | +N/A | +Impacts performance because of business logic check | +
+ | Record Quality (record cannot be processed) | +Yes | +A large amount of int error messages slow down the overall processing speed | +
Misc | +Method & Speed to post message into queue | +Yes | +Ensure message post (writing to queue) as fast as possible. A slow pacing lowes the env processing capacity. | +
+ | Any other concurrent transactions | +N/A | +other concurrency workloads impact the processing time | +
+ | Worker Node Capacity | +Yes | +Worker Node Capacity may limit working pod (e.g. MEA) capacity. Pod distribution should also be considered. | +
Same to MIF/JMS test, the lab results indicate a significant correlation between the transaction persecond (TPS) and database disk IO utilization. This correlation suggests that the level of transactionalactivity directly impacts the IO workload on the database disk. Conversely, the IO workload acts as alimitation on the system's ability to handle a larger volume of transactions.
+The results also demonstrate a notable connection between the disk IO throughput and the TPS (Transactions Per Second).
+Doubling the number of CRON JVMs and Kafka topic partitions leads to a twofold increase in the maximum TPS. However, this change also results in an enlarged distribution difference, growing from 2% to 10%. Consequently, in the final phase, the overall processing rate diminishes, with the TPS decreasing from 72 to 66, attributed to the Kafka rule - which allows a maximum of 1 consumer per partition.
+Increasing the number of partitions may result in better performance for small messages (e.g., 10 assetsper message) compared to large messages. Please ensure that there are an adequate number ofmessages in the queue for processing.
+When evaluating the performance of a single MEA JVM, the TPS in MIF/Kafka matches that of JMS. Nevertheless, when multiple processing JVMs are utilized, JMS surpasses performance due to its more equitable workload distribution. From a best-practice standpoint, it is advisable to have one Kafka topic with 6 partitions and multiple Kafka topics for parallel processing.
+Strongly recommend creating a mobile database for supporting data downloads. Online support downloading can significantly impact the performance of Mobile Pods, databases, and networks.
+To mitigate download failures, consider increasing the timeout value for the ingressor. The default server/client timeout is set too low, affecting the pass rate. Use the following commands to raise the default value:
+oc -n openshift-ingress-operator patch ingresscontroller/default --type=merge -p '{"spec":{"tuningOptions": {"clientTimeout": "300s"}}}'
oc -n openshift-ingress-operator patch ingresscontroller/default --type=merge -p '{"spec":{"tuningOptions": {"serverTimeout": "300s"}}}'
Scaling up the coreapi pod can enhance the downloading experience for the mobile app.
+Consider scaling up the mobile pods when the CPU usage of a pod exceeds 4.
+Optimal disk throughput for the database is crucial for a smooth app downloading experience.
+Observations from lab tests suggest that balanced node resource utilization is crucial for optimal performance. It is worth noting that the default topology spread constraints in the ManageWorkspace Custom Resource (CR) are set to "topologyKey: topology.kubernetes.io/zone". However, in a single-zone cluster, if the pod is not being evenly distributed across worker nodes, considerto be set to "topologyKey: topology.kubernetes.io/hostname" instead.
+mongostat --username admin --password <password> --authenticationDatabase admin --ssl --sslAllowInvalidCertificates 2
mongotop --username admin --password <password> --authenticationDatabase admin --ssl --sslAllowInvalidCertificates 2
oc logs -n <mongo namespace> <mongo pod name> -c mongod | grep -iE 'Slow query'
db.currentOp({"active" : true,"secs_running" : { "$gt" : 3 },"ns" : /^msg/})
db.killOp("opid")
db.serverStatus().globalLock
db.serverStatus().mem
db.serverStatus().wiredTiger.cache
db.serverStatus().connections
Monitoring your OpenShift clusters is critical for the environment health, the quality of services. It helps ensure that all deployed workloads are running smoothly and that the environment is properly scoped.
+OpenShift Container Platform includes a pre-installed monitoring stack that is based on the Prometheus/Grafana. MAS also provides app-level promethus metrics and a set of Grafana dashboards for application health. More installation, configuration details can be found in IBM MAS Monitoring
+Best practice for OpenShift Monitoring Service
+enableUserWorkload: false
Below is the sample for configmap cluster-monitoring-config
+apiVersion: v1
+kind: ConfigMap
+metadata:
+ name: cluster-monitoring-config
+ namespace: openshift-monitoring
+data:
+ config.yaml: |
+ enableUserWorkload: true
+ prometheusK8s:
+ retention: 90d
+ volumeClaimTemplate:
+ spec:
+ storageClassName: nfs-client
+ resources:
+ requests:
+ cpu: 200m
+ storage: 300Gi
+ memory: 2Gi
+ limits:
+ cpu: 2
+ memory: 4Gi
+ alertmanagerMain:
+ volumeClaimTemplate:
+ spec:
+ storageClassName: nfs-client
+ resources:
+ requests:
+ storage: 20Gi
+
Note
+Highly recommend to use OpenShift cluster Insights Advisor that to check for any issue related to the current version, nodes and mis-configurations. It is the first step for the problem diagnosis.
+Steps:
+This settings control how many processes can be run within one single container. If it is too small, it can cause folk bomb issue. E.g. db2w instance may be unavailable when there are thousands of connections/agents upcoming or Openshift Container Storage not behaving well with a large amount of PVCs.
+OOB value for OCP platforms:
+Platform Version | +Default Value | +
---|---|
IBM ROKS (4.8) | +231239 | +
AWS ROSA | +4096 in OpenShift 4.11 and higher | +
Azure Self-Managed OCP | +1024 | +
Steps to check or update PID limit: +
$ oc debug node/$NODE_NAME
+$ chroot /host
+$ cat /etc/crio/crio.conf
+# add / modify the line "pids_limit = <new value>"
+# run belows commands to reboot services and worker nodes
+$ systemctl daemon-reload
+$ systemctl restart crio
+$ shutdown -r now
+
Openshift HAProxy supports up to 20k connections per pod. Consider to scale up ingress pod for any app (like IoT) with a high volume connection workload.
+Scale up ingress controller
+oc patch -n openshift-ingress-operator ingresscontroller/default --patch '{"spec":{"replicas": 3}}' --type=merge
One of the most important tunable parameters for HAProxy scalability is the maxconn
parameter. The router can handle a maximum number of 20k concurrent connections by using oc adm router --max-connections=xxxxx
. This parameter will be impacted by node settings sysctl fs.nr_open
and sysctl fs.file-max
. HAproxy will not start if maxconn is high, but node setting is low.
Note: OpenShift Container Platform no longer supports modifying Ingress Controller deployments by setting environment variables such as ROUTER_THREADS, ROUTER_DEFAULT_TUNNEL_TIMEOUT, ROUTER_DEFAULT_CLIENT_TIMEOUT, ROUTER_DEFAULT_SERVER_TIMEOUT, and RELOAD_INTERVAL. You can modify the Ingress Controller deployment, but if the Ingress Operator is enabled, the configuration is overwritten.
+Starting from OCP 4.10, there have been four load-balancing algorithms available: source, roundrobin, random, and leastconn. The default algorithm is set to random. In earlier versions of OCP, before 4.10, there were three load-balancing algorithms: source, roundrobin, and leastconn. The default algorithm in those versions was leastconn. Set up annotations for each route to change the default algorithm if needed. e.g. haproxy.router.openshift.io/balance=roundrobin
There are a wide selection instance types that comprise varying combinations of CPU, memory, disk and network. Below are a few considerations:
+The sizing number in this page is based on a standard workload. Used as reference only.
+Use Sizing Calculation Sheet for MAS sizing.
+If using OCS to manage the storage class, OCS service itself requires minimum 3 nodes with 14 core / 32G (Note: this is the total request amount, not per node).
+3 OCP nodes will run ODF services. (NOTE: OCP clusters often contain additional OCP worker nodes which do not run ODF services.) +Each OCP node running ODF services has:16 core / 64 GB memory
+Based on the benchmark results, for sizing we recommend 50 - 75 user load per MAS Manage UI server bundle pod, which is equivalent to a JVM with 2 core on Maximo 7.6.x.
+App | +CPU Request (core) | +CPU Limits (core) | +Memory Rquest (GB) | +Memory Limits(GB) | +
---|---|---|---|---|
Add | +6 | +12 | +13 | +26 | +
Assist | +12.4 | +57.7 | +19.46 | +62.38 | +
Core | +1.5 | +18.95 | +6.27 | +32.5 | +
Health | +2.9 | +15.6 | +7.12 | +30.84 | +
HPU | +0.9 | +5.5 | +0.92 | +6.5 | +
IoT | +19.66 | +214.65 | +57.08 | +269 | +
Manage | +2.9 | +11.1 | +4.04 | +17 | +
Monitor | +5.4 | +32.4 | +12.84 | +55.5 | +
Optimizer | +7.4 | +19.3 | +25.57 | +117 | +
Predict | +3.1 | +12.5 | +6.13 | +24.5 | +
Additional cost | +- - - - - - - | +- - - - - - - | +- - - - - - - - - - - | +- - - - - - - - - - - | +
ocs* | +14 | +32 | +14 | +32 | +
cp4d (with 2 db2w instances)* | +31.59 | +40.7 | +235.39 | +249.70 | +
each additional manage pod* | +1 | +6 | +2 | +10 | +
Info
+Info
+A monitoring system is strongly recommended to track the environment health and the quality of services.
+Scope | +Name | +Used for | +
---|---|---|
OCP | +OpenShift Monitoring Service | +OpenShift Cluster and MAS | +
DB2 | +IBM DSM | +DB2 Historical and Realtime Troubleshooting | +
DB2 | +db2top | +DB2 Realtime Troubleshooting | +
DBTest | +DBTest | +An utility to test db network latency and fetching time | +
Oracle | +AWR, StatsPack | +Historical Troubleshooting | +
JVM | +IBM Support Assistant | +Heap Dump and GC Log Analysis | +
JVM | +MAT | +JVM Dump Analysis | +
Maximo | +PerfMon | +- Maximo UI Activity Tracing - Note: Enabling PerfMon may significantly degrade server performance. - Recommend for a single user with Dev/Test env only |
+
MongoDB | +mongotop | +MongoDB Realtime Troubleshooting | +
HAR | +HTTP Archive Viewer | +HAR Analysis - for web page and client side (browser) performance | +
SQL | +Poor SQL | +Online SQL Formatter | +
SQL | +Squirrl | +Universal SQL Client | +
SSL | +SSL Shopper | +Online certificate decode tool | +
OS | +top | +Process and thread level analysis, hotspot analysis - top is available in most containers and on OCP worker nodes | +
OS | +sar | +a system command be used to monitor system resources like cpu, memory, disk, network... | +
OCP | +oc debug node/<node name> |
+Worker node debugging | +
System performance depends on more than the applications and the database. The network architecture affects performance. Application server configuration can hurt or improve performance. The way that you deploy Maximo across servers affects the way the products perform. Many other factors come into play in providing the end-user experience of system performance. +Subsequent sections in this paper address the following topics:
+db2top can be used for a real-time diagnosis.
+db2top -db <dbname>
list memory allocation:
+db2mtrk -i -d –v
+
list long run query:
+SELECT ELAPSED_TIME_MIN,SUBSTR(AUTHID,1,10) AS AUTH_ID, AGENT_ID,APPL_STATUS,SUBSTR(STMT_TEXT,1,20) AS SQL_TEXT FROM SYSIBMADM.LONG_RUNNING_SQL WHERE ELAPSED_TIME_MIN > 0 ORDER BY ELAPSED_TIME_MIN DESC;
+
list backup/restore status:
+db2pd -barstats -d <dbname>
+
list most active tables:
+SELECT SUBSTR(TABSCHEMA,1,10) AS SCHEMA,SUBSTR(TABNAME,1,20) AS NAME,TABLE_SCANS,ROWS_READ,ROWS_INSERTED,ROWS_DELETED FROM TABLE(MON_GET_TABLE('','',-1)) ORDER BY ROWS_READ DESC FETCH FIRST 5 ROWS ONLY
+
list most active indexes:
+SELECT SUBSTR(TABSCHEMA,1,10) AS SCHEMA,SUBSTR(TABNAME,1,20) AS NAME,IID,NLEAF, NLEVELS,INDEX_SCANS,KEY_UPDATES,BOUNDARY_LEAF_NODE_SPLITS + NONBOUNDARY_LEAF_NODE_SPLITS AS PAGE_SPLITS FROM TABLE(MON_GET_INDEX('','',-1)) ORDER BY INDEX_SCANS DESC FETCH FIRST 5 ROWS ONLY
+
list db2 advise for the statement:
+db2advis -database bludb -s "select * from maximo.ahfactorhistory where ahdriverhistoryid = 123 for read only" -n MAXIMO -q MAXIMO
+
checking for indexes the need to be rebuilt
+db2 reorgchk current statistics on schema 'MAXIMO' > /tmp/reorgchk.log
+
Any indexes or tables with an *
in the REORG column, indicate that they are candidates for reorg.
list the query execution plan:
+db2expln -database bludb -schema MAXIMO -package % -statement "select * from maximo.ahfactorhistory where ahdriverhistoryid = 123 for read only" -terminal -graph > query1_access_plan.txt
+
list all indexes for a specific table:
+select * from syscat.indexes i where TABNAME ='ITEMSTRUCT'
+
list insert/update/delete/tablescan stats for a specific table:
+SELECT rows_read,rows_inserted,rows_updated,rows_deleted,table_scans FROM TABLE(MON_GET_TABLE('MAXIMO','ASSET',-2))
+
list insert/update/delete/tablescan stats for all tables:
+SELECT SUBSTR(TABSCHEMA,1,10) AS SCHEMA,SUBSTR(TABNAME,1,20) AS NAME,TABLE_SCANS,ROWS_READ,ROWS_INSERTED,ROWS_DELETED FROM TABLE(MON_GET_TABLE('','',-1)) ORDER BY ROWS_READ DESC FETCH FIRST 5 ROWS ONLY"
+
list top 10 big tables:
+select creator, name, avgrowsize, card, stats_time, avgrowsize*card as tbsize, npages*t.pagesize/1024/1024 as tbsize_inMB from sysibm.systables t1, syscat.tablespaces t where creator not like 'DB2%' and t1.tbspace=t.tbspace order by tbsize desc fetch first 10 rows only
+
list data and index size for one table:
+select tabschema, tabname, DATA_OBJECT_P_SIZE/1024 as data_inMB, INDEX_OBJECT_P_SIZE/1024 as index_inMB,LONG_OBJECT_P_SIZE/1024 LongObj_inMB, LOB_OBJECT_P_SIZE/1024 as LOB_inMB from table(sysproc.admin_get_tab_info('MAXIMO','WORKORDER'))
+
list error message:
+db2 ? <sqlerror>
+
db2pd
: monitor and troubleshoot DB2 database command
db2diag
: db2diag logs analysis tool commanddb2set
: db2 global settingsdb2 get dbm cfg
: db2 database manager configuration db2 get db cfg
: db2 database configuration IBM DSM is useful to do both real-time/ historical data diagnosis, find out the expensive sql query, justify cpu spent on sql execution or other e.g. sorting, parsing, fetching, io and so on. It requires pre-configuration.
+A high-level set up:
+notes: This utility requires Java version 11 or higher.
+The DBTest Utility has two modes:
+Benchmark Mode (the default): is to measure database connection time, query execution time and data fetching time for every 100 records.
+Query Mode: is to display the query result with database connection time, query execution time and data fetching time.
+Here is an example demonstrating how to utilize this utility in the Maximo UI pod.
+cd /tmp
+curl -L -v -o run-dbtest-in-maxinst-pod.sh https://ibm-mas.github.io/mas-performance/pd/download/DBTest/run-dbtest-in-maxinst-pod.sh
+bash run-dbtest-in-maxinst-pod.sh
+
# change to /tmp
+cd /tmp
+
+# download DBTest
+curl -L -v -o DBTest.class https://ibm-mas.github.io/mas-performance/pd/download/DBTest/DBTest.class
+
+# set DBURL. If this utility is in maximo UI pod, set DBURL="$MXE_DB_URL"
+export DBURL="<jdbc url>" or export DBURL="$MXE_DB_URL" or export DBURL="${MXE_DB_URL}sslTrustStoreLocation=${java_truststore};sslTrustStorePassword=${java_truststore_password};"
+export DBUSERNAME='<username>'
+export DBPASSWORD='<password>'
+export SQLQUERY='select * from maximo.maxattribute'
+
+# execute the utility in benchmark mode
+java -classpath .:$(dirname "$(find /opt | grep "oraclethin.jar" | head -n 1)")/* DBTest
+
Result Samples:
+Given optimal network latency and a healthy database status, the expected data fetching time is less than 10 milliseconds.
+Good Result:
+
Bad Result:
+
java -classpath .:$(dirname "$(find /opt | grep "oraclethin.jar" | head -n 1)")/* DBTest -q
+
Output Sample:
+(base) [~/javatool]$ java -classpath .:./lib/* DBTest -q
+Dec. 06, 2023 11:49:47 A.M. DBTest getConnection
+INFO: Loading Class took: 0.029 seconds
+Dec. 06, 2023 11:49:53 A.M. DBTest getConnection
+INFO: DB Connecting took: 6.55 seconds
+Dec. 06, 2023 11:49:53 A.M. DBTest printResult
+INFO: Query Execution took: 0.099 seconds
+APP, OPTIONNAME, DESCRIPTION, ESIGENABLED, VISIBLE, ALSOGRANTS, ALSOREVOKES, PREREQUISITE, SIGOPTIONID, LANGCODE, HASLD, ROWSTAMP
+---------------------------------------------------------------------------------------------------------------------------------
+APIKEY, READ, Access to API Keys application, 0, 1, null, ALL, null, 200004204, EN, 0, 290874862
+Dec. 06, 2023 11:49:54 A.M. DBTest printResult
+INFO: Fetching Record took: 0.058 seconds
+
As a result of architectural modifications, Maximo 8.x (MAS Manage app) now operates on WebSphere Liberty Base 21.0.0.5 with OpenJ9 within the OpenShift Container Platform (OCP). It's essential to note that JVM arguments outlined in the 7.x Best Practice documentation may not be relevant or applicable to the Maximo 8.x environment. Here are additional details:
+As of WebSphere Liberty 18.0.0.1, the thread growth algorithm is enhanced to grow thread capacity more aggressively so as to react more rapidly to peak loads. For many environments, this autonomic tuning provided by the Open Liberty thread pool works well with no configuration or tuning by the operator. If necessary, you can adjust coreThreads
and maxThreads
value.
-Xgcpolicy:gencon
+Gencon is the default policy in OpenJ9, this parameter works in both 7.x and 8.x
+-Xmx or -XX:MaxRAMPercentage (maximum heap size)
+If not specifying -Xmx value, JVM uses 75% of total container memory when -XX:+UseContainerSupport is set. When -Xmx is set, -XX:MaxRAMPercentage will be ignored.
+-XX:+UseContainerSupport/-XX:-UseContainerSupport
+If -XX:+UseContainerSupport is set, it allows to change the InitialRAMPercentage and MaxRAMPercentage values. -Xms and -Xmx can overwrite the limits.
+-Xmn (Nursery Space)
+Setting the size of the nursery when using this policy can be very important to optimize performance. 25 - 33% of total heap is recommended. Please note manage pod limited memory is 10G that is not Total heap size. Heap size is based on (-Xmx or -XX:MaxRAMPercentage) setting. 10G also includes memory used by websphere for cache, compilation as well maximo mmi container.
+-Xgcthreads4
+This parameter is used to set the number of threads that the Garbage Collector uses for parallel operations. By default, it is set to n -1 in OpenJ9 where n is the number of reported cpu on the node. You might want to restrict the number of GC threads used by each VM to reduce some overhead.
+-XcompilationThreads4
+This parameter is used to specify the number of compilation threads that are used by the JIT compiler. Same as gcthread, you might want to restrict the number of compilation threads used by each VM to reduce some overhead.
+-Xshareclasses
+this parameter is used to share class data between running VMs, which can reduce the startup time for a VM once the cache has been created.
+‑Xdisableexplicitgc (Recommended)
+This parameter is used to disabling explicit garbage collection disables any System.gc() calls from triggering garbage collections. For optimal performance, disable explicit garbage collection.
+-Djava.net.preferIPv4Stack=true/-Djava.net.preferIPv6Addresses=false
+For performance reasons, Maximo recommends to set this property to true. Note: this parameter can not be applied on the hosts that only communicate with ipv6.
+-XX:PermSize and -XX:MaxPermSize + Maximo 7.x BP recommends 320m. If seeing an OOM for PermSize, consider to increase to 512MI or higher.
+-Xcodecache32m + The maximum value you can specify for -Xcodecache is 32 MB. JIT compiler might allocate more than one code cache. It is controlled by -Xcodecachetotal which default value is 256MB.
+-Xverbosegclog
+Enable verbose gc log for the garbage collection analysis
+-Xtune:virtualized (under review)
+Optimizes OpenJ9 VM function for virtualized environments, such as a cloud, by reducing OpenJ9 VM CPU consumption when idle.
+Critical Note
+IBM Maximo Cluster Performance Insights is offered "AS IS", WITH NO WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING THE WARRANTY OF TITLE, NON-INFRINGEMENT OR NON-INTERFERENCE AND THE IMPLIED WARRANTIES AND CONDITIONS OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
+The IBM Product Support you have purchased with your IBM Maximo Application Suite Product does not cover this Application extension. Do not attempt to submit an IBM support ticket.
+The IBM TechXChange Maximo Community discussions can be leveraged to crowd-source assistance from Maximo Experts.
+IBM Maximo Cluster Performance Insights (Maximo CPI), is a new utility that use short and long term snapshots to addresses specific best practices for deployment of Maximo App Suite. It can assist in pinpointing areas that need improvement and provide actionable insights for optimizing the MAS deployment.
+Maximo Clients can conduct a self-assessment to ensure adherence to best practices, optimize resource use, and diagnose performance issues. This process helps in evaluating current practices, identifying areas for improvement, and enhancing overall efficiency and effectiveness.
+The utility gathers only metrics data, excluding any sensitive information. It is containerized for ease of use.
+Run on Docker
+docker pull quay.io/brianzhu_ibm/mcpi:latest
docker run -dit -p 8888:8888 --name mcpi quay.io/brianzhu_ibm/mcpi:latest
docker exec -it --user root mcpi bash
oc login https://<openshift-master-url>:<port> -u <username> -p <password>
or oc login https://<openshift-master-url>:<port> --token=<token>
collect-metric.sh
Run on OpenShift Cluster
+oc login https://<openshift-master-url>:<port> -u <username> -p <password>
or oc login https://<openshift-master-url>:<port> --token=<token>
collect-metric.sh
When trying to diagnose a request timeout problem it is helpful to rule out gateways/load balancers outside the OCP cluster. Sometimes these external +gateways can have short timeouts which are resetting a connection before the request is completed. The Ping test Utility is designed to help +diagnose this issue.
+IMPORTANT
+As of this writing the Ping test utility is not part of the base server bundle code and needs to be loaded via a customization archive. +This means that the ManageWorkspace CR needs to be updated and will require a restart of the server bundles (i.e. it will cause a disruption +while the server bundle pod is restarted).
+Single Customization Archive +
spec:
+ settings:
+ customization:
+ customizationArchive: https://ibm.box.com/shared/static/oj9z062b7x8hcywndnv6n57gya7e1ypz.zip
+
In case you already have a customization archive add to the customizationList
+
spec:
+ settings:
+ customizationList:
+ - customizationArchiveName: archiveAlias1
+ customizationArchiveUrl: https://ibm.box.com/shared/static/oj9z062b7x8hcywndnv6n57gya7e1ypz.zip
+
$ curl https://tenant1.manage.masperf4.ibmmas.com/maximo/ping?timeout=1
+{"thread wait time":"1 seconds","status":"ok"}
+$
+
Using the Ping servlet utility to test request timeouts inside the OCP cluster
below).$ curl https://tenant1.manage.masperf4.ibmmas.com/maximo/ping?timeout=300
+
$ curl --insecure https://172.30.237.166:443/maximo/ping?timeout=300
+{"thread wait time":"300 seconds","status":"ok"}
+$
+
If you receive a response from the request issued to the internal Cluster IP address of the MAS Manage UI service, but do not receive a response issued externally from outside the cluster, it could be the case that an external gateway service or load balancer is closing the connection due to a shorter timeout set on the gateway. Check is a network administrator.
+ +' + escapeHtml(summary) +'
' + noResultsText + '
'); + } +} + +function doSearch () { + var query = document.getElementById('mkdocs-search-query').value; + if (query.length > min_search_length) { + if (!window.Worker) { + displayResults(search(query)); + } else { + searchWorker.postMessage({query: query}); + } + } else { + // Clear results for short queries + displayResults([]); + } +} + +function initSearch () { + var search_input = document.getElementById('mkdocs-search-query'); + if (search_input) { + search_input.addEventListener("keyup", doSearch); + } + var term = getSearchTermFromLocation(); + if (term) { + search_input.value = term; + doSearch(); + } +} + +function onWorkerMessage (e) { + if (e.data.allowSearch) { + initSearch(); + } else if (e.data.results) { + var results = e.data.results; + displayResults(results); + } else if (e.data.config) { + min_search_length = e.data.config.min_search_length-1; + } +} + +if (!window.Worker) { + console.log('Web Worker API not supported'); + // load index in main thread + $.getScript(joinUrl(base_url, "search/worker.js")).done(function () { + console.log('Loaded worker'); + init(); + window.postMessage = function (msg) { + onWorkerMessage({data: msg}); + }; + }).fail(function (jqxhr, settings, exception) { + console.error('Could not load worker.js'); + }); +} else { + // Wrap search in a web worker + var searchWorker = new Worker(joinUrl(base_url, "search/worker.js")); + searchWorker.postMessage({init: true}); + searchWorker.onmessage = onWorkerMessage; +} diff --git a/search/search_index.json b/search/search_index.json new file mode 100644 index 0000000..d75c467 --- /dev/null +++ b/search/search_index.json @@ -0,0 +1 @@ +{"config":{"indexing":"full","lang":["en"],"min_search_length":3,"prebuild_index":false,"separator":"[\\s\\-]+"},"docs":[{"location":"","text":"Welcome to MAS Performance Wiki \uf0c1 Info This site will be updated periodically. More topics will be added soon. Lab benchmarks are not published, but can be shared upon request, with completion of an NDA. This site provides best practices, sizing and troubleshooting guidelines to improve the performance of IBM Maximo Application Suite (MAS) . Maximo 7.x Best Practices are also available on the site. Most DB configurations in the best practice are still applicable to MAS Manage app.","title":"Home"},{"location":"#welcome-to-mas-performance-wiki","text":"Info This site will be updated periodically. More topics will be added soon. Lab benchmarks are not published, but can be shared upon request, with completion of an NDA. This site provides best practices, sizing and troubleshooting guidelines to improve the performance of IBM Maximo Application Suite (MAS) . Maximo 7.x Best Practices are also available on the site. Most DB configurations in the best practice are still applicable to MAS Manage app.","title":"Welcome to MAS Performance Wiki"},{"location":"mas/aws/bestpractice/","text":"AWS \uf0c1 Instance Type \uf0c1 There are many instance types available in AWS. Based on the benchmark, recommend M5, M6 instances (e.g.M5.4xlarge) as master or worker nodes and P3, P4 as GPU nodes. Note Depending on the regions, some instances may not be available. Use AWS Pricing Calculator to check the instance availability and cost. g4dn can be used as GPU node for test/dev env, but not recommended for production env. If the application requires a good network performance, check Amazon EC2 instance network bandwidth site for more details. For production env, an instance with 10GB ethernet is recommended. Classic Load Balancer Idle Timeout \uf0c1 Each OCP cluster creates 1 class load balancer and 2 network load balancers in AWS. AWS classic load balancer has a default idle time 60 seconds. In some cases, this value is not enough for a long time transaction (e.g. asset health check notebook). Consider to adjust this value to what the application needs (e.g. 300 seconds). Also, monitoring classic load-balance performance is strongly recommend, particularly with IoT related app. (Note: Surge Queue Length's defaults to a hardcoded limit of 1024 . When queue is fully, the tcp handshake will fail) Amazon DocumentDB \uf0c1 DocumentDB is a fully managed MongoDB compatibility database. It can be used by both IBM Suite License Service (SLS) and MAS Core. However, there are functional differences between DocumentDB and MongoDB. Check this link for more details. Note: When using DocumentDB, it requires to set RetryWrite=false in SLS and Suite CRs. Amazon MSK \uf0c1 MAS supports MSK which is a fully managed apache Kafka service. Note monitor MSK performance via CloudWatch is strongly recommended. Key metrics include Disk usage by broker, CPU (User) usage by broker, Active Controller Count, Network RX packets by broker, Network TX packets by broker . define an appropriate config for Kafka, MSK and topics. e.g. retention.ms, retention.bytes, partitions and replics to support the workload. AWS Storage \uf0c1 EBS storages like gp2, gp3 are supported by OCP in AWS. Note: EBS storage is ReadWriteOnce. The volume can be mounted as read-write by a single node. io1 and io2 are SSD-based EBS that provides the higher performance. Check Amazon EBS volume types for extra info like throughput, tuning and cost. Below is a sample yaml to create io1 storageclass with 100 iopsPerGB . kind : StorageClass apiVersion : storage.k8s.io/v1 metadata : name : io1 provisioner : kubernetes.io/aws-ebs parameters : encrypted : 'true' iopsPerGB : '100' type : io1 reclaimPolicy : Delete allowVolumeExpansion : true volumeBindingMode : Immediate EFS Storage can be used as ReadWriteMany storageclass. EFS has different metered throughput modes. Bursting Throughput mode is the default. It is inexpensive, but does NOT perform well if all burst credits are used. Monitor BurstCreditBalance metric in CloudWatch. Provisioned Throughput mode is relatively expensive. It can drive up to 3 GiBps for read operations and 1 GiBps for write operations per file system More info can be found at Amazon EFS performance Self-managed OCP vs AWS ROSA \uf0c1 A self-managed OCP Cluster can be created by the installer cli tool that supports both IPI and UPI mode. It requires self maintenance and upgrades. Alternatively, ROSA is a managed Red Hat OpenShift Service. Each ROSA cluster comes with a fully managed control plane and compute nodes. Installation, management, maintenance, and upgrades are performed by Red Hat site reliability engineers (SRE) with joint Red Hat and Amazon support.","title":"AWS"},{"location":"mas/aws/bestpractice/#aws","text":"","title":"AWS"},{"location":"mas/aws/bestpractice/#instance-type","text":"There are many instance types available in AWS. Based on the benchmark, recommend M5, M6 instances (e.g.M5.4xlarge) as master or worker nodes and P3, P4 as GPU nodes. Note Depending on the regions, some instances may not be available. Use AWS Pricing Calculator to check the instance availability and cost. g4dn can be used as GPU node for test/dev env, but not recommended for production env. If the application requires a good network performance, check Amazon EC2 instance network bandwidth site for more details. For production env, an instance with 10GB ethernet is recommended.","title":"Instance Type"},{"location":"mas/aws/bestpractice/#classic-load-balancer-idle-timeout","text":"Each OCP cluster creates 1 class load balancer and 2 network load balancers in AWS. AWS classic load balancer has a default idle time 60 seconds. In some cases, this value is not enough for a long time transaction (e.g. asset health check notebook). Consider to adjust this value to what the application needs (e.g. 300 seconds). Also, monitoring classic load-balance performance is strongly recommend, particularly with IoT related app. (Note: Surge Queue Length's defaults to a hardcoded limit of 1024 . When queue is fully, the tcp handshake will fail)","title":"Classic Load Balancer Idle Timeout"},{"location":"mas/aws/bestpractice/#amazon-documentdb","text":"DocumentDB is a fully managed MongoDB compatibility database. It can be used by both IBM Suite License Service (SLS) and MAS Core. However, there are functional differences between DocumentDB and MongoDB. Check this link for more details. Note: When using DocumentDB, it requires to set RetryWrite=false in SLS and Suite CRs.","title":"Amazon DocumentDB"},{"location":"mas/aws/bestpractice/#amazon-msk","text":"MAS supports MSK which is a fully managed apache Kafka service. Note monitor MSK performance via CloudWatch is strongly recommended. Key metrics include Disk usage by broker, CPU (User) usage by broker, Active Controller Count, Network RX packets by broker, Network TX packets by broker . define an appropriate config for Kafka, MSK and topics. e.g. retention.ms, retention.bytes, partitions and replics to support the workload.","title":"Amazon MSK"},{"location":"mas/aws/bestpractice/#aws-storage","text":"EBS storages like gp2, gp3 are supported by OCP in AWS. Note: EBS storage is ReadWriteOnce. The volume can be mounted as read-write by a single node. io1 and io2 are SSD-based EBS that provides the higher performance. Check Amazon EBS volume types for extra info like throughput, tuning and cost. Below is a sample yaml to create io1 storageclass with 100 iopsPerGB . kind : StorageClass apiVersion : storage.k8s.io/v1 metadata : name : io1 provisioner : kubernetes.io/aws-ebs parameters : encrypted : 'true' iopsPerGB : '100' type : io1 reclaimPolicy : Delete allowVolumeExpansion : true volumeBindingMode : Immediate EFS Storage can be used as ReadWriteMany storageclass. EFS has different metered throughput modes. Bursting Throughput mode is the default. It is inexpensive, but does NOT perform well if all burst credits are used. Monitor BurstCreditBalance metric in CloudWatch. Provisioned Throughput mode is relatively expensive. It can drive up to 3 GiBps for read operations and 1 GiBps for write operations per file system More info can be found at Amazon EFS performance","title":"AWS Storage"},{"location":"mas/aws/bestpractice/#self-managed-ocp-vs-aws-rosa","text":"A self-managed OCP Cluster can be created by the installer cli tool that supports both IPI and UPI mode. It requires self maintenance and upgrades. Alternatively, ROSA is a managed Red Hat OpenShift Service. Each ROSA cluster comes with a fully managed control plane and compute nodes. Installation, management, maintenance, and upgrades are performed by Red Hat site reliability engineers (SRE) with joint Red Hat and Amazon support.","title":"Self-managed OCP vs AWS ROSA"},{"location":"mas/azure/bestpractice/","text":"Azure \uf0c1 Azure Storage \uf0c1 For OCP cluster, it recommends Premium File Storage, because MAS and its components need RWX (Read/Write/Many permission) storage to support a certain level high availability as well as doclink, jms storage\u2026 For External DB VM, it recommends a high-performance storage like Premium SSD or v2 or Ultra Disk . More performance metric can be found in https://learn.microsoft.com/en-us/azure/virtual-machines/disks-types.","title":"Azure"},{"location":"mas/azure/bestpractice/#azure","text":"","title":"Azure"},{"location":"mas/azure/bestpractice/#azure-storage","text":"For OCP cluster, it recommends Premium File Storage, because MAS and its components need RWX (Read/Write/Many permission) storage to support a certain level high availability as well as doclink, jms storage\u2026 For External DB VM, it recommends a high-performance storage like Premium SSD or v2 or Ultra Disk . More performance metric can be found in https://learn.microsoft.com/en-us/azure/virtual-machines/disks-types.","title":"Azure Storage"},{"location":"mas/core/bestpractice/","text":"MAS Core \uf0c1 The MAS core namespace contains several important services required for user login and authentication, application management, MAS adoption metrics, licensing, etc. To understand the insight of each service/pod functionality in MAS core, check MAS Pods Explained . Scaling MAS core for large number of concurrent users \uf0c1 The following are the key components/dependencies that require scaling as the number of concurrent MAS users grows. MongoDB (used extensively by coreidp, api-licensing, adoptionusage, and other MAS/SLS microservices) MAS core namespace: coreidp pods licencing-mediator pods coreapi pods (if users directly login to a MAS application, bypassing the Suite navigator page, this decreases the load on coreapi pods) SLS namespace: api-licensing pods k8s apiserver pods (coreapi pods issue k8s api calls to retrieve information from MAS application CRs, configmaps, etc.) Caveat The scaling guidance described below is provided from lab benchmark testing and may vary based on the differences in workload, environment, or configuration settings. MongoDB \uf0c1 MongoDB is a crucial dependency for MAS core services, if not scaled properly MongoDB can quickly become bottleneck as the number of concurrent users increases. A common symptom of an undersized MongoDB cluster is liveness probe timeouts and pod restarts of the MAS core services which depend on MongoDB (e.g. coreidp). For useful MongoDB troubleshooting commands see MongoDB Troubleshooting Key MongoDB metrics to monitor \uf0c1 The following MongoDB metrics are important to monitor Memory utilization: by default MongoDB will attempt to cache the active data set in memory (in the WiredTiger cache). If there are a large number of cache evictions or the mongod servers are oomkilled these can be indicators that the memory allocation is too small. Consider increasing the memory allocated to mongod server. CPU utilization: check that the mongod servers have not reached their allocated cpu limit Average read/write latency: average read and write latency should be under 50 milliseconds. If not it could be due to an undersized MongoDB cluster. Check that the MongoDB cluster has sufficient memory allocation and check disk performance. Lock waiters: a large number of lock waiters indicates contention on collections/documents in MongoDB Tip When using the ibm.mas_devops collection to install MAS you can optionally install Grafana with the cluster_monitoring ansible role . Once Grafana is installed via the cluster_monitoring ansible role you can then install MongoDB using the mongodb ansible role . The mongodb ansible role includes a Grafana dashboard for monitoring the MongoDB cluster. If your using a MongoDB cluster hosted by a cloud provider uses the monitoring dashboards provided by the cloud provider. Important MongoDB databases and collections \uf0c1 The following databases and collections in MongoDB are accessed frequently during user login and authentication. Database: mas_{{mas-instance-id}}_core Collection: User (user lookup during authentication) Collection: OauthToken (token creation/deletion) Database: {{sls-id}}_sls_licensing Collection: licenses (checkin/checkout licenses) Database: mas_{{mas-instance-id}}_adoptionusage Collection: users (daily adoption usage statistics) Collection: users_hourly (hourly adoption usage statistics) Scaling MongoDB community \uf0c1 The table below provides some general guidance on scaling MongoDB based on number of concurrent users and login rate. To scale MongoDB community edition you should specify the desired cpu/mem limits in the MongoDBCommunity CR. spec: statefulset: spec: template: spec: containers: - name: mongod resources: limits: cpu: