Instance Type


There are many instance types available in AWS. Based on the benchmark, recommend M5, M6 instances (e.g.M5.4xlarge) as master or worker nodes and P3, P4 as GPU nodes.



  • Depending on the regions, some instances may not be available. Use AWS Pricing Calculator to check the instance availability and cost.
  • +
  • g4dn can be used as GPU node for test/dev env, but not recommended for production env.
  • +
  • If the application requires a good network performance, check Amazon EC2 instance network bandwidth site for more details. For production env, an instance with 10GB ethernet is recommended.
  • +

Classic Load Balancer Idle Timeout


Each OCP cluster creates 1 class load balancer and 2 network load balancers in AWS. AWS classic load balancer has a default idle time 60 seconds. In some cases, this value is not enough for a long time transaction (e.g. asset health check notebook). Consider to adjust this value to what the application needs (e.g. 300 seconds).


Also, monitoring classic load-balance performance is strongly recommend, particularly with IoT related app. (Note: Surge Queue Length's defaults to a hardcoded limit of 1024. When queue is fully, the tcp handshake will fail)


Amazon DocumentDB


DocumentDB is a fully managed MongoDB compatibility database. It can be used by both IBM Suite License Service (SLS) and MAS Core. However, there are functional differences between DocumentDB and MongoDB. Check this link for more details.


Note: When using DocumentDB, it requires to set RetryWrite=false in SLS and Suite CRs.


Amazon MSK


MAS supports MSK which is a fully managed apache Kafka service.



  • monitor MSK performance via CloudWatch is strongly recommended. Key metrics include Disk usage by broker, CPU (User) usage by broker, Active Controller Count, Network RX packets by broker, Network TX packets by broker.
  • +
  • define an appropriate config for Kafka, MSK and topics. e.g. retention.ms, retention.bytes, partitions and replics to support the workload.
  • +

AWS Storage


EBS storages like gp2, gp3 are supported by OCP in AWS. Note: EBS storage is ReadWriteOnce. The volume can be mounted as read-write by a single node. io1 and io2 are SSD-based EBS that provides the higher performance. Check Amazon EBS volume types for extra info like throughput, tuning and cost.


Below is a sample yaml to create io1 storageclass with 100 iopsPerGB.

kind: StorageClass
+apiVersion: storage.k8s.io/v1
+  name: io1
+provisioner: kubernetes.io/aws-ebs
+  encrypted: 'true'
+  iopsPerGB: '100'
+  type: io1
+reclaimPolicy: Delete
+allowVolumeExpansion: true
+volumeBindingMode: Immediate

EFS Storage can be used as ReadWriteMany storageclass. EFS has different metered throughput modes.

  • Bursting Throughput mode is the default. It is inexpensive, but does NOT perform well if all burst credits are used. Monitor BurstCreditBalance metric in CloudWatch.
  • +
  • Provisioned Throughput mode is relatively expensive. It can drive up to 3 GiBps for read operations and 1 GiBps for write operations per file system
  • +
  • More info can be found at Amazon EFS performance
  • +

Self-managed OCP vs AWS ROSA


A self-managed OCP Cluster can be created by the installer cli tool that supports both IPI and UPI mode. It requires self maintenance and upgrades. Alternatively, ROSA is a managed Red Hat OpenShift Service. Each ROSA cluster comes with a fully managed control plane and compute nodes. Installation, management, maintenance, and upgrades are performed by Red Hat site reliability engineers (SRE) with joint Red Hat and Amazon support.

Azure Storage


For OCP cluster, it recommends Premium File Storage, because MAS and its components need RWX (Read/Write/Many permission) storage to support a certain level high availability as well as doclink, jms storage…


For External DB VM, it recommends a high-performance storage like Premium SSD or v2 or Ultra Disk. More performance metric can be found in https://learn.microsoft.com/en-us/azure/virtual-machines/disks-types.

MAS Core


The MAS core namespace contains several important services required for user login and authentication, application management, MAS adoption metrics, licensing, etc. To understand the insight of each service/pod functionality in MAS core, check MAS Pods Explained .


Scaling MAS core for large number of concurrent users


The following are the key components/dependencies that require scaling as the number of concurrent MAS users grows.

  • MongoDB (used extensively by coreidp, api-licensing, adoptionusage, and other MAS/SLS microservices)
  • +
  • MAS core namespace:
    • coreidp pods
    • +
    • licencing-mediator pods
    • +
    • coreapi pods (if users directly login to a MAS application, bypassing the Suite navigator page, this decreases the load on coreapi pods)
    • +
  • +
  • SLS namespace:
    • api-licensing pods
    • +
  • +
  • k8s apiserver pods (coreapi pods issue k8s api calls to retrieve information from MAS application CRs, configmaps, etc.)
  • +



The scaling guidance described below is provided from lab benchmark testing and may vary based on the differences in workload, environment, or configuration settings.




MongoDB is a crucial dependency for MAS core services, if not scaled properly MongoDB can quickly become bottleneck as the number of concurrent users increases. A common symptom of an undersized MongoDB cluster is liveness probe timeouts and pod restarts of the MAS core services which depend on MongoDB (e.g. coreidp).


For useful MongoDB troubleshooting commands see MongoDB Troubleshooting


Key MongoDB metrics to monitor


The following MongoDB metrics are important to monitor

  • Memory utilization: by default MongoDB will attempt to cache the active data set in memory (in the WiredTiger cache). If there are a large number of cache evictions or the mongod servers are oomkilled these can be indicators that the memory allocation is too small. Consider increasing the memory allocated to mongod server.
  • +
  • CPU utilization: check that the mongod servers have not reached their allocated cpu limit
  • +
  • Average read/write latency: average read and write latency should be under 50 milliseconds. If not it could be due to an undersized MongoDB cluster. Check that the MongoDB cluster has sufficient memory allocation and check disk performance.
  • +
  • Lock waiters: a large number of lock waiters indicates contention on collections/documents in MongoDB
  • +



When using the ibm.mas_devops collection to install MAS you can optionally install Grafana with the cluster_monitoring ansible role. Once Grafana is installed via the cluster_monitoring ansible role you can then install MongoDB using the mongodb ansible role. The mongodb ansible role includes a Grafana dashboard for monitoring the MongoDB cluster.


If your using a MongoDB cluster hosted by a cloud provider uses the monitoring dashboards provided by the cloud provider.


Important MongoDB databases and collections


The following databases and collections in MongoDB are accessed frequently during user login and authentication.

  • Database: mas_{{mas-instance-id}}_core
    • Collection: User (user lookup during authentication)
    • +
    • Collection: OauthToken (token creation/deletion)
    • +
  • +
  • Database: {{sls-id}}_sls_licensing
    • Collection: licenses (checkin/checkout licenses)
    • +
  • +
  • Database: mas_{{mas-instance-id}}_adoptionusage
    • Collection: users (daily adoption usage statistics)
    • +
    • Collection: users_hourly (hourly adoption usage statistics)
    • +
  • +

Scaling MongoDB community


The table below provides some general guidance on scaling MongoDB based on number of concurrent users and login rate. To scale MongoDB community edition you should specify the desired cpu/mem limits in the MongoDBCommunity CR.

+  statefulset:
+    spec:
+      template:
+        spec:
+          containers:
+          - name: mongod
+            resources:
+              limits:
+                cpu: <cpu limit>
+                memory: <mem limit>
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Login rate (logins/minute)MongoDB CPU limitMongoDB Memory limit (GB)

Scaling coreidp service (MAS core namespace)


The table below provides some general guidance on scaling the coreidp service based on number of concurrent users and login rate. To scale the coreidp service use the podTemplates workload customization feature in MAS.

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Login rate (logins/minute)coreidp replicascoreidp CPU limitcoreidp Memory limit (GB)

Scaling licensing-mediator service (MAS core namespace)


The table below provides some general guidance on scaling the licensing-mediator service based on number of concurrent users and login rate. The coreidp service calls the licensing-mediator service which in turn calls the api-licensing service in the SLS namespace for license checkin/checkout operations. To scale the licensing-mediator service use the podTemplates workload customization feature in MAS.

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Login rate (logins/minute)licensing-mediator replicaslicensing-mediator CPU limitlicensing-mediator Memory limit (GB)

Scaling api-licensing service (SLS namespace)


The table below provides some general guidance on scaling the api-licensing service based on number of concurrent users and login rate. To scale the api-licensing service use the podTemplates workload customization feature in MAS.

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Login rate (logins/minute)api-licensing replicasapi-licensing CPU limitapi-licensing Memory limit (GB)

Scaling coreapi service (MAS core namespace)


The table below provides some general guidance on scaling the coreapi service based on number of concurrent users and login rate. To scale the coreapi service use the podTemplates workload customization feature in MAS.

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Login rate (logins/minute)coreapi replicascoreapi CPU limitcoreapi Memory limit (GB)
IBM Cloud


IBM Storage


IBM Cloud provides both block and file storages for OCP. Both storages support ReadWriteMany access. If the app requires a high-performance disks, consider to setup custom performance storageclass as blow:


block storage sample yaml

allowVolumeExpansion: true
+apiVersion: storage.k8s.io/v1
+kind: StorageClass
+  name: block100p
+  billingType: hourly
+  classVersion: "2"
+  fsType: ext4
+  sizeIOPSRange: |-
+    [20-1999]Gi:[100-100]
+  type: Performance
+provisioner: ibm.io/ibmc-block
+reclaimPolicy: Delete
+volumeBindingMode: WaitForFirstConsumer

file storage sample yaml

allowVolumeExpansion: true
+apiVersion: storage.k8s.io/v1
+kind: StorageClass
+  name: file100p
+  billingType: hourly
+  classVersion: "2"
+  fsType: ext4
+  sizeIOPSRange: |-
+    [20-1999]Gi:[100-100]
+  type: Performance
+provisioner: ibm.io/ibmc-file
+reclaimPolicy: Delete
+volumeBindingMode: WaitForFirstConsumer

IBM External Load Balancer


If the built-in ingress load balancer in OCP is unable to scale to handle with "large" workloads (100K+ concurrent device connections), consider to provision an instance of IBM cloud NLB2.0 (IPVS/KeepAlived) load balancer.




IBM ROKS is a managed Red Hat OpenShift Service in IBM Cloud. Each ROKS cluster comes with a fully managed control plane and compute nodes. Installation, management, maintenance, and upgrades are performed by Red Hat site reliability engineers (SRE) with joint Red Hat and IBM Cloud support.

MQTT vs HTTP Messaging


The MQTT protocol is the preferred messaging protocol for data ingest in to the MAS IoT service. HTTP messaging support was added to MAS IoT for low volume scenarios and is not designed to be used for message rates greater than 1K msgs/sec.


MQTT message ingest rates are 2-3 orders of magnitude faster than HTTP. The primary reason being that HTTP messaging requires a TLS handshake and authentication on every message published. The authentication requires a database lookup for the device authentication token. As such, HTTP messaging puts a strain on the authentication service and the IoT database.


In order to achieve high data ingest rates with MAS IoT service, use the MQTT protocol and keep the device connection open while publishing messages.


Best practice messaging pattern

+MQTT PUBLISH (in loop until all messages are published)

Messaging Anti-pattern


Data Ingest rates, devices, and connections


The MQTT service in MAS IoT was designed to handle many device connections, each publishing at low rates. As such, when designing a data ingest application for MAS IoT it should distribute the load over many MQTT devices or applications in order to maximize message rates. Single device or application connections will be throttled based on the IoT Fair use policy (see below).


IoT Fair Use Policy


IoT data ingest throttling limits are per device and are based on the device class (i.e. Device, Gateway, Application). These limits are in place to prevent DoS attacks from rogue (i.e. badly behaving) devices. The throttling limits do not scale with the MAS IoT deployment size. For more information on MAS IoT messaging quotas see https://www.ibm.com/docs/en/mapms/1_cloud?topic=features-quotas


Messaging QoS


The messaging QoS specified when publishing an MQTT message also has a strong impact on messaging rates.


QoS in order of fastest to slowest:

  • QoS 0 - at most once (data loss possible, no message persistence or ACKs)
  • +
  • QoS 1 - at least once (duplicates are possible, messages persisted and ACKed)
  • +
  • QoS 2 - exactly once (application client required to maintain state, messages persisted and two phase commit between client/server)
  • +

QoS >0 performance considerations

  • requires disk persistence in MAS IoT messaging components and therefore disk I/O performance becomes critical with QoS >0.
  • +
  • the MQTT specification provides a kind of protocol level flow control negotiation between client and server. The number of unacked messages allowed on the session is negotiated between client and server and if the client has no more available msg ids it must pause publishing until msg IDs become available. Msg IDs become available when messages ACKs are received. See MQTT spec for details: https://docs.oasis-open.org/mqtt/mqtt/v5.0/os/mqtt-v5.0-os.html#_QoS_1:_At
  • +

Summary of factors that influence data ingestion rate

  • Choice of messaging protocol: MQTT (high volume) vs HTTP (low volume)
  • +
  • Messaging Pattern: do NOT close MQTT sessions after each message published. leave connections open.
  • +
  • Number of devices: Higher message rates are possible when the load is distributed over more connections
  • +
  • Choice of device class: Fair use quotas are based on device class (https://www.ibm.com/docs/en/mapms/1_cloud?topic=features-quotas)[https://www.ibm.com/docs/en/mapms/1_cloud?topic=features-quotas]
  • +
  • Choice of QoS: high levels of messaging guarantees come with higher costs
  • +

IoT Deployment

  • IoT CRD defines 3 default size deployments: dev, small, medium that controls the default settings for pod replics, cpu and memory. For production, medium is required. +Sample yaml for medium deployment in IoT CR +
    apiVersion: iot.ibm.com/v1
    +kind: IoT
    +  name: masinst1
    +  namespace: mas-masinst1-iot
    +  bindings:
    +    jdbc: system
    +    kafka: system
    +    mongo: system
    +  settings:
    +    deployment:
    +      size: medium
  • +
  • If need to adjust the default setting for a deployment, go the iot-operator pod, then change the corresponding yaml files under /opt/ansible/roles/<ibm-iot-operator>/vars folder, e.g. /opt/ansible/roles/ibm-iot-actions/vars/size_medium.yml
  • +

Connection and OpenShift Ingress Controllers


Openshift HAProxy supports 20k connection per pod. The total connection determinants how many end devices can connect to IoT MSProxy.

  • By default, IBM ROKS deploys 3 router members that supports 3x20k = 60K connection
  • +
  • By default, AWS ROSA deploys 2 router members that supports 2x20k = 40K connection
    • use the below command to scale up to 3 router members: +oc patch -n openshift-ingress-operator ingresscontroller/default --patch '{"spec":{"replicas": 3}}' --type=merge
    • +
  • +



IoT uses Kafka to process the messages. Follow the Kafka Configuration Reference to configure best value for Kafka/Topics retention.ms, retention.bytes, partitions, replics to support the workload.



+ +

Message Rate and Ethernet Network Bandwidth


Depending on the cloud providers, worker node instance has different network bandwidth. It determines how fast the end devices can send the request. Message rate is limited by the message size and the bandwidth of ethernet network. To achieve higher rates and/or larger messages it will require a 10GB ethernet. The network bandwidth also impacts the response latency. The higher bandwidth, the lower latency.


Below deployment configurations are recommended as starting value with medium and large workload.

  • MSProxy - 4 MSProxies with 1 CPU and 4GB
  • +
  • MessageGateWay - 1 MGW 6 CPUs and 16GB along with 4 TcpIop threads
  • +
MAS Manage Oil and Gas/HSE


Best Practice for Performance

  1. Archive or clean historical records of "Permit to Work", "Isolation Certificate", "Work Order" and "Operator Log/LogEntry" will help a lot on performance.
  2. +
  3. Adding below indexes which we identified in internal benchmark test will help a lot on performance.
    + Please do remember to update table statistics after adding any new index, since new index will only be effective after updating table statistics.
  4. +

Indexes Identified in Internal Benchmark Test

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Table NameColumnsComments
plusgpermitwork"ptwclass" ASC,"siteid" ASC,"orgid" ASC,"permitworknum" ASC
plusgpermitwork"ptwclass" ASC,"status" ASC,"plusgpertypeid" ASC,"permitworknum" ASC
plusgpermitwork"ptwclass" ASC
plusgpermitwork"status" ASC,"ptwclass" ASC,"description" ASC
plusgpertype"pertypenum" ASC,"plusgpertypeid" ASC
workorder"description" ASCAdd it if search on description field, create as text index is better
workorder"status" ASC,"historyflag" ASC,"istask" ASC,"wonum" ASCAdd it if search on status field
plusgoperaction"recordid" ASC,"class" ASC
plusgshftlogentry"recordkey" ASC,"orgid" ASC,"siteid" ASC,"createdate" ASC
plusgshiftlog"shiftnum" ASC,"isshiftlog" ASC,"startdate" ASC
plusgrelatedrec"relatedreckey" ASC,"relatedrecclass" ASC,"recordkey" ASC
plusgrelatedrec"recordkey" ASC,"class" ASC,"relatedrecclass" ASC
plusgincperson"ticketid" ASC
maxsession"issystem" ASC, "userid" ASC, "clienthost" ASC
ticket"globalticketid" ASC,"globalticketclass" ASC
report"reportname" ASC,"appname" ASC,"reportnum" ASC,"runtype" ASC,"userid" ASC
reportrunqueue"running" ASC,"priority" ASC,"submittime" DESC
MAS Manage Transportation


Best Practice for Performance

  1. Due to some Transportation applications execute SQLs contain "Like" clause, turn off DB2 Statement Concentrator can make CPU utilization much lower on database server (db2 update db cfg for database_name using stmt_conc off).
  2. +
  3. Adding below indexes which we identified in internal benchmark test will help a lot on performance.
    + If not specified, all columns are ASC by default.
    + Please do remember to update table statistics after adding any new index, since new index will only be effective after updating table statistics.
  4. +

Indexes Identified in Internal Benchmark Test

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Table NameColumns
logintracking"userid" ASC,"attemptresult" ASC,"attemptdate" DESC
inspectionresult"siteid" ASC, "referenceobject" ASC, "referenceobjectid" ASC
+ +
MAS Manage


At what point is it necessary to partition a MAS Manage workload across more than one MAS instance?


A new MAS instance is required to run MAS Manage workloads at the point in which the DB server can no longer be scaled up. When the DB server can no longer be scaled up, the customer should plan to create a new MAS instance and move sites to the new MAS instance which will be using a new DB server.


Maximo Transaction latency


When describing Maximo transaction latency it is important to define the boundaries of what constitutes a standard or out-of-the-box Maximo transaction. The description below does just that.



  • CRUD refers to create, update, delete operations
  • +
  • MBO stands for Maximo Business Objects (which are hierarchical in nature)
  • +
  • Transaction latency is defined as the elapsed time between when the Maximo server receives the transaction request to the time when the Maximo server has sent the response. For UI users this also includes the elapsed time between when the Save button is clicked to when control is returned to the user.
  • +

Definition of a transaction


An out of the box Maximo transaction is expected to complete with a latency of 2 seconds or less, where a transaction is defined as the creation, update, or deletion of a single MBO, containing no more than one child object and with no attachments or binary data (blobs). Example include, but are not limited to:

  1. Creation of a single WorkOrder object. This includes generation of the WorkOrderStatus and WorkOrderAncestor records.
  2. +
  3. Update of a single WorkOrder object. For example, changing states from Approved to Closed.
  4. +
  5. Deletion of a single WorkOrder object.
  6. +

Definition restrictions


The following conditions are considered to be outside the scope of an out of the box Maximo transaction, and therefore do not fall under the 2 second latency characterization.

  1. For UI initiated transactions this does not include latency incurred from downloading UI resources (e.g. js, css, png, jpg, etc.)
  2. +
  3. It applies to out of the box Maximo applications, but not customized applications or out of the box Maximo applications with automation scripts
  4. +
  5. It does not apply to UI initiated transactions with a large number of xhr requests (or portlets), for example the Maximo start center. Large here means greater than 2 xhr requests per page. Note, Maximo currently supports HTTP/1.1, so xhr requests initiated from a single user UI page are sequential, not concurrent.
  6. +
  7. It does not apply to customized saved queries.
  8. +
  9. It does not apply to bulk load requests.
  10. +
  11. It does not apply to report related transactions
  12. +

App Server


MAS Manage has different bundle types e.g. All, UI, MEA, Report and CRON to configure app server. Adjust the resource settings like cpu, memory, replic to match the workload. The settings are in ManageWorkspaces CR. Below is the sample.

apiVersion: apps.mas.ibm.com/v1
+kind: ManageWorkspace
+ settings:
+    deployment:
+      serverBundles:
+        - bundleType: mea
+          isDefault: false
+          isMobileTarget: false
+          isUserSyncTarget: true
+          name: mea
+          replica: 1
+          routeSubDomain: all
+        - bundleType: cron
+          isDefault: false
+          isMobileTarget: false
+          isUserSyncTarget: false
+          name: cron
+          replica: 1
+    settings:
+        resources:
+            manageAdmin:
+                limits:
+                cpu: '2'
+                memory: 4Gi
+                requests:
+                cpu: '0.2'
+                memory: 500Mi
+            serverBundles:
+                limits:
+                cpu: '6'
+                memory: 10Gi
+                requests:
+                cpu: '0.2'
+                memory: 1Gi



Lab test shows roundrobin has more stable and better performance than leastconn policy which is the default. Follow this link to update load balancer policy.


Manage Pod Functionality


Follow this link to understand the manage pod functionality.


LTPA timeout


Using IBM Maximo Application Suite (MAS), Manage users will receive an error message saying to reload the application after 2 hours, even while actively working. This 2-hour timeout default is when the LTPA token in Manage expires, and is redirecting the user back to the login page for MAS. Follow Updating LTPA timeout in Manage to increase the default value.


WebSphere Liberty


Due to the architecture change, Maximo 8.x (MAS Manage app) is deployed on WebSphere Liberty Base with OpenJ9. As of WebSphere Liberty, the thread growth algorithm is enhanced to grow thread capacity more aggressively to react more rapidly to peak loads. For many environments, this autonomic tuning provided by the Open Liberty thread pool works well with no configuration or tuning by the operator. If necessary, you can adjust coreThreads and maxThreads value by tuning liberty.


Configure JVM options in Manage app


Follow this link to configure JVM options






disk performance is critial for db performance. Recommend a storage or disk with

  • disk throughput: > 250 MB/s
  • +
  • IOPS: 10 IOPS/GB to 100 IOPS/GB (depending on volume size)
  • +

To measure disk performance on Linux use the dd command. The sample command below measures disk performance of the data volume inside a db2 pod running in OCP




Make sure that ddtest filename is appended to the end of the data path or the dd command will wipe the db2 data directory.

[db2inst1@c-db2wh-manage-db2u-0 - Db2U bludata0]$ dd if=/dev/zero of=path_of_db2_data_directory/ddtest bs=128K count=8192
+8192+0 records in
+8192+0 records out
+1073741824 bytes (1.1 GB, 1.0 GiB) copied, 2.84314 s, 378 MB/s
+[db2inst1@c-db2wh-manage-db2u-0 - Db2U bludata0]$

Network latency between app and db server


Reducing network latency is key to optimizing performance. Confirm latency is below 50ms by conducting a ping test. For production env, strongly recommend keeping the latency below 10ms and having app and db server in the same network segment. In cloud deployment scenarios, ensure both the database and OpenShift cluster are located within the same region, with the possibility of being in the same availability zone (AZ). Utilize the ping command to evaluate and pinpoint latency issues.


Large table optimization


When optimizing large tables in the Manage app, it is recommended to transfer these tables to a dedicated tablespace on high-throughput disks, coupled with a dedicated buffer cache for enhanced performance. The speed of the disks and the availability of memory play crucial roles in this optimization strategy. Additionally, ensure that index statistics are regularly updated, and address any problematic queries to further optimize the system.


DB - DB2/DB2wh


DB2 Tuning in Maximo 7.6.x Best practice is applicable.




The containerized DB2U and DB2WH deployments do NOT support text search (Regular DB2 has text search).
+As a result, some queries may perform poorly on containerized DB2 relative to Oracle DB and SQL Server, which both support text search.


Searching records by Description on the list page is a typical scenario whose performance can benefit from text search capability of the database, especially if no other indexed attributes are included in the query.


Adding a non-unique index on Description can help if an exact search can be made (Maximo search type = EXACT or user types is "=" before the search string, eg =Text) or the search can be done based on the beginning of the string (user types '%' at end of the string, eg Text%). If possible, adding other fields to the query (either by user typing them or as part of the default, where those attributes are part of an index can also help. In addition, adding Description to the end of one of these indexes can also show improvement.



  • increase maxsequence cache to 50
  • +
  • run runstats and/or reorg to update index periodically
  • +
  • separate system storage, user storage, backup storage, transaction logs storage, temporary tablespace storage on different disks if possible.
  • +
  • Use DB2 Performance Diagnosis to troubleshoot and tuning the db and SQL.
  • +
  • Manage requires row-organized tables. Check db2w db setting (by default it uses column based) and update the setting by db2 update db cfg using DFT_TABLE_ORG ROW
  • +
  • Manage does NOT support MMP or table partition in the current version, but consider to archive records over 1-year old. Optim is the one of the tools can be used for archiving. see this guide and this video for details.
  • +
  • Increase the concurrently running statements allowed for a DB2 application. This issue occcurs when loading a large amount of data via MIF or api call. See this link for the tuning.
  • +
  • storageclass
    • for ibm cloud: performance(Custom) block storage with 100+ IOPS for data storage, block gold for system and block silver for backup
    • +
    • for aws cloud: if using EFS for db, consider Provisioned mode to have a constant throughput. For more disk options, see details in this page
    • +
  • +
  • For db2 registry (db2set):
    • Set db2_workload=maximo. That makes db cfg variable WLM_ADMISSION_CTRL is set to NO
    • +
    • Do NOT change the default values for DB2_OVERRIDE_NUM_CPUS and DB2_OVERRIDE_THREADING_DEGREE.
    • +
  • +
  • Verify db2 db cfg variable WLM_ADMISSION_CTRL is set to NO
  • +
  • For db2ucluster CR
    • Do NOT set db2 instance memory. The operator will automatically calculate it based on the container memory limit.
    • +
    • (Optional) for performance stability, set the same value to both container resource request and limit.
    • +
  • +
  • For db2 monitor switches
    • The best practice is turn off all monitor switches except the Timestamp in dbm cfg.
    • +
    • If turn on monitor switches in dbm cfg, will cause monitor switches are turned on by default for all DB2 sessions, this will bring 5%-10% overhead on the overall database performance, depends on the workload and database server hardware spec. +So we should not turn on monitor switches in dbm cfg.
    • +
    • When we need to take DB2 monitor data, we should only turn on monitor switches in a specific session by the following command:
      + db2 update monitor switches using BUFFERPOOL on LOCK on SORT on STATEMENT on TIMESTAMP on TABLE on UOW on
      +And turn off all monitor switches immediately after getting required monitor data by the following command:
      + db2 update monitor switches using BUFFERPOOL off LOCK off SORT off STATEMENT off TIMESTAMP off TABLE off UOW off
    • +
  • +

DB - Oracle

+ +


  • +

    Maximo 7.6.x Best practice is applicable

  • +
  • +

    additional settings for MSSQL Server 2019

    • compatibility level: if maximo db is upgraded from the old version and the performance degradation is observed after the upgrade, consider to set compatibility level to the old version to keep the execution plan same.
    • +
    • isolation level: +
              ALTER DATABASE <DB NAME>  
      +        ALTER DATABASE <DB NAME> 
    • +
  • +
+ +
+ +
+ +
+ +
+ +
+ +

Work Order Intelligence Inferencing


This was added to the MAS Manage application to assist users with problem code classification (PCC) for Work Orders. See the product documentation for more details.


PCC model inferencing (batch mode from cron task)


Inferencing is typically run more frequently than training, but is less resource intensive. By default, MAS Manage is configured with a single instance of the AIINFJOB cron task. This is recommended for most workloads.


The predictor pod where inferencing/prediction occurs receives a batch of Work Orders to be inferenced from the MAS Manage cron pod running the AIINFJOB cron task. The batch size (or page size, defined on the MXAPIWODETAIL object structure query template) is the best way to control the rate at which Work Orders are inferenced. In the graph below you can see how the total time to inference 100K work orders is influenced by the batch size. With a batch size of 500 Work Orders/request and a 30 second interval for the AIINFJOB cron task 100K work orders were inferenced in approximately 1.6 hours. Compared to a batch size of 10 WO/request which took 83 hours.




The recommended batch size is 500 Work Orders/request and the recommended interval for the AIINFJOB cron task is 30 seconds.




PCC model inferencing required resources (batch mode from cron task)


The graphs below show the CPU and memory resource utilization of the predictor pod based on the batch size. As you can see, the CPU utilization of the predictor pod increases with the batch size, but the memory utilization remains fairly consistent (i.e. between 4GB - 5GB)






PCC model inferencing: batch cron processing vs on-demand single inference


For bulk inferencing of large numbers of Work Orders it is recommended to use the AIINFJOB cron task. However, UI users can also request problem code inferencing on a single work order. In this case the predictor pod will receive a single work order and as a result the overhead of processing a single work order is much higher. For example, to inference a batch of 10 or more work orders will result in an average inferencing time of 20 milliseconds per work order in the predictor pod, but the inference time for a single work order from the UI is about 120 milliseconds (in the predictor pod). The total time including the MAS Manage API request is about 750 milliseconds. It is therefore much more efficient to inference large numbers of work orders asynchronously using the AIINFJOB cron task and a page size of 500. In other words, don't use the API from a script.

+ +

Work Order Intelligence Training


This was added to the MAS Manage application to assist users with problem code classification (PCC) for Work Orders. See the product documentation for more details.


PCC model training required resources


Model training is resource intensive. For this reason there is a limit of one active model training per MAS Manage instance.


A single model training requires at least 8GB of memory. The pipeline pod, where model training occurs, will allocate a number of busy processes equal to the number of CPU on the worker node where the pod is scheduled. At the time of this writing there is no CPU limit set for the pipeline pod, so it will consume as much CPU resources as are available on the worker node where it is scheduled. In general, the more CPU is available to the pipeline pod the faster training time will go.


The three data points on the graph below were taken on a 16 CPU worker node. In the tests below a cpu limit was placed on the pipeline pod (not the default, i.e. by default the pipeline pod does not have specified limits). As you can see the training time with an 8 CPU limit was a little more than twice as fast as the training time with a 4 CPU limit. However, when comparing the 16 CPU limit and 8 CPU limit training time, there is very little improvement. This can be attributed to the fact that there were other workloads running on the worker node where the pipeline pod was scheduled and as well as synchronization waits between the training processes/threads. In other words, to improve the training time for the 16 cpu limit test it would be necessary to schedule the pipeline pod on a worker node with more than 16 CPU and fewer competing workloads.




Sample sizes for PCC model training




Do not train with more than 10K labeled samples. 10K samples is the recommended limit for PCC training.


The training times for a single epoch and different sample sizes are shown below. In general, the larger the size of the labeled sample data set, the longer the training time will be. You can see below there is an +exception to this rule. When comparing the single epoch training time between the 1K sample size and the 5K sample size, you can see that the single epoch training time for 5K sample size is only 82 minutes compared to 220 minutes for the 1K sample size. This is due to the fact that there were 30 problem codes in this test and with 1K sample size there were an insufficient number of samples per problem code. As a result, the model leveraged Watson X to generate synthetic samples and this process accounts for the additional training time for the 1K sample set.




The results below show training time for a single epoch. For a real training, 12 epochs is used and therefore the single epoch training times below should be multiplied by 12 to get the real training time. Note, there is a default timeout of 14400 minutes (or 10 days) for training to complete.



+ +



Lab Result Highlights

  • +

    The lab results indicate a significant correlation between the transaction per second (TPS) and database disk IO utilization. This correlation suggests that the level of transactional activity directly impacts the IO workload on the database disk. Conversely, the IO workload acts as a limitation on the system's ability to handle a larger volume of transactions. +TPS

  • +
  • +

    When IO is not the limiting factor, increasing the number of MEA Pods can positively impact the processing performance.

  • +
  • +

    Increasing the Message-Driven Bean (MDB) instances can potentially have a positive impact on system performance. It is recommended to adjust the number of records per message, the # of MDB and the batch size. By finding the right balance, you can target a resource usage of around 2 cores and 4-7GB of RAM that can help ensure efficient utilization without overburdening the MEA pods.

  • +
  • +

    Based on the lab results, it has been observed that a large number of internal error messages have a substantial impact on processing throughput.

  • +
  • +

    Under a certain circumstance, the configuration parameter mxe.int.splitdataonpost does not demonstrate a positive impact. To validate its effectiveness, it is recommended to perform a dry run in your specific environment for verification.

  • +

Performance Troubleshooting Checklist


To troubleshoot and optimize performance, follow this checklist:

  • Ensure adherence to best practices for optimizing performance in your DB, Openshift, and MAS environments.
  • +
  • Monitor disk IO utilization of the database and maintain it within acceptable limits to avoid performance degradation due to saturated disk resources.
  • +
  • Adjust the number of records per message and the MDB/batch size to effectively manage resource utilization of MEA pods. Aim for a resource consumption range of approximately 2 cores and 4-7 GB.
  • +
  • Regularly check the message queue to prevent it from becoming empty, ensuring a steady flow of messages for processing.
  • +
  • Minimize the occurrence of integration error messages as they can significantly impact processing throughput. Pay attention to a high volume of internal error messages and investigate the message reprocessing application for further insights.
  • +
  • Set a sufficiently large value for maxMessageDepth to avoid message queue overflow. It is recommended to match SIBus's default value of at least 500,000.
  • +
  • When the need for additional MEA pods arises, consider scaling up the number of worker nodes to accommodate the increased demand effectively.
  • +

Test Methodologies

  • +

    Establish a monitoring system to track essential performance metrics throughout the testing process.

  • +
  • +

    Begin with a dry run using a single MEA pod to establish a baseline benchmark for performance evaluation.

  • +
  • +

    Adjust the Message-Driven Bean (MDB) and BatchSize parameters to optimize resource utilization within an appropriate range for the MEA pod.

  • +
  • +

    Scale up the number of MEA pods as needed to meet performance requirements and accommodate increased workload.

  • +
  • +

    Continuously monitor and assess the performance of both the database and the application to identify any bottlenecks or areas for improvement.

  • +

By following these test methodologies, you can effectively monitor and optimize the performance of your system, ensuring efficient resource utilization and maintaining satisfactory levels of performance.

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ComponentConfigurationAdjustable or ScalableObserveration & Best Practice
JMS / MIFmaxMessageDepthYesMake it large enough. If it is too small, when the queue is full, the process fails and may be hard to recover. Recommend 500,000 same as SIBus
maxEndpointsYesLimit the maxConcurrency
MDB(maxConcurrency)YesAlone with BatchSize will impact processing speed and MEA pods resouce utliitzation
BatchSize(maxBatchSize)YesAlone with MDB will impact processing speed and MEA pods resouce utliitzation
Maximo# of JMS PodYes1 JMS Server works well in benchmark test. It does not consume a significant resource
# of MEA PodYesAble to linear scale
MEA CPU / MEM UsageYesAdjust JMS/MDB and BatchSize to control MEA pods resources in a reasonable range e.g. (2 - 3 core / 4 -7G)
JMS CPU / MEM UsageYesDefault setting works well in the benchmark test
DB CPU / MEM UsageYesEnsure DB has sufficient resource
DB Disk IO Util %Yes, but sometime it is hard to adjustDisk IO throughput is critial for the overall processing
DB Lock HoldsN/A
DB Tuning: Long Running Query, # of Appl, Memory..YesFollow the best practice to tune DB
Maximo Sequence CacheYesa reasonable # e.g. 20 or 50 can reduce the db cpu and processing time
Message# of record per MessageYes
data structure (complexity of the record)N/AImpacts performance because of business logic check
Record Quality (record cannot be processed)YesA large amount of int error messages slow down the overall processing speed
MiscMethod & Speed to post message into queueYesEnsure message post (writing to queue) as fast as possible. A slow pacing lowes the env processing capacity.
Any other concurrent transactionsN/Aother concurrency workloads impact the processing time
Worker Node CapacityYesWorker Node Capacity may limit working pod (e.g. MEA) capacity. Pod distribution should also be considered.
+ +

MAS Manage MIF/Kafka


Lab Result Highlights

  • +

    Same to MIF/JMS test, the lab results indicate a significant correlation between the transaction persecond (TPS) and database disk IO utilization. This correlation suggests that the level of transactionalactivity directly impacts the IO workload on the database disk. Conversely, the IO workload acts as alimitation on the system's ability to handle a larger volume of transactions.

  • +
  • +

    The results also demonstrate a notable connection between the disk IO throughput and the TPS (Transactions Per Second).

  • +
  • +

    Doubling the number of CRON JVMs and Kafka topic partitions leads to a twofold increase in the maximum TPS. However, this change also results in an enlarged distribution difference, growing from 2% to 10%. Consequently, in the final phase, the overall processing rate diminishes, with the TPS decreasing from 72 to 66, attributed to the Kafka rule - which allows a maximum of 1 consumer per partition.

  • +
  • +

    Increasing the number of partitions may result in better performance for small messages (e.g., 10 assetsper message) compared to large messages. Please ensure that there are an adequate number ofmessages in the queue for processing.

  • +
  • +

    When evaluating the performance of a single MEA JVM, the TPS in MIF/Kafka matches that of JMS. Nevertheless, when multiple processing JVMs are utilized, JMS surpasses performance due to its more equitable workload distribution. From a best-practice standpoint, it is advisable to have one Kafka topic with 6 partitions and multiple Kafka topics for parallel processing.

  • +
+ +

MAS Manage Mobile


Tips and Tricks:

  • +

    Strongly recommend creating a mobile database for supporting data downloads. Online support downloading can significantly impact the performance of Mobile Pods, databases, and networks.

  • +
  • +

    To mitigate download failures, consider increasing the timeout value for the ingressor. The default server/client timeout is set too low, affecting the pass rate. Use the following commands to raise the default value:


    oc -n openshift-ingress-operator patch ingresscontroller/default --type=merge -p '{"spec":{"tuningOptions": {"clientTimeout": "300s"}}}'


    oc -n openshift-ingress-operator patch ingresscontroller/default --type=merge -p '{"spec":{"tuningOptions": {"serverTimeout": "300s"}}}'

  • +
  • +

    Scaling up the coreapi pod can enhance the downloading experience for the mobile app.

  • +
  • +

    Consider scaling up the mobile pods when the CPU usage of a pod exceeds 4.

  • +
  • +

    Optimal disk throughput for the database is crucial for a smooth app downloading experience.

  • +
  • +

    Observations from lab tests suggest that balanced node resource utilization is crucial for optimal performance. It is worth noting that the default topology spread constraints in the ManageWorkspace Custom Resource (CR) are set to "topologyKey: topology.kubernetes.io/zone". However, in a single-zone cluster, if the pod is not being evenly distributed across worker nodes, considerto be set to "topologyKey: topology.kubernetes.io/hostname" instead.

  • +
+ +



MongoDB Troubleshoot:

  • mongostat: mongostat --username admin --password <password> --authenticationDatabase admin --ssl --sslAllowInvalidCertificates 2
  • +
  • mongotop: mongotop --username admin --password <password> --authenticationDatabase admin --ssl --sslAllowInvalidCertificates 2
  • +
  • check mongod log for slow queries (MongoDB community): oc logs -n <mongo namespace> <mongo pod name> -c mongod | grep -iE 'Slow query'
  • +
  • long connection over 3 seconds: db.currentOp({"active" : true,"secs_running" : { "$gt" : 3 },"ns" : /^msg/})
  • +
  • kill long running connection: db.killOp("opid")
  • +
  • locking: db.serverStatus().globalLock
  • +
  • mem: db.serverStatus().mem
  • +
  • wiredTiger cache: db.serverStatus().wiredTiger.cache
  • +
  • concurrent: db.serverStatus().connections
  • +
+ +



Monitoring your OpenShift clusters is critical for the environment health, the quality of services. It helps ensure that all deployed workloads are running smoothly and that the environment is properly scoped.


OpenShift Monitoring Service (Promethus/Grafana)


OpenShift Container Platform includes a pre-installed monitoring stack that is based on the Prometheus/Grafana. MAS also provides app-level promethus metrics and a set of Grafana dashboards for application health. More installation, configuration details can be found in IBM MAS Monitoring


Best practice for OpenShift Monitoring Service

  • enable User Workload: enableUserWorkload: false
  • +
  • consider to increase the promethus retention policy whose default value is 24h and add persistent volumes
  • +
  • consider to change Alert Manager's storage class and size
  • +

Below is the sample for configmap cluster-monitoring-config

apiVersion: v1
+kind: ConfigMap
+  name: cluster-monitoring-config
+  namespace: openshift-monitoring
+  config.yaml: |
+    enableUserWorkload: true
+    prometheusK8s:
+      retention: 90d
+      volumeClaimTemplate:
+        spec:
+          storageClassName: nfs-client
+          resources:
+            requests:
+              cpu: 200m
+              storage: 300Gi
+              memory: 2Gi
+            limits:
+              cpu: 2
+              memory: 4Gi
+    alertmanagerMain:
+      volumeClaimTemplate:
+        spec:
+          storageClassName: nfs-client
+          resources:
+            requests:
+              storage: 20Gi


  • Except OpenShift Monitoring Service (Promethus/Grafana), there are other paid solutions like IBM Instana, New Relic, Data Dog that also support OCP.
  • +
  • If the cluster is cloud based, consider to use cloud provider's monitoring tool for additional info like network, disk, managed services. e.g. AWS CloudWatch, IBM Log Analysis...
  • +
+ +

OpenShift Container Platform


Cluster Insights Advisor


Highly recommend to use OpenShift cluster Insights Advisor that to check for any issue related to the current version, nodes and mis-configurations. It is the first step for the problem diagnosis.



  • Login on OpenShift Console
  • +
  • Go to Administration -> Cluster Settings
  • +
  • Click OpenShift Cluster Manager in Subscription section. It redirects the url to RedHat Hybrid Cloud Console
  • +
  • Click Insights Advisor
  • +

PID limit for docker


This settings control how many processes can be run within one single container. If it is too small, it can cause folk bomb issue. E.g. db2w instance may be unavailable when there are thousands of connections/agents upcoming or Openshift Container Storage not behaving well with a large amount of PVCs.


OOB value for OCP platforms:

+ + + + + + + + + + + + + + + + + + + + + +
Platform VersionDefault Value
IBM ROKS (4.8)231239
AWS ROSA4096 in OpenShift 4.11 and higher
Azure Self-Managed OCP1024

Steps to check or update PID limit: +

$ oc debug node/$NODE_NAME
+$ chroot /host
+$ cat /etc/crio/crio.conf
+# add / modify the line "pids_limit = <new value>"
+# run belows commands to reboot services and worker nodes
+$ systemctl daemon-reload
+$ systemctl restart crio
+$ shutdown -r now


HAProxy Router


Ingress Controller


Openshift HAProxy supports up to 20k connections per pod. Consider to scale up ingress pod for any app (like IoT) with a high volume connection workload.


Scale up ingress controller

  • command: oc patch -n openshift-ingress-operator ingresscontroller/default --patch '{"spec":{"replicas": 3}}' --type=merge
  • +

Max Connection


One of the most important tunable parameters for HAProxy scalability is the maxconn parameter. The router can handle a maximum number of 20k concurrent connections by using oc adm router --max-connections=xxxxx. This parameter will be impacted by node settings sysctl fs.nr_open and sysctl fs.file-max. HAproxy will not start if maxconn is high, but node setting is low.


Note: OpenShift Container Platform no longer supports modifying Ingress Controller deployments by setting environment variables such as ROUTER_THREADS, ROUTER_DEFAULT_TUNNEL_TIMEOUT, ROUTER_DEFAULT_CLIENT_TIMEOUT, ROUTER_DEFAULT_SERVER_TIMEOUT, and RELOAD_INTERVAL. You can modify the Ingress Controller deployment, but if the Ingress Operator is enabled, the configuration is overwritten.


Load Balance Algorithm


Starting from OCP 4.10, there have been four load-balancing algorithms available: source, roundrobin, random, and leastconn. The default algorithm is set to random. In earlier versions of OCP, before 4.10, there were three load-balancing algorithms: source, roundrobin, and leastconn. The default algorithm in those versions was leastconn. Set up annotations for each route to change the default algorithm if needed. e.g. haproxy.router.openshift.io/balance=roundrobin


Master and Worker Nodes Consideration


There are a wide selection instance types that comprise varying combinations of CPU, memory, disk and network. Below are a few considerations:

  • Each worker node will reserve about 1 core for internal services. In order to avoid the side effect of overcommit, 16core/64G is a good starting type for a normal worker node. A 8-core instance may not have insufficent capacity while 32-core instance may lose a big cluster capacity due to an outage or failure.
  • +
  • Using balanced CPU-memory worder nodes typically fits our work load -the ratio of CPU to memory is 1 to 4.
  • +
  • An instance with a higher memory/cpu ratio e.g. 8:1 is recommended for database nodes.
  • +
  • The number of worker nodes >=3. This will give a high availability needing a smaller built in redundant capacity.
  • +
  • For the product env, a 8core/32G is recommended for master nodes to avoid any bottleneck for the internal services.
  • +
  • An instance with 10GB ethernet is strongly recommended for the production env.
  • +
  • Check the GPU chip type for gpu node selection.
  • +
+ +

Sizing Guidance


The sizing number in this page is based on a standard workload. Used as reference only.


Sizing Calculation Sheet


Use Sizing Calculation Sheet for MAS sizing.


Factors that impact the sizing consideration

  • storage operator: e.g. ocs, odf...
  • +
  • cp4d services: e.g. db2w, watson studio...
  • +
  • mongodb service
  • +
  • kafka service
  • +

OCS (OpenShift Container Storage)


If using OCS to manage the storage class, OCS service itself requires minimum 3 nodes with 14 core / 32G (Note: this is the total request amount, not per node).


ODF (OpenShift Data Foundation)


3 OCP nodes will run ODF services. (NOTE: OCP clusters often contain additional OCP worker nodes which do not run ODF services.) +Each OCP node running ODF services has:16 core / 64 GB memory


CP4D/DB2W Minimum Resource Requirement

  • When running CP4D/DB2W on OpenShift's worknode, each instance requires at least 6.1 core and 18G ram. Note: an instance pod cannot be scheduled if the node's (total capacity - total limit) is less than 6.1 core or 18G ram,
  • +
  • a dedicated worker node or external db is recommended.
  • +
  • db2 operator is an alternative.
  • +

MAS Manage


Based on the benchmark results, for sizing we recommend 50 - 75 user load per MAS Manage UI server bundle pod, which is equivalent to a JVM with 2 core on Maximo 7.6.x.


MAS Resource Statistics

  • use below values as reference only.
  • +
  • the footprint is based on the loads and spec settings. e.g. IoT T-shirt size, Manage bundle and replic #
  • +
  • the value below is based on IoT small T-shirt size and Manage with only all-in-one bundle and replic =1
  • +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
AppCPU Request (core)CPU Limits (core)Memory Rquest (GB)Memory Limits(GB)
Additional cost- - - - - - -- - - - - - -- - - - - - - - - - -- - - - - - - - - - -
cp4d (with 2 db2w instances)*31.5940.7235.39249.70
each additional manage pod*16210
+ + +
Performance Diagnosis




A monitoring system is strongly recommended to track the environment health and the quality of services.


Diagnostic Utility

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ScopeNameUsed for
OCPOpenShift Monitoring ServiceOpenShift Cluster and MAS
DB2IBM DSMDB2 Historical and Realtime Troubleshooting
DB2db2topDB2 Realtime Troubleshooting
DBTestDBTestAn utility to test db network latency and fetching time
OracleAWR, StatsPackHistorical Troubleshooting
JVMIBM Support AssistantHeap Dump and GC Log Analysis
JVMMATJVM Dump Analysis
MaximoPerfMon- Maximo UI Activity Tracing
- Note: Enabling PerfMon may significantly degrade server performance.
- Recommend for a single user with Dev/Test env only
MongoDBmongotopMongoDB Realtime Troubleshooting
HARHTTP Archive ViewerHAR Analysis - for web page and client side (browser) performance
SQLPoor SQLOnline SQL Formatter
SQLSquirrlUniversal SQL Client
SSLSSL ShopperOnline certificate decode tool
OStopProcess and thread level analysis, hotspot analysis - top is available in most containers and on OCP worker nodes
OSsara system command be used to monitor system resources like cpu, memory, disk, network...
OCPoc debug node/<node name>Worker node debugging

Factors in system performance


System performance depends on more than the applications and the database. The network architecture affects performance. Application server configuration can hurt or improve performance. The way that you deploy Maximo across servers affects the way the products perform. Many other factors come into play in providing the end-user experience of system performance. +Subsequent sections in this paper address the following topics:

  • System architecture setup including OCP, Instance Type, Storage
  • +
  • App and DB server configuration
  • +
  • Network issues
  • +
  • Bandwidth
  • +
  • Load balancing
  • +
  • Database tuning
  • +
  • SQL tuning
  • +
  • Scheduled tasks (cron tasks)
  • +
  • Reporting
  • +
  • Integration with other systems using the integration framework
  • +
  • Troubleshooting
  • +

Performance Check List

  • check node status. e.g. any NOT Ready worker nodes
  • +
  • if there is any pod or node cpu, memeory usage approaching to the limit?
  • +
  • if there is any pod restarted many time recently?
  • +
  • if there is any JVM Heapdump dump?
  • +
  • if there is any JVM Hung Thread
  • +
  • if there is any node or pod with a high system or IO wait (20%)?
  • +
  • if there is any node memory, disk or pid pressure?
  • if the response time is high (over 2 sec)?
  • if any long running (over 2 sec) or high cpu cost query?
  • if there is network bottleneck (e.g. load-balancer)
  • is app server or db server busy?
    • if app server is busy
      • check the request, limit value for cpu, memory
      • should replic memebers be increased?
    • if db server is busy
      • check cpu, memory, disk current usage and limit value
      • check any utility in the background. e.g. backup
      • check db lock
      • check if there is any high cost query
      • check disk performance
db2top can be used for a real-time diagnosis.

  • Command: db2top -db <dbname>
    • press h: help screen
    • press I: reset the interval time (default is 2 seconds)
    • press m: memory screen
    • press B: bottleneck screen
    • press b: bufferpool screen
    • press T: Table screen
    • press U: locks screen
    • press u: utility screen to check if runstat is running
    • press D: Dynamic SQL screen
    • Catch High CPU SQL in Dynamic SQL screen, do:
      • Press z and 5 to sort by cpu usage
      • Copy SQL Hashcode
      • Press L and Paste SQL Hashcode
  • Notes: Be cautions when taking any snapshot.
  • See more details on User Manual
Diagnosis Commands

  • +

    list memory allocation:

    db2mtrk -i -d –v
    list long run query:

  • +
    list backup/restore status:

    db2pd -barstats -d <dbname>
    list most active tables:

  • +
    list most active indexes:

  • +

    list db2 advise for the statement:

    db2advis -database bludb  -s "select * from maximo.ahfactorhistory where ahdriverhistoryid = 123 for read only"  -n MAXIMO -q MAXIMO
    checking for indexes the need to be rebuilt

    db2 reorgchk current statistics on schema 'MAXIMO' > /tmp/reorgchk.log

    Any indexes or tables with an * in the REORG column, indicate that they are candidates for reorg.

    list the query execution plan:

    db2expln -database bludb -schema MAXIMO -package % -statement "select * from maximo.ahfactorhistory where ahdriverhistoryid = 123 for read only" -terminal -graph   > query1_access_plan.txt
    list all indexes for a specific table:

    select * from syscat.indexes i where TABNAME ='ITEMSTRUCT'
    list insert/update/delete/tablescan stats for a specific table:

    SELECT rows_read,rows_inserted,rows_updated,rows_deleted,table_scans FROM TABLE(MON_GET_TABLE('MAXIMO','ASSET',-2))
    list insert/update/delete/tablescan stats for all tables:

  • +
    list top 10 big tables:

    select creator, name, avgrowsize, card, stats_time, avgrowsize*card as tbsize, npages*t.pagesize/1024/1024 as tbsize_inMB from sysibm.systables t1, syscat.tablespaces t where creator not like 'DB2%' and t1.tbspace=t.tbspace order by tbsize desc fetch first 10 rows only 
    list data and index size for one table:

    select tabschema, tabname, DATA_OBJECT_P_SIZE/1024 as data_inMB, INDEX_OBJECT_P_SIZE/1024 as index_inMB,LONG_OBJECT_P_SIZE/1024 LongObj_inMB, LOB_OBJECT_P_SIZE/1024 as LOB_inMB from table(sysproc.admin_get_tab_info('MAXIMO','WORKORDER')) 
    list error message:

    db2 ? <sqlerror>
    db2pd: monitor and troubleshoot DB2 database command

  • db2diag: db2diag logs analysis tool command
  • db2set: db2 global settings
  • db2 get dbm cfg: db2 database manager configuration
  • db2 get db cfg: db2 database configuration
IBM Data Server Manager (IBM DSM)


IBM DSM is useful to do both real-time/ historical data diagnosis, find out the expensive sql query, justify cpu spent on sql execution or other e.g. sorting, parsing, fetching, io and so on. It requires pre-configuration.


A high-level set up:

  • Download the latest version of Data Server Manager from IBM developerWorks or IBM Passport Advantage Online, then extract to /opt/ibm/dsm
  • run setup.sh to set up and create admin user
  • run start.sh to start the server, url is http://hostname:11080/console
  • log on the console, select a time period (e.g. peak time) and then generate report.
DBTest Utility


notes: This utility requires Java version 11 or higher.


The DBTest Utility has two modes:


Benchmark Mode (the default): is to measure database connection time, query execution time and data fetching time for every 100 records.


Query Mode: is to display the query result with database connection time, query execution time and data fetching time.


Here is an example demonstrating how to utilize this utility in the Maximo UI pod.


Run DBTest in MAS Manage maxinst pod

  • go to maxinst pod in the MAS Manage namespace -> terminal tab, then execute below commands:
  • +
cd /tmp
+curl -L -v -o run-dbtest-in-maxinst-pod.sh https://ibm-mas.github.io/mas-performance/pd/download/DBTest/run-dbtest-in-maxinst-pod.sh
+bash run-dbtest-in-maxinst-pod.sh

Run DBTest in Maximo UI Pod

  • go to maximo ui pod -> terminal tab, then execute below commands:
  • +
# change to /tmp
+cd /tmp
+# download DBTest
+curl -L -v -o DBTest.class https://ibm-mas.github.io/mas-performance/pd/download/DBTest/DBTest.class
+# set DBURL. If this utility is in maximo UI pod, set DBURL="$MXE_DB_URL"
+export DBURL="<jdbc url>" or export DBURL="$MXE_DB_URL" or export DBURL="${MXE_DB_URL}sslTrustStoreLocation=${java_truststore};sslTrustStorePassword=${java_truststore_password};"
+export DBUSERNAME='<username>'
+export DBPASSWORD='<password>'
+export SQLQUERY='select * from maximo.maxattribute'
+# execute the utility in benchmark mode
+java -classpath .:$(dirname "$(find /opt | grep "oraclethin.jar" | head -n 1)")/* DBTest

Result Samples:


Given optimal network latency and a healthy database status, the expected data fetching time is less than 10 milliseconds.


Good Result: +Good Result


Bad Result: +Bad Result


Execute the utility in query mode

java -classpath .:$(dirname "$(find /opt | grep "oraclethin.jar" | head -n 1)")/* DBTest -q

Output Sample:

(base) [~/javatool]$ java -classpath .:./lib/* DBTest -q
+Dec. 06, 2023 11:49:47 A.M. DBTest getConnection
+INFO: Loading Class took: 0.029 seconds
+Dec. 06, 2023 11:49:53 A.M. DBTest getConnection
+INFO: DB Connecting took: 6.55 seconds
+Dec. 06, 2023 11:49:53 A.M. DBTest printResult
+INFO: Query Execution took: 0.099 seconds
+APIKEY, READ, Access to API Keys application, 0, 1, null, ALL, null, 200004204, EN, 0, 290874862
+Dec. 06, 2023 11:49:54 A.M. DBTest printResult
+INFO: Fetching Record took: 0.058 seconds
+ +
+ +
+ +
+ +
+ +
« Previous

Next »
GitHub


As a result of architectural modifications, Maximo 8.x (MAS Manage app) now operates on WebSphere Liberty Base with OpenJ9 within the OpenShift Container Platform (OCP). It's essential to note that JVM arguments outlined in the 7.x Best Practice documentation may not be relevant or applicable to the Maximo 8.x environment. Here are additional details:


WebSphere Liberty


As of WebSphere Liberty, the thread growth algorithm is enhanced to grow thread capacity more aggressively so as to react more rapidly to peak loads. For many environments, this autonomic tuning provided by the Open Liberty thread pool works well with no configuration or tuning by the operator. If necessary, you can adjust coreThreads and maxThreads value.


Generic JVM Arguments

  • +



    Gencon is the default policy in OpenJ9, this parameter works in both 7.x and 8.x

  • +
  • +

    -Xmx or -XX:MaxRAMPercentage (maximum heap size)


    If not specifying -Xmx value, JVM uses 75% of total container memory when -XX:+UseContainerSupport is set. When -Xmx is set, -XX:MaxRAMPercentage will be ignored.

  • +
  • +



    If -XX:+UseContainerSupport is set, it allows to change the InitialRAMPercentage and MaxRAMPercentage values. -Xms and -Xmx can overwrite the limits.

  • +
  • +

    -Xmn (Nursery Space)


    Setting the size of the nursery when using this policy can be very important to optimize performance. 25 - 33% of total heap is recommended. Please note manage pod limited memory is 10G that is not Total heap size. Heap size is based on (-Xmx or -XX:MaxRAMPercentage) setting. 10G also includes memory used by websphere for cache, compilation as well maximo mmi container.

  • +
  • +



    This parameter is used to set the number of threads that the Garbage Collector uses for parallel operations. By default, it is set to n -1 in OpenJ9 where n is the number of reported cpu on the node. You might want to restrict the number of GC threads used by each VM to reduce some overhead.

  • +
  • +



    This parameter is used to specify the number of compilation threads that are used by the JIT compiler. Same as gcthread, you might want to restrict the number of compilation threads used by each VM to reduce some overhead.

  • +
  • +



    this parameter is used to share class data between running VMs, which can reduce the startup time for a VM once the cache has been created.

  • +
  • +

    ‑Xdisableexplicitgc (Recommended)


    This parameter is used to disabling explicit garbage collection disables any System.gc() calls from triggering garbage collections. For optimal performance, disable explicit garbage collection.

  • +
  • +



    For performance reasons, Maximo recommends to set this property to true. Note: this parameter can not be applied on the hosts that only communicate with ipv6.

  • +
  • +

    -XX:PermSize and -XX:MaxPermSize + Maximo 7.x BP recommends 320m. If seeing an OOM for PermSize, consider to increase to 512MI or higher.

  • +
  • +

    -Xcodecache32m + The maximum value you can specify for -Xcodecache is 32 MB. JIT compiler might allocate more than one code cache. It is controlled by -Xcodecachetotal which default value is 256MB.

  • +
  • +



    Enable verbose gc log for the garbage collection analysis

  • +
  • +

    -Xtune:virtualized (under review)


    Optimizes OpenJ9 VM function for virtualized environments, such as a cloud, by reducing OpenJ9 VM CPU consumption when idle.

  • +
GitHub

« Previous
+ + +
+ +
  • + + +
  • + Edit on GitHub +
  • +


Coming soon ...


Critical Note




The IBM Product Support you have purchased with your IBM Maximo Application Suite Product does not cover this Application extension. Do not attempt to submit an IBM support ticket.


The IBM TechXChange Maximo Community discussions can be leveraged to crowd-source assistance from Maximo Experts.


What is IBM Maximo Cluster Performance Insights


IBM Maximo Cluster Performance Insights (Maximo CPI), is a new utility that use short and long term snapshots to addresses specific best practices for deployment of Maximo App Suite. It can assist in pinpointing areas that need improvement and provide actionable insights for optimizing the MAS deployment.


Maximo Clients can conduct a self-assessment to ensure adherence to best practices, optimize resource use, and diagnose performance issues. This process helps in evaluating current practices, identifying areas for improvement, and enhancing overall efficiency and effectiveness.


The utility gathers only metrics data, excluding any sensitive information. It is containerized for ease of use.


IBM Maximo Cluster Performance Insights Main Features

  • Identify any missing or incorrect settings that not follows MAS Best Practice
  • +
  • Offer an in-depth evaluation of the deployed MAS system's performance
  • +
  • Provide recommendations for minimizing the size of the MAS deployment to reduce infrastructure costs
  • +
  • Identify certificates that have expired or are about to expire
  • +
  • Provide suggestion for rebalancing the node resource utilization to optimize the workload
  • +
  • Capacity to send a notification via slack
  • +
  • Offer a platform for customized MAS Manage schedule scaling
  • +

User guide

  • +

    Run on Docker

    • Download the docker container: docker pull quay.io/brianzhu_ibm/mcpi:latest
    • +
    • Run the docker container: docker run -dit -p 8888:8888 --name mcpi quay.io/brianzhu_ibm/mcpi:latest
    • +
    • Data Collection
      • enter into the docker container: docker exec -it --user root mcpi bash
      • +
      • login on OpenShift Cluster: oc login https://<openshift-master-url>:<port> -u <username> -p <password> or oc login https://<openshift-master-url>:<port> --token=<token>
      • +
      • execute data collection command: collect-metric.sh
      • +
      • note: when the command finishes executing, it returns the path to the MHC JSON file. Below is a sample of the returning. In this case, the path to the MHC JSON file is /tmp/mhc-2024-08-01-19-36.json +alt text
      • +
    • +
    • Data Review
      • launch the mcpi viewer url (http://localhost:8888) in the browser
      • +
      • review the data: Under Load a MAS Harmony Checker JSON file from the server's path, enter the path to the MHC JSON file e.g. /tmp/mhc-2024-08-01-19-36.json Below is the sample snapshot +alt text
      • +
    • +
  • +
  • +

    Run on OpenShift Cluster

    • Download maximo-cpi-deployment.yaml
    • +
    • Login on OpenShift Cluster Console
    • +
    • Click + to import YAML, then Drag and drop maximo-cpi-deployment.yaml
    • +
    • Data Collection
      • login into the cluster console
      • +
      • go to maximo-cpi project
      • +
      • click on mcpi-deployment-xxx pod
      • +
      • go to Terminal tab
        • login on OpenShift Cluster: oc login https://<openshift-master-url>:<port> -u <username> -p <password> or oc login https://<openshift-master-url>:<port> --token=<token>
        • +
        • execute data collection command: collect-metric.sh
        • +
        • note: when the command finishes executing, it returns the path to the MHC JSON file. See the sample in the Run on Docker section
        • +
      • +
    • +
    • Data Review
      • go to maximo-cpi project -> Networking -> Routes
      • +
      • click on mcpi-viewer-route url
      • +
      • review the data: Under Load a MAS Harmony Checker JSON file from the server's path, enter the path to the MHC JSON file e.g. /tmp/mhc-2024-08-01-19-36.json See the sample in the Run on Docker section
      • +
    • +
  • +

Most Common User Scenarios


1) Best practice to minimizing footprint through Maximo CPI

  • Step 1: Eliminate the surplus nodes if exist
  • +
  • Step 2: Balance CPU and Memory Request%; Align CPU and Memory Requests to match hardware specifications, such as a ratio of 1:4 or 1:8.
  • +
  • Step 3: Continuously reduce the resource requests for pods/containers to enhance utilization. Ideally, aim for resource utilization that exceeds the resource requests and approaches 60–70% of the cluster capacity.
  • +
  • Repeat Step 1 – 3 if needed
  • +

2) Best practice for performance troubleshooting and configuration checking

  • Step 1: Heatmap viewer provides the problematic pods and nodes
  • +
  • Step 2: Maximo CPI viewer provides the metric details
  • +
  • Step 3: Identify the severity and functional impacts
  • +
  • Step 4: Vertically and horizontally adjust the pod/service/node and apply the recommended OpenShift Configuration if needed
  • +
  • Repeat Step 1 – 4 if needed
  • +

3) Rebalance Node Resource

  • Issue Description: Observe the unbalance resource usage among the nodes. E.g. some nodes use 80% cpu, but the other uses 20% cpu.
  • +
  • Reason: Imbalanced placement OpenShift schedules the service / pod based on the resource cost increment , not the real resource usage.
  • +
  • Solution: migrate pods from busy nodes to non-busy nodes with min movements. This is a typical bin-packing (NP-Hard) problem. Maximo CPI uses the greedy algorithm since the time and minimum steps are not critical.
  • +
  • Actions:       ⚠️ Moving pods can be disruptive at times, as it may cause an outage while the stateful service pod is being relocated.
    • execute node-balance.sh. The output will provide movepod command if any issue is detected
    • +
    • execute movepod.sh to move the pods.
    • +
  • +

4) Scheduled Scaling

  • modify mas-manage-scheduled-scaling-sample.sh to adjust the parameters e.g. time and pod replica number
  • +
  • set up the slack url and channel name for notification if needed
  • +

5) Expired and Expiring Certificate

  • modify cert-expiration-slack-alert-sample.sh to adjust the paramenter e.g. time and expiration-in-days
  • +
  • set up the slack url and channel name for notification if needed
  • +


  • Release this utility to the public via IBM Accelerator
  • +
  • Extend metric collection to cover the database performance metrics
  • +
  • Add and enhance the policies for alerting and best practices
  • +
  • Enhance MAS Optimization, Sizing, Re-balance, Scaling, Performance Diagnosis via AI technology
  • +
GitHub

« Previous

Next »
GitHub
+ + +
+ +
+ +
+ +

Ping test Utility


When trying to diagnose a request timeout problem it is helpful to rule out gateways/load balancers outside the OCP cluster. Sometimes these external +gateways can have short timeouts which are resetting a connection before the request is completed. The Ping test Utility is designed to help +diagnose this issue.




As of this writing the Ping test utility is not part of the base server bundle code and needs to be loaded via a customization archive. +This means that the ManageWorkspace CR needs to be updated and will require a restart of the server bundles (i.e. it will cause a disruption +while the server bundle pod is restarted).


Updating the ManageWorkspace CR

  • Edit the ManageWorkspace CR in the MAS Manage namespace
  • +

Single Customization Archive +

+  settings:
+    customization:
+      customizationArchive: https://ibm.box.com/shared/static/oj9z062b7x8hcywndnv6n57gya7e1ypz.zip


In case you already have a customization archive add to the customizationList +

+  settings:
+    customizationList:
+    - customizationArchiveName: archiveAlias1
+      customizationArchiveUrl: https://ibm.box.com/shared/static/oj9z062b7x8hcywndnv6n57gya7e1ypz.zip

  • Wait for the MAS Manage workspace operator to update the server bundle pods with the Ping servlet class and restart the server bundle pods
  • +

Using the Ping servlet utility to test request timeouts outside the OCP cluster

  • Run the following curl command outside the OCP cluster using the external hostname of the MAS Manage server bundle pod. The command below will send + a request to the Ping servlet which will wait for 1 second before responding. If a response is returned it means no timeout occurred.
  • +
$ curl https://tenant1.manage.masperf4.ibmmas.com/maximo/ping?timeout=1
+{"thread wait time":"1 seconds","status":"ok"}
  • Change the timeout value to match the timeout that you are observing in problematic request. For example, the Ping request below sets a timeout of 300 + seconds. If no response is received it means the request timed out and the same request should be attempted from inside the OCP cluster using the private + IP address of the server bundle pod (see Using the Ping servlet utility to test request timeouts inside the OCP cluster below).
  • +
$ curl https://tenant1.manage.masperf4.ibmmas.com/maximo/ping?timeout=300

Using the Ping servlet utility to test request timeouts inside the OCP cluster

  • Obtain the internal Cluster IP address of the MAS Manage UI service.
  • +
  • Go to maxinst pod in the MAS Manage namespace -> terminal tab, then execute below commands:
  • +
$ curl --insecure
+{"thread wait time":"300 seconds","status":"ok"}

If you receive a response from the request issued to the internal Cluster IP address of the MAS Manage UI service, but do not receive a response issued externally from outside the cluster, it could be the case that an external gateway service or load balancer is closing the connection due to a shorter timeout set on the gateway. Check is a network administrator.

GitHub

« Previous

Next »
GitHub
GitHub

« Previous

Next »
GitHub
Thus it is absolute. + return path; + } + if (base.substring(base.length-1) === "/") { + // base ends with `/` + return base + path; + } + return base + "/" + path; +} + +function escapeHtml (value) { + return value.replace(/&/g, '&') + .replace(/"/g, '"') + .replace(//g, '>'); +} + +function formatResult (location, title, summary) { + return ''; +} + +function displayResults (results) { + var search_results = document.getElementById("mkdocs-search-results"); + while (search_results.firstChild) { + search_results.removeChild(search_results.firstChild); + } + if (results.length > 0){ + for (var i=0; i < results.length; i++){ + var result = results[i]; + var html = formatResult(result.location, result.title, result.summary); + search_results.insertAdjacentHTML('beforeend', html); + } + } else { + var noResultsText = search_results.getAttribute('data-no-results-text'); + if (!noResultsText) { + noResultsText = "No results found"; + } + search_results.insertAdjacentHTML('beforeend', '

' + noResultsText + '

'); + } +} + +function doSearch () { + var query = document.getElementById('mkdocs-search-query').value; + if (query.length > min_search_length) { + if (!window.Worker) { + displayResults(search(query)); + } else { + searchWorker.postMessage({query: query}); + } + } else { + // Clear results for short queries + displayResults([]); + } +} + +function initSearch () { + var search_input = document.getElementById('mkdocs-search-query'); + if (search_input) { + search_input.addEventListener("keyup", doSearch); + } + var term = getSearchTermFromLocation(); + if (term) { + search_input.value = term; + doSearch(); + } +} + +function onWorkerMessage (e) { + if (e.data.allowSearch) { + initSearch(); + } else if (e.data.results) { + var results = e.data.results; + displayResults(results); + } else if (e.data.config) { + min_search_length = e.data.config.min_search_length-1; + } +} + +if (!window.Worker) { + console.log('Web Worker API not supported'); + // load index in main thread + $.getScript(joinUrl(base_url, "search/worker.js")).done(function () { + console.log('Loaded worker'); + init(); + window.postMessage = function (msg) { + onWorkerMessage({data: msg}); + }; + }).fail(function (jqxhr, settings, exception) { + console.error('Could not load worker.js'); + }); +} else { + // Wrap search in a web worker + var searchWorker = new Worker(joinUrl(base_url, "search/worker.js")); + searchWorker.postMessage({init: true}); + searchWorker.onmessage = onWorkerMessage; +} diff --git a/search/search_index.json b/search/search_index.json new file mode 100644 index 0000000..d75c467 --- /dev/null +++ b/search/search_index.json @@ -0,0 +1 @@ +{"config":{"indexing":"full","lang":["en"],"min_search_length":3,"prebuild_index":false,"separator":"[\\s\\-]+"},"docs":[{"location":"","text":"Welcome to MAS Performance Wiki \uf0c1 Info This site will be updated periodically. More topics will be added soon. Lab benchmarks are not published, but can be shared upon request, with completion of an NDA. This site provides best practices, sizing and troubleshooting guidelines to improve the performance of IBM Maximo Application Suite (MAS) . Maximo 7.x Best Practices are also available on the site. Most DB configurations in the best practice are still applicable to MAS Manage app.","title":"Home"},{"location":"#welcome-to-mas-performance-wiki","text":"Info This site will be updated periodically. More topics will be added soon. Lab benchmarks are not published, but can be shared upon request, with completion of an NDA. This site provides best practices, sizing and troubleshooting guidelines to improve the performance of IBM Maximo Application Suite (MAS) . Maximo 7.x Best Practices are also available on the site. Most DB configurations in the best practice are still applicable to MAS Manage app.","title":"Welcome to MAS Performance Wiki"},{"location":"mas/aws/bestpractice/","text":"AWS \uf0c1 Instance Type \uf0c1 There are many instance types available in AWS. Based on the benchmark, recommend M5, M6 instances (e.g.M5.4xlarge) as master or worker nodes and P3, P4 as GPU nodes. Note Depending on the regions, some instances may not be available. Use AWS Pricing Calculator to check the instance availability and cost. g4dn can be used as GPU node for test/dev env, but not recommended for production env. If the application requires a good network performance, check Amazon EC2 instance network bandwidth site for more details. For production env, an instance with 10GB ethernet is recommended. Classic Load Balancer Idle Timeout \uf0c1 Each OCP cluster creates 1 class load balancer and 2 network load balancers in AWS. AWS classic load balancer has a default idle time 60 seconds. In some cases, this value is not enough for a long time transaction (e.g. asset health check notebook). Consider to adjust this value to what the application needs (e.g. 300 seconds). Also, monitoring classic load-balance performance is strongly recommend, particularly with IoT related app. (Note: Surge Queue Length's defaults to a hardcoded limit of 1024 . When queue is fully, the tcp handshake will fail) Amazon DocumentDB \uf0c1 DocumentDB is a fully managed MongoDB compatibility database. It can be used by both IBM Suite License Service (SLS) and MAS Core. However, there are functional differences between DocumentDB and MongoDB. Check this link for more details. Note: When using DocumentDB, it requires to set RetryWrite=false in SLS and Suite CRs. Amazon MSK \uf0c1 MAS supports MSK which is a fully managed apache Kafka service. Note monitor MSK performance via CloudWatch is strongly recommended. Key metrics include Disk usage by broker, CPU (User) usage by broker, Active Controller Count, Network RX packets by broker, Network TX packets by broker . define an appropriate config for Kafka, MSK and topics. e.g. retention.ms, retention.bytes, partitions and replics to support the workload. AWS Storage \uf0c1 EBS storages like gp2, gp3 are supported by OCP in AWS. Note: EBS storage is ReadWriteOnce. The volume can be mounted as read-write by a single node. io1 and io2 are SSD-based EBS that provides the higher performance. Check Amazon EBS volume types for extra info like throughput, tuning and cost. Below is a sample yaml to create io1 storageclass with 100 iopsPerGB . kind : StorageClass apiVersion : storage.k8s.io/v1 metadata : name : io1 provisioner : kubernetes.io/aws-ebs parameters : encrypted : 'true' iopsPerGB : '100' type : io1 reclaimPolicy : Delete allowVolumeExpansion : true volumeBindingMode : Immediate EFS Storage can be used as ReadWriteMany storageclass. EFS has different metered throughput modes. Bursting Throughput mode is the default. It is inexpensive, but does NOT perform well if all burst credits are used. Monitor BurstCreditBalance metric in CloudWatch. Provisioned Throughput mode is relatively expensive. It can drive up to 3 GiBps for read operations and 1 GiBps for write operations per file system More info can be found at Amazon EFS performance Self-managed OCP vs AWS ROSA \uf0c1 A self-managed OCP Cluster can be created by the installer cli tool that supports both IPI and UPI mode. It requires self maintenance and upgrades. Alternatively, ROSA is a managed Red Hat OpenShift Service. Each ROSA cluster comes with a fully managed control plane and compute nodes. Installation, management, maintenance, and upgrades are performed by Red Hat site reliability engineers (SRE) with joint Red Hat and Amazon support.","title":"AWS"},{"location":"mas/aws/bestpractice/#aws","text":"","title":"AWS"},{"location":"mas/aws/bestpractice/#instance-type","text":"There are many instance types available in AWS. Based on the benchmark, recommend M5, M6 instances (e.g.M5.4xlarge) as master or worker nodes and P3, P4 as GPU nodes. Note Depending on the regions, some instances may not be available. Use AWS Pricing Calculator to check the instance availability and cost. g4dn can be used as GPU node for test/dev env, but not recommended for production env. If the application requires a good network performance, check Amazon EC2 instance network bandwidth site for more details. For production env, an instance with 10GB ethernet is recommended.","title":"Instance Type"},{"location":"mas/aws/bestpractice/#classic-load-balancer-idle-timeout","text":"Each OCP cluster creates 1 class load balancer and 2 network load balancers in AWS. AWS classic load balancer has a default idle time 60 seconds. In some cases, this value is not enough for a long time transaction (e.g. asset health check notebook). Consider to adjust this value to what the application needs (e.g. 300 seconds). Also, monitoring classic load-balance performance is strongly recommend, particularly with IoT related app. (Note: Surge Queue Length's defaults to a hardcoded limit of 1024 . When queue is fully, the tcp handshake will fail)","title":"Classic Load Balancer Idle Timeout"},{"location":"mas/aws/bestpractice/#amazon-documentdb","text":"DocumentDB is a fully managed MongoDB compatibility database. It can be used by both IBM Suite License Service (SLS) and MAS Core. However, there are functional differences between DocumentDB and MongoDB. Check this link for more details. Note: When using DocumentDB, it requires to set RetryWrite=false in SLS and Suite CRs.","title":"Amazon DocumentDB"},{"location":"mas/aws/bestpractice/#amazon-msk","text":"MAS supports MSK which is a fully managed apache Kafka service. Note monitor MSK performance via CloudWatch is strongly recommended. Key metrics include Disk usage by broker, CPU (User) usage by broker, Active Controller Count, Network RX packets by broker, Network TX packets by broker . define an appropriate config for Kafka, MSK and topics. e.g. retention.ms, retention.bytes, partitions and replics to support the workload.","title":"Amazon MSK"},{"location":"mas/aws/bestpractice/#aws-storage","text":"EBS storages like gp2, gp3 are supported by OCP in AWS. Note: EBS storage is ReadWriteOnce. The volume can be mounted as read-write by a single node. io1 and io2 are SSD-based EBS that provides the higher performance. Check Amazon EBS volume types for extra info like throughput, tuning and cost. Below is a sample yaml to create io1 storageclass with 100 iopsPerGB . kind : StorageClass apiVersion : storage.k8s.io/v1 metadata : name : io1 provisioner : kubernetes.io/aws-ebs parameters : encrypted : 'true' iopsPerGB : '100' type : io1 reclaimPolicy : Delete allowVolumeExpansion : true volumeBindingMode : Immediate EFS Storage can be used as ReadWriteMany storageclass. EFS has different metered throughput modes. Bursting Throughput mode is the default. It is inexpensive, but does NOT perform well if all burst credits are used. Monitor BurstCreditBalance metric in CloudWatch. Provisioned Throughput mode is relatively expensive. It can drive up to 3 GiBps for read operations and 1 GiBps for write operations per file system More info can be found at Amazon EFS performance","title":"AWS Storage"},{"location":"mas/aws/bestpractice/#self-managed-ocp-vs-aws-rosa","text":"A self-managed OCP Cluster can be created by the installer cli tool that supports both IPI and UPI mode. It requires self maintenance and upgrades. Alternatively, ROSA is a managed Red Hat OpenShift Service. Each ROSA cluster comes with a fully managed control plane and compute nodes. Installation, management, maintenance, and upgrades are performed by Red Hat site reliability engineers (SRE) with joint Red Hat and Amazon support.","title":"Self-managed OCP vs AWS ROSA"},{"location":"mas/azure/bestpractice/","text":"Azure \uf0c1 Azure Storage \uf0c1 For OCP cluster, it recommends Premium File Storage, because MAS and its components need RWX (Read/Write/Many permission) storage to support a certain level high availability as well as doclink, jms storage\u2026 For External DB VM, it recommends a high-performance storage like Premium SSD or v2 or Ultra Disk . More performance metric can be found in https://learn.microsoft.com/en-us/azure/virtual-machines/disks-types.","title":"Azure"},{"location":"mas/azure/bestpractice/#azure","text":"","title":"Azure"},{"location":"mas/azure/bestpractice/#azure-storage","text":"For OCP cluster, it recommends Premium File Storage, because MAS and its components need RWX (Read/Write/Many permission) storage to support a certain level high availability as well as doclink, jms storage\u2026 For External DB VM, it recommends a high-performance storage like Premium SSD or v2 or Ultra Disk . More performance metric can be found in https://learn.microsoft.com/en-us/azure/virtual-machines/disks-types.","title":"Azure Storage"},{"location":"mas/core/bestpractice/","text":"MAS Core \uf0c1 The MAS core namespace contains several important services required for user login and authentication, application management, MAS adoption metrics, licensing, etc. To understand the insight of each service/pod functionality in MAS core, check MAS Pods Explained . Scaling MAS core for large number of concurrent users \uf0c1 The following are the key components/dependencies that require scaling as the number of concurrent MAS users grows. MongoDB (used extensively by coreidp, api-licensing, adoptionusage, and other MAS/SLS microservices) MAS core namespace: coreidp pods licencing-mediator pods coreapi pods (if users directly login to a MAS application, bypassing the Suite navigator page, this decreases the load on coreapi pods) SLS namespace: api-licensing pods k8s apiserver pods (coreapi pods issue k8s api calls to retrieve information from MAS application CRs, configmaps, etc.) Caveat The scaling guidance described below is provided from lab benchmark testing and may vary based on the differences in workload, environment, or configuration settings. MongoDB \uf0c1 MongoDB is a crucial dependency for MAS core services, if not scaled properly MongoDB can quickly become bottleneck as the number of concurrent users increases. A common symptom of an undersized MongoDB cluster is liveness probe timeouts and pod restarts of the MAS core services which depend on MongoDB (e.g. coreidp). For useful MongoDB troubleshooting commands see MongoDB Troubleshooting Key MongoDB metrics to monitor \uf0c1 The following MongoDB metrics are important to monitor Memory utilization: by default MongoDB will attempt to cache the active data set in memory (in the WiredTiger cache). If there are a large number of cache evictions or the mongod servers are oomkilled these can be indicators that the memory allocation is too small. Consider increasing the memory allocated to mongod server. CPU utilization: check that the mongod servers have not reached their allocated cpu limit Average read/write latency: average read and write latency should be under 50 milliseconds. If not it could be due to an undersized MongoDB cluster. Check that the MongoDB cluster has sufficient memory allocation and check disk performance. Lock waiters: a large number of lock waiters indicates contention on collections/documents in MongoDB Tip When using the ibm.mas_devops collection to install MAS you can optionally install Grafana with the cluster_monitoring ansible role . Once Grafana is installed via the cluster_monitoring ansible role you can then install MongoDB using the mongodb ansible role . The mongodb ansible role includes a Grafana dashboard for monitoring the MongoDB cluster. If your using a MongoDB cluster hosted by a cloud provider uses the monitoring dashboards provided by the cloud provider. Important MongoDB databases and collections \uf0c1 The following databases and collections in MongoDB are accessed frequently during user login and authentication. Database: mas_{{mas-instance-id}}_core Collection: User (user lookup during authentication) Collection: OauthToken (token creation/deletion) Database: {{sls-id}}_sls_licensing Collection: licenses (checkin/checkout licenses) Database: mas_{{mas-instance-id}}_adoptionusage Collection: users (daily adoption usage statistics) Collection: users_hourly (hourly adoption usage statistics) Scaling MongoDB community \uf0c1 The table below provides some general guidance on scaling MongoDB based on number of concurrent users and login rate. To scale MongoDB community edition you should specify the desired cpu/mem limits in the MongoDBCommunity CR. spec: statefulset: spec: template: spec: containers: - name: mongod resources: limits: cpu: memory: Login rate (logins/minute) MongoDB CPU limit MongoDB Memory limit (GB) 75 2 4 150 2 4 300 4 8 600 6 12 1200 8 16 Scaling coreidp service (MAS core namespace) \uf0c1 The table below provides some general guidance on scaling the coreidp service based on number of concurrent users and login rate. To scale the coreidp service use the podTemplates workload customization feature in MAS. Login rate (logins/minute) coreidp replicas coreidp CPU limit coreidp Memory limit (GB) 75 1 6 1 150 1 6 1 300 1 6 1 600 2 6 2 1200 4 6 3 Scaling licensing-mediator service (MAS core namespace) \uf0c1 The table below provides some general guidance on scaling the licensing-mediator service based on number of concurrent users and login rate. The coreidp service calls the licensing-mediator service which in turn calls the api-licensing service in the SLS namespace for license checkin/checkout operations. To scale the licensing-mediator service use the podTemplates workload customization feature in MAS. Login rate (logins/minute) licensing-mediator replicas licensing-mediator CPU limit licensing-mediator Memory limit (GB) 75 1 1 1 150 1 1 1 300 2 2 1 600 4 3 1 1200 6 3 1 Scaling api-licensing service (SLS namespace) \uf0c1 The table below provides some general guidance on scaling the api-licensing service based on number of concurrent users and login rate. To scale the api-licensing service use the podTemplates workload customization feature in MAS. Login rate (logins/minute) api-licensing replicas api-licensing CPU limit api-licensing Memory limit (GB) 75 1 1 2 150 1 2 2 300 2 2 2 600 2 2 2 1200 2 2 2 Scaling coreapi service (MAS core namespace) \uf0c1 The table below provides some general guidance on scaling the coreapi service based on number of concurrent users and login rate. To scale the coreapi service use the podTemplates workload customization feature in MAS. Login rate (logins/minute) coreapi replicas coreapi CPU limit coreapi Memory limit (GB) 75 3 1 2 150 3 1 2 300 3 1 2 600 3 2 2 1200 3 3 2","title":"MAS Core"},{"location":"mas/core/bestpractice/#mas-core","text":"The MAS core namespace contains several important services required for user login and authentication, application management, MAS adoption metrics, licensing, etc. To understand the insight of each service/pod functionality in MAS core, check MAS Pods Explained .","title":"MAS Core"},{"location":"mas/core/bestpractice/#scaling-mas-core-for-large-number-of-concurrent-users","text":"The following are the key components/dependencies that require scaling as the number of concurrent MAS users grows. MongoDB (used extensively by coreidp, api-licensing, adoptionusage, and other MAS/SLS microservices) MAS core namespace: coreidp pods licencing-mediator pods coreapi pods (if users directly login to a MAS application, bypassing the Suite navigator page, this decreases the load on coreapi pods) SLS namespace: api-licensing pods k8s apiserver pods (coreapi pods issue k8s api calls to retrieve information from MAS application CRs, configmaps, etc.) Caveat The scaling guidance described below is provided from lab benchmark testing and may vary based on the differences in workload, environment, or configuration settings.","title":"Scaling MAS core for large number of concurrent users"},{"location":"mas/core/bestpractice/#mongodb","text":"MongoDB is a crucial dependency for MAS core services, if not scaled properly MongoDB can quickly become bottleneck as the number of concurrent users increases. A common symptom of an undersized MongoDB cluster is liveness probe timeouts and pod restarts of the MAS core services which depend on MongoDB (e.g. coreidp). For useful MongoDB troubleshooting commands see MongoDB Troubleshooting","title":"MongoDB"},{"location":"mas/core/bestpractice/#key-mongodb-metrics-to-monitor","text":"The following MongoDB metrics are important to monitor Memory utilization: by default MongoDB will attempt to cache the active data set in memory (in the WiredTiger cache). If there are a large number of cache evictions or the mongod servers are oomkilled these can be indicators that the memory allocation is too small. Consider increasing the memory allocated to mongod server. CPU utilization: check that the mongod servers have not reached their allocated cpu limit Average read/write latency: average read and write latency should be under 50 milliseconds. If not it could be due to an undersized MongoDB cluster. Check that the MongoDB cluster has sufficient memory allocation and check disk performance. Lock waiters: a large number of lock waiters indicates contention on collections/documents in MongoDB Tip When using the ibm.mas_devops collection to install MAS you can optionally install Grafana with the cluster_monitoring ansible role . Once Grafana is installed via the cluster_monitoring ansible role you can then install MongoDB using the mongodb ansible role . The mongodb ansible role includes a Grafana dashboard for monitoring the MongoDB cluster. If your using a MongoDB cluster hosted by a cloud provider uses the monitoring dashboards provided by the cloud provider.","title":"Key MongoDB metrics to monitor"},{"location":"mas/core/bestpractice/#important-mongodb-databases-and-collections","text":"The following databases and collections in MongoDB are accessed frequently during user login and authentication. Database: mas_{{mas-instance-id}}_core Collection: User (user lookup during authentication) Collection: OauthToken (token creation/deletion) Database: {{sls-id}}_sls_licensing Collection: licenses (checkin/checkout licenses) Database: mas_{{mas-instance-id}}_adoptionusage Collection: users (daily adoption usage statistics) Collection: users_hourly (hourly adoption usage statistics)","title":"Important MongoDB databases and collections"},{"location":"mas/core/bestpractice/#scaling-mongodb-community","text":"The table below provides some general guidance on scaling MongoDB based on number of concurrent users and login rate. To scale MongoDB community edition you should specify the desired cpu/mem limits in the MongoDBCommunity CR. spec: statefulset: spec: template: spec: containers: - name: mongod resources: limits: cpu: memory: Login rate (logins/minute) MongoDB CPU limit MongoDB Memory limit (GB) 75 2 4 150 2 4 300 4 8 600 6 12 1200 8 16","title":"Scaling MongoDB community"},{"location":"mas/core/bestpractice/#scaling-coreidp-service-mas-core-namespace","text":"The table below provides some general guidance on scaling the coreidp service based on number of concurrent users and login rate. To scale the coreidp service use the podTemplates workload customization feature in MAS. Login rate (logins/minute) coreidp replicas coreidp CPU limit coreidp Memory limit (GB) 75 1 6 1 150 1 6 1 300 1 6 1 600 2 6 2 1200 4 6 3","title":"Scaling coreidp service (MAS core namespace)"},{"location":"mas/core/bestpractice/#scaling-licensing-mediator-service-mas-core-namespace","text":"The table below provides some general guidance on scaling the licensing-mediator service based on number of concurrent users and login rate. The coreidp service calls the licensing-mediator service which in turn calls the api-licensing service in the SLS namespace for license checkin/checkout operations. To scale the licensing-mediator service use the podTemplates workload customization feature in MAS. Login rate (logins/minute) licensing-mediator replicas licensing-mediator CPU limit licensing-mediator Memory limit (GB) 75 1 1 1 150 1 1 1 300 2 2 1 600 4 3 1 1200 6 3 1","title":"Scaling licensing-mediator service (MAS core namespace)"},{"location":"mas/core/bestpractice/#scaling-api-licensing-service-sls-namespace","text":"The table below provides some general guidance on scaling the api-licensing service based on number of concurrent users and login rate. To scale the api-licensing service use the podTemplates workload customization feature in MAS. Login rate (logins/minute) api-licensing replicas api-licensing CPU limit api-licensing Memory limit (GB) 75 1 1 2 150 1 2 2 300 2 2 2 600 2 2 2 1200 2 2 2","title":"Scaling api-licensing service (SLS namespace)"},{"location":"mas/core/bestpractice/#scaling-coreapi-service-mas-core-namespace","text":"The table below provides some general guidance on scaling the coreapi service based on number of concurrent users and login rate. To scale the coreapi service use the podTemplates workload customization feature in MAS. Login rate (logins/minute) coreapi replicas coreapi CPU limit coreapi Memory limit (GB) 75 3 1 2 150 3 1 2 300 3 1 2 600 3 2 2 1200 3 3 2","title":"Scaling coreapi service (MAS core namespace)"},{"location":"mas/ibmcloud/bestpractice/","text":"IBM Cloud \uf0c1 IBM Storage \uf0c1 IBM Cloud provides both block and file storages for OCP. Both storages support ReadWriteMany access. If the app requires a high-performance disks, consider to setup custom performance storageclass as blow: block storage sample yaml allowVolumeExpansion : true apiVersion : storage.k8s.io/v1 kind : StorageClass metadata : name : block100p parameters : billingType : hourly classVersion : \"2\" fsType : ext4 sizeIOPSRange : |- [20-1999]Gi:[100-100] type : Performance provisioner : ibm.io/ibmc-block reclaimPolicy : Delete volumeBindingMode : WaitForFirstConsumer file storage sample yaml allowVolumeExpansion : true apiVersion : storage.k8s.io/v1 kind : StorageClass metadata : name : file100p parameters : billingType : hourly classVersion : \"2\" fsType : ext4 sizeIOPSRange : |- [20-1999]Gi:[100-100] type : Performance provisioner : ibm.io/ibmc-file reclaimPolicy : Delete volumeBindingMode : WaitForFirstConsumer IBM External Load Balancer \uf0c1 If the built-in ingress load balancer in OCP is unable to scale to handle with \"large\" workloads (100K+ concurrent device connections), consider to provision an instance of IBM cloud NLB2.0 (IPVS/KeepAlived) load balancer. IBM ROKS \uf0c1 IBM ROKS is a managed Red Hat OpenShift Service in IBM Cloud. Each ROKS cluster comes with a fully managed control plane and compute nodes. Installation, management, maintenance, and upgrades are performed by Red Hat site reliability engineers (SRE) with joint Red Hat and IBM Cloud support.","title":"IBM Cloud"},{"location":"mas/ibmcloud/bestpractice/#ibm-cloud","text":"","title":"IBM Cloud"},{"location":"mas/ibmcloud/bestpractice/#ibm-storage","text":"IBM Cloud provides both block and file storages for OCP. Both storages support ReadWriteMany access. If the app requires a high-performance disks, consider to setup custom performance storageclass as blow: block storage sample yaml allowVolumeExpansion : true apiVersion : storage.k8s.io/v1 kind : StorageClass metadata : name : block100p parameters : billingType : hourly classVersion : \"2\" fsType : ext4 sizeIOPSRange : |- [20-1999]Gi:[100-100] type : Performance provisioner : ibm.io/ibmc-block reclaimPolicy : Delete volumeBindingMode : WaitForFirstConsumer file storage sample yaml allowVolumeExpansion : true apiVersion : storage.k8s.io/v1 kind : StorageClass metadata : name : file100p parameters : billingType : hourly classVersion : \"2\" fsType : ext4 sizeIOPSRange : |- [20-1999]Gi:[100-100] type : Performance provisioner : ibm.io/ibmc-file reclaimPolicy : Delete volumeBindingMode : WaitForFirstConsumer","title":"IBM Storage"},{"location":"mas/ibmcloud/bestpractice/#ibm-external-load-balancer","text":"If the built-in ingress load balancer in OCP is unable to scale to handle with \"large\" workloads (100K+ concurrent device connections), consider to provision an instance of IBM cloud NLB2.0 (IPVS/KeepAlived) load balancer.","title":"IBM External Load Balancer"},{"location":"mas/ibmcloud/bestpractice/#ibm-roks","text":"IBM ROKS is a managed Red Hat OpenShift Service in IBM Cloud. Each ROKS cluster comes with a fully managed control plane and compute nodes. Installation, management, maintenance, and upgrades are performed by Red Hat site reliability engineers (SRE) with joint Red Hat and IBM Cloud support.","title":"IBM ROKS"},{"location":"mas/iot/bestpractice/","text":"MAS IoT \uf0c1 MQTT vs HTTP Messaging \uf0c1 The MQTT protocol is the preferred messaging protocol for data ingest in to the MAS IoT service. HTTP messaging support was added to MAS IoT for low volume scenarios and is not designed to be used for message rates greater than 1K msgs/sec. MQTT message ingest rates are 2-3 orders of magnitude faster than HTTP. The primary reason being that HTTP messaging requires a TLS handshake and authentication on every message published. The authentication requires a database lookup for the device authentication token. As such, HTTP messaging puts a strain on the authentication service and the IoT database. In order to achieve high data ingest rates with MAS IoT service, use the MQTT protocol and keep the device connection open while publishing messages. Best practice messaging pattern \uf0c1 MQTT CONNECT MQTT PUBLISH (in loop until all messages are published) Messaging Anti-pattern \uf0c1 MQTT CONNECT MQTT PUBLISH MQTT DISCONNECT MQTT CONNECT MQTT PUBLISH MQTT DISCONNECT ... Data Ingest rates, devices, and connections \uf0c1 The MQTT service in MAS IoT was designed to handle many device connections, each publishing at low rates. As such, when designing a data ingest application for MAS IoT it should distribute the load over many MQTT devices or applications in order to maximize message rates. Single device or application connections will be throttled based on the IoT Fair use policy (see below). IoT Fair Use Policy \uf0c1 IoT data ingest throttling limits are per device and are based on the device class (i.e. Device, Gateway, Application). These limits are in place to prevent DoS attacks from rogue (i.e. badly behaving) devices. The throttling limits do not scale with the MAS IoT deployment size. For more information on MAS IoT messaging quotas see https://www.ibm.com/docs/en/mapms/1_cloud?topic=features-quotas Messaging QoS \uf0c1 The messaging QoS specified when publishing an MQTT message also has a strong impact on messaging rates. QoS in order of fastest to slowest: QoS 0 - at most once (data loss possible, no message persistence or ACKs) QoS 1 - at least once (duplicates are possible, messages persisted and ACKed) QoS 2 - exactly once (application client required to maintain state, messages persisted and two phase commit between client/server) QoS >0 performance considerations requires disk persistence in MAS IoT messaging components and therefore disk I/O performance becomes critical with QoS >0. the MQTT specification provides a kind of protocol level flow control negotiation between client and server. The number of unacked messages allowed on the session is negotiated between client and server and if the client has no more available msg ids it must pause publishing until msg IDs become available. Msg IDs become available when messages ACKs are received. See MQTT spec for details: https://docs.oasis-open.org/mqtt/mqtt/v5.0/os/mqtt-v5.0-os.html#_QoS_1:_At Summary of factors that influence data ingestion rate \uf0c1 Choice of messaging protocol: MQTT (high volume) vs HTTP (low volume) Messaging Pattern: do NOT close MQTT sessions after each message published. leave connections open. Number of devices: Higher message rates are possible when the load is distributed over more connections Choice of device class: Fair use quotas are based on device class (https://www.ibm.com/docs/en/mapms/1_cloud?topic=features-quotas)[https://www.ibm.com/docs/en/mapms/1_cloud?topic=features-quotas] Choice of QoS: high levels of messaging guarantees come with higher costs IoT Deployment \uf0c1 IoT CRD defines 3 default size deployments: dev, small, medium that controls the default settings for pod replics, cpu and memory. For production, medium is required. Sample yaml for medium deployment in IoT CR apiVersion : iot.ibm.com/v1 kind : IoT metadata : name : masinst1 namespace : mas-masinst1-iot spec : bindings : jdbc : system kafka : system mongo : system settings : deployment : size : medium If need to adjust the default setting for a deployment, go the iot-operator pod, then change the corresponding yaml files under /opt/ansible/roles//vars folder, e.g. /opt/ansible/roles/ibm-iot-actions/vars/size_medium.yml Connection and OpenShift Ingress Controllers \uf0c1 Openshift HAProxy supports 20k connection per pod. The total connection determinants how many end devices can connect to IoT MSProxy. By default, IBM ROKS deploys 3 router members that supports 3x20k = 60K connection By default, AWS ROSA deploys 2 router members that supports 2x20k = 40K connection use the below command to scale up to 3 router members: oc patch -n openshift-ingress-operator ingresscontroller/default --patch '{\"spec\":{\"replicas\": 3}}' --type=merge Kafka \uf0c1 IoT uses Kafka to process the messages. Follow the Kafka Configuration Reference to configure best value for Kafka/Topics retention.ms, retention.bytes, partitions, replics to support the workload. AWS MSK \uf0c1 configuration details can be found at https://docs.aws.amazon.com/msk/latest/developerguide/msk-default-configuration.html monitoring MSK is strongly recommended monitoring classic load-balance is strongly recommend Message Rate and Ethernet Network Bandwidth \uf0c1 Depending on the cloud providers, worker node instance has different network bandwidth. It determines how fast the end devices can send the request. Message rate is limited by the message size and the bandwidth of ethernet network. To achieve higher rates and/or larger messages it will require a 10GB ethernet. The network bandwidth also impacts the response latency. The higher bandwidth, the lower latency. Below deployment configurations are recommended as starting value with medium and large workload. MSProxy - 4 MSProxies with 1 CPU and 4GB MessageGateWay - 1 MGW 6 CPUs and 16GB along with 4 TcpIop threads","title":"MAS IoT"},{"location":"mas/iot/bestpractice/#mas-iot","text":"","title":"MAS IoT"},{"location":"mas/iot/bestpractice/#mqtt-vs-http-messaging","text":"The MQTT protocol is the preferred messaging protocol for data ingest in to the MAS IoT service. HTTP messaging support was added to MAS IoT for low volume scenarios and is not designed to be used for message rates greater than 1K msgs/sec. MQTT message ingest rates are 2-3 orders of magnitude faster than HTTP. The primary reason being that HTTP messaging requires a TLS handshake and authentication on every message published. The authentication requires a database lookup for the device authentication token. As such, HTTP messaging puts a strain on the authentication service and the IoT database. In order to achieve high data ingest rates with MAS IoT service, use the MQTT protocol and keep the device connection open while publishing messages.","title":"MQTT vs HTTP Messaging"},{"location":"mas/iot/bestpractice/#best-practice-messaging-pattern","text":"MQTT CONNECT MQTT PUBLISH (in loop until all messages are published)","title":"Best practice messaging pattern"},{"location":"mas/iot/bestpractice/#messaging-anti-pattern","text":"MQTT CONNECT MQTT PUBLISH MQTT DISCONNECT MQTT CONNECT MQTT PUBLISH MQTT DISCONNECT ...","title":"Messaging Anti-pattern"},{"location":"mas/iot/bestpractice/#data-ingest-rates-devices-and-connections","text":"The MQTT service in MAS IoT was designed to handle many device connections, each publishing at low rates. As such, when designing a data ingest application for MAS IoT it should distribute the load over many MQTT devices or applications in order to maximize message rates. Single device or application connections will be throttled based on the IoT Fair use policy (see below).","title":"Data Ingest rates, devices, and connections"},{"location":"mas/iot/bestpractice/#iot-fair-use-policy","text":"IoT data ingest throttling limits are per device and are based on the device class (i.e. Device, Gateway, Application). These limits are in place to prevent DoS attacks from rogue (i.e. badly behaving) devices. The throttling limits do not scale with the MAS IoT deployment size. For more information on MAS IoT messaging quotas see https://www.ibm.com/docs/en/mapms/1_cloud?topic=features-quotas","title":"IoT Fair Use Policy"},{"location":"mas/iot/bestpractice/#messaging-qos","text":"The messaging QoS specified when publishing an MQTT message also has a strong impact on messaging rates. QoS in order of fastest to slowest: QoS 0 - at most once (data loss possible, no message persistence or ACKs) QoS 1 - at least once (duplicates are possible, messages persisted and ACKed) QoS 2 - exactly once (application client required to maintain state, messages persisted and two phase commit between client/server) QoS >0 performance considerations requires disk persistence in MAS IoT messaging components and therefore disk I/O performance becomes critical with QoS >0. the MQTT specification provides a kind of protocol level flow control negotiation between client and server. The number of unacked messages allowed on the session is negotiated between client and server and if the client has no more available msg ids it must pause publishing until msg IDs become available. Msg IDs become available when messages ACKs are received. See MQTT spec for details: https://docs.oasis-open.org/mqtt/mqtt/v5.0/os/mqtt-v5.0-os.html#_QoS_1:_At","title":"Messaging QoS"},{"location":"mas/iot/bestpractice/#summary-of-factors-that-influence-data-ingestion-rate","text":"Choice of messaging protocol: MQTT (high volume) vs HTTP (low volume) Messaging Pattern: do NOT close MQTT sessions after each message published. leave connections open. Number of devices: Higher message rates are possible when the load is distributed over more connections Choice of device class: Fair use quotas are based on device class (https://www.ibm.com/docs/en/mapms/1_cloud?topic=features-quotas)[https://www.ibm.com/docs/en/mapms/1_cloud?topic=features-quotas] Choice of QoS: high levels of messaging guarantees come with higher costs","title":"Summary of factors that influence data ingestion rate"},{"location":"mas/iot/bestpractice/#iot-deployment","text":"IoT CRD defines 3 default size deployments: dev, small, medium that controls the default settings for pod replics, cpu and memory. For production, medium is required. Sample yaml for medium deployment in IoT CR apiVersion : iot.ibm.com/v1 kind : IoT metadata : name : masinst1 namespace : mas-masinst1-iot spec : bindings : jdbc : system kafka : system mongo : system settings : deployment : size : medium If need to adjust the default setting for a deployment, go the iot-operator pod, then change the corresponding yaml files under /opt/ansible/roles//vars folder, e.g. /opt/ansible/roles/ibm-iot-actions/vars/size_medium.yml","title":"IoT Deployment"},{"location":"mas/iot/bestpractice/#connection-and-openshift-ingress-controllers","text":"Openshift HAProxy supports 20k connection per pod. The total connection determinants how many end devices can connect to IoT MSProxy. By default, IBM ROKS deploys 3 router members that supports 3x20k = 60K connection By default, AWS ROSA deploys 2 router members that supports 2x20k = 40K connection use the below command to scale up to 3 router members: oc patch -n openshift-ingress-operator ingresscontroller/default --patch '{\"spec\":{\"replicas\": 3}}' --type=merge","title":"Connection and OpenShift Ingress Controllers"},{"location":"mas/iot/bestpractice/#kafka","text":"IoT uses Kafka to process the messages. Follow the Kafka Configuration Reference to configure best value for Kafka/Topics retention.ms, retention.bytes, partitions, replics to support the workload.","title":"Kafka"},{"location":"mas/iot/bestpractice/#aws-msk","text":"configuration details can be found at https://docs.aws.amazon.com/msk/latest/developerguide/msk-default-configuration.html monitoring MSK is strongly recommended monitoring classic load-balance is strongly recommend","title":"AWS MSK"},{"location":"mas/iot/bestpractice/#message-rate-and-ethernet-network-bandwidth","text":"Depending on the cloud providers, worker node instance has different network bandwidth. It determines how fast the end devices can send the request. Message rate is limited by the message size and the bandwidth of ethernet network. To achieve higher rates and/or larger messages it will require a 10GB ethernet. The network bandwidth also impacts the response latency. The higher bandwidth, the lower latency. Below deployment configurations are recommended as starting value with medium and large workload. MSProxy - 4 MSProxies with 1 CPU and 4GB MessageGateWay - 1 MGW 6 CPUs and 16GB along with 4 TcpIop threads","title":"Message Rate and Ethernet Network Bandwidth"},{"location":"mas/manage/bestpractice/","text":"MAS Manage \uf0c1 At what point is it necessary to partition a MAS Manage workload across more than one MAS instance? \uf0c1 A new MAS instance is required to run MAS Manage workloads at the point in which the DB server can no longer be scaled up. When the DB server can no longer be scaled up, the customer should plan to create a new MAS instance and move sites to the new MAS instance which will be using a new DB server. Maximo Transaction latency \uf0c1 When describing Maximo transaction latency it is important to define the boundaries of what constitutes a standard or out-of-the-box Maximo transaction. The description below does just that. Terminology \uf0c1 CRUD refers to create, update, delete operations MBO stands for Maximo Business Objects (which are hierarchical in nature) Transaction latency is defined as the elapsed time between when the Maximo server receives the transaction request to the time when the Maximo server has sent the response. For UI users this also includes the elapsed time between when the Save button is clicked to when control is returned to the user. Definition of a transaction \uf0c1 An out of the box Maximo transaction is expected to complete with a latency of 2 seconds or less, where a transaction is defined as the creation, update, or deletion of a single MBO, containing no more than one child object and with no attachments or binary data (blobs). Example include, but are not limited to: Creation of a single WorkOrder object. This includes generation of the WorkOrderStatus and WorkOrderAncestor records. Update of a single WorkOrder object. For example, changing states from Approved to Closed. Deletion of a single WorkOrder object. Definition restrictions \uf0c1 The following conditions are considered to be outside the scope of an out of the box Maximo transaction, and therefore do not fall under the 2 second latency characterization. For UI initiated transactions this does not include latency incurred from downloading UI resources (e.g. js, css, png, jpg, etc.) It applies to out of the box Maximo applications, but not customized applications or out of the box Maximo applications with automation scripts It does not apply to UI initiated transactions with a large number of xhr requests (or portlets), for example the Maximo start center. Large here means greater than 2 xhr requests per page. Note, Maximo currently supports HTTP/1.1, so xhr requests initiated from a single user UI page are sequential, not concurrent. It does not apply to customized saved queries. It does not apply to bulk load requests. It does not apply to report related transactions App Server \uf0c1 MAS Manage has different bundle types e.g. All, UI, MEA, Report and CRON to configure app server. Adjust the resource settings like cpu, memory, replic to match the workload. The settings are in ManageWorkspaces CR. Below is the sample. apiVersion : apps.mas.ibm.com/v1 kind : ManageWorkspace ... spec : settings : deployment : serverBundles : - bundleType : mea isDefault : false isMobileTarget : false isUserSyncTarget : true name : mea replica : 1 routeSubDomain : all - bundleType : cron isDefault : false isMobileTarget : false isUserSyncTarget : false name : cron replica : 1 ... spec : settings : resources : manageAdmin : limits : cpu : '2' memory : 4Gi requests : cpu : '0.2' memory : 500Mi serverBundles : limits : cpu : '6' memory : 10Gi requests : cpu : '0.2' memory : 1Gi Load-Balancer \uf0c1 Lab test shows roundrobin has more stable and better performance than leastconn policy which is the default. Follow this link to update load balancer policy. Manage Pod Functionality \uf0c1 Follow this link to understand the manage pod functionality. LTPA timeout \uf0c1 Using IBM Maximo Application Suite (MAS), Manage users will receive an error message saying to reload the application after 2 hours, even while actively working. This 2-hour timeout default is when the LTPA token in Manage expires, and is redirecting the user back to the login page for MAS. Follow Updating LTPA timeout in Manage to increase the default value. WebSphere Liberty \uf0c1 Due to the architecture change, Maximo 8.x (MAS Manage app) is deployed on WebSphere Liberty Base with OpenJ9. As of WebSphere Liberty, the thread growth algorithm is enhanced to grow thread capacity more aggressively to react more rapidly to peak loads. For many environments, this autonomic tuning provided by the Open Liberty thread pool works well with no configuration or tuning by the operator. If necessary, you can adjust coreThreads and maxThreads value by tuning liberty . Configure JVM options in Manage app \uf0c1 Follow this link to configure JVM options DB \uf0c1 Disk \uf0c1 disk performance is critial for db performance. Recommend a storage or disk with disk throughput: > 250 MB/s IOPS: 10 IOPS/GB to 100 IOPS/GB (depending on volume size) To measure disk performance on Linux use the dd command. The sample command below measures disk performance of the data volume inside a db2 pod running in OCP CAUTION Make sure that ddtest filename is appended to the end of the data path or the dd command will wipe the db2 data directory. [db2inst1@c-db2wh-manage-db2u-0 - Db2U bludata0]$ dd if=/dev/zero of=path_of_db2_data_directory/ddtest bs=128K count=8192 8192+0 records in 8192+0 records out 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 2.84314 s, 378 MB/s [db2inst1@c-db2wh-manage-db2u-0 - Db2U bludata0]$ Network latency between app and db server \uf0c1 Reducing network latency is key to optimizing performance. Confirm latency is below 50ms by conducting a ping test. For production env, strongly recommend keeping the latency below 10ms and having app and db server in the same network segment. In cloud deployment scenarios, ensure both the database and OpenShift cluster are located within the same region, with the possibility of being in the same availability zone (AZ). Utilize the ping command to evaluate and pinpoint latency issues. Large table optimization \uf0c1 When optimizing large tables in the Manage app, it is recommended to transfer these tables to a dedicated tablespace on high-throughput disks, coupled with a dedicated buffer cache for enhanced performance. The speed of the disks and the availability of memory play crucial roles in this optimization strategy. Additionally, ensure that index statistics are regularly updated, and address any problematic queries to further optimize the system. DB - DB2/DB2wh \uf0c1 DB2 Tuning in Maximo 7.6.x Best practice is applicable. IMPORTANT The containerized DB2U and DB2WH deployments do NOT support text search (Regular DB2 has text search). As a result, some queries may perform poorly on containerized DB2 relative to Oracle DB and SQL Server, which both support text search. Searching records by Description on the list page is a typical scenario whose performance can benefit from text search capability of the database, especially if no other indexed attributes are included in the query. Adding a non-unique index on Description can help if an exact search can be made (Maximo search type = EXACT or user types is \"=\" before the search string, eg =Text) or the search can be done based on the beginning of the string (user types '%' at end of the string, eg Text%). If possible, adding other fields to the query (either by user typing them or as part of the default, where those attributes are part of an index can also help. In addition, adding Description to the end of one of these indexes can also show improvement. Highlights: increase maxsequence cache to 50 run runstats and/or reorg to update index periodically separate system storage, user storage, backup storage, transaction logs storage, temporary tablespace storage on different disks if possible. Use DB2 Performance Diagnosis to troubleshoot and tuning the db and SQL. Manage requires row-organized tables. Check db2w db setting (by default it uses column based) and update the setting by db2 update db cfg using DFT_TABLE_ORG ROW Manage does NOT support MMP or table partition in the current version, but consider to archive records over 1-year old. Optim is the one of the tools can be used for archiving. see this guide and this video for details. Increase the concurrently running statements allowed for a DB2 application. This issue occcurs when loading a large amount of data via MIF or api call. See this link for the tuning. storageclass for ibm cloud: performance(Custom) block storage with 100+ IOPS for data storage, block gold for system and block silver for backup for aws cloud: if using EFS for db, consider Provisioned mode to have a constant throughput. For more disk options, see details in this page For db2 registry (db2set): Set db2_workload=maximo . That makes db cfg variable WLM_ADMISSION_CTRL is set to NO Do NOT change the default values for DB2_OVERRIDE_NUM_CPUS and DB2_OVERRIDE_THREADING_DEGREE . Verify db2 db cfg variable WLM_ADMISSION_CTRL is set to NO For db2ucluster CR Do NOT set db2 instance memory. The operator will automatically calculate it based on the container memory limit. (Optional) for performance stability, set the same value to both container resource request and limit. For db2 monitor switches The best practice is turn off all monitor switches except the Timestamp in dbm cfg. If turn on monitor switches in dbm cfg, will cause monitor switches are turned on by default for all DB2 sessions, this will bring 5%-10% overhead on the overall database performance, depends on the workload and database server hardware spec. So we should not turn on monitor switches in dbm cfg. When we need to take DB2 monitor data, we should only turn on monitor switches in a specific session by the following command: db2 update monitor switches using BUFFERPOOL on LOCK on SORT on STATEMENT on TIMESTAMP on TABLE on UOW on And turn off all monitor switches immediately after getting required monitor data by the following command: db2 update monitor switches using BUFFERPOOL off LOCK off SORT off STATEMENT off TIMESTAMP off TABLE off UOW off DB - Oracle \uf0c1 Maximo 7.6.x Best practice is applicable DB - MSSQL \uf0c1 Maximo 7.6.x Best practice is applicable additional settings for MSSQL Server 2019 compatibility level: if maximo db is upgraded from the old version and the performance degradation is observed after the upgrade, consider to set compatibility level to the old version to keep the execution plan same. isolation level: ALTER DATABASE < DB NAME > SET ALLOW_SNAPSHOT_ISOLATION ON ALTER DATABASE < DB NAME > SET READ_COMMITTED_SNAPSHOT ON","title":"Overview"},{"location":"mas/manage/bestpractice/#mas-manage","text":"","title":"MAS Manage"},{"location":"mas/manage/bestpractice/#at-what-point-is-it-necessary-to-partition-a-mas-manage-workload-across-more-than-one-mas-instance","text":"A new MAS instance is required to run MAS Manage workloads at the point in which the DB server can no longer be scaled up. When the DB server can no longer be scaled up, the customer should plan to create a new MAS instance and move sites to the new MAS instance which will be using a new DB server.","title":"At what point is it necessary to partition a MAS Manage workload across more than one MAS instance?"},{"location":"mas/manage/bestpractice/#maximo-transaction-latency","text":"When describing Maximo transaction latency it is important to define the boundaries of what constitutes a standard or out-of-the-box Maximo transaction. The description below does just that.","title":"Maximo Transaction latency"},{"location":"mas/manage/bestpractice/#terminology","text":"CRUD refers to create, update, delete operations MBO stands for Maximo Business Objects (which are hierarchical in nature) Transaction latency is defined as the elapsed time between when the Maximo server receives the transaction request to the time when the Maximo server has sent the response. For UI users this also includes the elapsed time between when the Save button is clicked to when control is returned to the user.","title":"Terminology"},{"location":"mas/manage/bestpractice/#definition-of-a-transaction","text":"An out of the box Maximo transaction is expected to complete with a latency of 2 seconds or less, where a transaction is defined as the creation, update, or deletion of a single MBO, containing no more than one child object and with no attachments or binary data (blobs). Example include, but are not limited to: Creation of a single WorkOrder object. This includes generation of the WorkOrderStatus and WorkOrderAncestor records. Update of a single WorkOrder object. For example, changing states from Approved to Closed. Deletion of a single WorkOrder object.","title":"Definition of a transaction"},{"location":"mas/manage/bestpractice/#definition-restrictions","text":"The following conditions are considered to be outside the scope of an out of the box Maximo transaction, and therefore do not fall under the 2 second latency characterization. For UI initiated transactions this does not include latency incurred from downloading UI resources (e.g. js, css, png, jpg, etc.) It applies to out of the box Maximo applications, but not customized applications or out of the box Maximo applications with automation scripts It does not apply to UI initiated transactions with a large number of xhr requests (or portlets), for example the Maximo start center. Large here means greater than 2 xhr requests per page. Note, Maximo currently supports HTTP/1.1, so xhr requests initiated from a single user UI page are sequential, not concurrent. It does not apply to customized saved queries. It does not apply to bulk load requests. It does not apply to report related transactions","title":"Definition restrictions"},{"location":"mas/manage/bestpractice/#app-server","text":"MAS Manage has different bundle types e.g. All, UI, MEA, Report and CRON to configure app server. Adjust the resource settings like cpu, memory, replic to match the workload. The settings are in ManageWorkspaces CR. Below is the sample. apiVersion : apps.mas.ibm.com/v1 kind : ManageWorkspace ... spec : settings : deployment : serverBundles : - bundleType : mea isDefault : false isMobileTarget : false isUserSyncTarget : true name : mea replica : 1 routeSubDomain : all - bundleType : cron isDefault : false isMobileTarget : false isUserSyncTarget : false name : cron replica : 1 ... spec : settings : resources : manageAdmin : limits : cpu : '2' memory : 4Gi requests : cpu : '0.2' memory : 500Mi serverBundles : limits : cpu : '6' memory : 10Gi requests : cpu : '0.2' memory : 1Gi","title":"App Server"},{"location":"mas/manage/bestpractice/#load-balancer","text":"Lab test shows roundrobin has more stable and better performance than leastconn policy which is the default. Follow this link to update load balancer policy.","title":"Load-Balancer"},{"location":"mas/manage/bestpractice/#manage-pod-functionality","text":"Follow this link to understand the manage pod functionality.","title":"Manage Pod Functionality"},{"location":"mas/manage/bestpractice/#ltpa-timeout","text":"Using IBM Maximo Application Suite (MAS), Manage users will receive an error message saying to reload the application after 2 hours, even while actively working. This 2-hour timeout default is when the LTPA token in Manage expires, and is redirecting the user back to the login page for MAS. Follow Updating LTPA timeout in Manage to increase the default value.","title":"LTPA timeout"},{"location":"mas/manage/bestpractice/#websphere-liberty","text":"Due to the architecture change, Maximo 8.x (MAS Manage app) is deployed on WebSphere Liberty Base with OpenJ9. As of WebSphere Liberty, the thread growth algorithm is enhanced to grow thread capacity more aggressively to react more rapidly to peak loads. For many environments, this autonomic tuning provided by the Open Liberty thread pool works well with no configuration or tuning by the operator. If necessary, you can adjust coreThreads and maxThreads value by tuning liberty .","title":"WebSphere Liberty"},{"location":"mas/manage/bestpractice/#configure-jvm-options-in-manage-app","text":"Follow this link to configure JVM options","title":"Configure JVM options in Manage app"},{"location":"mas/manage/bestpractice/#db","text":"","title":"DB"},{"location":"mas/manage/bestpractice/#disk","text":"disk performance is critial for db performance. Recommend a storage or disk with disk throughput: > 250 MB/s IOPS: 10 IOPS/GB to 100 IOPS/GB (depending on volume size) To measure disk performance on Linux use the dd command. The sample command below measures disk performance of the data volume inside a db2 pod running in OCP CAUTION Make sure that ddtest filename is appended to the end of the data path or the dd command will wipe the db2 data directory. [db2inst1@c-db2wh-manage-db2u-0 - Db2U bludata0]$ dd if=/dev/zero of=path_of_db2_data_directory/ddtest bs=128K count=8192 8192+0 records in 8192+0 records out 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 2.84314 s, 378 MB/s [db2inst1@c-db2wh-manage-db2u-0 - Db2U bludata0]$","title":"Disk"},{"location":"mas/manage/bestpractice/#network-latency-between-app-and-db-server","text":"Reducing network latency is key to optimizing performance. Confirm latency is below 50ms by conducting a ping test. For production env, strongly recommend keeping the latency below 10ms and having app and db server in the same network segment. In cloud deployment scenarios, ensure both the database and OpenShift cluster are located within the same region, with the possibility of being in the same availability zone (AZ). Utilize the ping command to evaluate and pinpoint latency issues.","title":"Network latency between app and db server"},{"location":"mas/manage/bestpractice/#large-table-optimization","text":"When optimizing large tables in the Manage app, it is recommended to transfer these tables to a dedicated tablespace on high-throughput disks, coupled with a dedicated buffer cache for enhanced performance. The speed of the disks and the availability of memory play crucial roles in this optimization strategy. Additionally, ensure that index statistics are regularly updated, and address any problematic queries to further optimize the system.","title":"Large table optimization"},{"location":"mas/manage/bestpractice/#db-db2db2wh","text":"DB2 Tuning in Maximo 7.6.x Best practice is applicable. IMPORTANT The containerized DB2U and DB2WH deployments do NOT support text search (Regular DB2 has text search). As a result, some queries may perform poorly on containerized DB2 relative to Oracle DB and SQL Server, which both support text search. Searching records by Description on the list page is a typical scenario whose performance can benefit from text search capability of the database, especially if no other indexed attributes are included in the query. Adding a non-unique index on Description can help if an exact search can be made (Maximo search type = EXACT or user types is \"=\" before the search string, eg =Text) or the search can be done based on the beginning of the string (user types '%' at end of the string, eg Text%). If possible, adding other fields to the query (either by user typing them or as part of the default, where those attributes are part of an index can also help. In addition, adding Description to the end of one of these indexes can also show improvement. Highlights: increase maxsequence cache to 50 run runstats and/or reorg to update index periodically separate system storage, user storage, backup storage, transaction logs storage, temporary tablespace storage on different disks if possible. Use DB2 Performance Diagnosis to troubleshoot and tuning the db and SQL. Manage requires row-organized tables. Check db2w db setting (by default it uses column based) and update the setting by db2 update db cfg using DFT_TABLE_ORG ROW Manage does NOT support MMP or table partition in the current version, but consider to archive records over 1-year old. Optim is the one of the tools can be used for archiving. see this guide and this video for details. Increase the concurrently running statements allowed for a DB2 application. This issue occcurs when loading a large amount of data via MIF or api call. See this link for the tuning. storageclass for ibm cloud: performance(Custom) block storage with 100+ IOPS for data storage, block gold for system and block silver for backup for aws cloud: if using EFS for db, consider Provisioned mode to have a constant throughput. For more disk options, see details in this page For db2 registry (db2set): Set db2_workload=maximo . That makes db cfg variable WLM_ADMISSION_CTRL is set to NO Do NOT change the default values for DB2_OVERRIDE_NUM_CPUS and DB2_OVERRIDE_THREADING_DEGREE . Verify db2 db cfg variable WLM_ADMISSION_CTRL is set to NO For db2ucluster CR Do NOT set db2 instance memory. The operator will automatically calculate it based on the container memory limit. (Optional) for performance stability, set the same value to both container resource request and limit. For db2 monitor switches The best practice is turn off all monitor switches except the Timestamp in dbm cfg. If turn on monitor switches in dbm cfg, will cause monitor switches are turned on by default for all DB2 sessions, this will bring 5%-10% overhead on the overall database performance, depends on the workload and database server hardware spec. So we should not turn on monitor switches in dbm cfg. When we need to take DB2 monitor data, we should only turn on monitor switches in a specific session by the following command: db2 update monitor switches using BUFFERPOOL on LOCK on SORT on STATEMENT on TIMESTAMP on TABLE on UOW on And turn off all monitor switches immediately after getting required monitor data by the following command: db2 update monitor switches using BUFFERPOOL off LOCK off SORT off STATEMENT off TIMESTAMP off TABLE off UOW off","title":"DB - DB2/DB2wh"},{"location":"mas/manage/bestpractice/#db-oracle","text":"Maximo 7.6.x Best practice is applicable","title":"DB - Oracle"},{"location":"mas/manage/bestpractice/#db-mssql","text":"Maximo 7.6.x Best practice is applicable additional settings for MSSQL Server 2019 compatibility level: if maximo db is upgraded from the old version and the performance degradation is observed after the upgrade, consider to set compatibility level to the old version to keep the execution plan same. isolation level: ALTER DATABASE < DB NAME > SET ALLOW_SNAPSHOT_ISOLATION ON ALTER DATABASE < DB NAME > SET READ_COMMITTED_SNAPSHOT ON","title":"DB - MSSQL"},{"location":"mas/manage/woi-infer/","text":"Work Order Intelligence Inferencing \uf0c1 This was added to the MAS Manage application to assist users with problem code classification (PCC) for Work Orders. See the product documentation for more details. PCC model inferencing (batch mode from cron task) \uf0c1 Inferencing is typically run more frequently than training, but is less resource intensive. By default, MAS Manage is configured with a single instance of the AIINFJOB cron task. This is recommended for most workloads. The predictor pod where inferencing/prediction occurs receives a batch of Work Orders to be inferenced from the MAS Manage cron pod running the AIINFJOB cron task. The batch size (or page size, defined on the MXAPIWODETAIL object structure query template) is the best way to control the rate at which Work Orders are inferenced. In the graph below you can see how the total time to inference 100K work orders is influenced by the batch size. With a batch size of 500 Work Orders/request and a 30 second interval for the AIINFJOB cron task 100K work orders were inferenced in approximately 1.6 hours. Compared to a batch size of 10 WO/request which took 83 hours. Important The recommended batch size is 500 Work Orders/request and the recommended interval for the AIINFJOB cron task is 30 seconds. PCC model inferencing required resources (batch mode from cron task) \uf0c1 The graphs below show the CPU and memory resource utilization of the predictor pod based on the batch size. As you can see, the CPU utilization of the predictor pod increases with the batch size, but the memory utilization remains fairly consistent (i.e. between 4GB - 5GB) PCC model inferencing: batch cron processing vs on-demand single inference \uf0c1 For bulk inferencing of large numbers of Work Orders it is recommended to use the AIINFJOB cron task. However, UI users can also request problem code inferencing on a single work order. In this case the predictor pod will receive a single work order and as a result the overhead of processing a single work order is much higher. For example, to inference a batch of 10 or more work orders will result in an average inferencing time of 20 milliseconds per work order in the predictor pod, but the inference time for a single work order from the UI is about 120 milliseconds (in the predictor pod). The total time including the MAS Manage API request is about 750 milliseconds. It is therefore much more efficient to inference large numbers of work orders asynchronously using the AIINFJOB cron task and a page size of 500. In other words, don't use the API from a script.","title":"Inferencing"},{"location":"mas/manage/woi-infer/#work-order-intelligence-inferencing","text":"This was added to the MAS Manage application to assist users with problem code classification (PCC) for Work Orders. See the product documentation for more details.","title":"Work Order Intelligence Inferencing"},{"location":"mas/manage/woi-infer/#pcc-model-inferencing-batch-mode-from-cron-task","text":"Inferencing is typically run more frequently than training, but is less resource intensive. By default, MAS Manage is configured with a single instance of the AIINFJOB cron task. This is recommended for most workloads. The predictor pod where inferencing/prediction occurs receives a batch of Work Orders to be inferenced from the MAS Manage cron pod running the AIINFJOB cron task. The batch size (or page size, defined on the MXAPIWODETAIL object structure query template) is the best way to control the rate at which Work Orders are inferenced. In the graph below you can see how the total time to inference 100K work orders is influenced by the batch size. With a batch size of 500 Work Orders/request and a 30 second interval for the AIINFJOB cron task 100K work orders were inferenced in approximately 1.6 hours. Compared to a batch size of 10 WO/request which took 83 hours. Important The recommended batch size is 500 Work Orders/request and the recommended interval for the AIINFJOB cron task is 30 seconds.","title":"PCC model inferencing (batch mode from cron task)"},{"location":"mas/manage/woi-infer/#pcc-model-inferencing-required-resources-batch-mode-from-cron-task","text":"The graphs below show the CPU and memory resource utilization of the predictor pod based on the batch size. As you can see, the CPU utilization of the predictor pod increases with the batch size, but the memory utilization remains fairly consistent (i.e. between 4GB - 5GB)","title":"PCC model inferencing required resources (batch mode from cron task)"},{"location":"mas/manage/woi-infer/#pcc-model-inferencing-batch-cron-processing-vs-on-demand-single-inference","text":"For bulk inferencing of large numbers of Work Orders it is recommended to use the AIINFJOB cron task. However, UI users can also request problem code inferencing on a single work order. In this case the predictor pod will receive a single work order and as a result the overhead of processing a single work order is much higher. For example, to inference a batch of 10 or more work orders will result in an average inferencing time of 20 milliseconds per work order in the predictor pod, but the inference time for a single work order from the UI is about 120 milliseconds (in the predictor pod). The total time including the MAS Manage API request is about 750 milliseconds. It is therefore much more efficient to inference large numbers of work orders asynchronously using the AIINFJOB cron task and a page size of 500. In other words, don't use the API from a script.","title":"PCC model inferencing: batch cron processing vs on-demand single inference"},{"location":"mas/manage/woi-train/","text":"Work Order Intelligence Training \uf0c1 This was added to the MAS Manage application to assist users with problem code classification (PCC) for Work Orders. See the product documentation for more details. PCC model training required resources \uf0c1 Model training is resource intensive. For this reason there is a limit of one active model training per MAS Manage instance. A single model training requires at least 8GB of memory. The pipeline pod, where model training occurs, will allocate a number of busy processes equal to the number of CPU on the worker node where the pod is scheduled. At the time of this writing there is no CPU limit set for the pipeline pod, so it will consume as much CPU resources as are available on the worker node where it is scheduled. In general, the more CPU is available to the pipeline pod the faster training time will go. The three data points on the graph below were taken on a 16 CPU worker node. In the tests below a cpu limit was placed on the pipeline pod (not the default, i.e. by default the pipeline pod does not have specified limits). As you can see the training time with an 8 CPU limit was a little more than twice as fast as the training time with a 4 CPU limit. However, when comparing the 16 CPU limit and 8 CPU limit training time, there is very little improvement. This can be attributed to the fact that there were other workloads running on the worker node where the pipeline pod was scheduled and as well as synchronization waits between the training processes/threads. In other words, to improve the training time for the 16 cpu limit test it would be necessary to schedule the pipeline pod on a worker node with more than 16 CPU and fewer competing workloads. Sample sizes for PCC model training \uf0c1 Important Do not train with more than 10K labeled samples. 10K samples is the recommended limit for PCC training. The training times for a single epoch and different sample sizes are shown below. In general, the larger the size of the labeled sample data set, the longer the training time will be. You can see below there is an exception to this rule. When comparing the single epoch training time between the 1K sample size and the 5K sample size, you can see that the single epoch training time for 5K sample size is only 82 minutes compared to 220 minutes for the 1K sample size. This is due to the fact that there were 30 problem codes in this test and with 1K sample size there were an insufficient number of samples per problem code. As a result, the model leveraged Watson X to generate synthetic samples and this process accounts for the additional training time for the 1K sample set. Info The results below show training time for a single epoch. For a real training, 12 epochs is used and therefore the single epoch training times below should be multiplied by 12 to get the real training time. Note, there is a default timeout of 14400 minutes (or 10 days) for training to complete.","title":"Training"},{"location":"mas/manage/woi-train/#work-order-intelligence-training","text":"This was added to the MAS Manage application to assist users with problem code classification (PCC) for Work Orders. See the product documentation for more details.","title":"Work Order Intelligence Training"},{"location":"mas/manage/woi-train/#pcc-model-training-required-resources","text":"Model training is resource intensive. For this reason there is a limit of one active model training per MAS Manage instance. A single model training requires at least 8GB of memory. The pipeline pod, where model training occurs, will allocate a number of busy processes equal to the number of CPU on the worker node where the pod is scheduled. At the time of this writing there is no CPU limit set for the pipeline pod, so it will consume as much CPU resources as are available on the worker node where it is scheduled. In general, the more CPU is available to the pipeline pod the faster training time will go. The three data points on the graph below were taken on a 16 CPU worker node. In the tests below a cpu limit was placed on the pipeline pod (not the default, i.e. by default the pipeline pod does not have specified limits). As you can see the training time with an 8 CPU limit was a little more than twice as fast as the training time with a 4 CPU limit. However, when comparing the 16 CPU limit and 8 CPU limit training time, there is very little improvement. This can be attributed to the fact that there were other workloads running on the worker node where the pipeline pod was scheduled and as well as synchronization waits between the training processes/threads. In other words, to improve the training time for the 16 cpu limit test it would be necessary to schedule the pipeline pod on a worker node with more than 16 CPU and fewer competing workloads.","title":"PCC model training required resources"},{"location":"mas/manage/woi-train/#sample-sizes-for-pcc-model-training","text":"Important Do not train with more than 10K labeled samples. 10K samples is the recommended limit for PCC training. The training times for a single epoch and different sample sizes are shown below. In general, the larger the size of the labeled sample data set, the longer the training time will be. You can see below there is an exception to this rule. When comparing the single epoch training time between the 1K sample size and the 5K sample size, you can see that the single epoch training time for 5K sample size is only 82 minutes compared to 220 minutes for the 1K sample size. This is due to the fact that there were 30 problem codes in this test and with 1K sample size there were an insufficient number of samples per problem code. As a result, the model leveraged Watson X to generate synthetic samples and this process accounts for the additional training time for the 1K sample set. Info The results below show training time for a single epoch. For a real training, 12 epochs is used and therefore the single epoch training times below should be multiplied by 12 to get the real training time. Note, there is a default timeout of 14400 minutes (or 10 days) for training to complete.","title":"Sample sizes for PCC model training"},{"location":"mas/manage-industry-solutions/ong-hse/bestpractice/","text":"MAS Manage Oil and Gas/HSE \uf0c1 Best Practice for Performance \uf0c1 Archive or clean historical records of \"Permit to Work\", \"Isolation Certificate\", \"Work Order\" and \"Operator Log/LogEntry\" will help a lot on performance. Adding below indexes which we identified in internal benchmark test will help a lot on performance. Please do remember to update table statistics after adding any new index, since new index will only be effective after updating table statistics. Indexes Identified in Internal Benchmark Test \uf0c1 Table Name Columns Comments plusgpermitwork \"ptwclass\" ASC,\"siteid\" ASC,\"orgid\" ASC,\"permitworknum\" ASC plusgpermitwork \"ptwclass\" ASC,\"status\" ASC,\"plusgpertypeid\" ASC,\"permitworknum\" ASC plusgpermitwork \"ptwclass\" ASC plusgpermitwork \"status\" ASC,\"ptwclass\" ASC,\"description\" ASC plusgpertype \"pertypenum\" ASC,\"plusgpertypeid\" ASC workorder \"description\" ASC Add it if search on description field, create as text index is better workorder \"status\" ASC,\"historyflag\" ASC,\"istask\" ASC,\"wonum\" ASC Add it if search on status field plusgoperaction \"recordid\" ASC,\"class\" ASC plusgshftlogentry \"recordkey\" ASC,\"orgid\" ASC,\"siteid\" ASC,\"createdate\" ASC plusgshiftlog \"shiftnum\" ASC,\"isshiftlog\" ASC,\"startdate\" ASC plusgrelatedrec \"relatedreckey\" ASC,\"relatedrecclass\" ASC,\"recordkey\" ASC plusgrelatedrec \"recordkey\" ASC,\"class\" ASC,\"relatedrecclass\" ASC plusgincperson \"ticketid\" ASC maxsession \"issystem\" ASC, \"userid\" ASC, \"clienthost\" ASC ticket \"globalticketid\" ASC,\"globalticketclass\" ASC report \"reportname\" ASC,\"appname\" ASC,\"reportnum\" ASC,\"runtype\" ASC,\"userid\" ASC reportrunqueue \"running\" ASC,\"priority\" ASC,\"submittime\" DESC","title":"Oil and Gas/HSE"},{"location":"mas/manage-industry-solutions/ong-hse/bestpractice/#mas-manage-oil-and-gashse","text":"","title":"MAS Manage Oil and Gas/HSE"},{"location":"mas/manage-industry-solutions/ong-hse/bestpractice/#best-practice-for-performance","text":"Archive or clean historical records of \"Permit to Work\", \"Isolation Certificate\", \"Work Order\" and \"Operator Log/LogEntry\" will help a lot on performance. Adding below indexes which we identified in internal benchmark test will help a lot on performance. Please do remember to update table statistics after adding any new index, since new index will only be effective after updating table statistics.","title":"Best Practice for Performance"},{"location":"mas/manage-industry-solutions/ong-hse/bestpractice/#indexes-identified-in-internal-benchmark-test","text":"Table Name Columns Comments plusgpermitwork \"ptwclass\" ASC,\"siteid\" ASC,\"orgid\" ASC,\"permitworknum\" ASC plusgpermitwork \"ptwclass\" ASC,\"status\" ASC,\"plusgpertypeid\" ASC,\"permitworknum\" ASC plusgpermitwork \"ptwclass\" ASC plusgpermitwork \"status\" ASC,\"ptwclass\" ASC,\"description\" ASC plusgpertype \"pertypenum\" ASC,\"plusgpertypeid\" ASC workorder \"description\" ASC Add it if search on description field, create as text index is better workorder \"status\" ASC,\"historyflag\" ASC,\"istask\" ASC,\"wonum\" ASC Add it if search on status field plusgoperaction \"recordid\" ASC,\"class\" ASC plusgshftlogentry \"recordkey\" ASC,\"orgid\" ASC,\"siteid\" ASC,\"createdate\" ASC plusgshiftlog \"shiftnum\" ASC,\"isshiftlog\" ASC,\"startdate\" ASC plusgrelatedrec \"relatedreckey\" ASC,\"relatedrecclass\" ASC,\"recordkey\" ASC plusgrelatedrec \"recordkey\" ASC,\"class\" ASC,\"relatedrecclass\" ASC plusgincperson \"ticketid\" ASC maxsession \"issystem\" ASC, \"userid\" ASC, \"clienthost\" ASC ticket \"globalticketid\" ASC,\"globalticketclass\" ASC report \"reportname\" ASC,\"appname\" ASC,\"reportnum\" ASC,\"runtype\" ASC,\"userid\" ASC reportrunqueue \"running\" ASC,\"priority\" ASC,\"submittime\" DESC","title":"Indexes Identified in Internal Benchmark Test"},{"location":"mas/manage-industry-solutions/transportation/bestpractice/","text":"MAS Manage Transportation \uf0c1 Best Practice for Performance \uf0c1 Due to some Transportation applications execute SQLs contain \"Like\" clause, turn off DB2 Statement Concentrator can make CPU utilization much lower on database server (db2 update db cfg for database_name using stmt_conc off). Adding below indexes which we identified in internal benchmark test will help a lot on performance. If not specified, all columns are ASC by default. Please do remember to update table statistics after adding any new index, since new index will only be effective after updating table statistics. Indexes Identified in Internal Benchmark Test \uf0c1 Table Name Columns PLUSTWARRTRANS CONTRACTTYPE,CLAIMID,ASSETNUM,CONTRACTNUM,SITEID PLUSTWARRTRANS CONTRACTTYPE,CLAIMID,TRANSDATE,PLUSTWARRTRANSID maxuser status,userid logintracking \"userid\" ASC,\"attemptresult\" ASC,\"attemptdate\" DESC craftrate orgid SYNONYMDOMAIN MAXVALUE,DOMAINID,VALUE CONTLINEASSET \"PLUSTNEWEXTENDEDREASON\" ASC, \"LOCATION\" DESC CONTRACT CONTRACTTYPE,STATUS,ORGID,CONTRACTNUM LOCHIERARCHY LOCATION,SYSTEMID,SITEID,PARENT MULTIASSETLOCCI ISPRIMARY,RECORDKEY,WORKSITEID,RECORDCLASS PLUSTASSETALIAS ISACTIVE,ALIAS,PLUSTASSETALIASID,DESCRIPTION,ORGID,LANGCODE,ISDEFAULT,ISASSETNUM,HASLD,SITEID,ASSETNUM INVOICELINE PLUSTCONTRACTNUM,SITEID,INVOICELINENUM,INVOICENUM INVOICE INVOICENUM,SITEID,STATUS INVOICELINE INVOICENUM,SITEID,INVOICELINENUM INVOICECOST \"INVOICENUM\" ASC, \"SITEID\" ASC, \"ASSETNUM\" ASC, \"INVOICELINENUM\" ASC ASSET \"ASSETNUM\" ASC, \"SITEID\" ASC, \"ASSETID\" ASC inspectionform inspformnum,status,orgid inspectionresult \"siteid\" ASC, \"referenceobject\" ASC, \"referenceobjectid\" ASC PLUSTWARRTRANS \"CONTRACTTYPE\" ASC, \"CLAIMID\" ASC, \"PLUSTWARRTRANSID\" ASC propertydefault contracttypeid,orgid plustitemwarr itemnum,plustpos,orgid,assetid,matusetransid,plustitemwarrid plustitemwarrmtr plustitemwarrid plustassetalias assetnum,siteid,isdefault PLUSTWARRTRANS \"CLAIMID\" ASC, \"SITEID\" ASC countbookline itemnum,countbooknum,siteid countbookline countbooknum,siteid,orgid countbookline match,countbooknum,siteid,orgid countbookline recon,countbooknum,siteid,orgid countbookline physcnt,countbooknum,siteid,orgid countbook storeroom,countbooknum,siteid item itemsetid,itemnum warrantyasset contractnum,revisionnum,orgid,assetid contractline CONTRACTNUM,REVISIONNUM,ORGID,CONTRACTLINENUM,contracttype invbalances \"PHYSCNTDATE\" DESC, \"SITEID\" ASC, \"ITEMNUM\" ASC countbookline countbooknum,siteid,rotating,itemnum invbalances location,nextphycntdate inventory location,siteid,orgid,itemnum countbooksel countbooknum,siteid mafappdata ismobile,status asset assetnum,siteid,status,plustisconsist,plustalias,orgid plustassetalias assetnum,siteid,isactive invoicecost assetnum,siteid plustclaim orgid,contractnum,status plustclaim assetnum,siteid asset assetid,moved,plustisconsist,description invoice siteid,invoicenum,status,orgid wplabor wplaborid,orgid joblabor orgid,siteid,jobplanid,jptask workorder plustcmpnum,siteid,status plustitemwarr matusetransid contlineasset assetid,location,locationsite,warrantystartdate,warrantyenddate,contractnum CONTRACT contractnum,revisionnum,orgid,contracttype,status contlineasset assetid,orgid,plustfullcoverage,contractnum,revisionnum,contractlinenum plustwpserv wpservid,orgid plustwarrtrans matusetransid,covereditemnum contractline itemnum,conditioncode,linestatus warrantyline plustcoverservices,plustcovermaterials pm siteid,status,pmnum,assetnum workorder siteid,pmnum,status plustwpserv wonum,siteid plustwarrtrans servrectransid plustwarrtrans parentwonum,refwonum,claimid,siteid,linecost plustwpserv invoicenum,invoicesite invoice status,invoicenum,siteid maxvars varname,orgid,varvalue inspectionresult inspformnum,revision,orgid,siteid,status,asset,location,resultnum plustwpserv wonum,invoicenum,complete","title":"Transportation"},{"location":"mas/manage-industry-solutions/transportation/bestpractice/#mas-manage-transportation","text":"","title":"MAS Manage Transportation"},{"location":"mas/manage-industry-solutions/transportation/bestpractice/#best-practice-for-performance","text":"Due to some Transportation applications execute SQLs contain \"Like\" clause, turn off DB2 Statement Concentrator can make CPU utilization much lower on database server (db2 update db cfg for database_name using stmt_conc off). Adding below indexes which we identified in internal benchmark test will help a lot on performance. If not specified, all columns are ASC by default. Please do remember to update table statistics after adding any new index, since new index will only be effective after updating table statistics.","title":"Best Practice for Performance"},{"location":"mas/manage-industry-solutions/transportation/bestpractice/#indexes-identified-in-internal-benchmark-test","text":"Table Name Columns PLUSTWARRTRANS CONTRACTTYPE,CLAIMID,ASSETNUM,CONTRACTNUM,SITEID PLUSTWARRTRANS CONTRACTTYPE,CLAIMID,TRANSDATE,PLUSTWARRTRANSID maxuser status,userid logintracking \"userid\" ASC,\"attemptresult\" ASC,\"attemptdate\" DESC craftrate orgid SYNONYMDOMAIN MAXVALUE,DOMAINID,VALUE CONTLINEASSET \"PLUSTNEWEXTENDEDREASON\" ASC, \"LOCATION\" DESC CONTRACT CONTRACTTYPE,STATUS,ORGID,CONTRACTNUM LOCHIERARCHY LOCATION,SYSTEMID,SITEID,PARENT MULTIASSETLOCCI ISPRIMARY,RECORDKEY,WORKSITEID,RECORDCLASS PLUSTASSETALIAS ISACTIVE,ALIAS,PLUSTASSETALIASID,DESCRIPTION,ORGID,LANGCODE,ISDEFAULT,ISASSETNUM,HASLD,SITEID,ASSETNUM INVOICELINE PLUSTCONTRACTNUM,SITEID,INVOICELINENUM,INVOICENUM INVOICE INVOICENUM,SITEID,STATUS INVOICELINE INVOICENUM,SITEID,INVOICELINENUM INVOICECOST \"INVOICENUM\" ASC, \"SITEID\" ASC, \"ASSETNUM\" ASC, \"INVOICELINENUM\" ASC ASSET \"ASSETNUM\" ASC, \"SITEID\" ASC, \"ASSETID\" ASC inspectionform inspformnum,status,orgid inspectionresult \"siteid\" ASC, \"referenceobject\" ASC, \"referenceobjectid\" ASC PLUSTWARRTRANS \"CONTRACTTYPE\" ASC, \"CLAIMID\" ASC, \"PLUSTWARRTRANSID\" ASC propertydefault contracttypeid,orgid plustitemwarr itemnum,plustpos,orgid,assetid,matusetransid,plustitemwarrid plustitemwarrmtr plustitemwarrid plustassetalias assetnum,siteid,isdefault PLUSTWARRTRANS \"CLAIMID\" ASC, \"SITEID\" ASC countbookline itemnum,countbooknum,siteid countbookline countbooknum,siteid,orgid countbookline match,countbooknum,siteid,orgid countbookline recon,countbooknum,siteid,orgid countbookline physcnt,countbooknum,siteid,orgid countbook storeroom,countbooknum,siteid item itemsetid,itemnum warrantyasset contractnum,revisionnum,orgid,assetid contractline CONTRACTNUM,REVISIONNUM,ORGID,CONTRACTLINENUM,contracttype invbalances \"PHYSCNTDATE\" DESC, \"SITEID\" ASC, \"ITEMNUM\" ASC countbookline countbooknum,siteid,rotating,itemnum invbalances location,nextphycntdate inventory location,siteid,orgid,itemnum countbooksel countbooknum,siteid mafappdata ismobile,status asset assetnum,siteid,status,plustisconsist,plustalias,orgid plustassetalias assetnum,siteid,isactive invoicecost assetnum,siteid plustclaim orgid,contractnum,status plustclaim assetnum,siteid asset assetid,moved,plustisconsist,description invoice siteid,invoicenum,status,orgid wplabor wplaborid,orgid joblabor orgid,siteid,jobplanid,jptask workorder plustcmpnum,siteid,status plustitemwarr matusetransid contlineasset assetid,location,locationsite,warrantystartdate,warrantyenddate,contractnum CONTRACT contractnum,revisionnum,orgid,contracttype,status contlineasset assetid,orgid,plustfullcoverage,contractnum,revisionnum,contractlinenum plustwpserv wpservid,orgid plustwarrtrans matusetransid,covereditemnum contractline itemnum,conditioncode,linestatus warrantyline plustcoverservices,plustcovermaterials pm siteid,status,pmnum,assetnum workorder siteid,pmnum,status plustwpserv wonum,siteid plustwarrtrans servrectransid plustwarrtrans parentwonum,refwonum,claimid,siteid,linecost plustwpserv invoicenum,invoicesite invoice status,invoicenum,siteid maxvars varname,orgid,varvalue inspectionresult inspformnum,revision,orgid,siteid,status,asset,location,resultnum plustwpserv wonum,invoicenum,complete","title":"Indexes Identified in Internal Benchmark Test"},{"location":"mas/mif-jms/bestpractice/","text":"MAS Manage MIF/JMS \uf0c1 Lab Result Highlights \uf0c1 The lab results indicate a significant correlation between the transaction per second (TPS) and database disk IO utilization. This correlation suggests that the level of transactional activity directly impacts the IO workload on the database disk. Conversely, the IO workload acts as a limitation on the system's ability to handle a larger volume of transactions. When IO is not the limiting factor, increasing the number of MEA Pods can positively impact the processing performance. Increasing the Message-Driven Bean (MDB) instances can potentially have a positive impact on system performance. It is recommended to adjust the number of records per message, the # of MDB and the batch size. By finding the right balance, you can target a resource usage of around 2 cores and 4-7GB of RAM that can help ensure efficient utilization without overburdening the MEA pods. Based on the lab results, it has been observed that a large number of internal error messages have a substantial impact on processing throughput. Under a certain circumstance, the configuration parameter mxe.int.splitdataonpost does not demonstrate a positive impact. To validate its effectiveness, it is recommended to perform a dry run in your specific environment for verification. Performance Troubleshooting Checklist \uf0c1 To troubleshoot and optimize performance, follow this checklist: Ensure adherence to best practices for optimizing performance in your DB, Openshift, and MAS environments. Monitor disk IO utilization of the database and maintain it within acceptable limits to avoid performance degradation due to saturated disk resources. Adjust the number of records per message and the MDB/batch size to effectively manage resource utilization of MEA pods. Aim for a resource consumption range of approximately 2 cores and 4-7 GB. Regularly check the message queue to prevent it from becoming empty, ensuring a steady flow of messages for processing. Minimize the occurrence of integration error messages as they can significantly impact processing throughput. Pay attention to a high volume of internal error messages and investigate the message reprocessing application for further insights. Set a sufficiently large value for maxMessageDepth to avoid message queue overflow. It is recommended to match SIBus's default value of at least 500,000. When the need for additional MEA pods arises, consider scaling up the number of worker nodes to accommodate the increased demand effectively. Test Methodologies \uf0c1 Establish a monitoring system to track essential performance metrics throughout the testing process. Begin with a dry run using a single MEA pod to establish a baseline benchmark for performance evaluation. Adjust the Message-Driven Bean (MDB) and BatchSize parameters to optimize resource utilization within an appropriate range for the MEA pod. Scale up the number of MEA pods as needed to meet performance requirements and accommodate increased workload. Continuously monitor and assess the performance of both the database and the application to identify any bottlenecks or areas for improvement. By following these test methodologies, you can effectively monitor and optimize the performance of your system, ensuring efficient resource utilization and maintaining satisfactory levels of performance. Major Performance Related Factors for MIF \uf0c1 Component Configuration Adjustable or Scalable Observeration & Best Practice JMS / MIF maxMessageDepth Yes Make it large enough. If it is too small, when the queue is full, the process fails and may be hard to recover. Recommend 500,000 same as SIBus maxEndpoints Yes Limit the maxConcurrency MDB(maxConcurrency) Yes Alone with BatchSize will impact processing speed and MEA pods resouce utliitzation BatchSize(maxBatchSize) Yes Alone with MDB will impact processing speed and MEA pods resouce utliitzation Maximo # of JMS Pod Yes 1 JMS Server works well in benchmark test. It does not consume a significant resource # of MEA Pod Yes Able to linear scale MEA CPU / MEM Usage Yes Adjust JMS/MDB and BatchSize to control MEA pods resources in a reasonable range e.g. (2 - 3 core / 4 -7G) JMS CPU / MEM Usage Yes Default setting works well in the benchmark test DB CPU / MEM Usage Yes Ensure DB has sufficient resource DB Disk IO Util % Yes, but sometime it is hard to adjust Disk IO throughput is critial for the overall processing DB Lock Holds N/A DB Tuning: Long Running Query, # of Appl, Memory.. Yes Follow the best practice to tune DB Maximo Sequence Cache Yes a reasonable # e.g. 20 or 50 can reduce the db cpu and processing time mxe.int.splitdataonpost Yes Message # of record per Message Yes data structure (complexity of the record) N/A Impacts performance because of business logic check Record Quality (record cannot be processed) Yes A large amount of int error messages slow down the overall processing speed Misc Method & Speed to post message into queue Yes Ensure message post (writing to queue) as fast as possible. A slow pacing lowes the env processing capacity. Any other concurrent transactions N/A other concurrency workloads impact the processing time Worker Node Capacity Yes Worker Node Capacity may limit working pod (e.g. MEA) capacity. Pod distribution should also be considered.","title":"JMS"},{"location":"mas/mif-jms/bestpractice/#mas-manage-mifjms","text":"","title":"MAS Manage MIF/JMS"},{"location":"mas/mif-jms/bestpractice/#lab-result-highlights","text":"The lab results indicate a significant correlation between the transaction per second (TPS) and database disk IO utilization. This correlation suggests that the level of transactional activity directly impacts the IO workload on the database disk. Conversely, the IO workload acts as a limitation on the system's ability to handle a larger volume of transactions. When IO is not the limiting factor, increasing the number of MEA Pods can positively impact the processing performance. Increasing the Message-Driven Bean (MDB) instances can potentially have a positive impact on system performance. It is recommended to adjust the number of records per message, the # of MDB and the batch size. By finding the right balance, you can target a resource usage of around 2 cores and 4-7GB of RAM that can help ensure efficient utilization without overburdening the MEA pods. Based on the lab results, it has been observed that a large number of internal error messages have a substantial impact on processing throughput. Under a certain circumstance, the configuration parameter mxe.int.splitdataonpost does not demonstrate a positive impact. To validate its effectiveness, it is recommended to perform a dry run in your specific environment for verification.","title":"Lab Result Highlights"},{"location":"mas/mif-jms/bestpractice/#performance-troubleshooting-checklist","text":"To troubleshoot and optimize performance, follow this checklist: Ensure adherence to best practices for optimizing performance in your DB, Openshift, and MAS environments. Monitor disk IO utilization of the database and maintain it within acceptable limits to avoid performance degradation due to saturated disk resources. Adjust the number of records per message and the MDB/batch size to effectively manage resource utilization of MEA pods. Aim for a resource consumption range of approximately 2 cores and 4-7 GB. Regularly check the message queue to prevent it from becoming empty, ensuring a steady flow of messages for processing. Minimize the occurrence of integration error messages as they can significantly impact processing throughput. Pay attention to a high volume of internal error messages and investigate the message reprocessing application for further insights. Set a sufficiently large value for maxMessageDepth to avoid message queue overflow. It is recommended to match SIBus's default value of at least 500,000. When the need for additional MEA pods arises, consider scaling up the number of worker nodes to accommodate the increased demand effectively.","title":"Performance Troubleshooting Checklist"},{"location":"mas/mif-jms/bestpractice/#test-methodologies","text":"Establish a monitoring system to track essential performance metrics throughout the testing process. Begin with a dry run using a single MEA pod to establish a baseline benchmark for performance evaluation. Adjust the Message-Driven Bean (MDB) and BatchSize parameters to optimize resource utilization within an appropriate range for the MEA pod. Scale up the number of MEA pods as needed to meet performance requirements and accommodate increased workload. Continuously monitor and assess the performance of both the database and the application to identify any bottlenecks or areas for improvement. By following these test methodologies, you can effectively monitor and optimize the performance of your system, ensuring efficient resource utilization and maintaining satisfactory levels of performance.","title":"Test Methodologies"},{"location":"mas/mif-jms/bestpractice/#major-performance-related-factors-for-mif","text":"Component Configuration Adjustable or Scalable Observeration & Best Practice JMS / MIF maxMessageDepth Yes Make it large enough. If it is too small, when the queue is full, the process fails and may be hard to recover. Recommend 500,000 same as SIBus maxEndpoints Yes Limit the maxConcurrency MDB(maxConcurrency) Yes Alone with BatchSize will impact processing speed and MEA pods resouce utliitzation BatchSize(maxBatchSize) Yes Alone with MDB will impact processing speed and MEA pods resouce utliitzation Maximo # of JMS Pod Yes 1 JMS Server works well in benchmark test. It does not consume a significant resource # of MEA Pod Yes Able to linear scale MEA CPU / MEM Usage Yes Adjust JMS/MDB and BatchSize to control MEA pods resources in a reasonable range e.g. (2 - 3 core / 4 -7G) JMS CPU / MEM Usage Yes Default setting works well in the benchmark test DB CPU / MEM Usage Yes Ensure DB has sufficient resource DB Disk IO Util % Yes, but sometime it is hard to adjust Disk IO throughput is critial for the overall processing DB Lock Holds N/A DB Tuning: Long Running Query, # of Appl, Memory.. Yes Follow the best practice to tune DB Maximo Sequence Cache Yes a reasonable # e.g. 20 or 50 can reduce the db cpu and processing time mxe.int.splitdataonpost Yes Message # of record per Message Yes data structure (complexity of the record) N/A Impacts performance because of business logic check Record Quality (record cannot be processed) Yes A large amount of int error messages slow down the overall processing speed Misc Method & Speed to post message into queue Yes Ensure message post (writing to queue) as fast as possible. A slow pacing lowes the env processing capacity. Any other concurrent transactions N/A other concurrency workloads impact the processing time Worker Node Capacity Yes Worker Node Capacity may limit working pod (e.g. MEA) capacity. Pod distribution should also be considered.","title":"Major Performance Related Factors for MIF"},{"location":"mas/mif-kafka/bestpractice/","text":"MAS Manage MIF/Kafka \uf0c1 Lab Result Highlights \uf0c1 Same to MIF/JMS test , the lab results indicate a significant correlation between the transaction persecond (TPS) and database disk IO utilization. This correlation suggests that the level of transactionalactivity directly impacts the IO workload on the database disk. Conversely, the IO workload acts as alimitation on the system's ability to handle a larger volume of transactions. The results also demonstrate a notable connection between the disk IO throughput and the TPS (Transactions Per Second). Doubling the number of CRON JVMs and Kafka topic partitions leads to a twofold increase in the maximum TPS. However, this change also results in an enlarged distribution difference, growing from 2% to 10%. Consequently, in the final phase, the overall processing rate diminishes, with the TPS decreasing from 72 to 66, attributed to the Kafka rule - which allows a maximum of 1 consumer per partition. Increasing the number of partitions may result in better performance for small messages (e.g., 10 assetsper message) compared to large messages. Please ensure that there are an adequate number ofmessages in the queue for processing. When evaluating the performance of a single MEA JVM, the TPS in MIF/Kafka matches that of JMS. Nevertheless, when multiple processing JVMs are utilized, JMS surpasses performance due to its more equitable workload distribution. From a best-practice standpoint, it is advisable to have one Kafka topic with 6 partitions and multiple Kafka topics for parallel processing.","title":"Kafka"},{"location":"mas/mif-kafka/bestpractice/#mas-manage-mifkafka","text":"","title":"MAS Manage MIF/Kafka"},{"location":"mas/mif-kafka/bestpractice/#lab-result-highlights","text":"Same to MIF/JMS test , the lab results indicate a significant correlation between the transaction persecond (TPS) and database disk IO utilization. This correlation suggests that the level of transactionalactivity directly impacts the IO workload on the database disk. Conversely, the IO workload acts as alimitation on the system's ability to handle a larger volume of transactions. The results also demonstrate a notable connection between the disk IO throughput and the TPS (Transactions Per Second). Doubling the number of CRON JVMs and Kafka topic partitions leads to a twofold increase in the maximum TPS. However, this change also results in an enlarged distribution difference, growing from 2% to 10%. Consequently, in the final phase, the overall processing rate diminishes, with the TPS decreasing from 72 to 66, attributed to the Kafka rule - which allows a maximum of 1 consumer per partition. Increasing the number of partitions may result in better performance for small messages (e.g., 10 assetsper message) compared to large messages. Please ensure that there are an adequate number ofmessages in the queue for processing. When evaluating the performance of a single MEA JVM, the TPS in MIF/Kafka matches that of JMS. Nevertheless, when multiple processing JVMs are utilized, JMS surpasses performance due to its more equitable workload distribution. From a best-practice standpoint, it is advisable to have one Kafka topic with 6 partitions and multiple Kafka topics for parallel processing.","title":"Lab Result Highlights"},{"location":"mas/mobile/bestpractice/","text":"MAS Manage Mobile \uf0c1 Tips and Tricks: \uf0c1 Strongly recommend creating a mobile database for supporting data downloads. Online support downloading can significantly impact the performance of Mobile Pods, databases, and networks. To mitigate download failures, consider increasing the timeout value for the ingressor. The default server/client timeout is set too low, affecting the pass rate. Use the following commands to raise the default value: oc -n openshift-ingress-operator patch ingresscontroller/default --type=merge -p '{\"spec\":{\"tuningOptions\": {\"clientTimeout\": \"300s\"}}}' oc -n openshift-ingress-operator patch ingresscontroller/default --type=merge -p '{\"spec\":{\"tuningOptions\": {\"serverTimeout\": \"300s\"}}}' Scaling up the coreapi pod can enhance the downloading experience for the mobile app. Consider scaling up the mobile pods when the CPU usage of a pod exceeds 4. Optimal disk throughput for the database is crucial for a smooth app downloading experience. Observations from lab tests suggest that balanced node resource utilization is crucial for optimal performance. It is worth noting that the default topology spread constraints in the ManageWorkspace Custom Resource (CR) are set to \"topologyKey: topology.kubernetes.io/zone\". However, in a single-zone cluster, if the pod is not being evenly distributed across worker nodes, considerto be set to \"topologyKey: topology.kubernetes.io/hostname\" instead.","title":"Mobile"},{"location":"mas/mobile/bestpractice/#mas-manage-mobile","text":"","title":"MAS Manage Mobile"},{"location":"mas/mobile/bestpractice/#tips-and-tricks","text":"Strongly recommend creating a mobile database for supporting data downloads. Online support downloading can significantly impact the performance of Mobile Pods, databases, and networks. To mitigate download failures, consider increasing the timeout value for the ingressor. The default server/client timeout is set too low, affecting the pass rate. Use the following commands to raise the default value: oc -n openshift-ingress-operator patch ingresscontroller/default --type=merge -p '{\"spec\":{\"tuningOptions\": {\"clientTimeout\": \"300s\"}}}' oc -n openshift-ingress-operator patch ingresscontroller/default --type=merge -p '{\"spec\":{\"tuningOptions\": {\"serverTimeout\": \"300s\"}}}' Scaling up the coreapi pod can enhance the downloading experience for the mobile app. Consider scaling up the mobile pods when the CPU usage of a pod exceeds 4. Optimal disk throughput for the database is crucial for a smooth app downloading experience. Observations from lab tests suggest that balanced node resource utilization is crucial for optimal performance. It is worth noting that the default topology spread constraints in the ManageWorkspace Custom Resource (CR) are set to \"topologyKey: topology.kubernetes.io/zone\". However, in a single-zone cluster, if the pod is not being evenly distributed across worker nodes, considerto be set to \"topologyKey: topology.kubernetes.io/hostname\" instead.","title":"Tips and Tricks:"},{"location":"mas/mongodb/bestpractice/","text":"MongoDB \uf0c1 MongoDB Troubleshoot: \uf0c1 mongostat: mongostat --username admin --password --authenticationDatabase admin --ssl --sslAllowInvalidCertificates 2 mongotop: mongotop --username admin --password --authenticationDatabase admin --ssl --sslAllowInvalidCertificates 2 check mongod log for slow queries (MongoDB community): oc logs -n -c mongod | grep -iE 'Slow query' long connection over 3 seconds: db.currentOp({\"active\" : true,\"secs_running\" : { \"$gt\" : 3 },\"ns\" : /^msg/}) kill long running connection: db.killOp(\"opid\") locking: db.serverStatus().globalLock mem: db.serverStatus().mem wiredTiger cache: db.serverStatus().wiredTiger.cache concurrent: db.serverStatus().connections","title":"MongoDB"},{"location":"mas/mongodb/bestpractice/#mongodb","text":"","title":"MongoDB"},{"location":"mas/mongodb/bestpractice/#mongodb-troubleshoot","text":"mongostat: mongostat --username admin --password --authenticationDatabase admin --ssl --sslAllowInvalidCertificates 2 mongotop: mongotop --username admin --password --authenticationDatabase admin --ssl --sslAllowInvalidCertificates 2 check mongod log for slow queries (MongoDB community): oc logs -n -c mongod | grep -iE 'Slow query' long connection over 3 seconds: db.currentOp({\"active\" : true,\"secs_running\" : { \"$gt\" : 3 },\"ns\" : /^msg/}) kill long running connection: db.killOp(\"opid\") locking: db.serverStatus().globalLock mem: db.serverStatus().mem wiredTiger cache: db.serverStatus().wiredTiger.cache concurrent: db.serverStatus().connections","title":"MongoDB Troubleshoot:"},{"location":"mas/monitoring/guidance/","text":"Monitoring \uf0c1 Monitoring your OpenShift clusters is critical for the environment health, the quality of services. It helps ensure that all deployed workloads are running smoothly and that the environment is properly scoped. OpenShift Monitoring Service (Promethus/Grafana) \uf0c1 OpenShift Container Platform includes a pre-installed monitoring stack that is based on the Prometheus/Grafana. MAS also provides app-level promethus metrics and a set of Grafana dashboards for application health. More installation, configuration details can be found in IBM MAS Monitoring Best practice for OpenShift Monitoring Service enable User Workload: enableUserWorkload: false consider to increase the promethus retention policy whose default value is 24h and add persistent volumes consider to change Alert Manager's storage class and size Below is the sample for configmap cluster-monitoring-config apiVersion : v1 kind : ConfigMap metadata : name : cluster-monitoring-config namespace : openshift-monitoring data : config.yaml : | enableUserWorkload: true prometheusK8s: retention: 90d volumeClaimTemplate: spec: storageClassName: nfs-client resources: requests: cpu: 200m storage: 300Gi memory: 2Gi limits: cpu: 2 memory: 4Gi alertmanagerMain: volumeClaimTemplate: spec: storageClassName: nfs-client resources: requests: storage: 20Gi Note Except OpenShift Monitoring Service (Promethus/Grafana), there are other paid solutions like IBM Instana , New Relic , Data Dog that also support OCP. If the cluster is cloud based, consider to use cloud provider's monitoring tool for additional info like network, disk, managed services. e.g. AWS CloudWatch, IBM Log Analysis...","title":"Monitoring"},{"location":"mas/monitoring/guidance/#monitoring","text":"Monitoring your OpenShift clusters is critical for the environment health, the quality of services. It helps ensure that all deployed workloads are running smoothly and that the environment is properly scoped.","title":"Monitoring"},{"location":"mas/monitoring/guidance/#openshift-monitoring-service-promethusgrafana","text":"OpenShift Container Platform includes a pre-installed monitoring stack that is based on the Prometheus/Grafana. MAS also provides app-level promethus metrics and a set of Grafana dashboards for application health. More installation, configuration details can be found in IBM MAS Monitoring Best practice for OpenShift Monitoring Service enable User Workload: enableUserWorkload: false consider to increase the promethus retention policy whose default value is 24h and add persistent volumes consider to change Alert Manager's storage class and size Below is the sample for configmap cluster-monitoring-config apiVersion : v1 kind : ConfigMap metadata : name : cluster-monitoring-config namespace : openshift-monitoring data : config.yaml : | enableUserWorkload: true prometheusK8s: retention: 90d volumeClaimTemplate: spec: storageClassName: nfs-client resources: requests: cpu: 200m storage: 300Gi memory: 2Gi limits: cpu: 2 memory: 4Gi alertmanagerMain: volumeClaimTemplate: spec: storageClassName: nfs-client resources: requests: storage: 20Gi Note Except OpenShift Monitoring Service (Promethus/Grafana), there are other paid solutions like IBM Instana , New Relic , Data Dog that also support OCP. If the cluster is cloud based, consider to use cloud provider's monitoring tool for additional info like network, disk, managed services. e.g. AWS CloudWatch, IBM Log Analysis...","title":"OpenShift Monitoring Service (Promethus/Grafana)"},{"location":"mas/ocp/bestpractice/","text":"OpenShift Container Platform \uf0c1 Cluster Insights Advisor \uf0c1 Highly recommend to use OpenShift cluster Insights Advisor that to check for any issue related to the current version, nodes and mis-configurations. It is the first step for the problem diagnosis. Steps: Login on OpenShift Console Go to Administration -> Cluster Settings Click OpenShift Cluster Manager in Subscription section. It redirects the url to RedHat Hybrid Cloud Console Click Insights Advisor PID limit for docker \uf0c1 This settings control how many processes can be run within one single container. If it is too small, it can cause folk bomb issue. E.g. db2w instance may be unavailable when there are thousands of connections/agents upcoming or Openshift Container Storage not behaving well with a large amount of PVCs. OOB value for OCP platforms: Platform Version Default Value IBM ROKS (4.8) 231239 AWS ROSA 4096 in OpenShift 4.11 and higher Azure Self-Managed OCP 1024 Steps to check or update PID limit: $ oc debug node/$NODE_NAME $ chroot /host $ cat /etc/crio/crio.conf # add / modify the line \"pids_limit = \" # run belows commands to reboot services and worker nodes $ systemctl daemon-reload $ systemctl restart crio $ shutdown -r now HAProxy Router \uf0c1 Ingress Controller \uf0c1 Openshift HAProxy supports up to 20k connections per pod. Consider to scale up ingress pod for any app (like IoT) with a high volume connection workload. Scale up ingress controller command: oc patch -n openshift-ingress-operator ingresscontroller/default --patch '{\"spec\":{\"replicas\": 3}}' --type=merge Max Connection \uf0c1 One of the most important tunable parameters for HAProxy scalability is the maxconn parameter. The router can handle a maximum number of 20k concurrent connections by using oc adm router --max-connections=xxxxx . This parameter will be impacted by node settings sysctl fs.nr_open and sysctl fs.file-max . HAproxy will not start if maxconn is high, but node setting is low. Note: OpenShift Container Platform no longer supports modifying Ingress Controller deployments by setting environment variables such as ROUTER_THREADS, ROUTER_DEFAULT_TUNNEL_TIMEOUT, ROUTER_DEFAULT_CLIENT_TIMEOUT, ROUTER_DEFAULT_SERVER_TIMEOUT, and RELOAD_INTERVAL. You can modify the Ingress Controller deployment, but if the Ingress Operator is enabled, the configuration is overwritten. Load Balance Algorithm \uf0c1 Starting from OCP 4.10, there have been four load-balancing algorithms available: source, roundrobin, random, and leastconn . The default algorithm is set to random . In earlier versions of OCP, before 4.10, there were three load-balancing algorithms: source, roundrobin, and leastconn . The default algorithm in those versions was leastconn . Set up annotations for each route to change the default algorithm if needed. e.g. haproxy.router.openshift.io/balance=roundrobin Master and Worker Nodes Consideration \uf0c1 There are a wide selection instance types that comprise varying combinations of CPU, memory, disk and network. Below are a few considerations: Each worker node will reserve about 1 core for internal services. In order to avoid the side effect of overcommit, 16core/64G is a good starting type for a normal worker node. A 8-core instance may not have insufficent capacity while 32-core instance may lose a big cluster capacity due to an outage or failure. Using balanced CPU-memory worder nodes typically fits our work load -the ratio of CPU to memory is 1 to 4. An instance with a higher memory/cpu ratio e.g. 8:1 is recommended for database nodes. The number of worker nodes >=3. This will give a high availability needing a smaller built in redundant capacity. For the product env, a 8core/32G is recommended for master nodes to avoid any bottleneck for the internal services. An instance with 10GB ethernet is strongly recommended for the production env. Check the GPU chip type for gpu node selection.","title":"OpenShift Container Platform"},{"location":"mas/ocp/bestpractice/#openshift-container-platform","text":"","title":"OpenShift Container Platform"},{"location":"mas/ocp/bestpractice/#cluster-insights-advisor","text":"Highly recommend to use OpenShift cluster Insights Advisor that to check for any issue related to the current version, nodes and mis-configurations. It is the first step for the problem diagnosis. Steps: Login on OpenShift Console Go to Administration -> Cluster Settings Click OpenShift Cluster Manager in Subscription section. It redirects the url to RedHat Hybrid Cloud Console Click Insights Advisor","title":"Cluster Insights Advisor"},{"location":"mas/ocp/bestpractice/#pid-limit-for-docker","text":"This settings control how many processes can be run within one single container. If it is too small, it can cause folk bomb issue. E.g. db2w instance may be unavailable when there are thousands of connections/agents upcoming or Openshift Container Storage not behaving well with a large amount of PVCs. OOB value for OCP platforms: Platform Version Default Value IBM ROKS (4.8) 231239 AWS ROSA 4096 in OpenShift 4.11 and higher Azure Self-Managed OCP 1024 Steps to check or update PID limit: $ oc debug node/$NODE_NAME $ chroot /host $ cat /etc/crio/crio.conf # add / modify the line \"pids_limit = \" # run belows commands to reboot services and worker nodes $ systemctl daemon-reload $ systemctl restart crio $ shutdown -r now","title":"PID limit for docker"},{"location":"mas/ocp/bestpractice/#haproxy-router","text":"","title":"HAProxy Router"},{"location":"mas/ocp/bestpractice/#ingress-controller","text":"Openshift HAProxy supports up to 20k connections per pod. Consider to scale up ingress pod for any app (like IoT) with a high volume connection workload. Scale up ingress controller command: oc patch -n openshift-ingress-operator ingresscontroller/default --patch '{\"spec\":{\"replicas\": 3}}' --type=merge","title":"Ingress Controller"},{"location":"mas/ocp/bestpractice/#max-connection","text":"One of the most important tunable parameters for HAProxy scalability is the maxconn parameter. The router can handle a maximum number of 20k concurrent connections by using oc adm router --max-connections=xxxxx . This parameter will be impacted by node settings sysctl fs.nr_open and sysctl fs.file-max . HAproxy will not start if maxconn is high, but node setting is low. Note: OpenShift Container Platform no longer supports modifying Ingress Controller deployments by setting environment variables such as ROUTER_THREADS, ROUTER_DEFAULT_TUNNEL_TIMEOUT, ROUTER_DEFAULT_CLIENT_TIMEOUT, ROUTER_DEFAULT_SERVER_TIMEOUT, and RELOAD_INTERVAL. You can modify the Ingress Controller deployment, but if the Ingress Operator is enabled, the configuration is overwritten.","title":"Max Connection"},{"location":"mas/ocp/bestpractice/#load-balance-algorithm","text":"Starting from OCP 4.10, there have been four load-balancing algorithms available: source, roundrobin, random, and leastconn . The default algorithm is set to random . In earlier versions of OCP, before 4.10, there were three load-balancing algorithms: source, roundrobin, and leastconn . The default algorithm in those versions was leastconn . Set up annotations for each route to change the default algorithm if needed. e.g. haproxy.router.openshift.io/balance=roundrobin","title":"Load Balance Algorithm"},{"location":"mas/ocp/bestpractice/#master-and-worker-nodes-consideration","text":"There are a wide selection instance types that comprise varying combinations of CPU, memory, disk and network. Below are a few considerations: Each worker node will reserve about 1 core for internal services. In order to avoid the side effect of overcommit, 16core/64G is a good starting type for a normal worker node. A 8-core instance may not have insufficent capacity while 32-core instance may lose a big cluster capacity due to an outage or failure. Using balanced CPU-memory worder nodes typically fits our work load -the ratio of CPU to memory is 1 to 4. An instance with a higher memory/cpu ratio e.g. 8:1 is recommended for database nodes. The number of worker nodes >=3. This will give a high availability needing a smaller built in redundant capacity. For the product env, a 8core/32G is recommended for master nodes to avoid any bottleneck for the internal services. An instance with 10GB ethernet is strongly recommended for the production env. Check the GPU chip type for gpu node selection.","title":"Master and Worker Nodes Consideration"},{"location":"mas/sizing/guidance/","text":"Sizing Guidance \uf0c1 The sizing number in this page is based on a standard workload. Used as reference only. Sizing Calculation Sheet \uf0c1 Use Sizing Calculation Sheet for MAS sizing. Factors that impact the sizing consideration \uf0c1 storage operator: e.g. ocs, odf... cp4d services: e.g. db2w, watson studio... mongodb service kafka service OCS (OpenShift Container Storage) \uf0c1 If using OCS to manage the storage class, OCS service itself requires minimum 3 nodes with 14 core / 32G (Note: this is the total request amount, not per node). ODF (OpenShift Data Foundation) \uf0c1 3 OCP nodes will run ODF services. (NOTE: OCP clusters often contain additional OCP worker nodes which do not run ODF services.) Each OCP node running ODF services has:16 core / 64 GB memory CP4D/DB2W Minimum Resource Requirement \uf0c1 When running CP4D/DB2W on OpenShift's worknode, each instance requires at least 6.1 core and 18G ram . Note: an instance pod cannot be scheduled if the node's (total capacity - total limit) is less than 6.1 core or 18G ram, a dedicated worker node or external db is recommended. db2 operator is an alternative. MAS Manage \uf0c1 Based on the benchmark results, for sizing we recommend 50 - 75 user load per MAS Manage UI server bundle pod, which is equivalent to a JVM with 2 core on Maximo 7.6.x. MAS Resource Statistics \uf0c1 use below values as reference only. the footprint is based on the loads and spec settings. e.g. IoT T-shirt size, Manage bundle and replic # the value below is based on IoT small T-shirt size and Manage with only all-in-one bundle and replic =1 App CPU Request (core) CPU Limits (core) Memory Rquest (GB) Memory Limits(GB) Add 6 12 13 26 Assist 12.4 57.7 19.46 62.38 Core 1.5 18.95 6.27 32.5 Health 2.9 15.6 7.12 30.84 HPU 0.9 5.5 0.92 6.5 IoT 19.66 214.65 57.08 269 Manage 2.9 11.1 4.04 17 Monitor 5.4 32.4 12.84 55.5 Optimizer 7.4 19.3 25.57 117 Predict 3.1 12.5 6.13 24.5 Additional cost - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ocs* 14 32 14 32 cp4d (with 2 db2w instances)* 31.59 40.7 235.39 249.70 each additional manage pod* 1 6 2 10","title":"Sizing Guidance"},{"location":"mas/sizing/guidance/#sizing-guidance","text":"The sizing number in this page is based on a standard workload. Used as reference only.","title":"Sizing Guidance"},{"location":"mas/sizing/guidance/#sizing-calculation-sheet","text":"Use Sizing Calculation Sheet for MAS sizing.","title":"Sizing Calculation Sheet"},{"location":"mas/sizing/guidance/#factors-that-impact-the-sizing-consideration","text":"storage operator: e.g. ocs, odf... cp4d services: e.g. db2w, watson studio... mongodb service kafka service","title":"Factors that impact the sizing consideration"},{"location":"mas/sizing/guidance/#ocs-openshift-container-storage","text":"If using OCS to manage the storage class, OCS service itself requires minimum 3 nodes with 14 core / 32G (Note: this is the total request amount, not per node).","title":"OCS (OpenShift Container Storage)"},{"location":"mas/sizing/guidance/#odf-openshift-data-foundation","text":"3 OCP nodes will run ODF services. (NOTE: OCP clusters often contain additional OCP worker nodes which do not run ODF services.) Each OCP node running ODF services has:16 core / 64 GB memory","title":"ODF (OpenShift Data Foundation)"},{"location":"mas/sizing/guidance/#cp4ddb2w-minimum-resource-requirement","text":"When running CP4D/DB2W on OpenShift's worknode, each instance requires at least 6.1 core and 18G ram . Note: an instance pod cannot be scheduled if the node's (total capacity - total limit) is less than 6.1 core or 18G ram, a dedicated worker node or external db is recommended. db2 operator is an alternative.","title":"CP4D/DB2W Minimum Resource Requirement"},{"location":"mas/sizing/guidance/#mas-manage","text":"Based on the benchmark results, for sizing we recommend 50 - 75 user load per MAS Manage UI server bundle pod, which is equivalent to a JVM with 2 core on Maximo 7.6.x.","title":"MAS Manage"},{"location":"mas/sizing/guidance/#mas-resource-statistics","text":"use below values as reference only. the footprint is based on the loads and spec settings. e.g. IoT T-shirt size, Manage bundle and replic # the value below is based on IoT small T-shirt size and Manage with only all-in-one bundle and replic =1 App CPU Request (core) CPU Limits (core) Memory Rquest (GB) Memory Limits(GB) Add 6 12 13 26 Assist 12.4 57.7 19.46 62.38 Core 1.5 18.95 6.27 32.5 Health 2.9 15.6 7.12 30.84 HPU 0.9 5.5 0.92 6.5 IoT 19.66 214.65 57.08 269 Manage 2.9 11.1 4.04 17 Monitor 5.4 32.4 12.84 55.5 Optimizer 7.4 19.3 25.57 117 Predict 3.1 12.5 6.13 24.5 Additional cost - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ocs* 14 32 14 32 cp4d (with 2 db2w instances)* 31.59 40.7 235.39 249.70 each additional manage pod* 1 6 2 10","title":"MAS Resource Statistics"},{"location":"maximo-7/bestpractice/","text":"Info The next version of Maximo 7 is MAS Manage 8.x. This page includes the best practice documents for Maximo EAM v7 only. The DB section in each best practice is still applicable to MAS Manage. Best Practices for Maximo 7.x \uf0c1 Maximo Best Practices for System Performance 7.6.x (Version 1.3) Maximo Anywhere 7.6.2 Best Practices (2018 Edition) Maximo Performance Tuning Tips: Additional tips on top of Best Practices for System Performance Maximo Best Practices for System Performance 7.5.x (Version 2.1) Improving Start Center performance in Maximo System Properties to Monitor and Troubleshoot Performance Maximo Report Performance Guide 7.6.x (Version 1)","title":"Best Practice"},{"location":"maximo-7/bestpractice/#best-practices-for-maximo-7x","text":"Maximo Best Practices for System Performance 7.6.x (Version 1.3) Maximo Anywhere 7.6.2 Best Practices (2018 Edition) Maximo Performance Tuning Tips: Additional tips on top of Best Practices for System Performance Maximo Best Practices for System Performance 7.5.x (Version 2.1) Improving Start Center performance in Maximo System Properties to Monitor and Troubleshoot Performance Maximo Report Performance Guide 7.6.x (Version 1)","title":"Best Practices for Maximo 7.x"},{"location":"pd/checklist/","text":"Performance Diagnosis \uf0c1 Info A monitoring system is strongly recommended to track the environment health and the quality of services. Diagnostic Utility \uf0c1 Scope Name Used for OCP OpenShift Monitoring Service OpenShift Cluster and MAS DB2 IBM DSM DB2 Historical and Realtime Troubleshooting DB2 db2top DB2 Realtime Troubleshooting DBTest DBTest An utility to test db network latency and fetching time Oracle AWR, StatsPack Historical Troubleshooting JVM IBM Support Assistant Heap Dump and GC Log Analysis JVM MAT JVM Dump Analysis Maximo PerfMon - Maximo UI Activity Tracing - Note : Enabling PerfMon may significantly degrade server performance. - Recommend for a single user with Dev/Test env only MongoDB mongotop MongoDB Realtime Troubleshooting HAR HTTP Archive Viewer HAR Analysis - for web page and client side (browser) performance SQL Poor SQL Online SQL Formatter SQL Squirrl Universal SQL Client SSL SSL Shopper Online certificate decode tool OS top Process and thread level analysis, hotspot analysis - top is available in most containers and on OCP worker nodes OS sar a system command be used to monitor system resources like cpu, memory, disk, network... OCP oc debug node/ Worker node debugging Factors in system performance \uf0c1 System performance depends on more than the applications and the database. The network architecture affects performance. Application server configuration can hurt or improve performance. The way that you deploy Maximo across servers affects the way the products perform. Many other factors come into play in providing the end-user experience of system performance. Subsequent sections in this paper address the following topics: System architecture setup including OCP, Instance Type, Storage App and DB server configuration Network issues Bandwidth Load balancing Database tuning SQL tuning Scheduled tasks (cron tasks) Reporting Integration with other systems using the integration framework Troubleshooting Performance Check List \uf0c1 check node status. e.g. any NOT Ready worker nodes if there is any pod or node cpu, memeory usage approaching to the limit? if there is any pod restarted many time recently? if there is any JVM Heapdump dump? if there is any JVM Hung Thread if there is any node or pod with a high system or IO wait (20%)? if there is any node memory, disk or pid pressure? if the response time is high (over 2 sec)? if any long running (over 2 sec) or high cpu cost query? if there is network bottleneck (e.g. load-balancer) is app server or db server busy? if app server is busy check the request, limit value for cpu, memory should replic memebers be increased? if db server is busy check cpu, memory, disk current usage and limit value check any utility in the background. e.g. backup check db lock check if there is any high cost query check disk performance","title":"Diagnosis Check List"},{"location":"pd/checklist/#performance-diagnosis","text":"Info A monitoring system is strongly recommended to track the environment health and the quality of services.","title":"Performance Diagnosis"},{"location":"pd/checklist/#diagnostic-utility","text":"Scope Name Used for OCP OpenShift Monitoring Service OpenShift Cluster and MAS DB2 IBM DSM DB2 Historical and Realtime Troubleshooting DB2 db2top DB2 Realtime Troubleshooting DBTest DBTest An utility to test db network latency and fetching time Oracle AWR, StatsPack Historical Troubleshooting JVM IBM Support Assistant Heap Dump and GC Log Analysis JVM MAT JVM Dump Analysis Maximo PerfMon - Maximo UI Activity Tracing - Note : Enabling PerfMon may significantly degrade server performance. - Recommend for a single user with Dev/Test env only MongoDB mongotop MongoDB Realtime Troubleshooting HAR HTTP Archive Viewer HAR Analysis - for web page and client side (browser) performance SQL Poor SQL Online SQL Formatter SQL Squirrl Universal SQL Client SSL SSL Shopper Online certificate decode tool OS top Process and thread level analysis, hotspot analysis - top is available in most containers and on OCP worker nodes OS sar a system command be used to monitor system resources like cpu, memory, disk, network... OCP oc debug node/ Worker node debugging","title":"Diagnostic Utility"},{"location":"pd/checklist/#factors-in-system-performance","text":"System performance depends on more than the applications and the database. The network architecture affects performance. Application server configuration can hurt or improve performance. The way that you deploy Maximo across servers affects the way the products perform. Many other factors come into play in providing the end-user experience of system performance. Subsequent sections in this paper address the following topics: System architecture setup including OCP, Instance Type, Storage App and DB server configuration Network issues Bandwidth Load balancing Database tuning SQL tuning Scheduled tasks (cron tasks) Reporting Integration with other systems using the integration framework Troubleshooting","title":"Factors in system performance"},{"location":"pd/checklist/#performance-check-list","text":"check node status. e.g. any NOT Ready worker nodes if there is any pod or node cpu, memeory usage approaching to the limit? if there is any pod restarted many time recently? if there is any JVM Heapdump dump? if there is any JVM Hung Thread if there is any node or pod with a high system or IO wait (20%)? if there is any node memory, disk or pid pressure? if the response time is high (over 2 sec)? if any long running (over 2 sec) or high cpu cost query? if there is network bottleneck (e.g. load-balancer) is app server or db server busy? if app server is busy check the request, limit value for cpu, memory should replic memebers be increased? if db server is busy check cpu, memory, disk current usage and limit value check any utility in the background. e.g. backup check db lock check if there is any high cost query check disk performance","title":"Performance Check List"},{"location":"pd/db2-performance-diagnosis/","text":"DB2TOP \uf0c1 db2top can be used for a real-time diagnosis. Command: db2top -db press h : help screen press I : reset the interval time (default is 2 seconds) press m : memory screen press B : bottleneck screen press b : bufferpool screen press T : Table screen press U : locks screen press u : utility screen to check if runstat is running press D : Dynamic SQL screen Catch High CPU SQL in Dynamic SQL screen, do: Press z and 5 to sort by cpu usage Copy SQL Hashcode Press L and Paste SQL Hashcode Notes: Be cautions when taking any snapshot. See more details on User Manual Diagnosis Commands \uf0c1 list memory allocation: db2mtrk -i -d \u2013v list long run query: SELECT ELAPSED_TIME_MIN,SUBSTR(AUTHID,1,10) AS AUTH_ID, AGENT_ID,APPL_STATUS,SUBSTR(STMT_TEXT,1,20) AS SQL_TEXT FROM SYSIBMADM.LONG_RUNNING_SQL WHERE ELAPSED_TIME_MIN > 0 ORDER BY ELAPSED_TIME_MIN DESC; list backup/restore status: db2pd -barstats -d list most active tables: SELECT SUBSTR(TABSCHEMA,1,10) AS SCHEMA,SUBSTR(TABNAME,1,20) AS NAME,TABLE_SCANS,ROWS_READ,ROWS_INSERTED,ROWS_DELETED FROM TABLE(MON_GET_TABLE('','',-1)) ORDER BY ROWS_READ DESC FETCH FIRST 5 ROWS ONLY list most active indexes: SELECT SUBSTR(TABSCHEMA,1,10) AS SCHEMA,SUBSTR(TABNAME,1,20) AS NAME,IID,NLEAF, NLEVELS,INDEX_SCANS,KEY_UPDATES,BOUNDARY_LEAF_NODE_SPLITS + NONBOUNDARY_LEAF_NODE_SPLITS AS PAGE_SPLITS FROM TABLE(MON_GET_INDEX('','',-1)) ORDER BY INDEX_SCANS DESC FETCH FIRST 5 ROWS ONLY list db2 advise for the statement: db2advis -database bludb -s \"select * from maximo.ahfactorhistory where ahdriverhistoryid = 123 for read only\" -n MAXIMO -q MAXIMO checking for indexes the need to be rebuilt db2 reorgchk current statistics on schema 'MAXIMO' > /tmp/reorgchk.log Any indexes or tables with an * in the REORG column, indicate that they are candidates for reorg. list the query execution plan: db2expln -database bludb -schema MAXIMO -package % -statement \"select * from maximo.ahfactorhistory where ahdriverhistoryid = 123 for read only\" -terminal -graph > query1_access_plan.txt list all indexes for a specific table: select * from syscat.indexes i where TABNAME ='ITEMSTRUCT' list insert/update/delete/tablescan stats for a specific table: SELECT rows_read,rows_inserted,rows_updated,rows_deleted,table_scans FROM TABLE(MON_GET_TABLE('MAXIMO','ASSET',-2)) list insert/update/delete/tablescan stats for all tables: SELECT SUBSTR(TABSCHEMA,1,10) AS SCHEMA,SUBSTR(TABNAME,1,20) AS NAME,TABLE_SCANS,ROWS_READ,ROWS_INSERTED,ROWS_DELETED FROM TABLE(MON_GET_TABLE('','',-1)) ORDER BY ROWS_READ DESC FETCH FIRST 5 ROWS ONLY\" list top 10 big tables: select creator, name, avgrowsize, card, stats_time, avgrowsize*card as tbsize, npages*t.pagesize/1024/1024 as tbsize_inMB from sysibm.systables t1, syscat.tablespaces t where creator not like 'DB2%' and t1.tbspace=t.tbspace order by tbsize desc fetch first 10 rows only list data and index size for one table: select tabschema, tabname, DATA_OBJECT_P_SIZE/1024 as data_inMB, INDEX_OBJECT_P_SIZE/1024 as index_inMB,LONG_OBJECT_P_SIZE/1024 LongObj_inMB, LOB_OBJECT_P_SIZE/1024 as LOB_inMB from table(sysproc.admin_get_tab_info('MAXIMO','WORKORDER')) list error message: db2 ? db2pd : monitor and troubleshoot DB2 database command db2diag : db2diag logs analysis tool command db2set : db2 global settings db2 get dbm cfg : db2 database manager configuration db2 get db cfg : db2 database configuration IBM Data Server Manager (IBM DSM) \uf0c1 IBM DSM is useful to do both real-time/ historical data diagnosis, find out the expensive sql query, justify cpu spent on sql execution or other e.g. sorting, parsing, fetching, io and so on. It requires pre-configuration. A high-level set up: Download the latest version of Data Server Manager from IBM developerWorks or IBM Passport Advantage Online , then extract to /opt/ibm/dsm run setup.sh to set up and create admin user run start.sh to start the server, url is http://hostname:11080/console log on the console, select a time period (e.g. peak time) and then generate report.","title":"DB2 Performance Diagnosis"},{"location":"pd/db2-performance-diagnosis/#db2top","text":"db2top can be used for a real-time diagnosis. Command: db2top -db press h : help screen press I : reset the interval time (default is 2 seconds) press m : memory screen press B : bottleneck screen press b : bufferpool screen press T : Table screen press U : locks screen press u : utility screen to check if runstat is running press D : Dynamic SQL screen Catch High CPU SQL in Dynamic SQL screen, do: Press z and 5 to sort by cpu usage Copy SQL Hashcode Press L and Paste SQL Hashcode Notes: Be cautions when taking any snapshot. See more details on User Manual","title":"DB2TOP"},{"location":"pd/db2-performance-diagnosis/#diagnosis-commands","text":"list memory allocation: db2mtrk -i -d \u2013v list long run query: SELECT ELAPSED_TIME_MIN,SUBSTR(AUTHID,1,10) AS AUTH_ID, AGENT_ID,APPL_STATUS,SUBSTR(STMT_TEXT,1,20) AS SQL_TEXT FROM SYSIBMADM.LONG_RUNNING_SQL WHERE ELAPSED_TIME_MIN > 0 ORDER BY ELAPSED_TIME_MIN DESC; list backup/restore status: db2pd -barstats -d list most active tables: SELECT SUBSTR(TABSCHEMA,1,10) AS SCHEMA,SUBSTR(TABNAME,1,20) AS NAME,TABLE_SCANS,ROWS_READ,ROWS_INSERTED,ROWS_DELETED FROM TABLE(MON_GET_TABLE('','',-1)) ORDER BY ROWS_READ DESC FETCH FIRST 5 ROWS ONLY list most active indexes: SELECT SUBSTR(TABSCHEMA,1,10) AS SCHEMA,SUBSTR(TABNAME,1,20) AS NAME,IID,NLEAF, NLEVELS,INDEX_SCANS,KEY_UPDATES,BOUNDARY_LEAF_NODE_SPLITS + NONBOUNDARY_LEAF_NODE_SPLITS AS PAGE_SPLITS FROM TABLE(MON_GET_INDEX('','',-1)) ORDER BY INDEX_SCANS DESC FETCH FIRST 5 ROWS ONLY list db2 advise for the statement: db2advis -database bludb -s \"select * from maximo.ahfactorhistory where ahdriverhistoryid = 123 for read only\" -n MAXIMO -q MAXIMO checking for indexes the need to be rebuilt db2 reorgchk current statistics on schema 'MAXIMO' > /tmp/reorgchk.log Any indexes or tables with an * in the REORG column, indicate that they are candidates for reorg. list the query execution plan: db2expln -database bludb -schema MAXIMO -package % -statement \"select * from maximo.ahfactorhistory where ahdriverhistoryid = 123 for read only\" -terminal -graph > query1_access_plan.txt list all indexes for a specific table: select * from syscat.indexes i where TABNAME ='ITEMSTRUCT' list insert/update/delete/tablescan stats for a specific table: SELECT rows_read,rows_inserted,rows_updated,rows_deleted,table_scans FROM TABLE(MON_GET_TABLE('MAXIMO','ASSET',-2)) list insert/update/delete/tablescan stats for all tables: SELECT SUBSTR(TABSCHEMA,1,10) AS SCHEMA,SUBSTR(TABNAME,1,20) AS NAME,TABLE_SCANS,ROWS_READ,ROWS_INSERTED,ROWS_DELETED FROM TABLE(MON_GET_TABLE('','',-1)) ORDER BY ROWS_READ DESC FETCH FIRST 5 ROWS ONLY\" list top 10 big tables: select creator, name, avgrowsize, card, stats_time, avgrowsize*card as tbsize, npages*t.pagesize/1024/1024 as tbsize_inMB from sysibm.systables t1, syscat.tablespaces t where creator not like 'DB2%' and t1.tbspace=t.tbspace order by tbsize desc fetch first 10 rows only list data and index size for one table: select tabschema, tabname, DATA_OBJECT_P_SIZE/1024 as data_inMB, INDEX_OBJECT_P_SIZE/1024 as index_inMB,LONG_OBJECT_P_SIZE/1024 LongObj_inMB, LOB_OBJECT_P_SIZE/1024 as LOB_inMB from table(sysproc.admin_get_tab_info('MAXIMO','WORKORDER')) list error message: db2 ? db2pd : monitor and troubleshoot DB2 database command db2diag : db2diag logs analysis tool command db2set : db2 global settings db2 get dbm cfg : db2 database manager configuration db2 get db cfg : db2 database configuration","title":"Diagnosis Commands"},{"location":"pd/db2-performance-diagnosis/#ibm-data-server-manager-ibm-dsm","text":"IBM DSM is useful to do both real-time/ historical data diagnosis, find out the expensive sql query, justify cpu spent on sql execution or other e.g. sorting, parsing, fetching, io and so on. It requires pre-configuration. A high-level set up: Download the latest version of Data Server Manager from IBM developerWorks or IBM Passport Advantage Online , then extract to /opt/ibm/dsm run setup.sh to set up and create admin user run start.sh to start the server, url is http://hostname:11080/console log on the console, select a time period (e.g. peak time) and then generate report.","title":"IBM Data Server Manager (IBM DSM)"},{"location":"pd/dbtest/","text":"DBTest Utility \uf0c1 notes: This utility requires Java version 11 or higher . The DBTest Utility has two modes: Benchmark Mode (the default): is to measure database connection time, query execution time and data fetching time for every 100 records. Query Mode: is to display the query result with database connection time, query execution time and data fetching time. Here is an example demonstrating how to utilize this utility in the Maximo UI pod. Run DBTest in MAS Manage maxinst pod \uf0c1 go to maxinst pod in the MAS Manage namespace -> terminal tab, then execute below commands: cd /tmp curl -L -v -o run-dbtest-in-maxinst-pod.sh https://ibm-mas.github.io/mas-performance/pd/download/DBTest/run-dbtest-in-maxinst-pod.sh bash run-dbtest-in-maxinst-pod.sh Run DBTest in Maximo UI Pod \uf0c1 go to maximo ui pod -> terminal tab, then execute below commands: # change to /tmp cd /tmp # download DBTest curl -L -v -o DBTest.class https://ibm-mas.github.io/mas-performance/pd/download/DBTest/DBTest.class # set DBURL. If this utility is in maximo UI pod, set DBURL=\"$MXE_DB_URL\" export DBURL = \"\" or export DBURL = \" $MXE_DB_URL \" or export DBURL = \" ${ MXE_DB_URL } sslTrustStoreLocation= ${ java_truststore } ;sslTrustStorePassword= ${ java_truststore_password } ;\" export DBUSERNAME = '' export DBPASSWORD = '' export SQLQUERY = 'select * from maximo.maxattribute' # execute the utility in benchmark mode java -classpath .: $( dirname \" $( find /opt | grep \"oraclethin.jar\" | head -n 1 ) \" ) /* DBTest Result Samples: Given optimal network latency and a healthy database status, the expected data fetching time is less than 10 milliseconds. Good Result: Bad Result: Execute the utility in query mode \uf0c1 java -classpath .: $( dirname \" $( find /opt | grep \"oraclethin.jar\" | head -n 1 ) \" ) /* DBTest -q Output Sample: (base) [~/javatool]$ java -classpath .:./lib/* DBTest -q Dec. 06, 2023 11:49:47 A.M. DBTest getConnection INFO: Loading Class took: 0.029 seconds Dec. 06, 2023 11:49:53 A.M. DBTest getConnection INFO: DB Connecting took: 6.55 seconds Dec. 06, 2023 11:49:53 A.M. DBTest printResult INFO: Query Execution took: 0.099 seconds APP, OPTIONNAME, DESCRIPTION, ESIGENABLED, VISIBLE, ALSOGRANTS, ALSOREVOKES, PREREQUISITE, SIGOPTIONID, LANGCODE, HASLD, ROWSTAMP --------------------------------------------------------------------------------------------------------------------------------- APIKEY, READ, Access to API Keys application, 0, 1, null, ALL, null, 200004204, EN, 0, 290874862 Dec. 06, 2023 11:49:54 A.M. DBTest printResult INFO: Fetching Record took: 0.058 seconds","title":"DBTest Utility"},{"location":"pd/dbtest/#dbtest-utility","text":"notes: This utility requires Java version 11 or higher . The DBTest Utility has two modes: Benchmark Mode (the default): is to measure database connection time, query execution time and data fetching time for every 100 records. Query Mode: is to display the query result with database connection time, query execution time and data fetching time. Here is an example demonstrating how to utilize this utility in the Maximo UI pod.","title":"DBTest Utility"},{"location":"pd/dbtest/#run-dbtest-in-mas-manage-maxinst-pod","text":"go to maxinst pod in the MAS Manage namespace -> terminal tab, then execute below commands: cd /tmp curl -L -v -o run-dbtest-in-maxinst-pod.sh https://ibm-mas.github.io/mas-performance/pd/download/DBTest/run-dbtest-in-maxinst-pod.sh bash run-dbtest-in-maxinst-pod.sh","title":"Run DBTest in MAS Manage maxinst pod"},{"location":"pd/dbtest/#run-dbtest-in-maximo-ui-pod","text":"go to maximo ui pod -> terminal tab, then execute below commands: # change to /tmp cd /tmp # download DBTest curl -L -v -o DBTest.class https://ibm-mas.github.io/mas-performance/pd/download/DBTest/DBTest.class # set DBURL. If this utility is in maximo UI pod, set DBURL=\"$MXE_DB_URL\" export DBURL = \"\" or export DBURL = \" $MXE_DB_URL \" or export DBURL = \" ${ MXE_DB_URL } sslTrustStoreLocation= ${ java_truststore } ;sslTrustStorePassword= ${ java_truststore_password } ;\" export DBUSERNAME = '' export DBPASSWORD = '' export SQLQUERY = 'select * from maximo.maxattribute' # execute the utility in benchmark mode java -classpath .: $( dirname \" $( find /opt | grep \"oraclethin.jar\" | head -n 1 ) \" ) /* DBTest Result Samples: Given optimal network latency and a healthy database status, the expected data fetching time is less than 10 milliseconds. Good Result: Bad Result:","title":"Run DBTest in Maximo UI Pod"},{"location":"pd/dbtest/#execute-the-utility-in-query-mode","text":"java -classpath .: $( dirname \" $( find /opt | grep \"oraclethin.jar\" | head -n 1 ) \" ) /* DBTest -q Output Sample: (base) [~/javatool]$ java -classpath .:./lib/* DBTest -q Dec. 06, 2023 11:49:47 A.M. DBTest getConnection INFO: Loading Class took: 0.029 seconds Dec. 06, 2023 11:49:53 A.M. DBTest getConnection INFO: DB Connecting took: 6.55 seconds Dec. 06, 2023 11:49:53 A.M. DBTest printResult INFO: Query Execution took: 0.099 seconds APP, OPTIONNAME, DESCRIPTION, ESIGENABLED, VISIBLE, ALSOGRANTS, ALSOREVOKES, PREREQUISITE, SIGOPTIONID, LANGCODE, HASLD, ROWSTAMP --------------------------------------------------------------------------------------------------------------------------------- APIKEY, READ, Access to API Keys application, 0, 1, null, ALL, null, 200004204, EN, 0, 290874862 Dec. 06, 2023 11:49:54 A.M. DBTest printResult INFO: Fetching Record took: 0.058 seconds","title":"Execute the utility in query mode"},{"location":"pd/jvm-performance-insight/","text":"As a result of architectural modifications, Maximo 8.x (MAS Manage app) now operates on WebSphere Liberty Base with OpenJ9 within the OpenShift Container Platform (OCP). It's essential to note that JVM arguments outlined in the 7.x Best Practice documentation may not be relevant or applicable to the Maximo 8.x environment. Here are additional details: WebSphere Liberty \uf0c1 As of WebSphere Liberty, the thread growth algorithm is enhanced to grow thread capacity more aggressively so as to react more rapidly to peak loads. For many environments, this autonomic tuning provided by the Open Liberty thread pool works well with no configuration or tuning by the operator. If necessary, you can adjust coreThreads and maxThreads value. Generic JVM Arguments \uf0c1 -Xgcpolicy:gencon Gencon is the default policy in OpenJ9, this parameter works in both 7.x and 8.x -Xmx or -XX:MaxRAMPercentage (maximum heap size) If not specifying -Xmx value, JVM uses 75% of total container memory when -XX:+UseContainerSupport is set. When -Xmx is set, -XX:MaxRAMPercentage will be ignored. -XX:+UseContainerSupport/-XX:-UseContainerSupport If -XX:+UseContainerSupport is set, it allows to change the InitialRAMPercentage and MaxRAMPercentage values. -Xms and -Xmx can overwrite the limits. -Xmn (Nursery Space) Setting the size of the nursery when using this policy can be very important to optimize performance. 25 - 33% of total heap is recommended. Please note manage pod limited memory is 10G that is not Total heap size. Heap size is based on (-Xmx or -XX:MaxRAMPercentage) setting. 10G also includes memory used by websphere for cache, compilation as well maximo mmi container. -Xgcthreads4 This parameter is used to set the number of threads that the Garbage Collector uses for parallel operations. By default, it is set to n -1 in OpenJ9 where n is the number of reported cpu on the node. You might want to restrict the number of GC threads used by each VM to reduce some overhead. -XcompilationThreads4 This parameter is used to specify the number of compilation threads that are used by the JIT compiler. Same as gcthread, you might want to restrict the number of compilation threads used by each VM to reduce some overhead. -Xshareclasses this parameter is used to share class data between running VMs, which can reduce the startup time for a VM once the cache has been created. \u2011Xdisableexplicitgc (Recommended) This parameter is used to disabling explicit garbage collection disables any System.gc() calls from triggering garbage collections. For optimal performance, disable explicit garbage collection. -Djava.net.preferIPv4Stack=true/-Djava.net.preferIPv6Addresses=false For performance reasons, Maximo recommends to set this property to true. Note: this parameter can not be applied on the hosts that only communicate with ipv6. -XX:PermSize and -XX:MaxPermSize Maximo 7.x BP recommends 320m. If seeing an OOM for PermSize, consider to increase to 512MI or higher. -Xcodecache32m The maximum value you can specify for -Xcodecache is 32 MB. JIT compiler might allocate more than one code cache. It is controlled by -Xcodecachetotal which default value is 256MB. -Xverbosegclog Enable verbose gc log for the garbage collection analysis -Xtune:virtualized (under review) Optimizes OpenJ9 VM function for virtualized environments, such as a cloud, by reducing OpenJ9 VM CPU consumption when idle.","title":"JVM Performance Insight"},{"location":"pd/jvm-performance-insight/#websphere-liberty","text":"As of WebSphere Liberty, the thread growth algorithm is enhanced to grow thread capacity more aggressively so as to react more rapidly to peak loads. For many environments, this autonomic tuning provided by the Open Liberty thread pool works well with no configuration or tuning by the operator. If necessary, you can adjust coreThreads and maxThreads value.","title":"WebSphere Liberty"},{"location":"pd/jvm-performance-insight/#generic-jvm-arguments","text":"-Xgcpolicy:gencon Gencon is the default policy in OpenJ9, this parameter works in both 7.x and 8.x -Xmx or -XX:MaxRAMPercentage (maximum heap size) If not specifying -Xmx value, JVM uses 75% of total container memory when -XX:+UseContainerSupport is set. When -Xmx is set, -XX:MaxRAMPercentage will be ignored. -XX:+UseContainerSupport/-XX:-UseContainerSupport If -XX:+UseContainerSupport is set, it allows to change the InitialRAMPercentage and MaxRAMPercentage values. -Xms and -Xmx can overwrite the limits. -Xmn (Nursery Space) Setting the size of the nursery when using this policy can be very important to optimize performance. 25 - 33% of total heap is recommended. Please note manage pod limited memory is 10G that is not Total heap size. Heap size is based on (-Xmx or -XX:MaxRAMPercentage) setting. 10G also includes memory used by websphere for cache, compilation as well maximo mmi container. -Xgcthreads4 This parameter is used to set the number of threads that the Garbage Collector uses for parallel operations. By default, it is set to n -1 in OpenJ9 where n is the number of reported cpu on the node. You might want to restrict the number of GC threads used by each VM to reduce some overhead. -XcompilationThreads4 This parameter is used to specify the number of compilation threads that are used by the JIT compiler. Same as gcthread, you might want to restrict the number of compilation threads used by each VM to reduce some overhead. -Xshareclasses this parameter is used to share class data between running VMs, which can reduce the startup time for a VM once the cache has been created. \u2011Xdisableexplicitgc (Recommended) This parameter is used to disabling explicit garbage collection disables any System.gc() calls from triggering garbage collections. For optimal performance, disable explicit garbage collection. -Djava.net.preferIPv4Stack=true/-Djava.net.preferIPv6Addresses=false For performance reasons, Maximo recommends to set this property to true. Note: this parameter can not be applied on the hosts that only communicate with ipv6. -XX:PermSize and -XX:MaxPermSize Maximo 7.x BP recommends 320m. If seeing an OOM for PermSize, consider to increase to 512MI or higher. -Xcodecache32m The maximum value you can specify for -Xcodecache is 32 MB. JIT compiler might allocate more than one code cache. It is controlled by -Xcodecachetotal which default value is 256MB. -Xverbosegclog Enable verbose gc log for the garbage collection analysis -Xtune:virtualized (under review) Optimizes OpenJ9 VM function for virtualized environments, such as a cloud, by reducing OpenJ9 VM CPU consumption when idle.","title":"Generic JVM Arguments"},{"location":"pd/pingtest/","text":"Ping test Utility \uf0c1 When trying to diagnose a request timeout problem it is helpful to rule out gateways/load balancers outside the OCP cluster. Sometimes these external gateways can have short timeouts which are resetting a connection before the request is completed. The Ping test Utility is designed to help diagnose this issue. IMPORTANT As of this writing the Ping test utility is not part of the base server bundle code and needs to be loaded via a customization archive. This means that the ManageWorkspace CR needs to be updated and will require a restart of the server bundles (i.e. it will cause a disruption while the server bundle pod is restarted). Updating the ManageWorkspace CR \uf0c1 Edit the ManageWorkspace CR in the MAS Manage namespace Single Customization Archive spec: settings: customization: customizationArchive: https://ibm.box.com/shared/static/oj9z062b7x8hcywndnv6n57gya7e1ypz.zip In case you already have a customization archive add to the customizationList spec: settings: customizationList: - customizationArchiveName: archiveAlias1 customizationArchiveUrl: https://ibm.box.com/shared/static/oj9z062b7x8hcywndnv6n57gya7e1ypz.zip Wait for the MAS Manage workspace operator to update the server bundle pods with the Ping servlet class and restart the server bundle pods Using the Ping servlet utility to test request timeouts outside the OCP cluster \uf0c1 Run the following curl command outside the OCP cluster using the external hostname of the MAS Manage server bundle pod. The command below will send a request to the Ping servlet which will wait for 1 second before responding. If a response is returned it means no timeout occurred. $ curl https://tenant1.manage.masperf4.ibmmas.com/maximo/ping?timeout=1 {\"thread wait time\":\"1 seconds\",\"status\":\"ok\"} $ Change the timeout value to match the timeout that you are observing in problematic request. For example, the Ping request below sets a timeout of 300 seconds. If no response is received it means the request timed out and the same request should be attempted from inside the OCP cluster using the private IP address of the server bundle pod (see Using the Ping servlet utility to test request timeouts inside the OCP cluster below). $ curl https://tenant1.manage.masperf4.ibmmas.com/maximo/ping?timeout=300 Using the Ping servlet utility to test request timeouts inside the OCP cluster \uf0c1 Obtain the internal Cluster IP address of the MAS Manage UI service. Go to maxinst pod in the MAS Manage namespace -> terminal tab, then execute below commands: $ curl --insecure {\"thread wait time\":\"300 seconds\",\"status\":\"ok\"} $ If you receive a response from the request issued to the internal Cluster IP address of the MAS Manage UI service, but do not receive a response issued externally from outside the cluster, it could be the case that an external gateway service or load balancer is closing the connection due to a shorter timeout set on the gateway. Check is a network administrator.","title":"PingTest Utility"},{"location":"pd/pingtest/#ping-test-utility","text":"When trying to diagnose a request timeout problem it is helpful to rule out gateways/load balancers outside the OCP cluster. Sometimes these external gateways can have short timeouts which are resetting a connection before the request is completed. The Ping test Utility is designed to help diagnose this issue. IMPORTANT As of this writing the Ping test utility is not part of the base server bundle code and needs to be loaded via a customization archive. This means that the ManageWorkspace CR needs to be updated and will require a restart of the server bundles (i.e. it will cause a disruption while the server bundle pod is restarted).","title":"Ping test Utility"},{"location":"pd/pingtest/#updating-the-manageworkspace-cr","text":"Edit the ManageWorkspace CR in the MAS Manage namespace Single Customization Archive spec: settings: customization: customizationArchive: https://ibm.box.com/shared/static/oj9z062b7x8hcywndnv6n57gya7e1ypz.zip In case you already have a customization archive add to the customizationList spec: settings: customizationList: - customizationArchiveName: archiveAlias1 customizationArchiveUrl: https://ibm.box.com/shared/static/oj9z062b7x8hcywndnv6n57gya7e1ypz.zip Wait for the MAS Manage workspace operator to update the server bundle pods with the Ping servlet class and restart the server bundle pods","title":"Updating the ManageWorkspace CR"},{"location":"pd/pingtest/#using-the-ping-servlet-utility-to-test-request-timeouts-outside-the-ocp-cluster","text":"Run the following curl command outside the OCP cluster using the external hostname of the MAS Manage server bundle pod. The command below will send a request to the Ping servlet which will wait for 1 second before responding. If a response is returned it means no timeout occurred. $ curl https://tenant1.manage.masperf4.ibmmas.com/maximo/ping?timeout=1 {\"thread wait time\":\"1 seconds\",\"status\":\"ok\"} $ Change the timeout value to match the timeout that you are observing in problematic request. For example, the Ping request below sets a timeout of 300 seconds. If no response is received it means the request timed out and the same request should be attempted from inside the OCP cluster using the private IP address of the server bundle pod (see Using the Ping servlet utility to test request timeouts inside the OCP cluster below). $ curl https://tenant1.manage.masperf4.ibmmas.com/maximo/ping?timeout=300","title":"Using the Ping servlet utility to test request timeouts outside the OCP cluster"},{"location":"pd/pingtest/#using-the-ping-servlet-utility-to-test-request-timeouts-inside-the-ocp-cluster","text":"Obtain the internal Cluster IP address of the MAS Manage UI service. Go to maxinst pod in the MAS Manage namespace -> terminal tab, then execute below commands: $ curl --insecure {\"thread wait time\":\"300 seconds\",\"status\":\"ok\"} $ If you receive a response from the request issued to the internal Cluster IP address of the MAS Manage UI service, but do not receive a response issued externally from outside the cluster, it could be the case that an external gateway service or load balancer is closing the connection due to a shorter timeout set on the gateway. Check is a network administrator.","title":"Using the Ping servlet utility to test request timeouts inside the OCP cluster"},{"location":"pd/ptbp/","text":"Performance Test Best Practice \uf0c1 Listed below are the performance tools used by the lab for performance test. JMeter \uf0c1 JMeter Best Practice Rational Performance Tester \uf0c1 Performance Test Best Practices for Rational Performance Tester","title":"Performance Test Best Practice"},{"location":"pd/ptbp/#performance-test-best-practice","text":"Listed below are the performance tools used by the lab for performance test.","title":"Performance Test Best Practice"},{"location":"pd/ptbp/#jmeter","text":"JMeter Best Practice","title":"JMeter"},{"location":"pd/ptbp/#rational-performance-tester","text":"Performance Test Best Practices for Rational Performance Tester","title":"Rational Performance Tester"},{"location":"pd/mcpi/maximo-cpi/","text":"Coming soon ... \uf0c1 Critical Note IBM Maximo Cluster Performance Insights is offered \"AS IS\", WITH NO WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING THE WARRANTY OF TITLE, NON-INFRINGEMENT OR NON-INTERFERENCE AND THE IMPLIED WARRANTIES AND CONDITIONS OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. The IBM Product Support you have purchased with your IBM Maximo Application Suite Product does not cover this Application extension. Do not attempt to submit an IBM support ticket. The IBM TechXChange Maximo Community discussions can be leveraged to crowd-source assistance from Maximo Experts. What is IBM Maximo Cluster Performance Insights \uf0c1 IBM Maximo Cluster Performance Insights (Maximo CPI) , is a new utility that use short and long term snapshots to addresses specific best practices for deployment of Maximo App Suite. It can assist in pinpointing areas that need improvement and provide actionable insights for optimizing the MAS deployment. Maximo Clients can conduct a self-assessment to ensure adherence to best practices, optimize resource use, and diagnose performance issues. This process helps in evaluating current practices, identifying areas for improvement, and enhancing overall efficiency and effectiveness. The utility gathers only metrics data, excluding any sensitive information. It is containerized for ease of use. IBM Maximo Cluster Performance Insights Main Features \uf0c1 Identify any missing or incorrect settings that not follows MAS Best Practice Offer an in-depth evaluation of the deployed MAS system's performance Provide recommendations for minimizing the size of the MAS deployment to reduce infrastructure costs Identify certificates that have expired or are about to expire Provide suggestion for rebalancing the node resource utilization to optimize the workload Capacity to send a notification via slack Offer a platform for customized MAS Manage schedule scaling User guide \uf0c1 Run on Docker Download the docker container: docker pull quay.io/brianzhu_ibm/mcpi:latest Run the docker container: docker run -dit -p 8888:8888 --name mcpi quay.io/brianzhu_ibm/mcpi:latest Data Collection enter into the docker container: docker exec -it --user root mcpi bash login on OpenShift Cluster: oc login https://: -u -p or oc login https://: --token= execute data collection command: collect-metric.sh note: when the command finishes executing, it returns the path to the MHC JSON file. Below is a sample of the returning. In this case, the path to the MHC JSON file is /tmp/mhc-2024-08-01-19-36.json Data Review launch the mcpi viewer url ( http://localhost:8888 ) in the browser review the data: Under Load a MAS Harmony Checker JSON file from the server's path , enter the path to the MHC JSON file e.g. /tmp/mhc-2024-08-01-19-36.json Below is the sample snapshot Run on OpenShift Cluster Download maximo-cpi-deployment.yaml Login on OpenShift Cluster Console Click + to import YAML, then Drag and drop maximo-cpi-deployment.yaml Data Collection login into the cluster console go to maximo-cpi project click on mcpi-deployment-xxx pod go to Terminal tab login on OpenShift Cluster: oc login https://: -u -p or oc login https://: --token= execute data collection command: collect-metric.sh note: when the command finishes executing, it returns the path to the MHC JSON file. See the sample in the Run on Docker section Data Review go to maximo-cpi project -> Networking -> Routes click on mcpi-viewer-route url review the data: Under Load a MAS Harmony Checker JSON file from the server's path , enter the path to the MHC JSON file e.g. /tmp/mhc-2024-08-01-19-36.json See the sample in the Run on Docker section Most Common User Scenarios \uf0c1 1) Best practice to minimizing footprint through Maximo CPI \uf0c1 Step 1: Eliminate the surplus nodes if exist Step 2: Balance CPU and Memory Request%; Align CPU and Memory Requests to match hardware specifications, such as a ratio of 1:4 or 1:8. Step 3: Continuously reduce the resource requests for pods/containers to enhance utilization. Ideally, aim for resource utilization that exceeds the resource requests and approaches 60\u201370% of the cluster capacity. Repeat Step 1 \u2013 3 if needed 2) Best practice for performance troubleshooting and configuration checking \uf0c1 Step 1: Heatmap viewer provides the problematic pods and nodes Step 2: Maximo CPI viewer provides the metric details Step 3: Identify the severity and functional impacts Step 4: Vertically and horizontally adjust the pod/service/node and apply the recommended OpenShift Configuration if needed Repeat Step 1 \u2013 4 if needed 3) Rebalance Node Resource \uf0c1 Issue Description: Observe the unbalance resource usage among the nodes. E.g. some nodes use 80% cpu, but the other uses 20% cpu. Reason: Imbalanced placement OpenShift schedules the service / pod based on the resource cost increment , not the real resource usage. Solution: migrate pods from busy nodes to non-busy nodes with min movements. This is a typical bin-packing (NP-Hard) problem. Maximo CPI uses the greedy algorithm since the time and minimum steps are not critical. Actions: \u26a0\ufe0f Moving pods can be disruptive at times, as it may cause an outage while the stateful service pod is being relocated. execute node-balance.sh . The output will provide movepod command if any issue is detected execute movepod.sh to move the pods. 4) Scheduled Scaling \uf0c1 modify mas-manage-scheduled-scaling-sample.sh to adjust the parameters e.g. time and pod replica number set up the slack url and channel name for notification if needed 5) Expired and Expiring Certificate \uf0c1 modify cert-expiration-slack-alert-sample.sh to adjust the paramenter e.g. time and expiration-in-days set up the slack url and channel name for notification if needed Upcoming \uf0c1 Release this utility to the public via IBM Accelerator Extend metric collection to cover the database performance metrics Add and enhance the policies for alerting and best practices Enhance MAS Optimization, Sizing, Re-balance, Scaling, Performance Diagnosis via AI technology","title":"IBM Maximo Cluster Performance Insights"},{"location":"pd/mcpi/maximo-cpi/#coming-soon","text":"Critical Note IBM Maximo Cluster Performance Insights is offered \"AS IS\", WITH NO WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING THE WARRANTY OF TITLE, NON-INFRINGEMENT OR NON-INTERFERENCE AND THE IMPLIED WARRANTIES AND CONDITIONS OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. The IBM Product Support you have purchased with your IBM Maximo Application Suite Product does not cover this Application extension. Do not attempt to submit an IBM support ticket. The IBM TechXChange Maximo Community discussions can be leveraged to crowd-source assistance from Maximo Experts.","title":"Coming soon ..."},{"location":"pd/mcpi/maximo-cpi/#what-is-ibm-maximo-cluster-performance-insights","text":"IBM Maximo Cluster Performance Insights (Maximo CPI) , is a new utility that use short and long term snapshots to addresses specific best practices for deployment of Maximo App Suite. It can assist in pinpointing areas that need improvement and provide actionable insights for optimizing the MAS deployment. Maximo Clients can conduct a self-assessment to ensure adherence to best practices, optimize resource use, and diagnose performance issues. This process helps in evaluating current practices, identifying areas for improvement, and enhancing overall efficiency and effectiveness. The utility gathers only metrics data, excluding any sensitive information. It is containerized for ease of use.","title":"What is IBM Maximo Cluster Performance Insights"},{"location":"pd/mcpi/maximo-cpi/#ibm-maximo-cluster-performance-insights-main-features","text":"Identify any missing or incorrect settings that not follows MAS Best Practice Offer an in-depth evaluation of the deployed MAS system's performance Provide recommendations for minimizing the size of the MAS deployment to reduce infrastructure costs Identify certificates that have expired or are about to expire Provide suggestion for rebalancing the node resource utilization to optimize the workload Capacity to send a notification via slack Offer a platform for customized MAS Manage schedule scaling","title":"IBM Maximo Cluster Performance Insights Main Features"},{"location":"pd/mcpi/maximo-cpi/#user-guide","text":"Run on Docker Download the docker container: docker pull quay.io/brianzhu_ibm/mcpi:latest Run the docker container: docker run -dit -p 8888:8888 --name mcpi quay.io/brianzhu_ibm/mcpi:latest Data Collection enter into the docker container: docker exec -it --user root mcpi bash login on OpenShift Cluster: oc login https://: -u -p or oc login https://: --token= execute data collection command: collect-metric.sh note: when the command finishes executing, it returns the path to the MHC JSON file. Below is a sample of the returning. In this case, the path to the MHC JSON file is /tmp/mhc-2024-08-01-19-36.json Data Review launch the mcpi viewer url ( http://localhost:8888 ) in the browser review the data: Under Load a MAS Harmony Checker JSON file from the server's path , enter the path to the MHC JSON file e.g. /tmp/mhc-2024-08-01-19-36.json Below is the sample snapshot Run on OpenShift Cluster Download maximo-cpi-deployment.yaml Login on OpenShift Cluster Console Click + to import YAML, then Drag and drop maximo-cpi-deployment.yaml Data Collection login into the cluster console go to maximo-cpi project click on mcpi-deployment-xxx pod go to Terminal tab login on OpenShift Cluster: oc login https://: -u -p or oc login https://: --token= execute data collection command: collect-metric.sh note: when the command finishes executing, it returns the path to the MHC JSON file. See the sample in the Run on Docker section Data Review go to maximo-cpi project -> Networking -> Routes click on mcpi-viewer-route url review the data: Under Load a MAS Harmony Checker JSON file from the server's path , enter the path to the MHC JSON file e.g. /tmp/mhc-2024-08-01-19-36.json See the sample in the Run on Docker section","title":"User guide"},{"location":"pd/mcpi/maximo-cpi/#most-common-user-scenarios","text":"","title":"Most Common User Scenarios"},{"location":"pd/mcpi/maximo-cpi/#1-best-practice-to-minimizing-footprint-through-maximo-cpi","text":"Step 1: Eliminate the surplus nodes if exist Step 2: Balance CPU and Memory Request%; Align CPU and Memory Requests to match hardware specifications, such as a ratio of 1:4 or 1:8. Step 3: Continuously reduce the resource requests for pods/containers to enhance utilization. Ideally, aim for resource utilization that exceeds the resource requests and approaches 60\u201370% of the cluster capacity. Repeat Step 1 \u2013 3 if needed","title":"1) Best practice to minimizing footprint through Maximo CPI"},{"location":"pd/mcpi/maximo-cpi/#2-best-practice-for-performance-troubleshooting-and-configuration-checking","text":"Step 1: Heatmap viewer provides the problematic pods and nodes Step 2: Maximo CPI viewer provides the metric details Step 3: Identify the severity and functional impacts Step 4: Vertically and horizontally adjust the pod/service/node and apply the recommended OpenShift Configuration if needed Repeat Step 1 \u2013 4 if needed","title":"2) Best practice for performance troubleshooting and configuration checking"},{"location":"pd/mcpi/maximo-cpi/#3-rebalance-node-resource","text":"Issue Description: Observe the unbalance resource usage among the nodes. E.g. some nodes use 80% cpu, but the other uses 20% cpu. Reason: Imbalanced placement OpenShift schedules the service / pod based on the resource cost increment , not the real resource usage. Solution: migrate pods from busy nodes to non-busy nodes with min movements. This is a typical bin-packing (NP-Hard) problem. Maximo CPI uses the greedy algorithm since the time and minimum steps are not critical. Actions: \u26a0\ufe0f Moving pods can be disruptive at times, as it may cause an outage while the stateful service pod is being relocated. execute node-balance.sh . The output will provide movepod command if any issue is detected execute movepod.sh to move the pods.","title":"3) Rebalance Node Resource"},{"location":"pd/mcpi/maximo-cpi/#4-scheduled-scaling","text":"modify mas-manage-scheduled-scaling-sample.sh to adjust the parameters e.g. time and pod replica number set up the slack url and channel name for notification if needed","title":"4) Scheduled Scaling"},{"location":"pd/mcpi/maximo-cpi/#5-expired-and-expiring-certificate","text":"modify cert-expiration-slack-alert-sample.sh to adjust the paramenter e.g. time and expiration-in-days set up the slack url and channel name for notification if needed","title":"5) Expired and Expiring Certificate"},{"location":"pd/mcpi/maximo-cpi/#upcoming","text":"Release this utility to the public via IBM Accelerator Extend metric collection to cover the database performance metrics Add and enhance the policies for alerting and best practices Enhance MAS Optimization, Sizing, Re-balance, Scaling, Performance Diagnosis via AI technology","title":"Upcoming"}]} \ No newline at end of file diff --git a/search/worker.js b/search/worker.js new file mode 100644 index 0000000..8628dbc --- /dev/null +++ b/search/worker.js @@ -0,0 +1,133 @@ +var base_path = 'function' === typeof importScripts ? '.' : '/search/'; +var allowSearch = false; +var index; +var documents = {}; +var lang = ['en']; +var data; + +function getScript(script, callback) { + console.log('Loading script: ' + script); + $.getScript(base_path + script).done(function () { + callback(); + }).fail(function (jqxhr, settings, exception) { + console.log('Error: ' + exception); + }); +} + +function getScriptsInOrder(scripts, callback) { + if (scripts.length === 0) { + callback(); + return; + } + getScript(scripts[0], function() { + getScriptsInOrder(scripts.slice(1), callback); + }); +} + +function loadScripts(urls, callback) { + if( 'function' === typeof importScripts ) { + importScripts.apply(null, urls); + callback(); + } else { + getScriptsInOrder(urls, callback); + } +} + +function onJSONLoaded () { + data = JSON.parse(this.responseText); + var scriptsToLoad = ['lunr.js']; + if (data.config && data.config.lang && data.config.lang.length) { + lang = data.config.lang; + } + if (lang.length > 1 || lang[0] !== "en") { + scriptsToLoad.push('lunr.stemmer.support.js'); + if (lang.length > 1) { + scriptsToLoad.push('lunr.multi.js'); + } + if (lang.includes("ja") || lang.includes("jp")) { + scriptsToLoad.push('tinyseg.js'); + } + for (var i=0; i < lang.length; i++) { + if (lang[i] != 'en') { + scriptsToLoad.push(['lunr', lang[i], 'js'].join('.')); + } + } + } + loadScripts(scriptsToLoad, onScriptsLoaded); +} + +function onScriptsLoaded () { + console.log('All search scripts loaded, building Lunr index...'); + if (data.config && data.config.separator && data.config.separator.length) { + lunr.tokenizer.separator = new RegExp(data.config.separator); + } + + if (data.index) { + index = lunr.Index.load(data.index); + data.docs.forEach(function (doc) { + documents[doc.location] = doc; + }); + console.log('Lunr pre-built index loaded, search ready'); + } else { + index = lunr(function () { + if (lang.length === 1 && lang[0] !== "en" && lunr[lang[0]]) { + this.use(lunr[lang[0]]); + } else if (lang.length > 1) { + this.use(lunr.multiLanguage.apply(null, lang)); // spread operator not supported in all browsers: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Operators/Spread_operator#Browser_compatibility + } + this.field('title'); + this.field('text'); + this.ref('location'); + + for (var i=0; i < data.docs.length; i++) { + var doc = data.docs[i]; + this.add(doc); + documents[doc.location] = doc; + } + }); + console.log('Lunr index built, search ready'); + } + allowSearch = true; + postMessage({config: data.config}); + postMessage({allowSearch: allowSearch}); +} + +function init () { + var oReq = new XMLHttpRequest(); + oReq.addEventListener("load", onJSONLoaded); + var index_path = base_path + '/search_index.json'; + if( 'function' === typeof importScripts ){ + index_path = 'search_index.json'; + } + oReq.open("GET", index_path); + oReq.send(); +} + +function search (query) { + if (!allowSearch) { + console.error('Assets for search still loading'); + return; + } + + var resultDocuments = []; + var results = index.search(query); + for (var i=0; i < results.length; i++){ + var result = results[i]; + doc = documents[result.ref]; + doc.summary = doc.text.substring(0, 200); + resultDocuments.push(doc); + } + return resultDocuments; +} + +if( 'function' === typeof importScripts ) { + onmessage = function (e) { + if (e.data.init) { + init(); + } else if (e.data.query) { + postMessage({ results: search(e.data.query) }); + } else { + console.error("Worker - Unrecognized message: " + e); + } + }; +} diff --git a/sitemap.xml b/sitemap.xml new file mode 100644 index 0000000..dfb04b8 --- /dev/null +++ b/sitemap.xml @@ -0,0 +1,133 @@ + + + + https://ibm-mas.github.io/mas-performance/ + 2024-08-16 + daily + + + https://ibm-mas.github.io/mas-performance/mas/aws/bestpractice/ + 2024-08-16 + daily + + + https://ibm-mas.github.io/mas-performance/mas/azure/bestpractice/ + 2024-08-16 + daily + + + https://ibm-mas.github.io/mas-performance/mas/core/bestpractice/ + 2024-08-16 + daily + + + https://ibm-mas.github.io/mas-performance/mas/ibmcloud/bestpractice/ + 2024-08-16 + daily + + + https://ibm-mas.github.io/mas-performance/mas/iot/bestpractice/ + 2024-08-16 + daily + + + https://ibm-mas.github.io/mas-performance/mas/manage/bestpractice/ + 2024-08-16 + daily + + + https://ibm-mas.github.io/mas-performance/mas/manage/woi-infer/ + 2024-08-16 + daily + + + https://ibm-mas.github.io/mas-performance/mas/manage/woi-train/ + 2024-08-16 + daily + + + https://ibm-mas.github.io/mas-performance/mas/manage-industry-solutions/ong-hse/bestpractice/ + 2024-08-16 + daily + + + https://ibm-mas.github.io/mas-performance/mas/manage-industry-solutions/transportation/bestpractice/ + 2024-08-16 + daily + + + https://ibm-mas.github.io/mas-performance/mas/mif-jms/bestpractice/ + 2024-08-16 + daily + + + https://ibm-mas.github.io/mas-performance/mas/mif-kafka/bestpractice/ + 2024-08-16 + daily + + + https://ibm-mas.github.io/mas-performance/mas/mobile/bestpractice/ + 2024-08-16 + daily + + + https://ibm-mas.github.io/mas-performance/mas/mongodb/bestpractice/ + 2024-08-16 + daily + + + https://ibm-mas.github.io/mas-performance/mas/monitoring/guidance/ + 2024-08-16 + daily + + + https://ibm-mas.github.io/mas-performance/mas/ocp/bestpractice/ + 2024-08-16 + daily + + + https://ibm-mas.github.io/mas-performance/mas/sizing/guidance/ + 2024-08-16 + daily + + + https://ibm-mas.github.io/mas-performance/maximo-7/bestpractice/ + 2024-08-16 + daily + + + https://ibm-mas.github.io/mas-performance/pd/checklist/ + 2024-08-16 + daily + + + https://ibm-mas.github.io/mas-performance/pd/db2-performance-diagnosis/ + 2024-08-16 + daily + + + https://ibm-mas.github.io/mas-performance/pd/dbtest/ + 2024-08-16 + daily + + + https://ibm-mas.github.io/mas-performance/pd/jvm-performance-insight/ + 2024-08-16 + daily + + + https://ibm-mas.github.io/mas-performance/pd/pingtest/ + 2024-08-16 + daily + + + https://ibm-mas.github.io/mas-performance/pd/ptbp/ + 2024-08-16 + daily + + + https://ibm-mas.github.io/mas-performance/pd/mcpi/maximo-cpi/ + 2024-08-16 + daily + + \ No newline at end of file diff --git a/sitemap.xml.gz b/sitemap.xml.gz new file mode 100644 index 0000000..14784fd Binary files /dev/null and b/sitemap.xml.gz differ