Service Availability Goals and Disaster Recovery PDF

Document Details

OptimisticSatire

Uploaded by OptimisticSatire

Kendriya Vidyalaya

Tags

service availability technical documentation system design cloud computing

Summary

This document outlines service availability goals and disaster recovery strategies for a specific service. It details objectives, metrics, and procedures to ensure high availability and efficient processing. The document also discusses handling Delta FactoryVolume creation and aborting jobs, highlighting special handling for certain operations.

Full Transcript

untitled text 51 Page 1/115 1 Created by Charles Paclat, last modified on Mar 23, 2023 2 Service Availability Goals 3 This defines our uptime goal for the SFCP in any given region. 4 5 The SFCP is a tool that is u...

untitled text 51 Page 1/115 1 Created by Charles Paclat, last modified on Mar 23, 2023 2 Service Availability Goals 3 This defines our uptime goal for the SFCP in any given region. 4 5 The SFCP is a tool that is used by its customers to optimize their … Lifecycle Management (LCM) operations by preprocessing there application … binaries into block volumes that represent golden masters of their … products. It is used as part of the Release Management process for … these services. At this time the only customer for this service is the … Fusion Application as a Service (FAaaS). 6 7 Our goal is to be available in every region we are deployed at 99.7%. 8 9 The team implements DevOps with on call to ensure that any failures are … addressed immediately 10 Service Level Objectives 11 The Fusion Applications and FAaaS have a very tight release train and … demand a high level of throughput for processing potential releases so … that they can be tested and verified in the preproduction systems. The … main metrics on which the SFCP will be judged are: 12 13 The rate of Master FactoryVolume creations 14 This is actually at a higher volume in the non-production tenancies that … are used for validation 15 This speaks to the rate at new variants are processed and requires that … the SFCP be horizontally scalable. 16 The time to create a Master FactoryVolume 17 This impacts the time from release delivery until testing can begin. 18 Form Masters Block Volumes this is about 2 hours. 19 The rate of Delta FactoryVolume creations 20 The time to create a Delta FactoryVolume 21 22 23 24 25 Master FactoryVolume Processing 26 27 28 Delta FactoryVolume Processing 29 30 31 Recovery Point Objectives 32 This documents the agreements that we have with the Kiev service with … regard to the reliability of our stored buckets and what happens in the … event of failure. 33 34 Kiev guarantees twice weekly full backups and twice daily partial … backups. 35 36 37 untitled text 51 Page 2/115 38 Disaster Recovery 39 40 41 The SFCP is a regional service that provides performs operations on … behalf of the FACP. Each region is independent of the others and there … is non benefit to being able to shift the processing to another region. 42 43 In the event of a regional failure the SFCP will be restored when the … region is resumed. The worst case scenario is that the most recent … restore point is 1/2 day out of date. Any operations that failed were … lost will need to be repeated by the FACP. None of these effect live … customers. 44 45 In the event that a region is unavailable, the testing for FACP would … move to another region and will be able to proceed accordingly. 46 47 Created by Russell Genna, last modified on Sep 06, 2024 48 Overview 49 The FACP team uses SFaaS primarily to create two types of FactoryVolume … (FV) resources: Master Volumes (MV) and Delta Volumes (DV) 50 51 In the Gen2 Storage Factory there was special handling implemented for … DVs which is difficult to support in SFaaS: 52 53 If a DV (or MV) job was aborted after it had already completed it … resulted in a no-op where the resulting MV/DV just created was not … affected. 54 When a DV was revised the previous revision of the DV would be … automatically deleted after some time however the backups associated … with older DV revisions was not deleted. They would remain in existence … and are tracked in the MasterVolumeInfo meta-data stored in object … storage. Only when the current/latest revision of the DV was deleted … would the old backups also be cleaned up by the deletion process. 55 In SFaaS this special handling presents some challenges due to how … FactoryVolumes are being managed: 56 57 When FACP calls abort via the integration client on a FV it translates … to a DELETE call on that FV. If the FV being aborted has reached Active … lifecycle state in this situation calling DELETE will result the current … revision of the DV FV revision being deleted. This includes the … MasterVolumeInfo meta-data in OSS, the BVs & BV backups, etc... which is … undesirable. 58 During the DV FV create workflow once a certain point is reached in the … creation process the old revision of the DV FV has its expiration date … set which will result in the older FV revision being automatically … deleted after a predetermined amount of time (e.g. 24 hours). However … if DELETE is called on the DV FV actively being created, via abort call … for instance, even if we account for the situation in #1 to prevent … deletion of the meta-data, once the expiration date is set it's possible … we would delete the DV FV that FACP is still using. 59 In SFaaS we do not maintain a history of the backups of previous DV FV untitled text 51 Page 3/115 59… revisions as described above. Once an older revision of the DV is … deleted the block volume backup associated are also deleted which can be … problematic if FACP is still attempting to use this backup for pod … provisioning. 60 Current DV FV Special Handling 61 The following is a list of the special handling we've already … implemented in SFaaS in order to account for the behavior in the Gen2 … SF: 62 63 In the FV create workflow during the VALIDATE_PAYLOAD step after … downloading and validating the existence of the StorageTemplate if the … meta-data indicates that we are creating a delta volume the … factoryVolumeType is set to the value of "delta" for use in subsequent … processing. 64 In the FV delete workflow WAIT_CREATE_CANCELED step, a list call is used … to identify the FV create workflow's corresponding WorkRequest (WR). If … that WR is not a in "Succeeded" status then the WF … DeleteFactoryVolume.State is updated with the "creationAborted" flag set … to "true" to signal subsequent handling later in the delete WF. 65 In the FV delete workflow SET_INITIAL_STATE step, if the … "creationAborted" flag is set to "true" and the FV has a … factoryVolumeType value of "delta" then in order to avoid deleting the … OSS meta-data and BVs/BV backups associated with the current DV FV … revision the OSS object names and BV/BV backup ocids will not be … populated into the DeleteFactoryVolume.State data. 66 In the SFaaS integration client when the abortJob method is called if … the factoryVolumeType value of the FV is "delta" and the FV is Active … the DELETE is not called on the FV and the abort call becomes a no-op. 67 Remaining DV FV Handling Concerns 68 The current DV FV special handling does help prevent the unwanted … deletion of the FV meta-data and BVs, however it still falls short in … some cases: 69 70 Due to the concurrent processing nature of this system even with this … special handling there are still gaps where the DV FV can end up in the … bad state we are trying to avoid. For instance, in case #4 when the … check is performed on the FV DV to see if it's "Active" it might not be … active when checked but could reach the Active state in between that … check and the resulting delete call. 71 Since SFaaS does not maintain a history of the BV backups from old … revisions, if FACP calls abort on a DV FV creation and the FV had … reached the Active lifecycle state then they will not consider that … creation successful. If that's the case then they will also not be … tracking the BVs which were produced by the FV which was just created. … This means they are still using the BVs from the previous revision of … the DV FV. However since the DV FV creation was successful on the SFaaS … side it will have set the expiration date on the previous FV revision … which FACP is still referencing meaning it will be deleted out from … under them after some time. 72 Additional Short Term DV FV Special Handling Proposals 73 There is no short term solution for the issues presented by the gaps in untitled text 51 Page 4/115 73… the current handling which are caused by concurrency and this most … likely would have be to be addressed by a long term solution. 74 The following is the short term proposal to avoid deleting BVs/BV … backups that FACP is still using: 75 In the FV create workflow, when setting the expiration date of a … previous revision the value will be set to a date further in the future … (e.g. 2 months) which will give plenty of time for FACP to move onto a … newer revision before its deletion. 76 This could be accomplished with a change to the limits flock to set the … "sf-revision-expiration-age-seconds" property to the desired value. 77 In order to avoid unnecessary resource consumption which would be the … result of (a), the integration client would detect if the old DV FV's … expiration date could be shortened. 78 The StorageFactoryClient.getMasterVolumeInfo() method is only called by … FACP after successfully creating a set of DV FVs (i.e. FA, IDM & OHS) … when they are ready to update their Kiev database. When this method is … called if the FV's factoryVolumeType is set to the value of "delta" the … client would identify the previous DV FV revision and call the FV UPDATE … endpoint to decrease the expiration date to a normal value (e.g. 24 … hours). 79 To support this the FV UPDATE endpoint must be modified to allow setting … the expiration date value. This would require spec changes and changes … in the FV UPDATE implementation. 80 81 82 Created by Balaji Mani, last modified on Oct 25, 2024 83 The CreateMultiTemplateReleaseSnapshotWorkflow gets the details of the … ReleaseSnpashotDetails and for each of the supportedFeaturesPerTemplate … in the ReleaseSnapshotDetails a new sub workflow is created to start … CreateMasterVolumeForReleaseSnapshotWorkflow. The … CreateMultiTemplateReleaseSnapshotWorkflow waits for each of the sub … workflows to get reach the terminal state. When all the workflows are … complete and if any of the sub workflows have been TERMINATED. ABORTED … or FAILED then we try to fail the entire workflow. This then leads to … the creation a new work request which goes through the entire process … which could be time consuming just in case if there was just only one … failure in the entire release set. 84 85 Hence to overcome the pain of re creating the entire workflow revival … can be used to recreate only those master volumes that where failed. The … below changes are be made to CreateMultiTemplateReleaseSnapshotWorkflow … and CreateMasterVolumeForReleaseSnapshotWorkflow making it revivable so … that the workflows when they fail they can resume from where they failed … and try to only request creating of those master volumes that have … failed. 86 87 88 89 CreateMultiTemplateReleaseSnapshotWorkflow 90 91 CREATE_MASTER_VOLUME_SUB_FLOWS untitled text 51 Page 5/115 92 Gets the details of the release and triggers the … CreateMasterVolumesForReleaseSnapshotWorkflow for each of the release … templates. 93 Updates the release snapshot details and workflow state with the … workflow id and template name details 94 95 96 WAIT_FOR_COMPLETION 97 Gets the workflow id for each of the templates and polls the sub … workflows to see if all the workflows have completed 98 If atleast one of the workflow is in a non terminal state the step … continues to wait for it to complete 99 if all the workflows are in terminal state and 100 If atleast on of them is in TERMINATED, ABORTED or FALIED then the … workflow moves to CLEANUP_FAILED_WORKFLOW 101 If all the sub workflows are successful then the workflow moves to … UPDATE_KIEV 102 103 104 CLEANUP_FAILED_WORKFLOW 105 Updates the details of SFaaS ocids relevant to facp volume id in release … snapshot 106 Update the state of release snapshot, release and work request as FAILED 107 WAIT_FOR_COMPLETION 108 If all of the workflows are in terminal state and if atleast one of them … is in TERMINATED, ABORTED or FALIED then store the details of the … mastervolumes throw a NonRetryableException instead of calling the … CLEANUP_FAILED_WORKFLOW 109 110 Changes to the workflow 111 Remove the catchStep from the WorkflowDefinitionParams annotation. 112 Add the workflow to WorkflowUtil.geWorkflowWithStateInterceptor method 113 CreateMasterVolumesForReleaseSnapshotWorkflow 114 115 116 CREATE_MASTER_VOLUME 117 Gets the details for the relevant template details for creating the … master volumes. 118 Triggers the storage factory workflows for each volume type and stores … the details of the fa volume id and sf volume id in the state. 119 Stores the details of the master volume with state in active in kiev. 120 121 122 WAIT_FOR_COMPLETION 123 124 Gets the session ids from the state and gets the status of each of the … storage factory workflows. 125 If all of the workflows have moved to FINISHED then we move to … UPDATE_KIEV step. 126 if one of the workflows is in TERMINATED, ABORTED or FAILED state then … we throw Non Retry able exception which moves the workflow to untitled text 51 Page 6/115 126… CLEANUP_FAILED_WORKFLOW. 127 if one of the workflows is in QUEUED, INITIALIZED, RESUMING or RUNNING … state the workflow wait for it to complete for a specified time limit … before throwing a NonRetryableException in which case the workflow moves … to CLEANUP_FAILED_WORKFLOW. 128 If the workflow is in not in any of the above mentioned states then a … NonRetryableException is thrown moving the workflow to … CLEANUP_FAILED_WORKFLOW. 129 130 131 CLEANUP_FAILED_WORKFLOW 132 Marks the work request to be failed with internal service error. 133 Gets the session ids for each of the jobs and checks if the … corresponding Storage Factory workflows are in FINISHED state. 134 if any of them are not in FINISHED state then an abort operation is … called. 135 CREATE_MASTER_VOLUME 136 After submitting the requests to storage factory store the details of … the volumes that are created in the state which can be used to submit … the request in REVIVE_MASTER_VOLUME step during revival. 137 138 139 REVIVE_MASTER_VOLUME 140 Check the status of the workflows calling to storage factory. 141 If any of them have failed then submit the request to storage factory … using the details stored in the state. 142 If any of the volume is in a non terminal state then wait for storage … factory job to be completed 143 Move the workflow to WAIT_FOR_COMPLETION step. 144 145 146 WAIT_FOR_COMPLETION 147 Check to see if the workflow is in revival and move the workflow to … REVIVE_MASTER_VOLUME step. 148 Check if all of the requests are in terminal state. If one of them is … still in progress then wait for the workflow to be complete until … timeout. 149 If all the requests are in terminal state and if atleast one of them is … not in FINISHED state then call abort on those that have failed and … throw a NonRetryableException. 150 151 Changes to the workflow 152 Remove the catchStep from the WorkflowDefinitionParams annotation. 153 Add the workflow to WorkflowUtil.geWorkflowWithStateInterceptor method. 154 Questions: 155 After FACP workflow is revived how to check if the volume failed or … revived in SF? 156 157 Currently we are seeing revive even in the succeeded workflows in the UI … what is the use of that? 158 Should the abort be called if there is an failure in one of the Block untitled text 51 Page 7/115 158… volume creation in CreateMasterVolumeForReleaseSnapshotWorkflow? 159 160 Created by Prerna Malik, last modified on Oct 21, 2024 161 Overview 162 Storage Factory changes to support Multi template release 163 Template Changes 164 Processing Definition changes 165 Backward compatibility 166 Test Plan 167 Storage Factory Testing 168 Integration Test Plan with FACP 169 References 170 Overview 171 As of today a Release Set / Release Snapshot is made up of volumes that … are created from only one type of FA template with STARTER and CDRM … always being separate releases. The existing categories supported from … SF are mentioned below. 172 173 starter 174 starter_adb 175 cdrm 176 cdrm_adb 177 A release can only support either one of the above mentioned category … for a given release set. FACP will be supporting multi-template release … staring 25.01. Multi-Template Releases. On similar lines SF needs to … provide support to process the multi template requests. Starting 25.01 … SF will be supporting additional categories. 178 179 Storage Factory changes to support Multi template release 180 Below table shows the existing and new categories which will be added by … Storage Factory. The highlighted categories are the new addition. Since … the feature will be rolled out starting 25.01 from FACP, storage factory … changes will be available with release starting 25.01. 181 182 183 184 Template Type SF Category Example Template Expected Volumes 185 186 187 Legacy Consolidated ( IDM, with EXA) 188 189 starter 190 191 192 193 STARTER_TEMPLATE.tar.gz 194 195 FA,OHS,IDM 196 197 cdrm 198 untitled text 51 Page 8/115 199 CDRM_TEMPLATE.tar.gz 200 201 FA,OHS,IDM 202 ADB-S ( IDM with ADB ) 203 204 starter_adb 205 206 STARTER_TEMPLATE_ADB.tar.gz 207 208 FA,OHS,IDM 209 210 cdrm_adb 211 212 CDRM_TEMPLATE_ADB.tar.gz 213 214 FA,OHS,IDM 215 IDCS ( IDCS with EXA ) 216 217 starter_idcs 218 219 STARTER_TEMPLATE_IDCS.tar.gz 220 221 FA,OHS 222 cdrm_idcs 223 224 CDRM_TEMPLATE_IDCS.tar.gz 225 226 FA,OHS 227 Converged ( IDCS with ADB ) 228 229 starter_adb_idcs 230 231 STARTER_TEMPLATE_CONVERGED.tar.gz 232 233 FA,OHS 234 cdrm_adb_idcs 235 236 CDRM_TEMPLATE_CONVERGED.tar.gz 237 238 FA,OHS 239 240 241 Template Changes 242 Add new categories for fa_main and ohs_main for both platforms arm64 and … x86. 243 244 245 246 Category Volume Purpose Platform Existing categories Updated … categories 247 crdm untitled text 51 Page 9/115 248 249 250 fa_main 251 252 ohs_main 253 254 falcm_casdelta_fa 255 256 falcm_casdelta_ohs 257 258 scratch 259 260 grcrepo 261 262 essbase 263 swapvol 264 x86 and arm64 265 266 "categories": [ 267 "cdrm", 268 "cdrm_adb" 269 ] "categories": [ 270 "cdrm", 271 "cdrm_adb", 272 "cdrm_idcs", 273 "cdrm_adb_idcs" 274 ] 275 starter 276 277 x86 and arm64 278 279 "categories": [ 280 "starter", 281 "starter_adb" 282 ] "categories": [ 283 "starter", 284 "starter_adb", 285 "starter_idcs", 286 "starter_adb_idcs" 287 ] 288 289 290 Processing Definition changes 291 ** Below changes applied to both starter and cdrm specific categories. 292 293 Uploading bulk_seed.tar.gz 294 295 296 297 298 untitled text 51 Page 10/115 299 put 300 bulk_seed 301 ${ARTIFACTS_BUCKET} 302 ${FA_VERSION}/STARTER_ADB/bulk_seed 303 304 305 306 307 put 308 bulk_seed 309 ${ARTIFACTS_BUCKET} 310 ${FA_VERSION}/CDRM_ADB/bulk_seed 311 312 313 314 315 bulk_seed.tar.gz 316 317 318 319 put 320 bulk_seed.tar.gz 321 ${ARTIFACTS_BUCKET} 322 ${FA_VERSION}/bulk_seed.tar.gz 323 324 325 bulk_seed.tar.gz 326 327 328 329 starter_adb and starter_adb_idcs will be uploaded to same location i.e … {FA_VERSION}/STARTER_ADB/bulk_seed 330 starter_cdrm and starter_adb_crm will be uploaded to same location i.e … {FA_VERSION}/CDRM_ADB/bulk_seed 331 starter_idcs will be uploaded to default location 332 ${FA_VERSION}/bulk_seed.tar.gz 333 Creating dbmetadata.json 334 335 336 337 338 339 340 … /scratch/tools/sfwrapper-tool/bin/createDBMetadataJson.sh 341 ${volume-root} 342 ${volume-root} 343 parent 344 345 untitled text 51 Page 11/115 346 put 347 dbmetadata.json 348 ${ARTIFACTS_BUCKET} 349 ${FA_VERSION}/dbmetadata.json 350 351 352 put 353 dbmetadata.json 354 ${ARTIFACTS_BUCKET} 355 … ${FA_VERSION}/${shape}_${category_upper}/dbmetadata.json 356 357 358 dbmetadata.json 359 360 361 362 No changes required for starter_adb_idcs and cdrm_adb_idcs. 363 364 For starter_idcs the location should be 365 366 ${FA_VERSION}/${shape}_${STARTER}/dbmetadata.json 367 368 For cdrm_idcs the location should be ${FA_VERSION}/${shape}_${CDRM} 369 370 Uploading db.properties and db.tar.tz 371 372 373 374 375 … ${TEMPLATE_FIRST_LEVEL_EXTRACT_LOCATION}/db/adb/db.properties 376 db/adb/db.properties 377 378 379 … ${TEMPLATE_FIRST_LEVEL_EXTRACT_LOCATION}/idm/adb/db.properties 380 idm/adb/db.properties 381 382 383 put 384 db/adb/db.properties 385 ${ARTIFACTS_BUCKET} 386 … ${FA_VERSION}/STARTER_ADB/FA/db.properties 387 388 389 put untitled text 51 Page 12/115 390 idm/adb/db.properties 391 ${ARTIFACTS_BUCKET} 392 … ${FA_VERSION}/STARTER_ADB/IDM/db.properties 393 394 395 db/adb/db.properties 396 397 398 idm/adb/db.properties 399 400 401 402 403 … ${TEMPLATE_FIRST_LEVEL_EXTRACT_LOCATION}/db/adb/db.properties 404 db/adb/db.properties 405 406 407 … ${TEMPLATE_FIRST_LEVEL_EXTRACT_LOCATION}/idm/adb/db.properties 408 idm/adb/db.properties 409 410 411 put 412 db/adb/db.properties 413 ${ARTIFACTS_BUCKET} 414 ${FA_VERSION}/CDRM_ADB/FA/db.properties 415 416 417 put 418 idm/adb/db.properties 419 ${ARTIFACTS_BUCKET} 420 … ${FA_VERSION}/CDRM_ADB/IDM/db.properties 421 422 423 db/adb/db.properties 424 425 426 idm/adb/db.properties 427 428 429 430 431 432 pdb_fusiondb 433 434 untitled text 51 Page 13/115 435 pdb_oiddb 436 437 438 … ${TEMPLATE_FIRST_LEVEL_EXTRACT_LOCATION}/db/pdb/dbfs.tar.gz 439 pdb_fusiondb 440 441 442 … ${TEMPLATE_FIRST_LEVEL_EXTRACT_LOCATION}/idm/oiddb/pdb/dbfs.tar. … gz 443 pdb_oiddb 444 true 445 446 447 … ${TEMPLATE_FIRST_LEVEL_EXTRACT_LOCATION}/db/pdb/pdbmanifest.xml< … /source> 448 pdb_fusiondb/pdbmanifest.xml 449 450 451 … ${TEMPLATE_FIRST_LEVEL_EXTRACT_LOCATION}/idm/oiddb/pdb/ … pdbmanifest.xml 452 pdb_oiddb/pdbmanifest.xml 453 true 454 455 456 db.tar.gz 457 458 pdb_fusiondb 459 pdb_oiddb 460 461 462 463 put 464 db.tar.gz 465 ${ARTIFACTS_BUCKET} 466 ${FA_VERSION}/db.tar.gz 467 468 469 put 470 db.tar.gz 471 ${ARTIFACTS_BUCKET} 472 … ${FA_VERSION}/${shape}_${category_upper}/db.tar.gz 473 474 475 db.tar.gz untitled text 51 Page 14/115 476 477 478 479 480 idm/adb/db.properties processing not applicable for 481 starter_idcs or cdrm_idcs. 482 starter_adb and starter_adb_ics the upload of db.properties will be done … to existing 483 path for starter_adb. 484 cdrm_adb and cdrm_adb_idcs the upload of db.properties will be done to … existing 485 path for cdrm_adb. 486 db.tar.gz is not applicable adb specific categories. 487 db.properties processing is not applicable idcs specific categories like … starter_idcs or cdrm_idcs. 488 For non adb categories db.tar.gz will be uploaded to oss 489 ${FA_VERSION}/db.tar.gz and … ${FA_VERSION}/${shape}_${category_upper}/db.tar.gz and … ${FA_VERSION}/${shape}_${category_upper}/db.tar.gz where category_upper … will only be either starter or cdrm (no idcs) 490 491 Uploading security.tar.gz 492 493 494 495 496 put 497 security.tar.gz 498 ${ARTIFACTS_BUCKET} 499 ${FA_VERSION}/ADB/security.tar.gz 500 501 502 503 504 put 505 security.tar.gz 506 ${ARTIFACTS_BUCKET} 507 ${FA_VERSION}/ADB/security.tar.gz 508 509 510 511 512 put 513 security.tar.gz 514 ${ARTIFACTS_BUCKET} 515 ${FA_VERSION}/security.tar.gz 516 517 518 519 For adb specific categories starter_adb,starter_adb_idcs, cdrm_adb, … cdrm_adb_idcs security.tar.gz will be uploaded to untitled text 51 Page 15/115 519… ${FA_VERSION}/ADB/security.tar.gz 520 For rest all cases it will be uploaded to … ${FA_VERSION}/security.tar.gz. 521 522 523 524 525 526 527 528 Backward compatibility 529 Since SF wont make any changes to existing categories, existing … categories would work as expected. 530 Once this feature is rolled out , SF will remove the support for … IDCS_ENABLED flag based handling. FACP should be sending new categories … starting 25.01. Below should be the equivalent new categories. 531 532 533 534 starter_adb true starter_adb_idcs 535 cdrm_adb true cdrm_adb_idcs 536 starter true starter_idcs 537 cdrm true scdrm_idcs 538 Test Plan 539 540 541 Storage Factory Testing 542 543 544 Master Volume Creation using (starter_idcs and cdrm_idcs) 545 To verify the master volume creation using newly defined templates. 546 Verify the volumes created in storage factory 547 Category starter_idcs, IDCS_ENABLED flag is false 548 Session FA_XS-X86_multi_template_starter_idcs_test_2 549 SF gpw 550 /scratch/tools/storagefactory-tool/bin/storagefactory-tool.sh get -rv … 11.13.24.10.0 -p fa_main -q FA_XS-X86_prermali_starter_idcs_3 -sh XS -ca … starter_idcs 551 Category starter, IDCS_ENABLED flag is false 552 Session FA_XS-X86_multi_template_starter_test_3 553 SF gpw 554 Category starter_adb_idcs, IDCS_ENABLED flag is true 555 Session inst-4xu8q-sf-test-flex-ncgjuk-london-1-ad-3 … FA_S-ARM64-IDCS-ADB-test6] 556 /scratch/tools/storagefactory-tool/bin/storagefactory-tool.sh get -rv … 11.13.24.07.0 -p fa_main -q … 926b2708-b37e-4f71-8bb5-cdeda6be798b-starter_adb_idcs_1 -sh S -ca … starter_adb_idcs -pt ARM64 557 2024-09-10 03:37:16[ ] [ ] INFO TemplateCategoryValidator:62 - Found … TEMPLATE_LOC=/net/fatemplates.appsdev.fusionappsdphx1.oraclevcn.com/ … FA_TEMPLATES/ADBPROV_STARTER_ADB_FAOVM_T36208429_27857 untitled text 51 Page 16/115 558 2024-09-10 03:37:16[ ] [ ] WARN TemplateCategoryValidator:75 - Category … substring IDCS not found in the file … /net/fatemplates.appsdev.fusionappsdphx1.oraclevcn.com/FA_TEMPLATES/ … ADBPROV_STARTER_ADB_FAOVM_T36208429_27857 559 2024-09-10 03:37:16[ ] [ ] INFO TemplateCategoryValidator:83 - Category … [STARTER_ADB_IDCS] found in file … /net/fatemplates.appsdev.fusionappsdphx1.oraclevcn.com/FA_TEMPLATES/ … ADBPROV_STARTER_ADB_FAOVM_T36208429_27857 560 2024-09-10 03:37:16[ ] [ ] INFO TemplateCategoryValidator:40 - Validate … category under /scratch/fa_main5474267368209710030/backup 561 2024-09-10 03:37:16[ ] [ ] INFO TemplateCategoryValidator:52 - … Validating templateProperties … /scratch/fa_main5474267368209710030/backup/initial-data/casrepos/fa/fa- … factory/template.properties against category STARTER_ADB_IDCS 562 2024-09-10 03:37:16[ ] [ ] INFO ProcessExecutor:113 - Executing command … [cat … /scratch/fa_main5474267368209710030/backup/initial-data/casrepos/fa/fa- … factory/template.properties |grep TEMPLATE_LOC | cut -d'=' -f2] 563 2024-09-10 03:37:16[ ] [ ] INFO ProcessExecutor:132 - Execution … completed with result ExecutorResult(exitCode=0, errorResponse=, … response=/net/fatemplates.appsdev.fusionappsdphx1.oraclevcn.com/ … FA_TEMPLATES/ADBPROV_STARTER_ADB_FAOVM_T36208429_27857) for the command … cat … /scratch/fa_main5474267368209710030/backup/initial-data/casrepos/fa/fa- … factory/template.properties |grep TEMPLATE_LOC | cut -d'=' -f2 564 2024-09-10 03:37:16[ ] [ ] INFO TemplateCategoryValidator:62 - Found … TEMPLATE_LOC=/net/fatemplates.appsdev.fusionappsdphx1.oraclevcn.com/ … FA_TEMPLATES/ADBPROV_STARTER_ADB_FAOVM_T36208429_27857 565 2024-09-10 03:37:16[ ] [ ] WARN TemplateCategoryValidator:75 - Category … substring IDCS not found in the file … /net/fatemplates.appsdev.fusionappsdphx1.oraclevcn.com/FA_TEMPLATES/ … ADBPROV_STARTER_ADB_FAOVM_T36208429_27857 566 2024-09-10 03:37:16[ ] [ ] INFO TemplateCategoryValidator:83 - Category … [STARTER_ADB_IDCS] found in file … /net/fatemplates.appsdev.fusionappsdphx1.oraclevcn.com/FA_TEMPLATES/ … ADBPROV_STARTER_ADB_FAOVM_T36208429_27857 567 2024-09-10 03:37:16[ ] [ ] INFO ValidatorInterceptor:156 - … ############### TemplateCategoryValidator execution success … ################ 568 2024-09-10 03:37:16[ ] [ 569 Category starter_adb_idcs, IDCS_ENABLED flag is false 570 Session inst-4xu8q-sf-test-flex-ncgjuk-london-1-ad-3 … FA_S-ARM64-IDCS-ADB-test7] 571 /scratch/tools/storagefactory-tool/bin/storagefactory-tool.sh get -rv … 11.13.24.07.0 -p fa_main -q … 926b2708-b37e-4f71-8bb5-cdeda6be798b-starter_adb_idcs_2 -sh S -ca … starter_adb_idcs -pt ARM64 572 573 Test storage-factory tool query APIs get, list volumes for the new … categories. 574 To verify if the query APIs works well with newly defined categories. untitled text 51 Page 17/115 575 /scratch/tools/storagefactory-tool/bin/storagefactory-tool.sh get -rv … 11.13.24.07.0 -p fa_main -q … 926b2708-b37e-4f71-8bb5-cdeda6be798b-starter_adb_idcs_2 -sh S -ca … starter_adb_idcs -pt ARM64 576 { 577 "template-ref" : { 578 "purpose" : "fa_main", 579 "shape" : "S", 580 "category" : "starter_adb_idcs", 581 "platform" : "ARM64", 582 "storageType" : "block" 583 }, 584 "release-info" : { 585 "version" : "11.13.24.07.0", 586 "qualifier" : … "926b2708-b37e-4f71-8bb5-cdeda6be798b-starter_adb_idcs_2" 587 }, 588 "volume-uuid" : "fbd0c8ae-573c-45b5-a333-7c991b33ebc0", 589 "content-size" : "64", 590 "region-master-volumes" : { 591 "uk-london-1" : { 592 "backup-id" : … "ocid1.volumebackup.oc1.uk-london-1. … abwgiljssl3fa74o574u5lrjz25rpl2mr5c57fyssxzsi4eixm5zjdv33z7q", 593 "domain-master-volumes" : { 594 "ncGJ:UK-LONDON-1-AD-2" : { 595 "id" : … "ocid1.volume.oc1.uk-london-1. … abwgiljrw252qrx3erucpwev7pa5iakglrsz4el7emp2lp4sfrf5l2mgvgoq", 596 "size" : "300" 597 }, 598 "ncGJ:UK-LONDON-1-AD-3" : { 599 "id" : … "ocid1.volume.oc1.uk-london-1. … abwgiljshtr2e3q6c53fwocy7ht3o4b4merypwjb33pertvd2a2a4j2hhjzq", 600 "size" : "300" 601 }, 602 "ncGJ:UK-LONDON-1-AD-1" : { 603 "id" : … "ocid1.volume.oc1.uk-london-1. … abwgiljtzfe5odmlgcyqid6z2kiv6gm53enojzoexd7svoqerqxkjtrkb3ya", 604 "size" : "300" 605 } 606 }, 607 "backup-history" : [ { 608 "id" : … "ocid1.volumebackup.oc1.uk-london-1. … abwgiljssl3fa74o574u5lrjz25rpl2mr5c57fyssxzsi4eixm5zjdv33z7q", 609 "creationDate" : "2024-09-10T05:19:35.156Z" 610 } ] 611 } untitled text 51 Page 18/115 612 }, 613 "staging-artifacts" : [ { 614 "archive-type" : "tar.gz", 615 "ossRegion" : "uk-london-1", 616 "ossNamespace" : "id2sxilrhi0w", 617 "bucketName" : "db_artifacts", 618 "objectName" : … "11.13.24.07.0-00000-FABSPFLT.762021.240529.1038/STARTER_ADB/bulk_seed/ … bulkseed_export_stats.txt", 619 "artifactPath" : … "/scratch/fa_main6098478811496481219/bulk_seed/bulkseed_export_stats.txt … ", 620 "artifactUse" : "ossput" 621 }, { 622 "archive-type" : "tar.gz", 623 "ossRegion" : "uk-london-1", 624 "ossNamespace" : "id2sxilrhi0w", 625 "bucketName" : "db_artifacts", 626 "objectName" : … "11.13.24.07.0-00000-FABSPFLT.762021.240529.1038/STARTER_ADB/bulk_seed/ … bulk_seed_only_table_01.dmp", 627 "artifactPath" : … "/scratch/fa_main6098478811496481219/bulk_seed/bulk_seed_only_table_01. … dmp", 628 "artifactUse" : "ossput" 629 }, { 630 "archive-type" : "tar.gz", 631 "ossRegion" : "uk-london-1", 632 "ossNamespace" : "id2sxilrhi0w", 633 "bucketName" : "db_artifacts", 634 "objectName" : … "11.13.24.07.0-00000-FABSPFLT.762021.240529.1038/STARTER_ADB/bulk_seed/ … bulk_seed_only_table_09.dmp", 635 "artifactPath" : … "/scratch/fa_main6098478811496481219/bulk_seed/bulk_seed_only_table_09. … dmp", 636 "artifactUse" : "ossput" 637 }, { 638 "archive-type" : "tar.gz", 639 "ossRegion" : "uk-london-1", 640 "ossNamespace" : "id2sxilrhi0w", 641 "bucketName" : "db_artifacts", 642 "objectName" : … "11.13.24.07.0-00000-FABSPFLT.762021.240529.1038/STARTER_ADB/bulk_seed/ … bulk_seed_only_table_03.dmp", 643 "artifactPath" : … "/scratch/fa_main6098478811496481219/bulk_seed/bulk_seed_only_table_03. … dmp", 644 "artifactUse" : "ossput" 645 }, { 646 "archive-type" : "tar.gz", untitled text 51 Page 19/115 647 "ossRegion" : "uk-london-1", 648 "ossNamespace" : "id2sxilrhi0w", 649 "bucketName" : "db_artifacts", 650 "objectName" : … "11.13.24.07.0-00000-FABSPFLT.762021.240529.1038/STARTER_ADB/bulk_seed/ … bulk_seed_only_table_04.dmp", 651 "artifactPath" : … "/scratch/fa_main6098478811496481219/bulk_seed/bulk_seed_only_table_04. … dmp", 652 "artifactUse" : "ossput" 653 }, { 654 "archive-type" : "tar.gz", 655 "ossRegion" : "uk-london-1", 656 "ossNamespace" : "id2sxilrhi0w", 657 "bucketName" : "db_artifacts", 658 "objectName" : … "11.13.24.07.0-00000-FABSPFLT.762021.240529.1038/STARTER_ADB/bulk_seed/ … bulk_seed_only_table_08.dmp", 659 "artifactPath" : … "/scratch/fa_main6098478811496481219/bulk_seed/bulk_seed_only_table_08. … dmp", 660 "artifactUse" : "ossput" 661 }, { 662 "archive-type" : "tar.gz", 663 "ossRegion" : "uk-london-1", 664 "ossNamespace" : "id2sxilrhi0w", 665 "bucketName" : "db_artifacts", 666 "objectName" : … "11.13.24.07.0-00000-FABSPFLT.762021.240529.1038/STARTER_ADB/bulk_seed/ … bulk_seed_only_table_06.dmp", 667 "artifactPath" : … "/scratch/fa_main6098478811496481219/bulk_seed/bulk_seed_only_table_06. … dmp", 668 "artifactUse" : "ossput" 669 }, { 670 "archive-type" : "tar.gz", 671 "ossRegion" : "uk-london-1", 672 "ossNamespace" : "id2sxilrhi0w", 673 "bucketName" : "db_artifacts", 674 "objectName" : … "11.13.24.07.0-00000-FABSPFLT.762021.240529.1038/STARTER_ADB/bulk_seed/ … bulk_seed_only_table_05.dmp", 675 "artifactPath" : … "/scratch/fa_main6098478811496481219/bulk_seed/bulk_seed_only_table_05. … dmp", 676 "artifactUse" : "ossput" 677 }, { 678 "archive-type" : "tar.gz", 679 "ossRegion" : "uk-london-1", 680 "ossNamespace" : "id2sxilrhi0w", 681 "bucketName" : "db_artifacts", untitled text 51 Page 20/115 682 "objectName" : … "11.13.24.07.0-00000-FABSPFLT.762021.240529.1038/STARTER_ADB/bulk_seed/ … bulk_seed_only_table_07.dmp", 683 "artifactPath" : … "/scratch/fa_main6098478811496481219/bulk_seed/bulk_seed_only_table_07. … dmp", 684 "artifactUse" : "ossput" 685 }, { 686 "archive-type" : "tar.gz", 687 "ossRegion" : "uk-london-1", 688 "ossNamespace" : "id2sxilrhi0w", 689 "bucketName" : "db_artifacts", 690 "objectName" : … "11.13.24.07.0-00000-FABSPFLT.762021.240529.1038/STARTER_ADB/bulk_seed/ … import_directive.properties", 691 "artifactPath" : … "/scratch/fa_main6098478811496481219/bulk_seed/import_directive. … properties", 692 "artifactUse" : "ossput" 693 }, { 694 "archive-type" : "tar.gz", 695 "ossRegion" : "uk-london-1", 696 "ossNamespace" : "id2sxilrhi0w", 697 "bucketName" : "db_artifacts", 698 "objectName" : … "11.13.24.07.0-00000-FABSPFLT.762021.240529.1038/STARTER_ADB/bulk_seed/ … bulk_seed_only_table_02.dmp", 699 "artifactPath" : … "/scratch/fa_main6098478811496481219/bulk_seed/bulk_seed_only_table_02. … dmp", 700 "artifactUse" : "ossput" 701 }, { 702 "archive-type" : "tar.gz", 703 "ossRegion" : "uk-london-1", 704 "ossNamespace" : "id2sxilrhi0w", 705 "bucketName" : "db_artifacts", 706 "objectName" : … "11.13.24.07.0-00000-FABSPFLT.762021.240529.1038/STARTER_ADB/bulk_seed/ … export_only_table.log", 707 "artifactPath" : … "/scratch/fa_main6098478811496481219/bulk_seed/export_only_table.log", 708 "artifactUse" : "ossput" 709 }, { 710 "archive-type" : "tar.gz", 711 "ossRegion" : "uk-london-1", 712 "ossNamespace" : "id2sxilrhi0w", 713 "bucketName" : "db_artifacts", 714 "objectName" : … "11.13.24.07.0-00000-FABSPFLT.762021.240529.1038/fnd_db_resource_plan. … sql", 715 "artifactPath" : untitled text 51 Page 21/115 715… "/scratch/fa_main6098478811496481219/data/casrepos/fa/APPLTOP/fmw/atgpf/ … atgpf/applcore/db/sql/common/fnd_db_resource_plan.sql", 716 "artifactUse" : "ossput" 717 }, { 718 "archive-type" : "tar.gz", 719 "ossRegion" : "uk-london-1", 720 "ossNamespace" : "id2sxilrhi0w", 721 "bucketName" : "db_artifacts", 722 "objectName" : … "11.13.24.07.0-00000-FABSPFLT.762021.240529.1038/STARTER_ADB/FA/db. … properties", 723 "artifactPath" : … "/scratch/fa_main6098478811496481219/db/adb/db.properties", 724 "artifactUse" : "ossput" 725 }, { 726 "archive-type" : "tar.gz", 727 "ossRegion" : "uk-london-1", 728 "ossNamespace" : "id2sxilrhi0w", 729 "bucketName" : "db_artifacts", 730 "objectName" : … "11.13.24.07.0-00000-FABSPFLT.762021.240529.1038/fusion_trust.jks", 731 "artifactPath" : … "/scratch/fa_main6098478811496481219/data/instances/APPLTOP/instance/ … keystores/fusion_trust.jks", 732 "artifactUse" : "ossput" 733 }, { 734 "archive-type" : "tar.gz", 735 "ossRegion" : "uk-london-1", 736 "ossNamespace" : "id2sxilrhi0w", 737 "bucketName" : "db_artifacts", 738 "objectName" : … "11.13.24.07.0-00000-FABSPFLT.762021.240529.1038/security.tar.gz", 739 "artifactPath" : … "/scratch/fa_main6098478811496481219/security.tar.gz", 740 "artifactUse" : "ossput" 741 }, { 742 "archive-type" : "tar.gz", 743 "ossRegion" : "uk-london-1", 744 "ossNamespace" : "id2sxilrhi0w", 745 "bucketName" : "db_artifacts", 746 "objectName" : … "11.13.24.07.0-00000-FABSPFLT.762021.240529.1038/bulk_seed.size.txt", 747 "artifactPath" : … "/scratch/fa_main6098478811496481219/bulk_seed.size.txt", 748 "artifactUse" : "ossput" 749 }, { 750 "type" : "BULK_SEED_V3", 751 "name" : "bulkseed", 752 "version" : "11.13.24.07.0-00000-FABSPFLT.762021.240529.1038", 753 "prefix" : "FA-TEMPLATES", 754 "archive-type" : "zip", untitled text 51 Page 22/115 755 "ossRegion" : "uk-london-1", 756 "ossNamespace" : "id2sxilrhi0w", 757 "bucketName" : "artifactRepository", 758 "artifactUse" : "template", 759 "checksum" : { 760 "algorithm-type" : "MD5", 761 "value" : "bRTApUqUb/LhluOKLYN34Q==-2" 762 } 763 }, { 764 "type" : "TEMPLATE_FA_MAIN_V2_CONSOLIDATED", 765 "name" : "STARTER_TEMPLATE_CONVERGED", 766 "version" : "11.13.24.07.0-00000-FABSPFLT.762021.240529.1038", 767 "prefix" : "FA-TEMPLATES", 768 "archive-type" : "tar.gz", 769 "ossRegion" : "uk-london-1", 770 "ossNamespace" : "id2sxilrhi0w", 771 "bucketName" : "artifactRepository", 772 "artifactUse" : "template", 773 "checksum" : { 774 "algorithm-type" : "MD5", 775 "value" : "GShANXI5QLe+sgbOjiy9YA==-448" 776 } 777 }, { 778 "type" : "TEMPLATE_EXTRA_V2_CONSOLIDATED", 779 "name" : "STARTER_TEMPLATE_CONVERGED", 780 "version" : "11.13.24.07.0-00000-FABSPFLT.762021.240529.1038", 781 "prefix" : "FA-TEMPLATES", 782 "archive-type" : "tar.gz", 783 "ossRegion" : "uk-london-1", 784 "ossNamespace" : "id2sxilrhi0w", 785 "bucketName" : "artifactRepository", 786 "artifactUse" : "template", 787 "checksum" : { 788 "algorithm-type" : "MD5", 789 "value" : "GShANXI5QLe+sgbOjiy9YA==-448" 790 } 791 }, { 792 "type" : "TEMPLATE_EXTRA_REPOSITORY_V2", 793 "name" : "Repository", 794 "version" : "11.13.24.07.0-00000-FABSPFLT.762021.240529.1038", 795 "prefix" : "FA-TEMPLATES", 796 "archive-type" : "tar.gz", 797 "ossRegion" : "uk-london-1", 798 "ossNamespace" : "id2sxilrhi0w", 799 "bucketName" : "artifactRepository", 800 "artifactUse" : "repository", 801 "checksum" : { 802 "algorithm-type" : "MD5", 803 "value" : "rVhb6SfifmMldcMIHmGDtQ==-48" 804 } 805 }, { untitled text 51 Page 23/115 806 "type" : "LCM_ARTIFACTS_LEGACY", 807 "name" : "vertex_mu", 808 "version" : "2024-08", 809 "prefix" : "lcm", 810 "archive-type" : "zip", 811 "ossRegion" : "uk-london-1", 812 "ossNamespace" : "id2sxilrhi0w", 813 "bucketName" : "artifactRepository", 814 "artifactUse" : "lcmArtifacts", 815 "checksum" : { 816 "algorithm-type" : "MD5", 817 "value" : "FXC8f6eDlO2BGYJPjwX9mg==-2" 818 } 819 }, { 820 "type" : "LCM_ARTIFACTS", 821 "name" : "fa-vm-assembly", 822 "version" : "2024.09.00.1708/fa-vm-assembly-2024.09.00.1708", 823 "prefix" : "com/oracle/fa/lcm", 824 "archive-type" : "tgz", 825 "ossRegion" : "uk-london-1", 826 "ossNamespace" : "id2sxilrhi0w", 827 "bucketName" : "artifactRepository", 828 "artifactUse" : "lcmArtifacts", 829 "checksum" : { 830 "algorithm-type" : "MD5", 831 "value" : "69j3JuMLVq4euN5NfdnyvA==" 832 } 833 }, { 834 "type" : "LCM_AGENT", 835 "name" : "facp-agent-api", 836 "version" : "1.1.363-dist", 837 "prefix" : "com/oracle/facontrolplane/agent/facp-agent-api/1.1.363", 838 "archive-type" : "zip", 839 "ossRegion" : "uk-london-1", 840 "ossNamespace" : "id2sxilrhi0w", 841 "bucketName" : "artifactRepository", 842 "artifactUse" : "lcmAgent", 843 "checksum" : { 844 "algorithm-type" : "MD5", 845 "value" : "Ur1zaIjZ+st5iOv8UmfB8A==" 846 } 847 } ], 848 "post-repo-staging-artifacts" : [ ], 849 "properties" : [ { 850 "name" : "FA_VERSION", 851 "value" : "11.13.24.07.0-00000-FABSPFLT.762021.240529.1038" 852 }, { 853 "name" : "ADB_ENABLED", 854 "value" : "true" 855 }, { 856 "name" : "LCM_ARTIFACTS_SET_ID", untitled text 51 Page 24/115 857 "value" : "LCMArtifactSet_2024.09.00.1708_1.1.363_2024-08" 858 }, { 859 "name" : "RELEASE_SNAPSHOT_ID", 860 "value" : "926b2708-b37e-4f71-8bb5-cdeda6be798b-starter_adb_idcs_2" 861 }, { 862 "name" : "opss-store-id", 863 "value" : … "ocid1.vaultsecret.oc1.uk-london-1. … amaaaaaaaxki4uqa7nora6wg526m7gychhjrgkzgf5pgzjas6qlqjojgeqga" 864 }, { 865 "name" : "opss-import-id", 866 "value" : … "ocid1.vaultsecret.oc1.uk-london-1. … amaaaaaaaxki4uqaqy2hwz6o46vwscwmqasqk6k5qb3u5pheu6qbp2xotpua" 867 }, { 868 "name" : "REVISION_NUMBER", 869 "value" : "1" 870 } ] 871 } 872 Delta patching on volumes created via new template 873 To verify if patch works well with the volumes created with new … template. 874 Delta patching for Category starter_adb_idcs 875 Session inst-4xu8q-sf-test-flex-ncgjuk-london-1-ad-3 … DV_POST_REPO_P4FA_GSI_SEP_10]$ 876 877 878 Integration Test Plan with FACP 879 880 881 1 FACP release set creation using multi template 882 To verify the master volume creation using multi-template. (Number of … volumes created) 883 Verify the volumes created in storage factory 884 To demonstrate, a release that has all 4 templates, will have following … volumes: 885 886 Starter (using single size and single platform to keep the list short): 887 888 Legacy Consolidated : M.X86.FA.1, M.X86.OHS.1, M.X86.IDM.1 889 890 ADB-S : M.X86.FA.2, M.X86.OHS.2, M.X86.IDM.2 891 892 IDCS: M.X86.FA.3, M.X86.OHS.3 893 894 Converged : M.X86.FA.4, M.X86.OHS.4 895 896 897 2 Scheduled Patching activity for multi-template release 898 To verify the factory patching goes through for volumes created via … multi-template. untitled text 51 Page 25/115 899 Verify the volumes created in storage factory 900 901 902 3 903 904 FACP release set creation using single newly defined template (IDCS or … Converged) 905 906 907 908 To verify the master volume creation using single template 909 Verify the volumes created in storage factory 910 911 4 912 913 Backward compatible test to verify existing templates (starter or … starter_adb) 914 915 To verify the master volume creation using single template 916 Verify the volumes created in storage factory 917 918 919 920 References 921 Multi-Template Releases 922 923 924 LikeBe the first to like this 925 No labelsEdit Labels 926 User icon: Add a picture of yourself 927 Write a comment… 928 929 930 Created by Faizaan Qureshi, last modified on Oct 21, 2024 931 A 932 F 933 E 934 Running-Available 935 metadata state: Available 936 B 937 Provisioning 938 C 939 (loading in Kiev) 940 D 941 (loading in Kiev) 942 Running 943 not loaded in kiev 944 M 945 Terminated 946 H 947 G untitled text 51 Page 26/115 948 Running-Assigned 949 metadata state: Assigned 950 L1 951 Terminating 952 OCI - Compute 953 Kiev - Bucket 954 State transitions for a DP Instance in SFaaS 955 J1 956 J2 957 (terminating compute 958 instance in OCI) 959 Running-TerminateNow 960 metadata state: TerminateNow 961 I 962 Running-TerminateLater 963 metadata state: TerminateLater 964 L2 965 L3 966 Terminating 967 metadata state: Terminating 968 Kick off termination for attached Block Volume 969 If stuck in terminating for too long, raise an alert 970 K 971 List all attached Block Volumes, then insert there metadata in a … DPInstanceBV bucket 972 973 974 975 Update Pool Configuration process: 976 Since the requirement to replace an Instance pool's current OS Image … with a new one is not something that needs to be done immediately across … all Instances of the pool, this process can run once a day, to check if … the Spectre property for Latest OS Image has a different image name than … the one used by pool's current config. 977 978 Get the list of all active customer and their RUNNING/SCALING Instance … Pools 979 Iterate over each Instance Pool, 980 if the Spectre property has same OS Image, then skip the proceed to the … next Instance Pool 981 if the Spectre property has a different OS Image, then, 982 create new Instance configuration, where all the values will remain same … as the current config used by the pool, but OS Image will have the value … set in Spectre property 983 replace pool's current configuration with the newly created one 984 985 986 Below is a table of how different processes will cause state transition … on above diagram, 987 988 A and B Compute starts (A) and completes Instance provisioning and untitled text 51 Page 27/115 988… transitions state to RUNNING (B) 989 990 991 992 C, D, F and H 993 get all RUNNING instances from an Active/Scaling pool 994 iterate over each instance and check if it has the latest matching … image, 995 if it does: check if this instance has its metadata in Kiev, 996 if it does: skip 997 if not: insert new metadata record with state Available (C) 998 if not: check if this Instance has its metadata in Kiev, 999 if not: Add instance metadata in Kiev, with state = TerminateNow (D) 1000 if it does: start a Kiev Transaction, call GET for the same instance and … check its state, 1001 if AVAILABLE: then change state to TerminateNow (F) 1002 if ASSIGNED: then change state to TerminateLater (H) 1003 1004 1005 1006 J 1007 1008 (J1 and J2) 1009 1010 1011 Iterate over all Instance with metadata in Kiev where state == … TerminateNow 1012 1013 Using OCI SDK, get list of all Block Volumes attached to the instance 1014 Insert all these BV ocids in DPInstnaceBlockVolume bucket, with Instance … ocid is relational key 1015 Start Kiev transaction and change metadata state to Terminating (J1) - … Query to obtain list of all attached Block Volumes, then insert their … metadata in a new Kiev bucket call DPInstanceBV (to be deleted later) - … within the same transaction, also call Terminate on the actual OCI … instnace (J2) 1016 1017 1018 1019 E 1020 1021 get all Queued jobs, sorted by priority and date. 1022 Iterate over each job, 1023 start a Kiev transaction, get an AVAILABLE instance from Kiev metadata … bucket, for the job's customer, 1024 if found: in same transaction, mark the Job as READY, and instance as … ASSIGNED (E) 1025 if not found: return by logging message that Unable to find an active … instance for the job 1026 1027 untitled text 51 Page 28/115 1028 G and I 1029 1030 1031 1032 When all that is needed from the assigned DP instance is completed, call … a helper method to create a Kiev transaction to update instance metadata … state, 1033 1034 If current state is Assigned, change it to Available (G) 1035 If current state is TerminateLater change it to TerminateNow (I) - this … means that the instance was created with an older configuration, and now … that the job it was running has completed, the instance is ready to be … terminated and then replaced by pool 1036 L 1037 1038 (L1, L2 and L3) 1039 1040 1041 1042 1043 The process will query Kiev to get all Instance metadata records where … state == Terminating, then iterate over each record, 1044 1045 Check if Instance is still in Terminating, 1046 TERMINATING: If it has taken longer than a configured time, emit metrics … that would trigger an alarm (such that on-call can pay attention to this … instance) - else if has not passed the configured time, then allow more … time by moving on to the next instance. 1047 TERMINATED: Since the actual Instance has terminated, we can now delete … its metadata record from Kiev and check for any Block Volumes that were … attached to it and were inserted in DPInstanceBV bucket, by (A) Loading … and Scaling Process, as part of (J1) transition 1048 Iterate over each BV record, and call terminate 1049 1050 L1 Compute completes Instance termination and transitions state to … TERMINATED (L1) 1051 1052 1053 1054 (C) Termination Process: 1055 Since this process is to monitor instance termination completion and to … kick off attached Block Volume deletion, it can be seen as more of a … resource cleaning process (to reclaim capacity). Since we are taking the … route to reuse DP Instance after they are done processing a job … (provided they their image is up to date), we should not expect many … instances in TERMINATING state each day, the volume will only be higher … on days when pool's Instance Configuration changes, and all existing … instances will become out of date, hence running this process 2-3 times … a day should be enough, to make sure there are no leaked resources when … we terminate an instance. 1056 untitled text 51 Page 29/115 1057 1058 1059 Scaling Instance Pool: 1060 Loading and Scaling Process (A) (as mentioned in above table) will … perform this task in its last step. 1061 1062 Iterate over the list of all Active customers 1063 Get count for all queued jobs for the customer 1064 Get count for all RUNNING instances for the customer that are not … ASSIGNED to a job (both Running and Running-Available, as shown in the … State diagram above) 1065 Calculate the number of Instances needed to process currently Queued … jobs, lets say as X 1066 if X is smaller than current Instance count for the pool (Scale Out): 1067 Update the instance count for the pool - the new instances created as … part of Scale out should become available for jobs in the next run of … Process (A) 1068 if X is bigger than current Instance count for the pool (Scale In): … (this can be tricky as it can potentially terminate instances that are … currently processing jobs) 1069 To reduce the instance count for the pool we will call detach, with … terminate option, which will not only terminate an instance without … replacing it with a new one, but also reduce the pool's instance count … (start a Kiev transaction, set instance metadata state to Terminating, … get the list of all attached Block Volumes, insert their metadata in … DPInstanceBV Kiev bucket, and call detach with terminate option on the … Instance). 1070 Following is the order in which we will call this step, 1071 First start with Running instances in the pool that do not have metadata … entries in Kiev, see if pool's instance count reduces to X (repeat this … step till all Running instances are gone, before going to the next step) 1072 if X is still smaller than pool's instance count, then start with … instances in the pool that have metadata state == TERMINATE_NOW (i.e. … instances in Running-TerminateNow state, as shown in the State diagram … above), see if pool's instance count reduces to X (repeat this step till … all Running-TerminateNow instances are gone, before going to the next … step) 1073 If X is still smaller than pool's instance count, then start with … Running instances in the pool that have metadata state == AVAILABLE … (i.e. instances in Running-Available state, as shown in the State … diagram above), see if pool's instance count reduces to X (repeat this … step till all Running-Available instances are gone, before going to the … next step) 1074 If X is still smaller than pool's instance count, then stop the Scale-In … process here (as we can not terminate instances that are currently … processing jobs), and let the next run of this Process (A) takes care of … further reduction in instance count. 1075 1076 1077 Created by Muthu Puranam, last modified on Nov 05, 2024 1078 Introduction: untitled text 51 Page 30/115 1079 We want to add dynamic scaling ability to SFAAS Instance Pool. The … idea is to dynamically scale up number of instances when move FV … creation jobs are queued and scale down when number of jobs queued is … lot lesser than available instances. 1080 1081 This wiki is to discuss various optimization techniques we want to … consider. 1082 1083 We are considering doing it in 2 steps: 1084 1085 Manually handling scaling, by exposing OPS API end points 1086 Auto scaling based on demand. 1087 1088 1089 Manually scaling instances: 1090 Instead of blindly allowing ops to randomly increase or decrease pool … size, we will allow only a certain % (10?) change to pool size using … this end point. We need to set up a maximum limit for number of … instances that could be scaled up to (need inputs for this limit) 1091 1092 Also, We want to make sure number of instances in a pool do not go below … a minimum pool size (5?). 1093 1094 We will have a property of targetPoolSize in DPInstancePool class, we … will use that property to scale/up or down instances. By default … targetPoolSize will be set to MaxPoolSize from limits. 1095 1096 Adding instances: 1097 Let us add end points to manually add/number of instance to/from … instance pool. 1098 1099 /dpInstancePools/{dpInstancePoolId}/actions/addInstancesToPool/{ … additonalInstances} 1100 1101 This is a PUT command that will add "additonalInstances" to the … instance pool. Ops can observe FV requests are waiting in queue for long … and can added instances as they see fit. 1102 1103 We will use … ComputeManagement::updateInstancePool(UpdateInstancePoolRequest) for … this. 1104 1105 1106 1107 Removing Instances: 1108 /dpInstancePools/{dpInstancePoolId}/actions/removeInstancesFromPool/{ … fewerInstances} 1109 1110 This is a DELETE command that will reduce "fewerInstances" from the … instance pool. Ops can observe no FV requests are waiting in queue for … long and pool has more instances in assigned state, they can take some untitled text 51 Page 31/115 1110… off from the pool. 1111 1112 During InstancePoolScalingNanny::cleanupTerminatedInstances 1113 1114 call we will compare totalInstances of the pool against Instance pool … target size. If it is more than totalInstances, we we will set … available instances for termination and decrement totalInstances value. 1115 1116 We right now, terminate an instance when 1117 1118 If it is reserved and reservation not expired. 1119 If agentVersion from limits and instance agent version is different 1120 If DpInstanceConfiguration and instance pool have different … instanceConfigurationId 1121 We will add, a policy to remove instances if totalInstances are more … than targetSize. 1122 1123 1124 Dynamic Scaling of Instances: 1125 We will have InstancePoolscaling nanny with the state fields as: 1126 1127 targetSize: Size of instance pool we want to keep based on current job … queue trends. it should be within range minPoolSize < targetSize < … maxPoolSize. 1128 minPoolSize: Minimum pool size we want to maintain, so we dont run out … of instances anytime. 1129 maxPoolSize: Maximum number of instances we want to use during extreme … situations. 1130 lastReadQueueSize: We will save last read queue size for comparison with … currrent read. 1131 QueueTrendBuffer: A sliding window or circular buffer. 1132 BufferSize: Size of circular buffer 1133 Reading frequency: How frequently Nanny reads queue size (lets keep it … as 5 minutes?) 1134 increamentStepSize will be in geometric progression (1,2,4,8..) 1135 decrementStepSize will be in geometric progression (1,2,4,8,..) 1136 1137 1138 1139 1140 Let us start with target size = maxPoolSize, let us have instance … pool with maxPoolSize if instances. Nanny will peridically look at queue … size and compare with lastReadQueueSize. 1141 1142 if (lastRead < currentRead) add +1 toQueueTrendBuffer, else -1. If sum … of the buffer element is +ve, we increment instances, else we decrement … it. 1143 1144 When we see queue is growing we use … ComputeManagement::updateInstancePool to new target size. 1145 untitled text 51 Page 32/115 1146 If we want to reduce number of instances in instance pool, we will use … StorageFactoryInstancePoolProvider:detachInstance method. This will be … invoked on only on any Available instances. 1147 1148 At the end of each nanny run, we will check target size and pool size, … we will either add instance if targetSize > poolSize or detach instance … if otherwise in a loop until we get targetSize = pool size. 1149 1150 1151 1152 1153 1154 Nanny tor read current queue size. Compare with last read, enter +1 to … ring buffer if queue is growing else -1. 1155 Ring buffer to track queue trend 1156 1 1 -1 -1 1 1157 1158 SumOfBuffer > 0 1159 Increase instaces in GP 1160 Decrease instances in GP 1161 Increase instances 1162 Decrease instances 1163 1 1164 3 1165 5 1166 7 1167 Min PoolSize 1168 MaxPoolSize 1169 Yes 1170 No 1171 1172 1173 1174 1175 1176 Testing: 1177 1178 1179 We have to come up with proper stress testing scenarios. I will add … a section for testing this feature. Testing will include, starting … Instance pool with maximum istances allowed as target size, see it is … scaling down for normal loads and stabilizes. 1180 1181 Then increase the load and verify it scales up. Then reduce load and see … it scales down as expected. 1182 1183 1184 1185 Created by Charles Paclat, last modified on Oct 28, 2024 1186 Problem Statement 1187 The SFCP makes use of Shepherd and ODO to manage its Control Planes. It untitled text 51 Page 33/115 1187… uses a configuration where by an InstanceConfiguration is created and … then bound to an InstancePool. That InstancePool is referred to by an … ODO pool. The ODO pool discovers the instances in that Instance Pool … and uses them as the target for the ODO API and Worker applications that … are deployed to it. 1188 1189 When changes are made to the InstanceConfiguration in shepherd the … configuration is replaced with a new one and the InstancePool is updated … to leverage that new configuration. In the event that an instance is … terminated, that instance is replaced with one that leverages the new … configuration. 1190 1191 There is a limitation in the implementation of OCI Instance Pool (a.k.a … as Scaling) whereby the instances are not automatically replaced when … the instance configuration of an instance pool is updated. The result … of this is that any existing instances are still running with the old … configuration. 1192 1193 In the coming milestones there are use cases where the instances need to … be replaced due to external intercoms. 1194 1195 Converting to use Oracle Linux 8 based images 1196 Disabling the IMDS v1 for the purposes of security. 1197 References: 1198 1199 https://confluence.oci.oraclecorp.com/pages/viewpage.action?spaceKey= … EGREEN&title=Migration 1200 https://confluence.oci.oraclecorp.com/display/EGREEN/Overlay 1201 1202 1203 Goals 1204 The goal is to design a standard process by which we can use shepherd to … orchestrate the replacement of the instances that are created as part of … the Storage Factory Infrastructure flock. There are at present 31 … deployed regions for the SFCP which dictates that this process should be … automated almost entirely through shepherd. This includes. 1205 1206 Replacing all of the CP instances for both work and api. 1207 Replacing all of the instances that are use as jump hosts. 1208 1209 1210 The only exception to this is that the IPs of the jump hosts are encoded … in the ssh configs for the system. These new IPs must be discovered and … replaced in regional stage information of the project. 1211 1212 DesignTh 1213 This design is based on a solution that was implemented by the ADP team … (ADP Solution). Our goals it accomplish the same ends without requiring … sequencing a series of commits. 1214 1215 Control Plane Updates untitled text 51 Page 34/115 1216 The Control Plane is based on Instance Pool and Instance Configuration. … The solution is based on a side effect of the way in which the Instance … Pool is implemented. When the size of an Instance Pool is expanded the … new instances are created with the replace instance configuration. … These new instances are discovered by the ODO service and added to the … ODO Pool as as well. Unfortunately, the current version of the … application is not automatically installed to these additional hosts. … To effect that change the application must be explicitly deployed at the … target version. This can be accomplished using the application … deployment flock. 1217 1218 Implementation 1219 To make this a part of our normal standard of operations when replacing … the instance configuration an additional input is to be added to the … flock that indicates if this is going to replace the instance configurn. 1220 1221 1222 1223 Shepherd Variables 1224 Shepherd has added support allowing release specific override variables. … These are supporting only at the top most level and therefore each of... 1225 1226 shepherd-dev/infrastructure/generic_region/variables.tf 1227 shepherd-beta/infrastructure/generic_region/variables.tf 1228 shepherd/infrastructure/generic_region/variables.tf 1229 variable "cp_instance_multiplier" { 1230 description = "Used when replacing the instances in the CP. Set this … to 2 if you want to double the instances that are available in the CP." 1231 type = number 1232 default = 1 1233 } 1234 The presence of this variable can be leverage when creating a release to … increase the size of all of the instance pools. 1235 1236 The diagram below shows that the user can specify the value of "2" for … "cp_instance_multiplier" 1237 1238 1239 1240 When this release is planned all of the instance pools are doubled in … size because of logic added to the infrastructure main of the form.. 1241 1242 cp_api_instance_pool_base_size = local.ad_count == 3 ? … local.region_specific_info.cp_api_instance_config.pool_size.AD_COUNT_3 : … local.region_specific_info.cp_api_instance_config.pool_size.AD_COUNT_1 1243 cp_api_instance_pool_size = local.cp_api_instance_pool_base_size * … var.cp_instance_multiplie 1244 The resulting plan demonstrates the behavior. By specifying the … multiplier the counts are doubled. 1245 1246 untitled text 51 Page 35/115 1247 1248 1249 1250 1251 1252 This is release and the new instances are created with the updated … instance configuration. 1253 1254 A subsequent release without the variable set will result in the older … versions being terminated leaving only the new instances. 1255 1256 Created by Charles Paclat, last modified on Aug 04, 2023 1257 As detailed in the CRISP Problem Overview there is a move away from the … use of "Service Principals" and towards the use of "Resource Principals" … when managing interactions between tenancies. The CRISP Approved … Solution using Resource Principals is the solution that will guide this … design. See also... Resource Principal Everything you need to know 1258 1259 Design Summary 1260 The basic design is that the customer will register themselves with our … service and the result is that internally we will create a data plane … instance on their behalf. In our case we will refer to the collective … resources as a "Storage Factory" since it is a factory for factory … volumes. The Storage Factory OCID is used to uniquely identify that … resource for the purposes of resource principals. This allows a … customer to write the required policies that enables the StorageFactory … to access OCI resources on their behalf. 1261 1262 The Control Plane nodes are enabled for the use of Service Principals … and are responsible for creating and managing a Resource Principal on … behalf of each of each customer using the associated Storage Factory … OCID. An internal service will ensure that any worker threads acting … on behalf of a customer are seeded with the required credentials. 1263 1264 As each of our stages has its own tenancy and service principal it is a … requirement that we scope the resource name to the stage with a … "resource_prefix": 1265 1266 sfcpinfratest - dev_storageFactory 1267 sfcpbeta - beta_storageFactory 1268 sfcppreprod - preprod_storageFactory 1269 sfcpprod. - storageFactory 1270 1271 1272 FACP Service 1273 Tenancy - U 1274 SFCP Service S1 1275 1 1276 Please create the resource R1 (an instance pool )in compartment C1 1277 2 1278 Service S1 checks authorization untitled text 51 Page 36/115 1279 and creates the instance pool and this request workflow is complete 1280 Resource R1 1281 (storage factory) 1282 3 1283 Next create the FV/FVD in compartment C1 1284 4 1285 OS Service S2 1286 Read the contents of the templates from OSS using RP authority 1287 Block Storage Service S2 1288 Clone a block volume in to the compartment using RP authority 1289 1290. 1291 This enables the customer to write the policies shown below. This … allows the Storage Factory that represents the customer's registration … in our tenancy to do work against the originating tenancy. 1292 1293 statements = [ 1294 # Policy to read from the OSS bucket containing the application … templates 1295 # TODO: these are the policies required for the data plane instance … for a particular customer to interact with the customer tenancy 1296 "allow any-user to read buckets in compartment … ${var.service_compartment_name} 1297 where all { 1298 target.bucket.name='${var.service_application_bucket_name}', 1299 request.principal.id='-${var.customer_ocid}, 1300 request.principal.type='storageFactory' }", 1301 1302 # Write to an OSS bucket containing the results of the volume … creations (this is there only for backward compatibility during a … transition period) 1303 "allow any-user to manage objects in compartment … ${var.service_compartment_name} 1304 where all { 1305 … target.bucket.name='${var.service_volume_info_bucket_name}'", 1306 request.principal.id='${var.customer_ocid}, 1307 request.principal.type='storageFactory' }", 1308 1309 # Manage block volumes in a specific compartment allows the service … to manage the block volumes in customers compartment 1310 "allow any-user to manage volumes in compartment … ${var.service_bv_compartment_name}" 1311 where all { 1312 request.principal.id='${var.customer_ocid}, 1313 request.principal.type='storageFactory' }", 1314 1315 ] 1316 Design Details 1317 Enabling the Volume Factory Resource Type 1318 Current Service Principals untitled text 51 Page 37/115 1319 1320 storage-factory-dev - tenant is that of sfcpinfratest one does not have … a separate hosting tenancy ID. 1321 allowed_dg_for_s2s = … ocid1.dynamicgroup.oc1.. … aaaaaaaaisrsltwsrirwl4m7ls7mt5lex5jnrx3a5hfqzwfeeomrve65gzya - … sf_control_plane_dg. Allows the control plane host to acquire service … principals 1322 resource_types_allowed = dev_storageFactory 1323 storage-factory - hosting tenant is sfcpbeta tenantId - sfcpspbeta 1324 "allowed_dg_for_s2s":"ocid1.dynamicgroup.oc1.. … aaaaaaaakkmtybgz4ypol7zs2khwynhrtyop54lhczvrqybadq2im2mtbstq" 1325 resource_types_allowed = beta_storageFactory 1326 This claims the resource type for our service tenancy and allows our … control plane hosts to act as the "service" by obtaining Service … Principals using... 1327 1328 S2SAuthenticationDetailsProvider.builder().useInstancePrincipals().build … (); 1329 1330 1331 resource "serviceprincipal_service_principal" sfcp_service_principal { 1332 service_name = var.service_principal_name 1333 tenant_id = (var.service_principal_tenancy_id == "") ? … var.tenancy_ocid : var.service_principal_tenancy_id 1334 hosting_tenant_id = (var.service_principal_tenancy_id == "") ? null : … var.tenancy_ocid 1335 cm_link = var.cm_link 1336 owner_email = var.owner_email 1337 properties = { 1338 allowed_dg_for_s2s = … oci_identity_dynamic_group.sf_control_plane_dg.id 1339 resource_types_allowed = var.allowed_resource_types 1340 } 1341 depends_on = … [oci_identity_dynamic_group.sf_control_plane_dg,oci_identity_policy. … sf_service_principal_policies] 1342 } 1343 1344 1345 With this complete it is now possible that the compute instances in our … control plane can use our Service Principals to create and sign … "Resource Principal Session Tokens" (RPST). 1346 1347 Customer Tenancy Registration 1348 Customer registration is a two step process. The first operation is to … create a StorageFactory definition in their compartment and then … secondly to associated a CustomerRegistration with that StorageFactory. 1349 1350 In order to perform this step they will need to authorize shepherd to … managed StorageFactory and CustomerDefinition resources in their untitled text 51 Page 38/115 1350… compartment. In order to interact with FactoryVolumeDefinition and … FactoryVolume they will need authorize their WFaaS worker nodes … instances in their compartment as well. 1351 1352 Control Plane Integration 1353 The Control Plane hosts are capable of acting with Service Principals on … behalf of the service. A new ResourcePrincipalManager internal service … is introduced that has the responsibility of retrieving and maintaining … a cache with a Resource Principal Session Token for each customer … "Volume Factory". The example from the Spruce Auth Libraries provides … the model on which this service is to be based. The session tokens … allocated are good for 12 hours and managed with a local cache refresh … policy of every 6 hours to ensure we are always up to date. 1354 1355 When executing WFaaS actions on behalf of a customer a … WorkerFunctionInterceptor will acquire the latest token and push it on … to the thread that is executing the request. The use of a thread local … variable here simplifies the access to this resource principal by … allowing any services to which the workflow is delegating to acquire a … client that is usable only for that customer tenancy. In addition to … just the actual ResourcePrincipal itself, the thread local variable … includes a context that enables the thread to access the key material … that is used in the signing process. This will be used to maintain this … same information on the data plane instance. 1356 1357 Details 1358 Every request to the SFCP is associated with a particular compartment … that is owned by the customer itself. This target compartment is … evaluated to determine the OCID of the tenancy for which the request is … being performed. These to elements are stored as common arguments to … all workflows. 1359 1360 When loading the state from Kiev in the WorkflowStateInterceptor these … properties are propagated onto the thread using ResourcePrincipalContext … support. 1361 1362 1363 1364 Data Plane Integration 1365 Each Data Plane host is managed as part of a Data Plane Instance Pool … that is created on behalf of a customer. For that reason we can safely … push the required signing material onto the hosts and be assured that … they are segregated from other customers. To enable this interaction a … new resourcePrincipal resource is added to the SFCP Agent Api. This API … allows the Control Plane Nanny process that is maintaining the Data … Plane instances to ensure that the required key material is seeded on to … the Data Plane hosts. By taking this approach the code that is on the … Data Plane can simply use the normal … ResourcePrincipalAuthenticationDetailsProvider when interacting with … the OCI service to interact with the customer tenancy. 1366 untitled text 51 Page 39/115 1367 The API is responsible for establishing the system properties that are … used when every any process is launched by the agent. 1368 1369 OCI_RESOURCE_PRINCIPAL_PRIVATE_PEM 1370 OCI_RESOURCE_PRINCIPAL_PRIVATE_PEM_PASSPHRASE 1371 OCI_RESOURCE_PRINCIPAL_RPST 1372 OCI_RESOURCE_PRINCIPAL_REGION 1373 The API will accept the signing material and write it to the correct … location on the host such that these environment variables resolve. 1374 1375 Reference Implementation 1376 A library from the security products team that seems to be getting some … adoption and provides a solution that can simply imported in from maven. 1377 1378 https://bitbucket.oci.oraclecorp.com/projects/SPRUCE/repos/auth- … providers-lib/browse 1379 1380 1381 1382 Our data plane code for handling Resource Principal is now in this … directory: 1383 https://bitbucket.oci.oraclecorp.com/projects/ODA/repos/bots-server- … infra/browse/bots-runti[…]main/java/oracle/cloud/bots/security/ … resourcePrincipal 1384 The main class is here: 1385 https://bitbucket.oci.oraclecorp.com/projects/ODA/repos/bots-server- … infra/browse/bots-runti[…]ity/resourcePrincipal/ … ResourcePrincipalV21Manager.java 1386 Our control plane code for handing Resource Principal has not moved, but … for reference, it is here: 1387 https://bitbucket.oci.oraclecorp.com/projects/ODA/repos/oda-control- … plane-app/browse/servic[…]incipaltokenservice/ … ResourcePrincipalTokenService.java 1388 1389 1390 1391 Another good example.. 1392 1393 https://bitbucket.oci.oraclecorp.com/projects/DI/repos/dicom/browse/ … common/dicom-security/src/main/java/com/oracle/dicom/resourceprincipal/ … ResourcePrincipalSessionTokenGenerator.java 1394 1395 Created by Amar Thangavel Balakrishnan, last modified on Jun 28, 2024 1396 Overview 1397 This wiki capture the new design of Resource Principals in SFaaS. 1398 1399 1400 1401 Problem Statement 1402 In the existing setup Control Plane make request for RPT and RPST. Then … that will be send to Data Plane through the agent endpoint in Data untitled text 51 Page 40/115 1402… Plane. That is Control Plane call the Data Plane end point and pass the … RPST blob to Data Plane. Data Plane inter set the below env variable … before invoking the Data Plane tool like mv-create. 1403 1404 Environment Variable List 1405 OCI_RESOURCE_PRINCIPAL_PRIVATE_PEM 1406 OCI_RESOURCE_PRINCIPAL_PRIVATE_PEM_PASSPHRASE 1407 OCI_RESOURCE_PRINCIPAL_RPST 1408 OCI_RESOURCE_PRINCIPAL_REGION 1409 1410 1411 1412 Compute API 1413 PKI Server 1414 PKI Agent 1415 Instance Metadata Service 1416 Identity Data Plane 1417 Block Volume API 1418 1. provision/renew cert 1419 2. ip info 1420 3. instance info 1421 4. cert 1422 5. cert 1423 SFaas Service Tenancy 1424 Control Plane 1425 Work Flow End Point 1426 6. cert 1427 External Services/Endpoints 1428 7. instance principal based on cert 1429 8. IPST 1430 Token Service 1431 9. getRPT(IPST) 1432 10. AuthZ request(IPST) 1433 11. AuthZ response 1434 12. service cert 1435 13. SPST 1436 14. RPT, SPST 1437 15. getRPST(RPT, SPST, IPST) 1438 16. RPST 1439 19. AuthZ request(RPST) 1440 20. AuthZ response 1441 Compute API 1442 14. retrieve tags 1443 OCI SDK 1444 Data Plane Host 1445 Agent End Points 1446 18. BV request(RPST) 1447 OCI SDK 1448 17. RPST Blob & DP Payload 1449 21. BV response 1450 untitled text 51 Page 41/115 1451 1452 1453 Limitations 1454 Control Plane have to push the new Service Token (RPST) to Data Plane … every time. 1455 If service token expires in Data Plane should return an error code to … control plane then control plane get the new service token and call the … Data Plane end point to pass the service token and proceed. 1456 Debugging in the Data Plane vms are difficult, as there is not service … token available in the Data Plane 1457 1458 1459 Proposed Solution 1460 Expose a new end point in Control Plane to get the RPST blob. When ever … Data Plane need a service token Data Plane make a call to Control Plane … end point to get the service token. For this Data plane should keep the … Control Plane endpoint URL in env variable. 1461 1462 Environment Variable List 1463 OCI_RESOURCE_PRINCIPAL_RPT_ENDPOINT 1464 OCI_RESOURCE_PRINCIPAL_RPST_ENDPOINT 1465 1466 1467 1468 Created by Muthu Puranam, last modified by Faizaan Qureshi on Apr 25, … 2024 1469 Use cases 1470 1. Debug an existing job: 1471 1.1. Requirement: 1472 For a target FVJob (that may have something unusual going on, and needs … debugging), developers should have the ability to hold the DPInstance … assigned to this job, such that the instance does not get recycled or … reused. 1473 1474 1.2. Implementation Approach: 1475 On DOPE UI, there will be a new Debug DP Instance button in DOPE, on … Factory Volume Details page (and new corresponding Ops API for … FactoryVolume resource). This button will only be enabled if the Factory … Volume is in CREATING state. 1476 1477 API end point: … /factoryVolumes/{factoryVolumeId}/actions/reserveDpInstance 1478 1479 In the API implementation to turn on debug mode, validate if the job … corresponding to the FactoryVolume resource is still RUNNING (i.e. … FactoryVolume resource is still CREATING), if validation is successful, … the assigned DPInstance will get ReservedForDebug flag set (with … timeReserved set to now). Effectively, when the code to release … DPInstances for COMPLETED/FAILED jobs will see a DPInstance with … RESERVED flag set, it will skip instance release - hence developers will … have this instance held for debugging. untitled text 51 Page 42/115 1480 1481 reservedForDebug = true; 1482 timeReserved = now; 1483 timeReleased = 0; 1484 reservedBy = ; 1485 releasedBy = null; 1486 reservationDuration = 24; // the number of hours this reservation can … last - unless the instance is actively processing a job 1487 Note: The same ReservedForDebug flag should be considered during the OS … Image and Agent version checks, in order to skip recycling of this … instance. 1488 1489 1490 1491 2. Reserve an Available instance for testing: 1492 2.1. Requirement: 1493 Based on the idea that all Instances in a fleet are same, user will have … an option to reserve an Available Instance, from an Instance Pool of … their choice. 1494 1495 2.2. Implementation Approach: 1496 On DOPE UI, user will have an option to call Reserve Available Instance, … either on Instance Pool list page (by clicking the 3 dots on the right … side of their Instance Pool entry) - Or on Instance Pool Details page … (by clicking a button under Actions, on top) 1497 1498 API end point: … /dpInstancePool/{dpInstancePoolId}/actions/reserveAvailableInstance 1499 1500 In the API implementation, a new FVJob record will be created with, 1501 1502 FactoryVolumeJobDetails { 1503... 1504 workflowVersion = null; 1505 factoryVolumeType = reserveAvailableInstance; 1506 resourceName = ; 1507... 1508 } 1509 When InstanceAssignmentNanny will view a QUEUED job with above values, … it will know that this is not an FVJob for volume creation, but it is to … find an Available Instance from the pool, that does not already have the … ReservedForDebug flag set, and then set the flag on that Instance (with … timeReserved set to now), such that this Available Instance is not … assigned to any job other than the one that will have the OCID for this … instance as part of the request payload. (see #3 for details) 1510 1511 1512 1513 3. Start a new job on a particular instance: 1514 3.1. Requirement: 1515 In CreateaFactroryVolume POST request, user should be able to pass OCID untitled text 51 Page 43/115 1515… for a reserved instance of their choice, as an override in FreeForm … tags. 1516 1517 3.2. Implementation Approach: 1518 In the API implementation, add validation code in createFactoryVolume() … method in FactoryVolumeService class, where it will check that the … requested Instance, has ReservedForDebug flag set, and is still in … Available state - else return user friendly error stating validation … failure, and do not create any new FVJob entry. 1519 1520 In case of API validation success, an FVJob will be created like it is … done today, so when InstanceAssignmentNanny is getting an instance for … such an FVJob that has reserved instance override set in FreeForm tags, … it will get that exact DPInstance and Assign that to the job. 1521 1522 1523 1524 4. Release Reserved Instance: 1525 4.1. Requirement: 1526 There needs to be a way to release a reserved instance, i.e. either … allow it to get recycled or reused like a regular instance, based on … other detail. 1527 1528 4.2. Implementation Approach: 1529 Note: A Reserved instance will change its lifecycle states like a … regular instance, except that it will not be included in the Agent … Version and Image Version checks, and it will not be picked by the … termination process, till the ReservedForDebug flag set to true. So … effectively system will only allow reserving an Instance that is in … Available / Assigned state, and once reserved, the instance can only … move back and forth between Available and Assigned states - till … reservation ends, and when reservation ends, the instance should always … go towards Termination, and should never be reused 1530 1531 There will be 2 routes to reset the ReservedForDebug flag back to false, … and also set timeReserved to 0. 1532 1533 4.2.1. Reservation Expires 1534 During Instance cleanup step of InstancePoolScalingNanny, it will check … if a DPInstance, with ReservedForDebug flag set to true, has … timeReserved that is older than the reservation hours set for the … instance (i.e. the allowed reservation duration has expired), then … update, 1535 1536 reservedForDebug = false; 1537 timeReleased = now; 1538 releasedBy = "InstancePoolScalingNanny"; 1539 reservationHours = 0; // the number of hours this reservation can last - … unless the instance is actively processing a job 1540 lifecycleState -> 1541 if (Available) {set TerminateNow;} untitled text 51 Page 44/115 1542 if (Assigned) {set TerminateLater;} 1543 4.2.2. Reservation Released by user 1544 User can call release on an instance (i.e. with in the allowed duration … of reservation period). DOPE UI will have an option on Instance Details … page, or on Instance list page, to call Release Instance Reservation. 1545 1546 API end point: /dpInstance/{dpInstanceId}/actions/releaseReservation 1547 1548 In the API implementation, simply set the following values on … DPInstance, and allow existing automated processes to further progress … the instance in its life. 1549 1550 ReservedForDebug = false; 1551 timeReleased = now; 1552 releasedBy = ; 1553 reservationHours = 0; // the number of hours this reservation can last - … unless the instance is actively processing a job 1554 lifecycleState -> 1555 if (Available) {set TerminateNow;} 1556 if (Assigned) {set TerminateLater;} 1557 1558 1559 5. Extend Reservation Period: 1560 5.1. Requirement: 1561 There needs to be a way for users to extend current reservation on an … instance. 1562 1563 5.2. Implementation Approach: 1564 DOPE UI will have an option on Instance Details page, or on Instance … list page, to call Extend Instance Reservation (only visible for … DPInstances that have reservedForDebug flag set to true) 1565 1566 API end point: /dpInstance/{dpInstanceId}/actions/extendReservation 1567 1568 In the API implementation, first validate that this instance has … reservedForDebug flag set to true, then simply add 24hrs more the to … the reservation field on DPInstance, 1569 1570 reservationHours += 24 ; // the number of hours this reservation can … last - unless the instance is actively processing a job 1571 1572 1573 6. DAL Model changes: 1574 Additional fields for DPInstanceDetails, 1575 1576 DPInstanceDetails { 1577... 1578 boolean ReservedForDebu

Use Quizgecko on...
Browser
Browser