Cognizant Crisis Ops Team Site Communications Protocol PDF

Summary

This document outlines communication protocols for the Cognizant crisis ops team. Protocols cover crisis chat, paging, and communications with Googlers/local leads, along with detailed procedures for various scenarios.

Full Transcript

Crisis ops team site: cognizant Self link: go/cognizant-crisis-ops Communications Crisis chat protocol Begin pinging in Crisis Chat after sending an email that requires attention (use @). Use @ when asking questions directed t...

Crisis ops team site: cognizant Self link: go/cognizant-crisis-ops Communications Crisis chat protocol Begin pinging in Crisis Chat after sending an email that requires attention (use @). Use @ when asking questions directed to an IC. If no response after 30min, then direct ping. If no response from the direct ping after another 30min, then page. Always page immediate if the issue is time-sensitive. Start copy/pasting screenshot images directly into Crisis Chat, rather than linking to the screenshot. ICs are now required extra sign-in authentication on mobile devices to view screenshots via links, so embedding the image directly in chat allows everyone to view immediately. If an IC doesn’t provide a solid rationale for a triage decision, follow up to ensure we provide one in email comms & Rocking Horse (drives go/respond). If there is back-and-forth in Crisis Chat regarding multiple incidents at the same time, try to preface a new ping with whatever alert you’re referring to (when it makes sense). E.g. “air quality maps #published” or “hurricane delta map update LGTM” rather than just “published.” For non-emergencies, e.g. suggesting turndown, assume primary ICs won’t be online in early A.M. hours. Try to hold off on pinging the ICs directly until business hours in these cases & no need to ping the secondary unless a page rolls over or it’s an emergency. When posting information about a detected event in go/crisis-chat, please include the RH suggestion for quick reference. Paging protocol In order to mitigate a scenario where both the primary and secondary oncall are failing to respond to pages, we are implementing the following ops protocol: If it has been > 10min that no one responds to an escalation (primary or secondary IC), generate another manual escalation to the rotation If no response on the second attempt, page the IC that last handed off in the other rotation timezone (E.g. if Aynsley handed off to Ryan, page Aynsley after two failed escalation attempts) Scenarios when you should page the IC: A Googler is asking to respond to an incident, page primary IC oncall (even if you disagree). In case of a malfunction in a specific language: for English or local language, page on-call IC according to current process Escalating for UNKNOWN responses that meet launch criteria For Friday overnights with a U.S. IC, please don’t page unless it is a manmade/violence event (Check in with IC in the morning) The max SLA for live SOS content to go live on SRP w/ troubleshooting is 15-20min. If it takes longer for a change to surface, page Eng oncall. If there is an issue affecting a live alert (e.g. alert suddenly disappears from SRP) & an IC is unavailable via Chat, page the IC & they can decide whether to page Eng. If they do, ICs can then coordinate directly w/ Eng oncall on the issue. ​To page Eng on-call, go to the Eng on-call site, and click "How do I page the oncall?" and follow the instructions. When paging an Eng oncaller, include in the text of the page something like "Please update go/crisis-chat with progress, and if you need more details." This ensures the oncaller will communicate in the proper channel and close the loop for Ops. Eng oncall SLA is 30min, so ping them directly if it’s been 1hr with no response. When in doubt, please page ICs! If you are dealing with a time sensitive question/issue, it’s better to ring their phone right away, rather than wait for a response to a chat tag. Rocking Horse Mailer Protocol Try to send mailer comms for post-escalation triage calls as soon as the decision has been made so stakeholders know what action we’ll be taking as early as possible & we can properly track latencies. When sending emails for US-based incidents, no need to manually copy/paste local leads as CCs (they should all already be in existing groups). When starting an escalation, please specify in crisis chat whether or not it is an auto-response, and by extension, whether or not an IC will need to be involved for launch approval email Alias [email protected] Crisis Ops + Aynsley, Ryan ENG & IC ON-CALL INFO Eng on-call site IC on-call site Comms with Googlers/ Local Leads When local Googlers jump onto Crisis Chat to provide support or request a change to an alert, best practice is for Ops shift leads (if online) to act as main POC. Sometimes an IC will take the lead in comms from Googlers, but if they’re unavailable or not very responsive then you have the expertise to let Googlers know whether a request falls within policy or not, e.g. SOS naming conventions, accounts/resources to include, etc. Feel free to use these templates for guidance. If an ask is more a judgment call, e.g. sensitivities around making the call over whether the incident should be named a ‘bombing’ vs. ‘explosion’, an IC should approve. You can always reach out to an IC over chat directly if you’re unsure of how to handle requests/asks/push-back from Googlers. You can refer to this updated doc with Crisis Response & partner team POCs (to be added to the team site) for context on what role a Googler might have. If someone is not on this list: Most likely means they’re a local lead and/or someone who isn’t on the core team & may not be familiar with policy or process. If a Googler requests access to the impressions dashboard or another seemingly sensitive document, ping the IC to handle directly with the Googler. If someone replies to one of your emails requesting a change to go/sos-contacts or a mailing list, please reply to the requester and tag aynsley@ in an email to make those changes. Be sure to remove everyone else from the to: and cc: list, since we don't want to spam the sos-announce groups. Occasionally, the contact will only reply to the person sending out the Monitor/Escalation/Launch/etc. email, so be sure to check those every so often, since it's easy for those to get lost in all of the communications we send out. Marketing may reach out with HPP questions - stable links Typically alert title query en or local lang, search organically Triggers based on user’s device language settings (can use translated title as query for local language) Daily Responsibilities Getting Settled 1. Drop a line in the Cognizant group chat showing you are online and starting your shift. 2. Prepare Crisis Response sync doc for your incoming shift. 3. Open task tracker and keep it updated throughout your shift 4. Meet with the previous shift for the handoff sync to discuss logged crises, crises that may worsen, etc. 5. Check your email for any Ack Updates and weekly bulletins from Aynsley. Make sure to read them thoroughly. 6. Ensure all SLS email threads, GDE escalations and twitter road closure bugs have been closed out. 7. Check the Primaries chat and sync doc for handoff notes from the previous shift. Detecting 1. Keep Rocking Horse open and make sure notifications are enabled on your desktop. Claim all RH signals within 5 minutes. 2. Have Dataminr open and log incidents that have not been picked up algorithmically by Rocking Horse. 3. Use other detection sources throughout your shift such as InciWeb, NHC, JTWC, Pagasa, etc. Triaging 1. Perform an incident monitoring sweep in Rocking Horse periodically (2-3 times per shift) to remain up to date on what is being logged and what we should keep an eye on (for potential later escalations). 2. Reply to any Report an Incident Form submissions that may come through. 3. Send a Monitor email for any incident scoring Unknown. 4. Escalate and launch SOS Alerts as needed. 5. If a card is launched, complete the Postmortem summary in the rollup after launching an SOS Alert. Live SOS Alert monitoring 1. Complete the SOS QA log. Within 5 minutes of launching a card. Every 3 hours afterwards. Complete Triggering QA: Immediately after launch, 3 hours after launch, and after turndown. If we have a live alert for a hurricane (or another moving crisis), check GDACS every 3 hours for the accurate location of the eye of the storm. 2. Research help & information throughout the lifecycle of an SOS Alert. 3. Take screenshots of the SOS Alert. 4. Suggest turndown of SOS Alerts that are in the recovery phase, or have dipped below threshold metrics. Turning down an SOS Alert 1. If an alert meets our turndown criteria, suggest turndown in the Crisis Response Ops chat. If an IC agrees, send the turndown email. 2. Wait 60 minutes, and if no one disagrees with the decision, complete all turndown responsibilities. Ack System The acking process involves acknowledging 2 things: 1. Ack Updates: may come out more frequently and unscheduled. Example 2. Aynsley's bulletins: these bulletins are typically sent on Friday and should be reviewed/ack'd as soon as possible - avoid going multiple days without responding to this email (unless you are OOO). Example Feedback forms Positive User Feedback/Media Reports form: if you come across exceptionally positive feedback via Listnr or you read a news report that mentions our SOS alert, please fill out this form. Report Graphic Imagery form: please use this form to report graphic imagery (photo, video, GIF, etc.) found when performing crisis response workflows in Dataminr. Proactive Projects Refer to the list below for current proactive projects. Please note that crisis workflows take precedence, but you are welcome to contribute to these projects during downtime. *If you have ideas for potential proactive projects, let your Shift Lead know! Current proactive projects 1. SLS Crosscheck Goal: to reduce dependency on SLS by leveraging the existing Google Translate formulas for triggering workflows Those interested in this project can work in any of the language tabs, inputting ldap in col A and whether the query should trigger in col L. [see screenshot] 2. H&I Database Goal: to compile relevant and authoritative Help & Information Those interested in this project can refer to this playbook for current instructions. The 'To-Do' tab is of upmost priority. 3. H&I Research Project Goal: to compile H&I for limited coverage areas Those interested in this project can follow the instructions on the overview tab 4. Extreme Weather Monitoring Pilot Current Trackers Crisis Reponse Time Tracker Time Tracker Guide Task Tracker Backup Crisis Detection Reminder that you should have RockingHorse open at all times so that any incoming signals can be immediately claimed and triaged asap. incident detection: Crisis Ops is responsible for detecting and logging all natural and manmade disaster event types while also merging the appropriate signals. incident Monitoring: Crisis Ops has taken on the responsibility of updating all crises logged within the Monitoring tab. In general, you can expect to be checking fast-developing situations like active shooters or terrorist attacks every 5–10 min at the start of the event, whereas slower moving crises like hurricanes can be checked every hour or so depending on severity. GDE escalations: These bugs should be filed for all unlaunched natural disasters in the monitoring tab that involve high priority roads (primary highways and secondary roads) impacted by the event. When performing a monitoring sweep, be sure to pay close attention to any incidents that may qualify. The bug template can be found at go/crisis-monitor-road. When filing, be sure to fill in all with the necessary information. GDE Escalations The bug template can be found at go/crisis-monitor-road Crisis Ops should file a GDE escalation for natural disasters through go/crisis-monitor-road when incidents we are monitoring are affecting major roadways/causing major disruption to an affected area. Bugs should be filed when High Priority Roads (HPR) are impacted. These include: Primary highways (roads that form the main network of city-to-city routes) Secondary roads (roads that act as connectors a) between Primary Highways or b) from Major or Minor Arterial roads to Primary Highways) Limited Access (High-quality roads that are accessible from both ramps and intersections) Controlled Access (Dual-carriageway roads that can only be accessed via ramps) Fill out the bracketed text on go/crisis-monitor-road and click create. Use this example bug as reference. Once you file, an email will appear with the subject "[Crisis - Monitor - Road updates] ". NOTE: We should not use intel sourced from Dataminr for GDE escalations or the Twitter road closure workflow Logging incidents in Rocking Horse Event/Incident Naming Convention: [geographic area] + [crisis type] Geographic Area = Smallest affected area Crisis Type = name of the type of event occurring Examples: Namie 4.9M earthquake, Collier County wildfires, Brooklyn floods, Hurricane Harvey Crisis Type: Select the crisis type from the dropdown menu that most accurately represents the incident. Location: When typing in location for a log, 10 MID suggestions are displayed. Clicking on a location MID links to KG. After selecting from the dropdown, you can click the MID link to view the data in Hume to ensure the correct location is selected. If there is not data for a location, you can select “freetext” and manually input the country code. Event Start Time: Time The Event Started in UTC time (Military). Example: 05-29-2019 17:20 Directly Affected: Estimate of how many people may become injured, killed, displaced or evacuated When we don't know the exact # of people but we know the # of households evacuated, we can multiply the number by 3. For example, if 40 households were displaced in a flood, we can assume that affects ~120 people Indirectly Affected: Estimate of how many people may have their commutes/flights disrupted, school/office closed, utilities interrupted, etc. Is the Incident Contained: Is the situation under control or is it ongoing? Media Coverage: Local: Limited to news outlets local to where the crisis is occurring National: Limited to media outlets based in the country the crisis is occurring International: Crisis is featured on the home page or “world” page of preferred media outlets (CNN, BBC, AFP, AP, Reuters, Al Jazeera and NYT) When in doubt what level of coverage an article/media outlet is, Google the media outlet in question for a better understanding of what it covers. Reports are breaking news: (Man-made incidents only) Did the crisis occur within the past hour or so? If yes, mark True. Lead Time or Latency: Determine the progress status of the incident by asking the following questions: How much lead time do we have on the crisis? What is the likely urgency for us to act? Is the crisis in progress? Well in advance? Or has it passed? Est Long Term Impact: Determine whether the incident will have a long-term impact by asking the following questions: Will local infrastructure be permanently damaged? How long will it take this area to recover after the crisis has passed? Notes: (Manual template below) Summary of incident:... Directly Affected Rationale:... Indirectly Affected Rationale:... Latest:... International: [link if applicable] Infrastructure issues/damage, e.g. highway closures: Wildfire template: Containment: XX% ; Acres: XXX Mandatory evacuation: # XX ; Evacs under advisory: # XX ; Time of last evac notice: MM/DD XX:XX UTC Visualization?: Yes/No Merging signals in rocking horse Sometimes, incidents coming into RH will relate to an event that has already been logged and therefore does not require a manual signal. In these instances, we will merge the signals. How to merge new signals: 1. Navigate to the New tab and locate the signal that needs to be merged 2. Set the Triage Label of this signal to the same Triage Label of the incident you’d like to merge it with 3. Navigate to the tab of the Triage Label you selected (monitor, launched, archived) 4. Select the two boxes in the left-hand corners of the signal and the original log 5. Once both are selected, click the merge button How to unmerge a signal: 1. Locate the NC4/manual/RTB signal you wish to unmerge 2. Select the unmerge button 3. Proceed to triage the unmerged signal 4. Edit the unmerged manual signal title to include "duplicate" and set the triage label to "No Response - Noise" Additionally, Rocking Horse contains an auto-clustering feature that is intended to automatically suggest incident signals that appear to be similar and might need to be merged. You can find additional context about this feature in the Rocking Horse Playbook and the Auto-Clustering Suggestion playbook Monitoring Incidents in Rocking Horse Check each crisis in the Monitor tab for new developments and to see if any of the previously logged fields related to the event have changed, such as “# Directly Affected.” If anything has changed, update the notes, update the field in RH and take the appropriate action. monitoring protocol 1. Thoroughly and routinely check the status of the crisis through Dataminr, news websites, and other sources for developments 2. For fast-developing situations like active shooters or terrorist attacks, a general rule is to check every 5–10 min. For slower moving crises like hurricanes, a general rule is to check every hour or so 3. Leave up to date notes / delete any irrelevant information 4. Use local language queries for certain incidents to get better, more updated information 5. Leave links to no more than 3 most recent articles. Please follow note template of leaving articles at the bottom of the note section and inputting any international articles. 6. Look for the 'green eye' symbol for Notable Updates to NC4 signals. ​ NC4 incidents in Rocking Horse tend to overwrite the same signal with new information. The notable updates feature is intended to help Ops identify critical changes that could affect triage decisions or the status of live alerts. ​ See the Notable Updates playbook for more details. 7. Regularly check to see if media coverage changes for an incident, which may trigger a 'Respond' score depending on the crisis and metrics. Local, National and International media coverage can be determined in the following ways: Local: limited to news outlets local to where the crisis is occurring; source is not recognized or published nationally Examples of local/regional news sources: RDNewsNow, indonews.id, KWQC (Mississippi), etc. National: limited to news outlets based in the country the crisis is occurring and are recognized nationally (or 1-2 international sources are reporting on the incident) Examples of national news sources: Times of India, CNN Indonesia, NZ Herald, The Standard (Kenya), NBC, etc. International: crisis is reported on by 3 or more international sources (CNN, BBC, AP, AFP, Reuters, Al Jazeera, and NYT) Triage Labels and tabs New incidents coming into RH will most likely require a Manual Signal (log). If the response recommendation states “Add Manual Signal”, then the incident should be logged. The exception to this is for Out of Scope Incidents or Noise incidents. Below is the general protocol for OOS/Noisy signals. Always gut-check incidents that may have potential mass casualties and/or international coverage. No Response - Out of Scope are militaristic, political or incidents located in a war-torn/conflict area. No Response - Noise incidents are any signals that have no relation to crisis. (Ex. President Biden steps out to get the new Dairy Queen Blizzard ; Anniversary of the Christchurch mass shooting) Continue to do your research and gut-checks on whether or not the incident should be moved to the Monitor tab or marked as No Response - Below Threshold from the point of adding a Manual Signal (logging). If we do not foresee the event worsening, we can set the triage label to No Response - Below Threshold directly after logging. The chart below represents the path for triaging incidents that come through RH. Temporary Asks NExtDoor pilot [5/8/24 - present] For any U.S. SOS Alert, we are reviewing and adding available NextDoor accounts providing relevant, timely and accurate updates to the Help & Info card. Please review the playbook for more info. Mix and Match As of 11/7/23, bushfires in Australia have the capability to use the Mix and Match feature with PAs and SOS appearing simultaneously. Ops should still escalate such incidents to an IC when a PA is active, as the feature can be glitchy in production. The Mix and Match LE has officially launched! As a reminder, this experiment will surface an active Public Alerts summary as an additional card within the SOS alert for floods & wildfires for on OSRP (mobile) only. Additionally, Cardmaker preview forces this experiment on, so you may see the Public Alert card included in an alert when viewed in Cardmaker preview or if you are in the experiment as a user. Please remember: 'Mix and Match' will apply to floods and wildfire alerts, specifically 'Mix and Match' will appear based on the crisis types above and the affected area overlap. The code on this feature works to match the PA that would appear on SRP if there was no SOS alert For reference on what this will look like, please review the images below: Mock instance of Mix and Match appearing on an SOS alert Side-by-side comparison showing an alert with and without Mix and Match LE Please note this card is different from the default Emergency Alerts carousel showing related Public Alerts. If we receive negative feedback about the Emergency Alerts carousel or unrelated alerts are surfacing in the carousel, you can uncheck this card from the alert. To recap, the only responsibility that lies with Crisis Ops is to keep an eye out for any identifiable negative user feedback (internal or external) pertaining to PA's on the alert. If there is negative feedback present, follow these steps: 1. Disable the PA via this checkbox in the Triggering tab of Cardmaker 2. File a bug in the SOS Maker component to playmobil@ (cc aynsley@, mayaek@, and yaelsp@) detailing the exact rationale for takedown. Additionally, provide screenshots of the negative user feedback used to justify disabling the PA Filing Bugs Please view go/sos-behavior-and-bugs to learn about operational feature behavior and known bugs. tool malfunction / product issue **READ FIRST: For all issues related to tools/SOS products, make sure to loop in the IC to avoid any duplicate bugs. Find bug reporting, triage and management procedures for the SOS product listed here. SLS form If the SLS form is malfunctioning, take a screenshot of any details relevant and include timestamps of the error in a group chat with @virginiaf, @heyryan, and @aynsley Search If IC confirms, file bugs related to how SOS Alerts are served to end-users through Google Search (except for triggering bugs) via go/new-sos-search-bug, e.g. Local Info tab showing an error. File via go/bad when you encounter an issue on Search for offensive or irrelevant info surfacing, e.g. off-topic Knowledge Panel, news stories that violate policy, etc. 1. When you open go/bad you will be prompted with this question. Select “Found it as part of core work for my product." 2. You will then be prompted with this question. Please select which product you found the issue, e.g. "Search Features / User Interface", "News", etc. 3. You will then fill out the rest of the form providing: 1. Search query (Exact text with no extra characters. If no query (e.g. News app, Discover), answer N/A.) 2. URL (Please add URL for News, Discover related issues) 3. How you heard about the issue, if applicable 4. Screenshot(s) of the issue 5. Search location 6. Level of severity (*Use this option ONLY if it's an all-hands on deck issue that requires immediate triage due to the level of user, PR or brand risk.) Maps If you encounter an issue on Maps related to SOS alerts, make sure to follow these steps: 1. Loop in the IC on the issue (if the feedback came through Listnr, make sure to include the link to Listnr) and ask if they would like us to file a bug 2. If IC confirms, use go/new-sos-geo-bug to assign the bug to Geo Crisis Team. Include as many details as you can, including screenshots and links Overview Map Bug: The overview map will not update to reflect the incident title in SOS alerts that have been expanded (e.g. McKinney fire to Siskiyou County wildfires). This is a KNOWN issue - please do not page Eng if this occurs. Cardmaker / Rocking Horse If you and/or others are experiencing access issues with Cardmaker or another internal tool, take screenshots & let an IC know. If the issue persists, file a bug & CC crisis-response-just-ics@. If the issue is work-stopping, e.g. Rocking Horse is down, first file a tracking bug with a screenshot then page Eng oncall and reference the bug. 1. Cardmaker: File a bug by clicking here In case of a malfunction in a specific language: English or local language: page on-call IC according to current process Other languages: report a Cardmaker bug (go/new-sos-maker-bug) and uncheck the specific language that manifests an issue E.g. local lead or Googler reports an issue 2. Rocking Horse: File a bug by clicking here NC4 failure banner: Revert to manual workflow and page Eng oncall with screenshots. See playbook for more info. Public Alerts If you encounter an issue on the Public Alerts dashboard, make sure to follow these steps: 1. Loop in the IC on the issue 2. If IC confirms, use go/new-publicalerts-bug to assign the bug to the Public Alerts team. Include as many details as you can including screenshots and links Top stories Ops should now file a go/badstream report if Top Stories/News isn’t working post launch As a part of the post-launch check process, if top stories are not appearing, please file a go/badstream [Screenshot] report, with details for the query and a screenshot This is detailed as step #10 in the post-launch responsibilities GDE Escalations Ops should file a GDE escalation for natural disasters through go/crisis-monitor-road when incidents we are monitoring are affecting major roadways/causing major disruption to an affected area. Bugs should be filed when High Priority Roads (HPR) are impacted. These include: Primary highways (roads that form the main network of city-to-city routes) Secondary roads (roads that act as connectors a) between Primary Highways or b) from Major or Minor Arterial roads to Primary Highways) Limited Access (High-quality roads that are accessible from both ramps and intersections) Controlled Access (Dual-carriageway roads that can only be accessed via ramps) Fill out the bracketed text on go/crisis-monitor-road and click create. Use this example bug as reference. Once you file, an email will appear with the subject "[Crisis - Monitor - Road updates] ". road closure twitter monitoring (RCTM) bugs Once we send a launched email, this will initiate a workflow on the GDE team. We provide support to GDE by replying to the Buganizer thread with twitter handles related to road closures in the area. *Note that this bug also sometimes results from a submission on go/crisis-monitor-road. 1. In your Buganizer email filter you will see an email with this subject: [Crisis name] Road Closure Twitter Monitoring. Click into the buganizer link to reply to the bug and provide twitter handles. Example thread: https://buganizer.corp.google.com/issues/221771099 2. Crisis Ops will search for official Twitter handles that should be monitored. These accounts must be governmental or authoritative. **Do not source these accounts through Dataminr due to the nature of our contract. Local DOT handle(s) are a good place to start, but we can also include police, fire dept, etc. Avoid news orgs, news sites, and citizens (bias) Ensure that tweets include information about road closures Please reply to these bugs within the hour we receive them. mailer bounceback If you receive a bounceback like this after sending Mailer comms, it means there’s a problem with the GDE component that normally auto-spawns bugs based on ‘Launched’ emails. This issue should be fixed, but if you get the bounceback again, let aynsley@ or heyryan@ know & then create the alert manually in the geo>data>escalation>crisis response component so it isn’t missed. You can use this bug as an example. Nav polygon In case of a failure, please override the pre-populated polygon by drawing it manually and flag for an IC to file a bug. Examples of a failure include: the affected area polygon appears broken, the alert can't be displayed, the drawing of the polygon doesn't seem right on the map (ie it erroneously covers a neighboring city), etc. AIR QUALITY CARDS If AQ card is showing 'unavailable data' with a wildfire alert, a primary should take down the Air Quality card and ask the IC oncall to file a bug in the AQ component here. Triaging Always add triage rationales in Rocking Horse any time we get a triage directive from an IC, regardless of whether an email is sent for the incident. Stakeholders and execs view triage rationales via go/respond and it’s our record for how we handled certain incidents, so please be comprehensive and write as though the public can view them. 1. Crisis Ops manages the triaging process. They log all incidents in Rocking Horse and assign triage labels. 2. Rocking Horse's algorithm will have recommended an action: No Response, Monitor, or Respond. If the Rocking Horse score is a Respond for a MAN MADE disaster, let other primaries know. SOS Crisis Ops should then immediately send the escalation email, followed by the activation email, and start drafting the SOS Alert. If the Rocking Horse score is a Respond for a NATURAL disaster. SOS Alert Ops will NOT send an escalation email UNLESS the natural disaster is: Statewide/region-wide disaster (e.g. California wildfires) Event crossing state lines (e.g Winter Storm Uri) Merging multiple alerts (e.g changing to singular August Complex fire from 3 unique alerts) Re-launching for old events that have restarted (fires, hurricanes, etc) 3. If Crisis Ops determines that we should not respond to the incident, they will assign the "No Response - Below Threshold or Out of Scope" label in Rocking Horse. This will move the incident to the Archived tab in Rocking Horse. 4. If Crisis Ops determines we should actively monitor the incident, they will assign the "Monitor" label in Rocking Horse and send a Monitor email. This will move the incident to the Monitor tab in Rocking Horse. Crisis Ops will let other primaries know anytime an incident is logged and the recommended action changes. (Rocking Horse score changes from "Unknown" to "Respond" due to updates in metrics or coverage.) 5. If it's decided we should respond to an incident, Crisis Ops will assign the "Escalate" label in Rocking Horse. This will move the incident to the "Launched" tab in Rocking Horse. Crisis Ops should follow escalation protocol. For more support on determining which email to send, please look at the New Mailer instructions. post-launch Crisis Ops to set triage label to 'launched' after filling out go/sos-board. New signals may appear over time. These need to be merged with the existing launched RH entry (if relevant). Primaries set triage label to ‘claimed’ , review the update & set label to ‘launched.’ Merge them together. More steps regarding post-launch here. turndown Crisis Ops to set triage label to 'launch-inactivated' after archiving the alert. More steps regarding turndown here. Report an Incident Form Crisis Ops should be signed up to receive notifications every time a form is submitted to the tracker. To do so, go to Tools then Notification Rules and select the appropriate setting. 1. When a form is submitted through the ‘Report an Incident’ form, Crisis Ops must immediately triage the incident to determine the proper response. 2. Once the status of the incident has been determined (e.g. previously logged, out of scope, etc) and marked accordingly in column K, Crisis Ops must immediately reply to the Googler who submitted using one of the 'Report an Incident' boilerplate email templates provided in this doc. Compose an email with the subject line: Report An Incident Form Address the email to: [Submitter’s LDAP], [email protected], [email protected] For the body of the email, select a boilerplate template according to the incident scenario from the playbook. Note that there are times when the IC will choose to be the one to respond to the submitter. In such cases, Crisis Ops can forego sending the email and mark so accordingly in column L. 3. Fill out everything after column J in the Report an Incident Form spreadsheet. 4. If the submitter replies to the email with additional questions/information or if they ping you individually or through go/crisis-chat, you can direct them to the IC by plussing them into the email thread or tagging them in go/crisis-chat. Escalating When tagging an IC in Crisis Chat for an escalation, please explicitly state whether it is a soft escalation or a hard escalation that will be followed up with a Mailer. In order to mitigate a scenario where both the primary and secondary oncall are failing to respond to pages, follow this protocol: If it has been > 10min that no one responds to an escalation (primary or secondary IC), generate another manual escalation to the rotation If no response on the second attempt, page the IC that last handed off in the other rotation timezone (E.g. if Aynsley handed off to Ryan, page Aynsley after two failed escalation attempts) When to follow a higher escalation threshold: Check the IC oncall schedule. Note whether the IC is US or Non-US. If the IC is currently in their local night-time hours, only page for urgent incidents. E.g. a mass shooting, terrorist attack, etc. Hard escalation (paging) criteria: A statewide/region-wide disaster Event is crossing state lines Any man-made incident that scores Unknown and 10 or more have been killed Any case that we're unsure how to handle when the incident scores Unknown and 10 or more have been killed Potential or confirmed act of terrorism regardless of Rocking Horse recommendation (especially if numerous civilian lives are threatened) If a Googler is asking to respond to an incident (even if you disagree) We have life-saving H&I that should override an active flood GIA Note: When in doubt, please page ICs! If you are dealing with a time sensitive question/issue, it’s better to ring their phone right away, rather than wait for a response to a chat tag. If an incident warrants a hard escalation, follow the steps below: 1. Send an escalation email from the Mailer. The triage rationale should be a fairly brief explanation behind the action taken, like the rationales from the mailer tool. It should be as specific as possible, including numbers that describe the action taken and help show impact or severity. Assign the "Escalating" triage label to the corresponding task in Rocking Horse. Note: PgMs have added cr-alert-ops@ as CC to Escalation Manager, which will allow us to see see both the escalation message sent to ICs and when an IC ack’s the escalation 2. Open Crisis Chat and introduce the situation. E.g., "We just escalated the Tokyo earthquake." If a crisis has a recommended action of "Respond", immediately follow the escalation email with an Activation email, and announce in Crisis Chat that you are drafting an alert and follow launching instructions. **If the event is man-made and scoring "Respond", immediately following escalation, use this template to let the IC know that we are activating: "We are activating for [X] due to the incident scoring RESPOND - no action is required from you, we are moving forward with a launch, but please let us know if you have any concerns about activation!" Highlight if the incident is receiving a significant amount of media attention, the amount of people directly affected, and our ability to provide specific help & information (if applicable). ​ If the incident was "Do Not Respond" or "Unknown", wait for IC confirmation before sending the Activation email. Then follow up with the regular launch process and comms protocol. Soft Escalation Criteria An incident that needs guidance but doesn't require immediate attention An incident is scoring Respond but has a live public alert for the crisis. If there is a live public alert, the IC will likely suggest holding off on launching until the public alert has expired. *As of 11/7/23, bushfires in Australia have the capability to use the Mix and Match feature with PAs and SOS appearing simultaneously. Ops does not need to escalate such incidents to an IC when a PA is active. If a public alert expires and we launch a basic alert, but then the public alert reappears, ping for an IC (they may prefer to turn the alert down in favor of the public alert). A violent incident (i.e. shooting) seems to be contained but the event is garnering national media attention, there are 10+ casualties (or media is suggesting potential mass casualties), and we have found potentially useful H&I. Note: If the IC doesn’t respond to a soft escalation/tag in Crisis Chat after 30 minutes, ping them directly with the question. If the IC doesn’t respond to the direct ping after ~30 minutes, page them. Do not page for turndown suggestions, unless the situation is timely and the IC is unresponsive in chat. If a US IC is oncall overnight, consider waiting until morning to escalate (unless urgent). If the situation is timely, e.g. work-stopping issue, production issue, man-made event, fast-paced natural disaster with minimal forecast, etc. – page them without waiting the designated buffer time. If the time buffer has elapsed, provide a rationale for why you're paging for a soft escalation in the text, e.g. "we have not received a response in Crisis Chat after ~1 hour and are following protocol by paging to address the situation." If an incident warrants a soft escalation, follow the steps below: 1. Tag the current IC on-call in crisis chat and introduce the situation, e.g. "Hi [IC ldap], we are soft escalating the Austin shooting, which has a live public alert and seems to be nearing containment." Provide any additional rationale on the incident, such as media attention, H&I availability, etc. 2. After a decision has been reached, send either an Activation or Monitor email with the IC's rationale clearly stated. If activating, follow the regular launch and launch comms protocol. Unique Circumstances: We generally do not override Public Alerts for earthquakes in Japan, regardless if the RH score is Respond. Escalate an earthquake in Japan only if there is substantially valuable H&I available that the PA is not providing. Overriding should be consulted with the IC and with the Japan team. We do not override Google Initiated Alerts unless we can provide critical H&I such as helplines, or shelters/evac info If an event is political in nature and has a recommended action of a Respond, touch base with IC in Crisis Chat or ping before escalating. If unresponsive, escalate via Mailer but include a detailed rationale acknowledging that the event is political. 1. Ex. of political incidents include, but not limited to: protests, military action, and issues occurring in disputed territories. Use best judgement in determining if something is political. If the event is receiving a lot of international attention but is no longer ongoing, send a monitor email & ping the IC in crisis chat letting them know about the incident. This helps other Google stakeholders understand the situation, whether we end up escalating/responding or not, in case they decide to respond in another way, e.g. Marketing posting a black ribbon of solidarity on the Google homepage for the affected country. If an exact location is unclear at the point an incident should be escalated, continue to escalate & explain the scenario to ICs. 1. We typically do not launch if we can’t pinpoint the exact location due to current product limitations. 2. We would only launch an alert without a map if we have incredibly helpful H&I, but there’s still a risk with Geo & so would be a rare situation. 3. When determining exact location, it’s more about making sure we’re assertive/confident with where we’re marking the pin & less about needing an authoritative source to confirm the location. There’s more at stake if we’re confirming via indirect sources. Make sure to always gut check w/ the IC. The IC can approve a man-made launch if local authorities have not publicly confirmed an event as long as all 3 of the below elements are true: 1. The situation is still active 2. We have an exact location 3. At least 3 reputable major news outlets are reporting on the situation Gut-Check Protocol Escalation thresholds for ICs on their off-hours: Check the IC oncall schedule. On night shift if any of the non-U.S. ICs are on, you can escalate at the normal threshold. If a U.S. IC is on, please follow a lower threshold escalation path. While the recommended action provided by Rocking Horse provides guidance on the appropriate response to an incident, it isn't foolproof. Below are guidelines on when to diverge from Rocking Horse recommendations. See the Escalating page for more details. 1. For MAN MADE disasters scoring Respond, always hard escalate. **After escalating, use this template to let the IC know we are activating a man-made disaster scoring Respond: "We are activating for [X] due to the incident scoring RESPOND - no action is required from you, we are moving forward with a launch, but please let us know if you have any concerns about activation!" 2. For NATURAL disasters scoring Respond, move forward with an auto launch. Do NOT send an escalation email UNLESS the disaster is: Statewide/region-wide (e.g. California wildfires) Crossing state lines (e.g. Winter Storm Uri) Merging multiple alerts (e.g. changing to singular August Complex fire from 3 unique alerts) Re-launching for old events that have restarted (fires, hurricanes, etc) Note: If you’re unsure of a response (even if its score is an auto-response), still escalate due to the score, but hold off on activation, and reach out to an IC, e.g. just because it’s an auto-respond doesn’t always mean we’ll respond. 3. You should hard escalate when: A man-made incident scores UNKNOWN and 10 or more have been killed. If

Use Quizgecko on...
Browser
Browser