Securing Application Logs with Amazon Comprehend

Security is one of the more overlooked aspects that many fall victim to when designing the architecture of applications. Partnering this lack of security priority with the increasing value of personal user data, security breaches become one of the certain ways for companies to lose user trust, face legal charges, and, in the long run, fail.

Various governments developed data compliance laws to set minimum guidelines for security in data handling by data stakeholders. These laws govern the collecting, processing, storing, and sharing of personal and sensitive information to protect individuals’ privacy and data security. Although regulations may vary depending on country or geographical region, such as GDPR in Europe, CCPA in California, and HIPAA in the United States, the main idea is to protect sensitive data collected and used by any company, institution, and the like.

Compliance, from the technical side

Data stakeholders adhere to data compliance regulations by implementing several key practices. A few examples of these practices include:

Obtaining explicit consent from users before collection and processing of personal data
Employing robust data encryption and security measures
Conduct regular audits and assessments, especially with updating guidelines
Limit data retention policies, strictly monitor access to data

To better understand how these practices are employed, we will take a deep dive into some data encryption and security measures used in application logging.

Log Security Methods

Application logs serve as crucial records of an application’s activity and are used for various purposes, primarily for troubleshooting, monitoring, and auditing. These logs typically contain detailed information about events, errors, and transactions within an application, including timestamps, user actions, system states, and more. However, in some cases, application logs may unintentionally contain Personally Identifiable Information (PII) because they capture user interactions and system responses.

To comply with data governance laws like GDPR, CCPA, HIPAA, and others, stakeholders must set stricter governance with their data containing PII, including application logs. There are various ways of implementing these methods to application logs, like PII abstraction, access control policies, and log retention policies.

PII Abstraction

Ideally, application logs should not include PII. However, some cases (especially for Business Intelligence purposes) would rely on and heavily benefit from basic personal data, resulting in a net positive experience for users given that the PII is secure.

To protect sensitive personal data, PII abstraction methods are used. PII abstraction makes PII inaccessible or nonsensical for those who may access these details outside the intended personnel. Two of the many possible implementations of these are redaction and hashing.

[SAMPLE LOG] 9626308c-e26d-4a45-a9a0-ce00de003e9dUserID: 8320, Username: 
nalvarado, Email: twood@example.com, Item: Laptop - Brand X, Price: $399.99, 
Address: 688 Harrison Landing Apt. 610
Nathanland, DC 43062, IP: 48.1.114.194, UserAgent: Mozilla/5.0 (Windows 98) 
AppleWebKit/536.2 (KHTML, like Gecko) Chrome/50.0.890.0 Safari/536.2, 
SessionID: eff82e9e-e2a9-45aa-8f78-cf841903b0e8

Above is a sample application log with no abstractions implemented. Notice that any person who could access this would be able to obtain sensitive personal information, like the person’s name and address.

The PII may be redacted or hashed upon logging to avoid cases like these. Redacted logs use arbitrary characters (like asterisks or X’s) to replace any character within the PII. A redacted log would look like the following:

0ff1416c-d624-4ba8-bb88-bd8bc2680b2a	UserID: 9119, Username: tbrown, Email: 
XXXXXXXXXXXXXXXXXXXXXXXXXX, Item: Laptop - Brand X, Price: $59.99, Address: 
XXXXXXXXXXXXXXXXXXXXXXXXXX, IP: XXXXXXXXXXXXXXX, UserAgent: Mozilla/5.0 
(Windows NT 10.0) AppleWebKit/536.2 (KHTML, like Gecko) Chrome/XXXXXXXXXX 
Safari/536.2, SessionID: ce9fc2f2-fe3b-4b84-baf5-f011a05963dd

On the other hand, the processing of the logs could hash the PII instead. Hashing refers to using a hash function to map the actual value to a more complex value (which is also a concept used in storing passwords). Even if the complex value could be accessed, without the hash function, it cannot be traced to the original value. This allows the data stakeholders to store the data and ensure its security as long as the hash function is not compromised.

fa76296f-8855-4baf-a3ff-99c570f96e69	UserID: 8977, Username: 
332aebf7492b28c74c6cb24327de9e7155162472066e8585d19db593a573ce658c5e14e2d3780d
2eb17d358055a3a098e1594…, Item: Laptop - Brand X, Price: $19.99, Address: 
f5ef4ac82196c2fe1a69f75b31c3c9c9f2cb85f9140162ee7f30ccb185b4b4045b8d9a…, IP: 
5cead27ee10292642878d866bcaaa6fe5f…, UserAgent: Mozilla/5.0 (X11; Linux i686) 
AppleWebKit/535.1 (KHTML, like Gecko) Chrome/15.0.848.0 Safari/535.1, 
SessionID: 0bd815c8-efe0-4464-be5d-79bb995c117c

Implementation of PII Abstraction

As mentioned earlier, for cases that do not require the raw information contained by PII, the logging of PII would be unnecessary. For other cases, PII Abstraction should be implemented as early in the lifetime of a log as possible.

PII Abstraction is implemented by using some technique to figure out where the PIIs are and then using some algorithm to hash the values or simply redact them with characters.

In the examples above, Regular Expressions (RegEx) and Amazon Comprehend were used to identify the PII.

Regular Expressions (RegEx)

Regular Expressions, often abbreviated as Regex or RegExp, are powerful text patterns used in computer science and programming to search for, match, and manipulate text strings based on specific patterns or rules. Here is an example of patterns PII generally follows and the RegEx that can be used to detect them.

pii_patterns = {
    'email': r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b',
    'ip_address': 
r'\b(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9
]?)\.(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(?:25[0-5]|2[0-4][0-9]|[01
]?[0-9][0-9]?)\b',
    'phone_number': r'\b\d{1,3}[-.\s]?\d{1,3}[-.\s]?\d{4}\b',
    'credit_card': r'\b(?:\d{4}-?){4}(?=\s|$)(?![\d-])',
    'ssn': r'\b(?!000|666|9\d\d)\d{3}[-\s]?(?!00)\d{2}[-\s]?(?!0000)\d{4}\b'
}

Using the PII patterns above can detect the PII they are designed for. The ‘email’ pattern can detect the following PII:

- john.doe@example.com
- alice_smith123@subdomain.domain.co.uk
- support@my-website.net

RegEx allows a more straightforward method of detecting PII’s based on the pattern definitions, assuming these patterns are consistent for all the texts. However, as seen in pii_patterns, these patterns are rather cryptic and would require hard coding for all possible examples. In other cases where PII doesn’t follow a definite format that can be inferred from the text itself, RegEx may fail, and PII may remain visible and could lead to mishandled data.

In these cases, another option for detecting the PII would be to use Amazon Comprehend.

Amazon Comprehend

Amazon Comprehend is a natural language processing service that helps users understand and analyze text data, providing insights into its content, sentiment, entities, and language. Comprehend has a “Detect PII” feature, which is designed to identify and extract PII from text, including data like names, addresses, and social security numbers, among many others.

Here is a code sample of how Amazon Comprehend can be used for these logs.

# AWS Comprehend client
comprehend = boto3.client(service_name='comprehend')

def handle_pii(data):
    """Function to redact or hash PII using AWS Comprehend."""
    # Detect PII entities using AWS Comprehend
    response = comprehend.detect_pii_entities(Text=data, LanguageCode='en')
    pii_entities = response['Entities']

For cases where the data has highly inconsistent text patterns, using Amazon Comprehend would generally provide better PII detection results.

Whichever the situation may be, after using either RegEx or Comprehend to detect the PII, simple logic could be used to either redact or hash the logs with PII. Below is a sample code snippet that works from a list of pii_entitites (detected by RegEx or Comprehend) and rebuilds the data without PII’s.

    for entity in pii_entities:
        start = entity['BeginOffset']
        end = entity['EndOffset']
        pii_text = data[start:end]

        if pii_handling_mode == "redact":
            replacement_text = 'X' * len(pii_text)
        elif pii_handling_mode == "hash":
            replacement_text = hashlib.sha256(pii_text.encode()).hexdigest()

        cleaned_data = data[:start] + replacement_text + data[end:]

Final Remarks

As the amount of data generated by every person increases over time, the opportunities for misuse of this data also increase. Observing strict compliance with data governance laws minimizes data misuse. It provides everyone with a sense of security that their data, identity, and sensitive information would not be in danger, even in some technical failures.

From the developer side, it cannot necessarily be ensured that any application and data procedure would result in 100% security. Still, consistently practicing the guidelines will help make them as secure as possible.

Aside from ensuring that data stakeholders will employ advanced techniques to secure our data, data generation begins with us individuals. Keeping our personal data out of the open is equally important, which poses a considerable risk. Practicing better data practices will help ensure our security in this digital age.

Thank you to everyone who read this article, and happy learning!

Resources:

https://aws.amazon.com/security/

https://aws.amazon.com/comprehend/

Written by: Lesmon Andres Lenin Saluta

Lesmon is a data practitioner and currently the Product Data Scientist at Angkas. He oversees the development and innovation of product data analysis and modeling towards impactful solutions and to make better-informed business decisions. He also has a genuine passion for mentoring. He is a Data Science Fellowship Mentor at Eskwelabs, where he imparts knowledge and nurtures the next generation of Data Practitioners. Outside of work, Lesmon is a freshman at the University of the Philippines - Diliman, a scholar taking up a degree in BS Computer Science.

AWS, Azure, and GCP Certifications are consistently among the top-paying IT certifications in the world, considering that most companies have now shifted to the cloud. Earn over $150,000 per year with an AWS, Azure, or GCP certification!

Follow us on LinkedIn, YouTube, Facebook, or join our Slack study group. More importantly, answer as many practice exams as you can to help increase your chances of passing your certification exams on your first try!

View Our AWS, Azure, and GCP Exam Reviewers Check out our FREE courses