About Data Streams

What are Data Streams

  • Data streams refer to continuous flows of data that are generated continuously over time
  • Data streams are continuous and potentially infinite in nature.
  • Data streams can originate from various sources such as:
    • sensors
    • social media feeds
    • financial transactions
    • website clickstreams
    • etc.

Characteristics of Data Streams

  • Continuous Flow: Data streams are continuous and never-ending
  • High Volume: Data streams often involve a high volume of data being generated in real-time.
  • Variety: Data streams can contain diverse types of data:
    • structured data (e.g records with fixed format)
    • semi-structured (e.g data elements expressed in JSON or XML, but with no strict schema)
    • unstructured data (text documents, images, vide, audio etc.)
  • Velocity: Data streams have a high velocity, meaning that data is generated and needs to be processed rapidly to derive insights in near real-time.
  • Real-time Processing: Due to the continuous nature of data streams and the need for timely insights, processing and analysis of data streams often occur in real-time or near real-time.
  • Dynamic: Data streams can be dynamic in nature, with data characteristics such as volume, velocity, and variety potentially changing over time.

Processing a Data Stream

  • A Stream Processing platform (like AWS Kinesis) would partition stream data into Shards.
  • Each shard receives a sequence of data records that are directed to it by using a Partition Key.
  • Shards are processes concurrently by various AWS compute services (EC2, AWS Lambda, EKS/ECS, EMR)