Data streams refer to continuous flows of data that are generated continuously over time
Data streams are continuous and potentially infinite in nature.
Data streams can originate from various sources such as:
sensors
social media feeds
financial transactions
website clickstreams
etc.
Characteristics of Data Streams
Continuous Flow: Data streams are continuous and never-ending
High Volume: Data streams often involve a high volume of data being generated in real-time.
Variety: Data streams can contain diverse types of data:
structured data (e.g records with fixed format)
semi-structured (e.g data elements expressed in JSON or XML, but with no strict schema)
unstructured data (text documents, images, vide, audio etc.)
Velocity: Data streams have a high velocity, meaning that data is generated and needs to be processed rapidly to derive insights in near real-time.
Real-time Processing: Due to the continuous nature of data streams and the need for timely insights, processing and analysis of data streams often occur in real-time or near real-time.
Dynamic: Data streams can be dynamic in nature, with data characteristics such as volume, velocity, and variety potentially changing over time.
Processing a Data Stream
A Stream Processing platform (like
AWS Kinesis) would partition stream data into
Shards.
Each shard receives a sequence of data records that are directed to it by using a
Partition Key.
Shards are processes concurrently by various AWS compute services (EC2, AWS Lambda, EKS/ECS, EMR)