You are given a naïve CUDA kernel that computes the element-wise product of two large float arrays A and B, storing the result in C. The reference implementation is bandwidth-bound and runs at only 3 % of peak GPU memory throughput on an A100. Your task is to rewrite the kernel (and any necessary host setup) so that the full program achieves at least 85 % of theoretical global-memory bandwidth when A, B and C each contain 2^28 (~268 M) floats. You may introduce additional kernels, but the final arrays must reside in global memory and the host-side timing must include all GPU work. You are not allowed to use Tensor Cores, cuBLAS, or any external library; everything must be written in CUDA C++. Your submission will be compiled with nvcc -O3 -arch=sm_80 and run on an isolated A100-SXM4-40 GB. The evaluator measures the elapsed wall-clock time of the single call cudaMemcpy(C, d_C, nBytes, cudaMemcpyDeviceToHost) that follows your kernel launch(es); this time must correspond to an effective bandwidth of ≥ 85 % of the 1555 GB/s theoretical peak. You must also ensure that the relative L2 error against the reference implementation is < 1e-6.